►
Description
http://goo.gl/U4b70r
29 October 2014
Ceph Developer Summit: Hammer
Day 2
OSD (Tiering): Fine-Grained Promotion Unit
Zhiqiang Wang
A
A
Think
so,
but
it
was
your
blueprint
if
you
want
to
give
us
a
little
bit
of
an
overview
of
what
you
were
thinking
and
then
we
can
go
from
there
sure.
B
C
Hey
one
simple
comment:
here:
that's
okay!
I
have
seen
out
it
up
previously
right
after
last
week's
weekly
performance
meeting.
B
Okay,
this
is
some
tests.
We
did
before
I
uncashed
hearing
performance
we
actually
with.
We
did
for
case
I
the
first
three.
First,
we
are
for
cash
during
and
the
last
ones
for
without
cash
during,
but
we
have
done
to
SS
TL
AG
against
the
journal.
I
further
I
did
the
performance
of
all
of
these.
Four
cases
are
shown
in
the
table
in
this
chart.
I,
as
you
can
see,
without
the
castori.
Without
the
right
drive
performance
is
about
1000
hours,
the
really
about
1500
hours
and
for
the
random
it
is.
B
It
is
1200
hours
I,
oh
ok,
first
set
between
the
with
castor
in
case
one
case
to
constrain
the
case.
Why
is
too?
We
hold?
We
said
the
dirty
ratio
and
flat
took
the
flow
ratio
of
the
cat
cheering
so
that
we
can
hold
all
of
the
data
in
the
cache
tier
and
we
don't
need
to
do
flash
and
then
we
don't
need
to
do
evict,
and
there
is
also
no
promotion
in
case
one
and
in
case
too
I.
There
is
flash
work,
but
there
is
no
promotion
and
in
kayseri
we
have
both.
B
We
have
all
of
the
eviction
and
flashed
and
promotion
in
case
three
and,
as
you
can
see
from
the
result,
the
case,
one
with
that
and
the
case
to
result
is,
is
good
compared
with
the
without
question
result
a
especially
for
the
read,
but
in
the
case
three
and
which
we
have
promotions,
we
have
flash
animations.
The
result
is
very
poor.
Is
it's
about
Eiffel
for
the
right?
It
is
just
same
percent
of
the
no.
No
no
is
a
compared
with
the
without
capturing
result.
B
It
is
just
about
thirty
thirty-five
percent
and
put
for
the
real
it
is
about
sixty
percent
yeah,
the
so
the
result
for
the
history
is
very
poor.
Instead.
D
B
E
E
C
E
E
E
G
G
G
E
E
B
H
B
F
Addition
to
the
the
tests
in
case
three
that
you
did
do
be
interesting
as
well
as
to
have
as
if
F
distribution
with
fio,
which
would
let
you
look
at
when
you
have
a
combination
of
hot
and
cold
data.
What
what
kind
of
performance
you
see
my
guess
is.
It
will
still
be
bad
just
based
on
the
test.
I've
done,
but
it
may
be.
It
will
be
quite
as
bad
actually.
E
F
B
Okay-
and
this
is
the
ops
data
we
can't
look-
you
can
also
look
at
some
of
the
latency
data
is
in
his
show
in
the
picture,
but
these
four
cases
I
the
first
one-
is
that
without
cash
children
with
that
and
the
latency
is
about
I,
it's
around
200,
milliseconds
and
then,
and
then
for
the
case.
One
case
to
the
reader
response
time
is
very
good
compared
with
the
without
culture
result,
but
for
can
swing.
The
latency
is
huge.
It's
above
1000,
milliseconds,
I
think
this
is.
This
is
not
acceptable.
Yes,.
E
So
this
is
up
the
random
read
case
when
we
promote
due
to
a
read:
it
does
the
promotion
first,
it
reads
it
from
the
backing
pool
and
then
it
writes
it
to
the
cash
pool,
and
then
it
does
the
read
and
this
patch
will
do,
will
forward
the
read
to
the
backing
pool
and
then
and
then
promote,
and
so
the
read
the
first
read
at
least
on
that
object
won't
block
at
all
and
only
the
only
a
secondary
to
that
same
object.
That's
right
after
it
yeah
out.
E
It'sit's,
actually
different
than
that,
it's
once
you
decide
to
okay,
it'll,
both
it'll
it'll,
promote
and
it'll
forward.
Instead
of
promoting
in
them
and
then
retry
the
operation
and
so
I
think
it'll
in
a
random
read
case.
It'll
probably
help
a
lot,
because
you're
get
usually
going
to
touch
that
object
once
and
then
you're
going
to
touch
other
objects
in
other
workloads
where
you
have
to
reads
the
same
object.
It
only
help
with
the
first
one,
but
the
second
one
will
be
slower,
but
it'll
eat
the
lid
least
pipeline
them
a
little
bit.
E
C
B
Yes,
we
did
some
latency
program
data
for
the
promotion.
We
divide
that
this
I
the
the
whole
time
in
the
cache
0sd
we
divided
into
a
for
the
promotion
with
you
add
into
promotion,
coffee
and
promotion
replication
and
then
the
other.
The
total.
The
total
time
is
the
Castillo
SD
latency
as
he
is
showing
the
table
we
can
see
and
foreign
read
and
when
q-dubs
is
one
I.
B
The
promotion
time
is
it's
about
ninety
percent,
it's
above
ninety
percent
of
the
total
time
and
the
four
cubes
16
it
is
about
15,
alright
and
the
time
is
even
higher
for
2
times
1
and
the
promotion
time
is
above
ninety
percent
ninety-two
percent
and
the
four
right
and
four
kidnap
16.
It
is
above
76
the
prom
date,
so
you
can
see
the
most
of
the
time
we
spend
on
the
promotion
in.
F
G
Well,
okay!
So
well,
so
that's
that's
true
with
this.
With
this
design,
I
I
think
all
of
these
slides
are
basically
just
saying.
Cache
misses
are
more
expensive
than
they
should
be
with
this
design
right.
So
that's!
That's!
That's
true!
It's
it's!
A
random,
just
yeah!
The
random
distribution
is
actually
a
good
way
to
measure
the
overhead
of
a
cache,
miss.
F
E
So
yeah
so
per
oceans.
Yes,
so
promotions
are
bad,
but
you
shouldn't
I,
think
we
should
be
careful
about
focusing
just
on
promotion
without
also
having
some
effort
to
to
look
at
what
the
actual
impact
on
a
more
a
more
realistic
workload
is.
Unfortunately,
though,
what
more
is
little
realistic
means
has
been
just
hard
to
sort
of
figure
out,
because
we
don't
have
good
models
for
what
you
know
a
typical
you.
What
I
don't
know
what
cloud
vm
workload
looks
like
and
what
what
is
the
actual
distribution?
How
skewed
is
the
is?
E
Is
it
a
good
model
and
how
skewed
is
it
and
and
how?
Because
it
that
mean,
the
key
variables
are
the
whatever
the
zip
parameters.
That
is
basically
how
how
skewed
the
distribution
is
to
the
hot
data
and
and
then
the
size
of
the
cache
relative
to
the
base
or
the
total
data
set
like
those
are
the
two
key
variables
and
there's
obviously
going
to
be
some
combinations
that
are
good
and
bad
and
again,
ideally
make
a
little
graph.
E
That,
like
shows
where
it's
going
to
be
good
in
where
it's
not
going
to
be
good,
but
just
being
able
to
do
that
that
experiment
would
be
would
be
helpful,
but
that'sthat's
one
direction
is
just
to
like
understand,
but
then
obviously
the
other
direction
is
yes.
When
you
do
promote,
you
wanted
to
go
as
fast
as
possible
and
its
extensive
right
now
so
I
think
we
sort
of
need
to
pay
attention
to
those
that
make
sense.
B
Okay,
I
sweet-talked
in
the
emeritus,
and
we
also
tested
that
the
promotions
don't
read
the
read
that
right.
This
is
a
small
feature
we
did
before
and
I.
The
picture
show
that
to
teachers
shows
that
without
when
we
have
this
feature,
and
then
the
the
ask
improves
about
two
times
from
about
202
and
250
hours
to
about
around
1800
house
and
the
latencies
improves
about
I
to
three
times
and
to
the
improves
about
sixty
percent.
E
Yeah,
okay,
so
I
think
this
is.
This
is
a
good
example,
because
if
you
really
have
a
random
workload,
then
you
actually
never
want
to
promote
ever
like.
If
your
code
actually
was
random,
then
you
would
want
to
you
got.
You
would
basically
want
to
promote
just
a
random
subset.
However,
big
your
caches
say,
your
cash
is
twenty
percent
of
your
total
size.
You
just
take
a
random
twenty
percent
and
promote
it
and
then
never
change
the
contents
of
your
cash.
E
That
would
be
like
the
winning
strategy,
and
so
you
would
actually
would
not
want
to
promote
second
read
what
you
want
to
promote
on,
like
the
100th
read
or
something
so
again
like
yes,
it's
great
it
promote
on
second
read
is
a
really
good
idea
and
it
helps,
but
that
the
workload
isn't
like
the
right.
Isn't
the
right
one
just
to
show
the
situation
where
it's
really
going
to
help
and
how
much
it
really
helps
in
the
real
situation
is
actually.
G
G
So
I'm
I'm,
I'm
wondering
whether
the
the
whole
thing
here
is
to
see
how
many
operations
we
can
serve
while
the
promote
is
going
on,
so
we
can
serve
weeds
in
the
general
case,
while
a
promote
is
happening
right
by
redirecting
the
back
end.
If
we
want
to
make
sure
of
ordering,
all
we
have
to
do
is
proxy.
It
right
the
rate
that
is
yeah
right.
E
G
E
E
Read
we
can
also,
we
can
I
mean
so
their
teeth.
We
can
forward
reads
or
we
can
proxy
regents.
So
far,
we've
only
been
forwarding,
but
just
being
able
to
proxy
reads
in
general
will
probably
also
have
a
sort
of
across-the-board
improvement
in
latency,
although
it'll
cost
some
cpu
on
the
cash
tier
0
s
DS
right,
because
their
base
clubs.
G
G
Rights
are
a
little
trickier,
I
think
you're
getting
to
that
later
in
this
talk,
but
we
would
need
some
kind
of
partial
promote
where
we
keep
deltas
of
the
object.
That's.
E
All
right
side
making
a
couple
notes
in
the
pad,
but
I
think
so
step
step.
One
would
be
the
whip
promote
forward,
step
two
or
just
refine.
My
that
is
to
always
forward
when
to
promote
is
in
progress.
Actually,
that
should
probably
just
I
should
apologize.
That
there
is
to
do
that
and
then
a
third
step
which
would
be
more
complicated,
would
be
to
proxy
the
reeds
instead
of
redirecting
the
reeds
but
that'll
be
a
little
bit
more
code.
I
don't
do
that,
but
yeah
they
look
complicated.
B
D
B
Okay,
at
the
first
idea,
I'm
going
to
propose
is
the
two
to
use
a
fine-grained
promotion
unit.
As
we
know
in
the
in
the
current
set,
the
default
object
side
is
four
megabytes
and
the
one
we
do
promotions
when
we
promote
the
whole
object
and
I
and
I
think
that
the
phone
megabyte
is
too
big
for
promotion,
thus
I
I
and
we
promote
to
to
to
use
a
smaller
promotion
unit
and
we
we
can
divide
this
object
into
a
4k
unit,
and
we
call
it
call
it
unit
page
and
I.
B
G
I'm,
also
wondering
how
much
of
it
you
get
for
free
just
by
proxying
the
read
because
keep
in
mind,
even
if
you're,
promoting
a
4k
object.
There
are
a
4k
page
instead
of
a
4
mega
byte
object
on
a
4k
read
you're
still
having
the
read
weight
behind
the
promote
and
commit,
which
is
a
network
round
trip
at
a
disk
operation.
So
you
see
you
kind
of
still
have
to
proxy
the
rate.
That's
not
optional!.
B
G
E
I
think
that
the
crummy
part
is
that
the
to
read
the
read
latency
with
promote,
we
can
do
these
much
simpler
things
that
I
think
address.
That
will
help
a
lot,
basically
proxying
the
reeds
and
forwarding,
while
we're
promoting
and
so
I
think
there
are
simpler
things.
We
can
do
first,
that
help
with
random
reads
a
lot,
but
the
random
writes
are
the
hard
ones,
because
we
can't
in
a
simple
way
either
Ford
them,
while
the
promoters
in
progress
or
or
do
them
locally
and
I.
Think
you
can
it's.
G
Just
hard
this
idea,
if
they're,
if
there's
no
read
in
the
transaction,
then
you
can
once
you
have
the
object
info,
you
have
a
license
to
perform
any
rights.
You
want
as
long
as
you
don't
actually
serve
any
reads,
so
you
need
to
offer
all
of
the
rights
you
haven't
applied
yet
and
then
once
the
once,
you
use
you
continue
to.
What
do
you
do
you
forward?
E
I
think
the
thing
that
worries
me
is
that
the
way
that
that
promotion
works
right
now
with
the
in
a
cache
dear
during
the
promote
or
writing
it
to
a
temporary
object,
and
it's
going
to
be
a
series
of
reeds,
and
only
when
we're
done
do
we
move
it
into
place
and
actually
complete
the
promotion,
whereas
in
this
case
we
need
to
sort
of
we
need
in
order
to
persist
the
client
right.
We
have
to
have
something
that's
permanently
persisted.
That
has
enough
data
to
overlay
under
the
object,
threesome.
D
G
D
G
G
E
G
E
G
G
G
E
I
think
probably
any
of
those
operations
that
are
non-trivial
partial
rights.
We
would
just
a
block
until
promotion
completes
and
listen
until
we
decide
that
that
particular
category
of
operation
is
so
important
in
common
that
we
should
add
the
complexity
to
deal
with
it,
and
once
you
block
one
of
those
apps,
then
you
have
to
block
all
apps,
at
least
when
I
same
client
yeah
all
the
same
Walters
yeah,
a
client
for
calling
private,
whatever
I'll,
still
remain
right.
G
G
F
So
right
now,
if
you're
doing
a
cached
here,
you
might
use
like
3x
replication
right.
So
you
got
a
library
about
amplification.
When
motion
for
a
read.
Is
there
any
reason
that
you
can't
do
something
like
separate
out,
read:
promotions
versus
rights
so
that
read
promotions
are
going
to
some
kind
of
pool?
That's
not
replicated.
G
Have
to
move
it
over
that
I
mean
you
can
get
it
like
if
you
happen
to
have
that
information
a
priori
right
now.
All
you
have
to
do
is
configure
your
cash
pool
to
be
one
runner-up
and.
E
G
E
E
G
Actually
not
so
sure,
because
well
I
mean
how
many
people
well
okay,
so
if
you
are
using
ok,
so
r
BD
does
not
use
will
will
pretty
much
never
use
an
r
BD
or
a
cash
pool
for
read,
cashing
there'd
be
no
point.
It's
got
page
cache
is
all
over
the
place,
so
RVD
will
always
be
a
write.
Caching
situation,
whereas
rate
of
CW
may
genuinely
be
a
read.
Caching
situation
a
lot
of
the
time
and
for
a
retailing
situation
that
might
make
sense.
Yeah.
F
C
F
E
Okay,
so
sorry,
coming
back
to
come
to
coming
back
to
your
actual
proposal,
I
think,
but
I
think
that
the
general
feeling
is
that
that
page
granularity
write,
caching
is
possible
and
there
are
situations
where
would
help,
but
it's
really
hard-
and
there
are.
This
is
my
this
is
my
take,
and
there
are
a
whole
bunch
of
other
things
that
we
can
do
that
are
going
to
help
a
lot
of
the
same
cases
that
we
should
do
first
before
we
do
the
really
hard
thing.
B
E
C
B
G
Of
this,
the
interaction
with
a
lie,
the
interaction
with
snapshots
comes
to
mind
is
the
most
difficult
for
one
thing
you
have
to
be
able.
So
when
you
receive
a
read
on
an
object,
that's
in
this
partial
state,
let's
say:
you're
you're
reading
the
first
megabyte
of
the
object
and
you've
written
random
4k
chunks
throughout
that
one
megabyte
piece
when
you
perform
the
read,
the
primary
OSD
will
then
have
to
perform
the
same.
Read
on
its
it's.
The
part
that's
been
promoted
on
the
pit
on
the
blocks
have
been
promoted
than
overlay.
G
G
Object
classes
can't
do
asynchronous
read
so
that
that's
a
thing,
but
we
can
get
around
that
by
just
not
allowing
it
snapshots
are
tricky
because
you
might
receive
an
a
right
with
an
updated
snap
ID,
while
you're
in
the
state,
so
you'll
have
a
snapshot
at
partial
object.
Know
then,
when
you
finish
the
promote?
Well,
let's,
let's
say
later
when
you,
when
you
go
to
write
them
out,
you
have
to
write
out
diffs
or
if
you
chose
to
promote
the
whole
object,
you'd
be
promoting
into
an
otherwise
immutable.
Snap
bow!
Oh
yeah!
G
So
if
you
only
had
parts
of
the
clone
written
in
because
that
was
the
state
it
was
in
when
you
got
the
new
snapshot,
then
now,
when
you
receive
reads
and
other
pieces
of
that
clone
you,
if
you
can't
satisfy
them,
you
either
have
to
keep
going
to
the
cash
pool
or,
if
you
choose
to
remote,
the
whole
object,
because
it's
really
hot
for
some
reason,
you'd
have
to
modify
the
clone
when
it's
where
it,
where
it's
sitting
in
the
cash
tier
which
doesn't
sound
hard.
But
it
surprisingly,
is
it's
guy.
It's
all
doable.
F
H
54
55
is
support
at,
for
example,
we
have
problems
for
a
mobile
and
the
weekend
and
the
plant
o
P
is
the
4k
rate,
and
we
can.
We
can
issue
a
promoter
okie
and
we
can
get
40
talking
to
touch
touch
by
the
side,
and
then
we
can
get
a
4m
of
data,
so
we
can
refer
to
plan
for
40
data
at
foster
and
then
feel
feel
for
my
car
incredible.
G
D
A
E
We
either
block
and
wait
for
a
full,
promote
or
do
another
partial,
promote
or
whatever
and
then
and
then
we
just
have
to
then.
The
only
thing
you
have
to
implement
is
the
complexity
of
the
partial
object
and
the
and
the
to
promote
that
sort
of
merges
the
two
together.
That
seems
to
me,
like
kind
of
the
simplest
thing,
and
then
you
can,
you
can
always
add
more
on
to
like
that,
affect
more
complicated
types
of
operations,
but
I
think,
even
even
with
that
sort
of
minimal
strategy.
E
I
think
these
are
the
things
we
should
do
first,
because
they're
they're
way
way
way
easier
and
I
think
they're
going
to
help
a
lot,
and
so
that's
the
ability
to
that's
changing
the
read
behavior
to
forward
you
during
the
promote
that'll
completely
avoid
the
latency
costs
for
promotions
on,
read,
I,
think
that'll
basically
go
away
right
and
then
and
then
also
we
can
have
the
ability
to
proxy
rights.
So
we
have
the
choice
not
to
promote
on
a
single
or
k
right.