►
From YouTube: CDS Infernalis (Day 2.2) -- OSD: Tiering
Description
Videos from Ceph Developer Summit: Infernalis (Day 2.2)
04 March 2015
https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)
A
All
right
now
we're
on
to
the
next
one.
This
is
a
double
whammy
here
with
the
next
version
of
tiering
or
the
the
next
the
next
take
on
tearing
and
soul
is
the
dynamic
data
relocation
for
cash
during
all
things,
cheering.
So
damn
you
want
to
start
with
yours,
and
then
we
can
hear
from
the
Intel
guys
on
theirs
as
well
sure.
B
Okay,
it's
so
common.
Can
you
guys
hear
me
yep
an
off
requested
feature
is
an
ability
for
the
OSD
to
offload
cold
data,
not
necessarily
to
another
ratos
pool
in
the
same
data
center,
but
to
something
completely
different
and
hopefully
cheaper,
like,
for
example,
a
cheaper
crappier
Rados
cluster
in
a
different
data
center
or
I,
don't
know
s3
or
something
so
the
way
the
current
cash
during
system
works.
Is
we
pretty
much
just
make
liberate
O's
calls
to
the
next
pool
down,
but
in
principle
we
could
use
really
anything
else
to
write
the
objects
out.
B
B
This
wouldn't
really
be
a
cashier,
be
more
of
it
here
or
tier
would
use
this
opaque,
plugin
interface
to
actually
do
the
object
of
motion
and
it
would
store
within
the
the
Welcome
greatest
here
a
metadata
object
redirect
indicating
which
plugin
was
used
to
offload
it
and
some
information
from
the
banking
system,
but
where
it
went
so
I
have
here
a
very
low
thought
crack
at
what
the
interface
might
look
like.
The
details
aren't
particularly
important.
B
B
B
C
B
C
B
Anyway,
yeah
it's,
it
seemed
it's.
There
was
no
reason
for
us
to
get
to
Fort
for
us
to
impose
a
name
on
at
least
I
can't
think
of
a
good
reason.
First,
one
pose
the
name
on
the
back
end
so
might
as
well
let
the
backend
choose
the
name.
I.
Also
don't
want
these
to
be
over
writable
at
least
again.
I
think
it
would
be
probably
easier
for
a
system
like
a
radius
ratio
coding
pool
not
to
allow
partial
overwrites.
So
da
jets
are
a
pendulum.
They
add
in
immutable
what's
closed.
B
So
one
thing
that
is
a
question
for
me
is
there
are
slow
backends
like
s3,
and
then
there
are
hyper
slow,
backends
like
glacier
or
a
tape
bot.
Do
we
care
about
extending
in
this
plugin
interface,
to
be
capable
of
handling
something
like
a
tape,
but
it
would
mean
that
Rados
would
have
to
be
able
to
propagate
in
some
kind
of
an
ian
progress
error
code
to
clients.
B
B
D
B
C
I
think
for
a
tape,
bot
I
mean
the
latency
czar.
Like
tens
of
seconds
to
minutes
I
mean,
I
think,
that's
that's
you
know
excruciating
but
but
tolerable
and
throws
the
win.
You
get
right
for
tape.
Robot
is
wouldn't
be
our.
So
would
you
know,
but,
but
I
mean,
if
you
have,
if
you
have
like
an
HSM
type
system,
where
you
have
just
random
files
in
your
system
up
and
archive
up
to
tape,
then
you
go
try
to
read
them
like
users,
kind
of
expect
that
it's
going
to
take.
C
B
B
Think
the
liberators
call
itself
returns,
Ian
progress
and
anything
above
that
that
wants
to
block
and
block
man.
Yeah,
okay,
I,
don't
want
liberators,
even
keeping
a
keeping
track
of
outstanding
and
progress
up
object
pro
shins
either.
If
that's
I
think
the
client
thinks
is
important,
it
can
probably
just
do
that.
I
mean
for
something
this
pulling
on.
An
interval
of
like
tens
of
seconds
is
probably
fine
and
that's
good
enough.
We
don't
need
to
keep
you
enough
state
for
a
Fred
notify
police.
That's
my
feeling.
B
Don't
want
to
don't
the
it
seems
simple,
except
that
we're
building
up
all
these
requests.
That
happens
enough
on
the
OSD,
as
it
is
I
think
going
forward
any
designs.
We
come
up
with
really
really
minimize
anything
the
OSD
holds
in
memory,
and
it's
not
really
that
much
more
complicated.
It
just
means
that
librettos
returns
in
the
in
progress
and
if
the
client
library
wants
to
wants
to
pull
it
can
do
that,
pulling
is
easy
to
to
it
to
it
anyway,
right.
E
B
Okay,
so
another
question
is
that
it's
interact
with
snapshots
if
at
all
in
principle,
it's
not
really
that
hard.
We
just
write
all
the
relevant
snapshot,
information
into
the
metadata
object
and
that
eat
that
even
allows
us
to
do
snapshot
trims
without
promoting
but
it'll
be
kind
of
tedious.
So
that's
extra
work
we'll
have
to
do
it.
Also
justice
lot
of
snapshots,
so
another
one
is
because
of
the
anonymous
object.
Name
nature
of
this.
If
you
open
an
object,
start
writing
and
then
there's
appearing
interval.
B
We
will
have
lost
that
information,
so
we
probably
need
to
write
into
the
PG
log
that
we're
starting
a
demotion
to
one
of
these
things
so
that
the
next
primary
can
look
backwards
to
the
log
and
clean
up
after
Eddie
canceled
promotions
or
demotions
yep
yep.
Those
are
all
the
gotchas
I've
got
I,
don't
know
if
anyone
else
has
any
other
ones
or
use
cases
that
are
missed.
That's
that's
the
big
one.
E
E
B
A
cast
you
okay,
so
the
the
main
reason
is
that
there
isn't
a
reason
to
do
that.
So
whatever
this,
so,
if
you're
using
s3,
for
example,
it
already
handles
object,
placement
there's,
there's
no
reason
to
part
to
partition
the
namespace.
It
has
three
objects
into
WP,
geez
that
only
proxy
off
to
s3
that
doesn't
seem
worthwhile
to
me.
We
might
as
well
have
the
cash
here
simply
write
directly
or
tyr
tyr
right
directly
to
the
backing
store.
The
other
reason
is
this
is
a
little
bit
different.
B
B
E
B
That's
another
question:
we
come
the
way
I've
described
this
this
interface.
There
isn't
really
any
relationship
between
the
name
that
the
back
end
gets
or
generates,
and
the
name
for
the
object
in
the
front
end
and
there's
not
that,
as
in
the
raid
0
school
there's,
also
not
necessarily
any
really
good
relationship
between
the
name.
The
object
has
in
ratos
and
its
actual
user
facing
manifestation.
If
it's
a
raid
ocw
object,
for
example,
it's
like
the
head
object
is
named
after
the
user-visible
s3
name.
B
That's
the
raid
0
CWS,
three,
not
the
back
end
s3,
sorry
for
choosing
that
example,
but
the
other
objects,
if
it's,
if
it's
a
big
object,
are
named
after
some
kind
of
monotonically,
increasing
sequence,
the
ER
and
a
something
else,
though,
the
only
way
to
do
that
would
be
to
propagate
some
kind
of
information
about
what
this
object
is
and
how
it
and
how
it
relates
to
the
higher
level
user-facing
object,
because
of
me
I
mean.
B
B
Okay,
so
basically,
whatever
whatever
process
was
looking
at
the
cold
storage
and
wanted
to
perform
reads,
would
have
to
a
be
capable
of
tolerating
the
fact
that
some
of
the
pieces
of
the
thing
that
it's
looking
for
might
not
be
there
because
they
haven't
been
demoted
yep
and
to
it
needs
to
be
able
to
reconstruct
the
jigsaw
puzzle.
So
it
would,
it
would
hey,
would
have
to
know
what
the
what
the
user
facing
application
was
and
how
it
was
naming
objects.
F
B
F
Yes,
how
may
be
a
dumb
question,
but
but
the
Dodge
GW
bacterium
that
either
is
planning
to
implement
right.
So
it's
these
separate
pockets
of
different
storage
so
for
how
actually
implementing
that
without
this
infrastructure
on
the
way
is
beside
like
like
on
different
through
the
sorghum
different
pocket,
really.
F
F
F
Otherwise,
how
actually
don't
be
saying
that
if
somebody
says
that
ok
I
want
this
object
will
be
accessed
only
once
in
a
year,
so
they
have
to
look
at
it
in
some
stories.
Actually,
it's
very
frequently
fix
a
star
or
a
little
bit,
even
eventually
my
turret,
we're
straight
because
we
raised
you
so
how
they
actually
will
be
doing
good
without
the
help
of
this
was
the
devil
suppose
I.
B
Ok,
so
rados
GW
has
the
freedom
to
write
to
any
pool
at
once,
though,
from
its
point
of
view,
if
it
has
a
replicated
pool
and
an
erasure
coded
pool
available
to
it,
there's
absolutely
no
reason
it
couldn't
just
choose
to
write
to
the
erasure
coded
pool.
Instead,
it
doesn't
need
OSD
help
for
that.
Similarly,
it
could
write
to
s3
on
its
own.
That
would
be
a
little
bit
pointless,
but
maybe
it
could.
B
F
B
B
F
So
we
are
saying
that
we
will
be
returning
some
in
in
progress
right
then
she
turns
straight
us
to
the
upstream
so
in
that,
so
why
we're
actually
planning
to
do
that?
Why
not
actually
wait
till
this
subjects,
no
really
cube
and
then
send
the
deck
for
getting
here
so
painful,
okay,
what
is
actually
knew
ed
knows.
Ppl
lose
any
time.
Oh
no,
the.
B
In-Progress
thing
is
only
about
at
like
at
a
pot
or
amazon
glazier,
glacier,
where
it
simply
takes
a
preposterous
amount
of
time
to
actually
retrieve
an
object
so
yeah
you
week.
We
could
just
wait
for
the
IO
to
complete-
maybe
maybe
that's
the
best
thing
to
do,
but
I
was
I
was
the
question
was
basically
doesn't
make
sense
to
extend
the
interface
to
deal
with
that
to
instead
return
an
error
and
say,
here's
a
handle
you
can
use
to
check
on
the
progress
of
this
operation
come
back
later.
B
Yeah
maybe
makes
more
sense
to
just
to
just
accept
that
it'll
take
a
while
end.
Yes,.
C
You
I
mean
you
might
have,
it
might
be
perfectly
reasonable
to
have
all
the
ingest
come
into
a
rate
of
spool,
and
then
you
know
slowly
in
the
background
that
get
split
off
to
something
really
slow.
That's
actually
the
typical
model
for
for
tape,
because
it
gives
you
like
the
random
utiful
objects
and
then
they
get
staged
out
after
an
hour
or
something
so
I.
Think
that
then
you
might,
you
might
have
that
like
a
pool
policy.
That
does
that.
F
F
B
B
F
B
B
C
C
E
Actually
I
yeah
I
swear.
You
have
talked
them
part
of
this
in
the
previous
blueprint,
okay,
hey.
What
do
I
want
to
do
is
to
add
to
increment,
not
adhere
inside
manteo
storage
in
Santa.
If
that,
currently
we
have
cast
here
and
then
we
have
best
here,
I
may
be
regulated,
let
away
and
can
do
manicures.
We
have
that
hot
here
we
don't.
G
E
Here
and
the
tomatillo
at
the
code
here,
maybe
as
many
as
we
want
I,
and
then
we
can
what
we
want
to
what
we
can
do
on
this
mud.
What
year
is
that
we
can,
I
dynamically
and
relocate
better
between
disappears
and
then
we
can.
Maybe
you
can
do
it
manually
or
do
it
automatically
yeah.
This
is
what
that,
for
what
would
what
so.
C
E
E
E
E
C
E
C
C
C
To
rideable
tears
and
then
as
many
cold,
tears
and
sort
of
the
third,
as
you
want,
I
mean
that
that's
that's
currently
what
we're!
What
we're
describing
I
think
the
question
is
whether
that's
sufficiently
general
to
capture
all
that
stuff.
I.
C
Mean
okay,
so
in
your
example,
having
an
SSD
tier,
an
HDD
tier
and
a
new
seat
here,
I
think
that
definitely
works.
Having
like
a
fourth
tier
of
tape,
I
think
it
works
there
too.
C
The
only
restriction
is
that
what
whats
am
subscribing
means
that,
in
order
to
move
it
from
a
racer
coded
to
tape,
backup
do
it,
it
would
actually
the
based
here
would
read
it
in
and
then
write
it
back
out
again
and
it's
not
clear
how?
Where?
Where
would
just
what
would
make
it
decide
that?
How
would
know
that
it's,
it's
so
cold
that
it's
just
gonna
move
top
the
tape,
maybe
some
external
hint
or
agent,
or
something
we
have
to
come
along
and
say
this
data
set
is
ancient
and
I.
E
Is
the
second
kind
of
my
ring
and
yeah
we
can
do
some
I
do
some
automatically
at
all
yeah,
maybe
time
to
do
automatically,
but
we
can
manually
to
relocate
the
data
to
the
code
here
in
decline.
Sign
using
some
common
ions
phone
status,
form
of
the
you
can
add
some
cumin
'god
need
you
to
send
a
bottom
to
the
female
MP.
No
bottom
yeah,
as
we
told
a
parable,
appear
in
the
bottom
in
the
in
the
previous
book
cream,
and
then
you
have
a
question
folder
or
for
that
done.
E
If
we
do
this,
when
we
don't
do
it
to
that,
when
we
get
a
lot
of
4
and
then
standard
attitude
to
the
SSD
poor
right
yeah,
but
sometimes
we
sometimes
where
you
want
to
amp
in
this
this
hotbed,
they
will
have
a
video
we
want
to
paint
and
in
the
cast
here
and
and
then
later.
Maybe
it
is
becoming
a
hot.
We
want
manually
to
an
aunt
in
the
cast
here.
Yeah
then
buddy,
but
he
do
you
a
case.
C
Yeah
I
mean
I
think
so
before
the
example
was
like
a
database
that
you
know
is
always
going
to
be
hot
and
in
that
case
just
put
it
on
the
other
pool
but
I.
Think
in
your
in
the
video
example,
then
that
makes
a
lot
of
sense
or
you
know
something
is
about
to
be
high.
You
pin
it
and
then
you
know
it's
going
so
I
think
that
makes
sense.
It's
like
having
a
having
a
good
operation
that
seems
pretty
reasonable,
yeah.
E
C
Go
ahead,
I
mean
the
other.
Half
of
this
is
that
I
think
that
the
hint
the
hints
that
that
we've
already
added
I
think
also
will
will
be,
give
you
most
of
what
you
need
right.
You
just
want.
We
want
to
add
an
additional
pin
and
unpin.
E
E
G
C
Yeah
yeah
yeah,
you
might
invest
in
a
new
microphones
on
that.
It's
pretty
it's
pretty
muffled.
So
the
one
thing
that
I
probably
should
have
done
but
didn't
is,
is
contrast
what
what
you
wrote
up,
Sam
with
the
what
we
sort
of
dreamt
up
for
ever
go
in
like
Firefly
EDS,
with
the
with
the
cult
earring.
Did
you
by
chance
I'll
get
that
one?
The.
B
Only
real
difference
is
so
they
actually
cover
kind
of
disjoint
areas.
If
I
didn't
I
haven't
read
it
recently,
but
if
I
recall
they
cover
disjoint
areas
that
one
was
more
about
policies
for,
or
choices
for,
how
to
implement
policies
which
we
still
need
here.
So
all
of
that
is
still
applicable
here.
The
only
contribution
here
is
that
we
might
not
use
a
ratos
pool
as
the
cold
dear.
That's
the
only
difference
so.
C
That
damn
that
that
is
that's
definitely
a
big
difference,
but
I
think
the
other
one
is
so.
This
is
the
radio
straighter
excellently
post
the
link
in
the
chat.
The
other
difference
is
I.
Think
in
in
this
original
proposal
you
could
I
care
what
it
is,
because
I
think
I
think
you
could
actually
write
to
the
back
end.
C
C
B
C
B
C
G
C
Which
is
right,
so
it's
a
redirect,
I
guess,
I
know:
I,
don't
see
that
I
don't
see
that
the
what
I'm,
not
sure
I,
not
sure
I
see
the
difference.
Oh
no,
sorry.
B
B
The
next
question
would
be:
is
there
value
in
having
the
cash
during
system
the
Firefly
variant
of
cashing,
where
you're
able
to
get
useful
information
back
from
them
from
the
back?
Dear?
Incidentally,
you
could
have
a
ratos
pool
in
a
different
data
center,
which
you
could
talk
to
like
a
ratos
client,
so
that
that
there
would
be
a
place
for
that
and
yet
a
third
system
where
it's
an
opaque
plugin
and
the
primary
handles
all
of
it
so
to
eat
all
three
of
those
or
maybe
just
the
cash
during
of
the
third
one.
B
That's
that's
the
next
part,
so
we'd
want
sort
of
motivating,
implementations
or
motivating
use
cases
to
make
a
decision.
I
would,
I
would
think
yeah.
C
Something
like
what
you
propose
where
it's
you
know
having
a
plug
in
it.
That's
some
other
back
end
could
be
s3.
I
think,
looking
at
something
like
a
dryish
like
type
Layton
sees
make
sense
because
it
could
be
rgw,
none
of
the
data
center
or
like
the
Geo,
distributed
whatever,
whatever
it
is,
that
that
scenario
and
another
rato
school,
so
you
could
have
a
castria,
that's
SSD
at
based
here,
that's
replicated
HDD
and
a
like
a
very
widely
striped
erase
your
coded
cold.
Here,
that's
still
read
us,
but
I
mean
if
it's.
B
You
lose
the
you,
you
lose
the
rado
specific,
so
if
we
actually
do
want
to
have
the
the
Richer
protocol
that
allows
the
client
to
receive
a
redirect
and
then
go
talk
to
the
pool
directly
for
performance
and
or
offloading
the
primary
CPU
reasons
or
few
reasons,
then
we
would
need
the
the
intermediate
yeah
fireflight
style
touring,
which
is
just
about
the
worst
name.
I
could
assign
to
that
look,
which
is
this
book.
B
B
G
B
B
That
would
allow
us
to
extend
the
existing
online
or
asynchronous
cached
hearing
agent,
which
again
in
this
case,
would
apply
to
things
other
than
just
the
cast
during
the
publication.
That
part
is
just
an
asynchronous
process
which
runs
the
background
of
scans
office
and
does
stuff.
Based
on
that,
we
would
sort
of
generalize
it
to
that
point.
How
you
think.
C
Yep,
so
the
one
other
thing
I
want
to
throw
out
here
while
we're
talking
about
this,
is
that
the
other
nice
thing
about
this
concept,
where
the
base
tier
has
all
these
pointers
redirects,
is
that
that
model
also
would
allow
us
to
do
some
sort
of
deduplication
underneath
it
where
that
pointer
says.
Oh,
this
object
is
composed
of
this
chunk
over
here
and
that
chunk
over
there
that
are
in
some
colder
reference
counted
greatest
pool
or
something
we're
all
that
for
all
that
you
do.
Placated
chunks
are
stored.
That's.
B
True,
you
don't
even
lose
anything
by
doing
it
that
way,
because
the
you
wouldn't
want
to
talk
to
that
pool
directly
anyway.
You'd
want
to
go
through
the
the
base.
C
E
E
C
B
C
B
D
B
So
there
are
sort
of
two
ways
you
can
go
with,
that
we
can
make
it
so
that
the
plug-in
actually
supports
both
kinds
and
the
OS
DS
problem,
and
it's
just
up
to
the
OSD
to
query
the
plugin
as
to
whether
it
supports
over
rights
or
not
and
act
accordingly.
But
the
larger
question
is
I
wasn't
clear
to
me
that
we
gained
a
lot
by
doing
it
that
way,
because
the
advantage
of
doing
over
rights
is
kind
of
lost.
If
you
expect
a
lot
of
locality
in
the
object
anyway.
B
So
so
this
is
a
very,
very
cold
tier
right.
This
is
meant
to
get
to
capture
the
case
where
ninety-five
percent
of
your
of
objects
ever
ridden
are
literally
never
read,
and
in
that
case
those
as
long
as
those
objects
never
leave
the
Coulter.
You
you've
kind
of
one
right
I'm
having
having
an
extra
promote
on
a
small
right
when
didn't
necessarily
need
it
might
or
might
not
be
that
big
a
deal.
G
B
Complicated
protocol
where
the
client
would
negotiate
with
the
base
tier
or
with
the
pot
tier
OSD
and
receive
you
know
in
enough
information,
allows
it
to
conclude
that
I
can
safely
right
to
the
base
here
right
now,
and
in
that
case
it
would
be,
it
would
be
able
to
do
partial
rights
and
also
other
things
that
would
be
faster.
Does
that
make
sense?
Did
you
have
a
use
case
where
it
will
be
useful?
B
C
Yeah,
the
worry
is
just
that
that,
as
the
complexity
of
the
bold
proposal,
sort
of
suggest,
anything
that
involves
redirects
and
the
sort
of
non
linear
path
through
the
means
that
it
gets
just,
gets
really
complicated.
When
you
start
thinking
about
all
the
reason
in
this
case,
you
could
proxy
it
yeah
practicing
symbolizes
to
thin
it
lexing.
B
Simplifies
it
a
lot
actually
yeah.
Exactly
so
ya
know:
you're
you're
right.
If
it,
if
it
supported
partial
rights,
we
could
conditionally
proxy
the
right
through
to
the
back
end.
If
it
was
the
kind
of
thing
the
interface
was
capable
of
supporting,
so
I
guess
we'd
want.
We
want
to
use
case
where
that
was
a
whim,
I
suppose,
yeah
and
I
don't
think
it's
super
well,
it
was
less
words,
but
I
feel.
C
B
Complicated
there
are
a
couple
of
gotchas
like
there's
no
way
to
so
our
Oh
map
interface
is
obnoxiously
rich
compared
to
most
most
other
systems.
Well,
so
their
systems,
don't
give
you
an
arbitrary
key
value
thing.
So
we
get
around
that
here
by
just
streaming
it
out
to
a
packetized,
byte
stream
out
to
some
buffer
stream
in
whatever
the
back
implementation
is.
B
But
if
we
allow
partial
overwrites
on
an
object
that
has
a
no
map
payload,
then
that
gets
a
little
complicated,
but
that's
not
to
say
that
we
couldn't
simply
have
a
policy
where
we
record
when
we
offload
it,
whether
or
not
it
has
a
nomad
payload
and
always
do
a
promotive
if
it
has
one.
So
it's
not
really
that
bad.
Does
that
answer
your
brush?
It
off
notice
that
I
can
just
work
confusing.