►
From YouTube: CDS G/H (Day 1) - Cold Storage Pools
Description
https://wiki.ceph.com/Planning/CDS/CDS_Giant_and_Hammer_(Jun_2014)
24 June 2014
Ceph Developer Summit G/H
Day 1
Cold Storage Pools
A
B
B
Yeah,
sorry,
I'm
on
a
laptop
with
no
headset,
so
so
this
is
a.
This
is
a
really
high
level
overview
blueprint.
So
what
we
wanted
to
get
a
discussion
going
on
is
what
is
it
going
to
take
for
a
pool,
type
or
other
other
type
and
SEF
to
exist
where
something
can
be
written
and
effectively,
never
rebalanced
or
rewritten,
or
if
it
has
to
happen
very
infrequently
so
we're
talking
about
you
know,
I
make
the
point
in
here.
We
talk
about
cold
storage.
B
Think
of
this
is
the
level
above
tape,
but
not
actively
access
data.
This
isn't
something
you're
spending
a
lot
of
time
looking
at,
but
it's
still
data
you
need
to
get
to
a
lot
faster
than
you
need
to
get
to
tape
and
we
could
go
into
the
use
cases
for
that
there.
There
are
many,
but
that's
the
general
idea
here
since
I'm
not
a
developer.
I
didn't
put
anything
technical
in
this
detailed
description.
I
was
hoping
we
could
get
some
of
that
out
of
this
discussion,
but
that's
kind
of
the
50,000
level
of
you.
C
Is
it
possible
to
adjust
I'm
sorry,
okay,
I'm
gonna
start
talking
now?
Is
it
possible
that
looking
for
is
more
like
an
interface
where
the
OSD
will
notice
extremely
cold
objects
and
demote
them
to
something
that
is
insane
like
that's
three,
my
Kevin
three
or
a
different
interface?
It
could
be
plugging
base.
Make
it
to
my
lesson
for
agnostic.
C
C
D
So
that
the
thought
I
had
was
what
right
now,
when
the
pairing
process
gets
going
and
it's
setting
up
that
it
sets
at
PG
temp
mappings
to
avoid
moving
stuff
just
so,
we
can
remain
available
until
it
has
sort
of
rebalanced
data.
But
what,
if
the
that
logic
was
policy
based
instead
of
doing
the
normal
thing
in
the
cold
storage
pool
it?
Just
that's
at
the
PG
tech
mapping
to
be
where
it
was,
and
it
doesn't
change
it
until
it
drops
below
the
desired
redundancy
at
which
point
it.
D
D
Yeah
I
mean
yes
and
no
I
mean
that
the
OST
map
is
linear
in
the
number
of
pg's,
no
steeze
right,
it's
just
a
small
constant,
so
this
is
bumping
up
the
content
by
something
less
than
100.
Hopefully
maybe
this
that's
that's
a
price
that
you
paid
on
your
cold
storage
here
right
I
mean
that,
ultimately,
that's
that's!
The
crash
is
forcing
you
to
run
into
the
spectrum
where
you
are
have
very
little
metadata
because
you
calculate
it
all
go.
C
D
Kind
of
you
you,
you
still
want
the
system
to
to
wake
up
and
add
because
and
move
them
around
when
either
balance
becomes
problematic
or
because,
because
you
drop
below
your
your
desired
redundancy
level,
all
right.
So
there's
still
our
reasons
when
you
do
want
to
make
things
move
you
just
want
to
be
sort
of,
instead
of
like
always
going
all
the
way
to
the
sort
of
the
moving
target
of
crush.
You
drag
your
feet
as
much
as
possible.
Is.
C
C
D
E
F
C
If
you
try
to
minimize
data
movement
in
the
face
of
disk
failure
to
employ,
they
have
a
pool
of
some
number
of
hops
dem
boys
when
the
mamaroo
text.
That
knows
he
has
failed.
It
cycles
one
of
the
hot
standbys
into
that
crushed
position,
if
not
the
OSD
ID,
and
that
new
disc
gets
logically
the
same
data
right.
That's
that's!
What
we're
talking
about
well,.
F
C
Well,
yeah,
so
the
goal
is
that
so
it's
a
trivia
system
for
fly
so
you're
trying
to
handle
this
value
right.
So
when
you
is,
is
it
that
you
think
the
disk
might
come
back
and
therefore
you
don't
want
to
waste
the
workout
or
you
just
generally,
don't
want
to
risk
reshuffling
data
because
you're
only
interested
maintaining
a
hundred
deaths.
Just
even
if
you
have
a
tap
connector.
A
E
A
A
F
D
D
F
D
I
mean
so
if
it
is
a
if
it
is
a
so
if
it's
a
partition,
that's
explicitly
specified
which
is
sort
of
what
I
think
a
lot
of
systems
do,
or
they
say,
like
this
partition
of
the
hash
range
maps
to
this
server
in
this
partition,
the
hash
range,
a
range
of
the
hash
range
met
whatever,
then,
you
have
metadata
for
every
PG,
in
which
case
you're
sort
of
where
we
started
with.
Where
you
have
you
say
this
PG
is
statically
mapped
to
these.
D
These
devices
I
think,
basically,
that
the
difference
is
that
right
now
we're
partitioning
a
range
of
hashes
and
and
sort
of
what
you're
suggesting
is
you
would
partition
based
on
something
else
like
the
object
name
which
conceivably
you
could
do,
you
would
have
to
have
per
PG
metadata,
that's
sort
of
specifies
the
start
and
end
key
for
every
ethnic
group,
but
you
could
do
that.
I
think
that
the
challenge
is
that.
F
D
That's
at
some
boundary
look
at
that
range
and
once
it's
sort
of
no
longer
the
end
one,
it's
sort
of
fixed
for
all
time,
which
I
think
captures
the
use
case.
You're
talking
about
and
would
conceive
the
way,
I
think
fit
into
the
framework.
But
it'd
be
it'd,
be
it'd,
be
tricky
but
again,
I
think
that's
sort
of
a
little
bit
different
than
what
what
we're
talking
about
here.
D
At
the
general
problem
with
that
is,
if
you
have
sort
of
one
big
pool
that
we,
if
you
ever
need
to
expand
the
pool,
then
you
you
don't
want
to
not
rebalance.
You
do
want
to
rebalance,
because
otherwise
all
the
new
stuff
is
empty
and
you
need
at
least
you
need
some
separation
within
the
new
category
in
order
to
get
the
like
separation
between
racks
and
replication
and
so
forth
for
the
variability
and
redundancy
you're
looking
for
actually.
D
But,
but
if
this,
if
that's
actually
what
you
do
want,
then
I
think
what
you
actually
want
to
do
is
just
create
it.
When
you
add,
like
your
next
three
racks
of
cold
storage,
you
just
create
a
new
pool,
that's
just
those
three
racks
and
you
fill
it
up
and
when
it's
full,
then
you,
you
know,
turn
off
the
spigot
and
then
at
deploy
your
next
three
racks
and
you
fill
those
up
right.
D
F
C
B
C
But
the
goal,
though,
when
you're,
adding
stuff
to
these
cold
pools,
be
what's
what
I've
seen?
Is
that
because
you
don't
plan
on
reading
back
at
a
these
objects
anytime
soon,
you
actually
just
want
to
write
to
wherever
you
wrote
the
last
night
right.
That's
that's
the
biggest
thing,
because
you
want
to
wake
up.
Whatever
is
already
awake.
You
don't
want
to
wake
up
random
stuff
in
the
past
that
pretty
much
completely
rules
out
any
kind
of
past
base
placement,
I,
cracked.
So
far,
pretty.
C
D
I
mean
it's
at
the
end
of
the
day,
it
comes
down
to
how
much
metadata
you're
going
to
spend
to
find
the
data,
because
if
you
have
sort
of
infinite
metadata
that
you're
willing
to
spend
to
locate
stuff
than
just
you
know,
wake
up
three,
two
at
a
time
right,
dislike
stream
of
stuff,
two,
three
of
them
until
they
fill
up
and
then
go
into
the
next
three
and
carefully
choose
them,
and
you
just
have
this
huge
database
that
says
where
everything
is
well.
Madame
is
a
scalar
right.
C
Cell
does
it
out
today
right
if
you're,
if
you're,
if
you
are,
if
you
do,
have
control
of
the
key
things
and
if
you're
doing
range
finishing
it
sounds
like
you
do,
then
you
can
still
use
crush
to
locate
the
pg's
right,
you're,
just
changing
the
way
objects
map
on
the
pgs
and
you're,
making
the
number
of
gigi's
very
dynamic,
yeah
right.
So
the
hard
so
bet
that
all
fits
in
okay
to
the
OSD.
C
D
I
think
so,
yes,
and
now
I
mean
I.
Think
that
the
simplest
thing
this
is
sort
of
shifting
topics
to
like
spending
discs,
but
the
simplest
way
to
implement
spin
down
would
be
that
the
OSD
stays
awake
and
the
little
arm
thing
on
there,
like
you,
can
a
drive
stays
awake,
but
the
platter
spins
down.
So
you
like
power
off
most
of
the
drive,
but
it's
still
awake
and
means
as
I'm
alive
I'm
just
I'm.
Just
you
know,
spun
down
we're.
D
C
B
And
not
to
put
too
fine
a
point
on
this,
but
I
also
kept
this
blueprint
high
level,
because
there
are
other
technologies
that
were
interested
in
here.
That
we
can't
talk
about
yet
because
they
haven't
been
announced.
So
I
wanted
to
see
if
there
was
something
that
could
work
across
multiple
types
of
cold
storage
and.
C
D
So
I
mean
I,
think
I
think
what
you're
you're
going
towards
damn
is
sort
of
the
the
other
tearing
blueprint
they
like
the
cold
tier
blueprint
or
basically
the
object
becomes
a
symlink
that
just
says
over
there.
It's
sort
of
right
there
and
that
the
way
we
caught
deflated
before
was
another
rate
of
pool,
but
it
could
be,
you
know,
tape
at
offset.
You
know,
take
number
237
what.
C
B
Because
that's
well
what
one
of
it
one
of
the
motivations
one
of
the
motivations
I
had
reading
this
was
looking
at
the
cached
hearing
that
you
guys
added
recently
and
wondering
ok.
Well,
there's
there's
this
to
layer
tier
now
and
looking
at
the
architecture
might
even
sure
if
it's
possible,
but
kid
could
it
be
made
a
three
level
tier
2,
where
you've
got
basically
hot,
yet
hot
data
warmed
in
and
cold
data.
That's.
D
That's
that's
kind
of
what
we
are.
Yes,
that's
what
we
had
originally
contemplated.
So
you
have
the
base
tier
that
the
cashier,
which
has
that
stuff
that
based
here,
which
it
flushes
to
you
and
then
when
stuff
gets
really
really
cold.
It
gets
punted
off
somewhere
else
and
there's
like
a
pointer
that
says,
go
look
over
there
right.
D
B
C
D
When
again,
I
think
you
have
yeah
these
extremes,
you
have
the
extreme.
Where
you
have
this
huge
index,
that's
that's
exactly
where
the
object
is
and
you
get
to
decide
exactly
where
it
goes
and
you
carefully
choose
drives.
You
know
that
already
powered
up
something
like
that
versus
where
stuff
is
where
it's
like
I,
don't
want
to
store
any
metadata,
so
I'm
going
to
calculate
everything
on
the
fly
and
and
move
things
around
when
I
need
to
and
you're
you're
sort
of.
I
think
what
you're
looking
for
somewhere
in
the
middle,
where
you're.
D
I
think
I
think
that
the
easy,
the
first
easy
thing
would
be
some
sort
of
explicit
PG
mapping
or
you
can
limit
some
of
the
EG
stuff
moving
around,
maybe
that'll
that'll.
That
might
give
you
like,
maybe
twenty
percent
or
thirty
percent,
maybe
more
or
sort
of
the
the
phyllo
thing
that
we
talked
about
where,
when
you
deploy
big
chunks
of
storage,
you
create
just
separate
pools
and
you
sort
of
fill
them
up
and
then
fill
up
the
next
one,
because
it's
sort
of
like
right
once
and
rarely
delete
type
archive
data.
That's.
B
D
Option
there's
a
suggestion
on
IRC
channel
just
a
moment
ago,
where
you
know
that
the
problem
is
that
your
your
hash,
distributed
storage
is
spinning
up
random
disks,
as
you
have.
The
sort
of
trickle
of
data
coming
in
I
mean
this
sort
of.
Obviously
some
there
is
that
you
just
have
it
buffered
somehow,
so
you
have
like
a
cache
tier,
even
like
a
ratos
cash
dear,
that's
getting
all
the
ingest
stuff
and
that
everyone's
wanna
go
and
you
flush
the
whole
thing.
D
If
they
spin
up
the
a
bunch
of
racks
and
you
do
it
to
rights,
and
then
you
spin
it
all
down
again,
which
I
think
is
conceptually
simple
and
architectural
II
simple
as
long
as
you
don't
architect
things
so
that,
like
it's,
some
stupid
thing
where
you
can't
power
up
more
than
five
percent
of
the
drives
without
glowing
your
power
circuits
or
something
which
right.
B
D
F
D
D
That's
what
I
think
that's
another
another
model
that
would
work
here.
Okay,
I
think
one
sort
of
one
last
thing
that
I
want
to
mention.
This
is
mostly
for
Sam
I'm,
back
up
on
the
PG
temp
topic,
but
maybe
PG
PG
temp,
isn't
the
tool
or
like
a
policy
around
how
you
use
PG
temp,
is
in
the
tool,
but
you'll
notice.
D
There
was
a
there's
a
similar
query
on
the
on
the
email
list
earlier
today
about
disabling
crush
in
sort
of
using
a
different
algorithm
I'm
wondering
if
what
we
want
to
add
in
the
postie
map
is
a
as
an
exception.
Mapping
similar
to
PG
temp.
That's
called
like
PG
force
where
you
can,
just
if
you
want
to
have
just
leave
the
crush
map
empty
and
just
literally
enumerate
where
each
PG
goes,
and
so
the
admin
could
guess.
I
I
forced
it
over
the
I,
want
it
there
and
it
would
do
it
right.
Little
B
and.
C
D
C
D
Not
all
I,
just
it
you
have
to
you-
have
to
manage
it,
but
I
think
there
are
cases
where
you're
like
I,
think
you're
like
they're,
like
you
want
it.
You
want
to
prod
at
the
cluster
like
make
it
do
something
for
some
reason
that,
like
we
aren't
contemplating
right
now-
or
you
literally
just
want
to
have
like
essential
thing,
that's
just
deciding
it
sees
the
failure
and
it
says:
okay,
move
that
piece
of
data
over
there.
Like
you
really
want
some
external
agent,
that's
orchestrating
all
the
movement.
You
don't
want
to.
C
D
Exactly
yeah
right
or
say
you
see,
you
are
using
crush
and
it
gives
you
some
it
gives
you
some.
You
know
a
bell
curve
of
how
that
what
the
utilizations
are
and
you're
literally
like
okay
I'm
just
kind
of
this
agent,
that's
going
to
move
ten
percent
the
pgs
to
make
my
balance
perfect
right.
Instead
of
doing
the
sort
of
exceptional
case,
that's
whatever
so
I
think
the
only
the
only
caveat
there
is
that
I
think
those
mappings
need
to.