►
From YouTube: 2015-MAY-27 -- Ceph Tech Talks: Placement Groups
Description
A detailed look at placement groups in Ceph. What they are, what they do, and how to use them in managing and troubleshooting your cluster.
http://ceph.com/ceph-tech-talks
A
All
right
welcome
everybody
to
the
next
monthly
Ceph
Tech
Talk,
where
we
try
to
feature
some
of
the
inner
workings
of
Ceph
at
a
deep
technical
exploration
level,
so
that
we
can
enhance
the
greater
technical
understanding
of
Ceph.
In
the
past,
we've
looked
at
ray
dose
and
the
block
device
in
the
Gateway,
as
well
as
the
calamari
API
and
all
GUI
associated
things.
This
months
we
have
an
examination
of
placement
groups
by
a
longtime
friend
and
supporter
of
the
seafloor
fluorine.
Would
you
like
to
go
ahead
and
introduce
yourself.
B
Please
for
groups
because,
as
I
found
out,
it's
some
of
the
things
it's
one
of
those
things
that
kind
of
found
people
about
what
are
they
actually
good
for
and
what
can
we
do
with
them,
so
this
tik-tok
is
called
placement
groups
inside
and
out
and
we'll
get
started
straightaway.
So
let's
talk
about
data
placement
for
a
moment
and
let's
diverge
or
sort
of
go
off
on
a
tangent
here
for
a
little
bit.
B
What
does
data
placement
and
conventional
data
lookup
look
like
in
legacy
distributed
storage
systems
so
the
way
that
this
typically
works
in
a
conventional
fashion
and
a
legacy
system
very
much
unlike
Saif
is
this
way
you
talk
to
a
central
look
up
earth
to
find
out
where
you
need
to
write
your
data
or
where
you
need
to
read
your
data
from
and
then
you
take
that
information
and
you
go
to
restore
servers
and
talk
to
them
directly.
That's
sort
of
a
standard
legacy
method
of
doing
things
now.
B
What
SIF
wants
to
do,
however,
is
it
wants
to
actually
scale
this
to
potentially
thousands
of
storage
locations
and
petabytes
and
exabytes
of
data.
So
how
does
essential
lookup
work
in
terms
of
actual
horizontal
scalability?
To
what
extent
does
it
actually
provide
horizontal
scalability
to
that
pivots,
petabyte
makes
a
byte
scale
level.
Actually
it
doesn't.
It
doesn't
work
that
sort
of
thing
really
doesn't
work,
because
your
central
lookup
facility
always
becomes
a
bottom
like
Ana
single
point
of
failure.
That's
not
what
Seth
wants
to
do
so.
B
Instead,
the
idea
in
Seth
is
that
you
don't
actually
look
up
where
data
is,
but
you
have
every
single
component
in
the
system,
including
clients,
including
servers,
actually
work
out
where
they
can
find
their
data.
So
that's
a
standard
feature
sort
of
Seth,
but
how
do
placement
groups
actually
come
in?
Why
do
we
need
placement
groups?
What
are
they
good
for
and
what
is
their
specific
role
now?
B
B
B
Cart
of
300
tennis,
balls
I
check
the
first
one
to
my
first
student
and
second
one
to
my
second
student,
all
the
way
to
the
fifth
and
then
I
start
over
at
the
beginning,
and
when
I'm
done
with
that,
then
I'm
going
to
have
my
tennis
balls
my
data
distributed
evenly
among
my
storage
locations.
My
students
well,
but
actually
my
OCD,
isn't
quite
happy
with
that,
because
what
I
really
want
to
do
is
I
want
to
distribute
these
tennis
balls
reproducibly.
B
In
other
words,
I
want
to
be
able
to
look
at
a
specific
tennis
ball
and
know
exactly
where
it
needs
to
go
I
want
to
and
when
I
see
a
ball
lying
around
on
the
ground
on
the
court.
I
want
to
be
able
to
pick
it
up
and
know
exactly
which
of
my
students
it
belongs
to.
So
what
can
I
do
in
order
to
achieve
that
and
to
make
this
sort
of
round
robin
thing?
More
reproducible,
well,
I.
B
Just
take
a
sharpie
and
I
number,
all
of
my
tennis
balls
I
number
them
from
zero
to
two
hundred.
Ninety
nine
and
then
I
apply
a
very
simple
algorithm
to
distribute
them
and
that
simple
algorithm
that
I
can
use
is
just
effectively
division
and
a
modulo
algorithm.
So
I
can
randomly
pick
up.
The
ball
with
a
number
130
divided
by
five
remainder
is
zero.
It
goes
to
my
first
student
Alice
I
pick
up
another
170
1/5
remainders
one.
It
goes
to
Bob
I
pick
up
another
ball.
B
13/5
remainder
is
3
and
it
goes
to
Daisy,
so
I'm,
happy
I
have
found
and
devised
the
way
of
simply
and
reproducibly
distributing
my
data,
all
of
it
until
disaster
strikes
something
terrible
happens
at
9:30.
In
the
morning,
I
have
a
six
student
walk
onto
my
court,
Frankie
slept
in
and
it's
kind
of
still
a
little
dazed
and
confused,
and
how
walks
on
to
my
tennis
court
now
I
have
a
six
student
and
now
what
do
I
do
well.
B
I
think
I
figure
that
maybe
well
we're
not
doing
that
badly
because,
as
it
happens,
I
pick
up
a
ball
randomly.
It
has
the
number
31
I
divide
by
five
remainders,
one
remainders,
also
one,
so
it
stays
with
Bob
and
I.
Think
we're
kind
of
cool,
but
of
course,
that
doesn't
last
long
because
I
pick
up
another
one
happens
to
have
the
number
36.
B
The
remainders
are
different
and
now
I
realize
that
by
the
simple
modification,
namely
I,
have
one
additional
student
I
now
need
to
shuffle
around
and
his
balls
between
all
of
my
students
and
I
need
to
do
some
shuffling
around
it
actually
doesn't
involve
Frankie
my
new
student
at
all,
as
you've
seen
this
example.
I
have
a
that's
going
from
from
from
from
Bob
to
Ellison
so
forth,
so
that
knave
approach
doesn't
really
work.
B
Now,
as
it
happens,
what
that
means
is
that
we
get
a
change
in
capacity,
in
this
case
our
number
of
students
that
changes
and
the
number
of
tennis
balls
that
I
need
to
shuffle
around
it's
totally
out
of
whack
with
the
actual
change
in
capacity.
So
what
I'm?
Creating
with
this
simple
and
naive
approach,
is
I
get
something
that
distributes
very
evenly,
but
that
creates
an
insane
amount
of
migration
if
there
is
a
slightest
change
to
the
system-
and
this
is
a
problem
that
every
distributed
storage
system
has
every
distributed.
B
Storage
system
wants
to
distribute
data
evenly,
but
it
also
wants
to
create
a
configuration
that
has
minimal
migration
when
capacity
changes
and
just
about
every
contemporary
distributed
storage
system
solves
it
effectively.
The
same
way
in
Ceph,
it's
called
placement
groups
in
Swift,
they're
called
partitions
and
so
forth.
So
how
does
this
work?
What
do
we
do?
Well,
it's
actually
fairly
simple,
and
that
is
we
add
an
additional
management
layer
and
in
the
tennis
court
or
tennis,
a
camp
analogy,
it's
just
a
set
that
we
can
put
our
tennis
balls
into
so
suppose.
B
B
B
Students
are
safe,
oh
is
these,
and
the
list
which
is
effectively
a
set
of
parameters
to
this
very
simple
algorithm
that
we're
using
translates
these
safe
OSD
map.
And
now
it's
important
to
understand
that
this
map,
the
parameters
to
our
algorithm,
that
changes
only
when
our
storage
topology
changes.
If
we
remove
a
storage
location
or
add
a
new
one.
B
That's
when
we
need
to
update
the
map
and
that
map
is
lazily
propagated
throughout
the
system
and
on
the
whole
we
get
is
something
that
is
much
more
effective,
much
more
efficient
and
much
more
scalable
than
central
data
lookups,
and
that's
really
the
whole
story.
Why
we
need
placement
groups
in
the
first
place.
They
are
a
very,
very
simple
addition
to
an
otherwise
very,
very
simple
algorithm,
but
it's
kind
of
simple
but
genius.
One
thing
that
I
didn't
mention
previously
is
the
fact
that,
typically,
what
you
should
for
and
again
this
is
true
for
Seth.
B
But
it's
also
true
for
other
distributed
storage
systems.
Is
that
the
number
of
buckets
that
you
have
in
the
system?
The
number
of
placement
groups
is
1
to
2
orders
of
magnitude
larger
than
the
number
of
actual
physical
storage
locations,
which
is
what
you
also
got
with
the
five
to
six
students
and
in
the
in
the
60
buckets,
and
the
reason
for
that
is,
of
course,
that
you
want
to
be
able
to
redistribute
them,
namely
and
best
avoid
the
situation
where
he
actually
need
to
split
buckets,
because
that's
an
operation,
that's
expensive.
B
So
this
is
why
we
need
placement
groups.
This
is
why
we
have
them
in
SEF.
So
what
can
we
do
with
them?
Well,
we
should
look
first
at
how
we
can
actually
examine
placement
groups
and
their
status.
The
way
that
you're,
probably
going
to
do
that
initially
most
of
time
or
very
frequently,
is
with
the
set
health
command.
So
let
me
run
you
through
this
here
real
quickly,
so
what
I'm
doing
here
is
there
we
go
that
is
taking
a
little
longer
than
you
expect
it.
B
C
A
B
B
B
If
you
now
do
asset
Health's,
you
get
the
hell
swarm
and
then,
if
you
get,
if
you
do
your
set
Hills
detail,
what
you're
getting
back
is
a
list
of
all
of
the
placement
groups
in
the
system
and
their
current
status.
I'm,
sorry,
not
all
the
placements,
but
only
the
ones
that
are
not
in
the
active
and
clean
State,
the
ones
that
are
currently
affected
by
some
sort
of
outage.
B
B
That's
my
OSD
that
I'm
restarting
then
my
previously
degraded
placement
groups
are
going
to
go
into
recovery
state
as
you
can
see
here,
and
then
it
takes
just
a
little
while
and
then
we
get
back
to
the
health
okay
state.
So
that's
the
first
way
of
getting
a
handle
on
the
status
of
your
replacement
group
is
simply
the
set
health
grant.
And
if
you
want
to
get
a
little
more
information,
then
you
can
use
seft
PG
dump.
So
let's
take
a
look
at
what's
fpg
dump
does
for
us
a
set.
B
Pg
dump
simply
gives
us
a
tabular
overview
over
all
of
our
all
of
our
placement
groups
and
what
I've
done
here
at
the
bottom,
because
it's
a
fairly
fair
amount
of
it's
a
fair
screen
fold
here
is
I've
taken
out
a
quick
grip
for
one
specific
placement
group
ID.
So
let
me
quickly
walk
you
through
the
output
that
you
get
here,
you
get
as
I
said,
a
tabular
output
that
starts
with
the
placement
group
ID,
and
then
it
gives
us
various
bits
of
information
such
as
what
is
Group
State.
In
this
case
it's
actively
clean.
B
If
you
add
that
right,
if
you
want
to
get
a
little
more
information,
you
can
do
that
as
well.
There
is
the
set
PG
query
command
now.
I
found
out
that,
while
this
command
is
perfectly
documented
of
the
staff
documentation,
it
apparently
is
not
known
to
that
many
people,
so
I
want
to
make
a
point
of
mentioning
it
here.
B
When
was
it
last
on
stale,
when
did
the
last
scrub
or
a
deep
scrub
run
and
a
bunch
of
other
information
and
we're
going
to
come
back
to
that
in
a
little
bit,
because
it
has
one
very
crucial
bit
of
information
in
it
in
a
specific
error
state.
So
that's
set
PG
query
and
then,
finally,
every
once
in
a
while,
you
may
want
to
find
out
what
placement
group
a
specific
object
belongs
to
in
the
first
place
and
that
you
can
do
very
simply
with
safe
OSD
map.
Let's
take
a
look
at
the
sets.
B
The
map
command
here,
real
quickly.
What
I'm
doing
here
is
I'm
simply
numerating
Rados
objects
in
a
specific
pool
and
then
I'm
mapping.
One
object
in
this
pool
and
the
syntax
for
that
is
self
esteem,
a
followed
by
the
pool
name
followed
by
the
object
name,
and
then
we
we
get.
What
we
get
back
is
the
PG
that
this
object
belongs
to
and
what
the
current
the
current
replica
set
is
for
this
for
this,
for
this
object
and
the
primary
OSD
for
it
as
well.
B
B
Is
the
placement
group
state
or
placement
group
status
so
what's
in
there
and
what
are
the
typical
states
or
typical
statuses
that
you
can
find
it
so
the
normal
state
that
we
have
in
a
placement
group
is
called
active
and
clean.
So
what
does
this
mean?
It
means
that
the
placement
group
is
currently
available
to
process
requests,
that
is
to
say,
we
can
read
from
it
and
we
can
write
to
it.
B
B
If
we
have
a
placement
group
that
is
currently
degraded
States,
then
that
simply
means
that
that
placement
group
currently
has
fewer
replicas
available
that
are
mandated
by
its
full
size.
So
in
other
words,
for
example,
we
have
a
pool
size
of
three
and
we
only
have
two
replicas
available.
So
what
should
you
do
in
this
case?
Well,
most
of
the
time,
actually
nothing.
So
typically,
no
action
is
needed
in
this
state.
B
You
will
either
run
to
recovery,
which
will
commence
when
the
OSD
either
comes
back
or
when
it's
mono
is
deep
down
out
interval
by
default.
Five
minutes
expires,
and
in
that
case,
Ceph
will
then
assign
a
new
OSD
to
the
placement
group,
and
recovery
will
commence
from
there.
So
a
degraded
State
is
basically
nothing
particularly
to
worry
about.
It
simply
means
that
someone
of
your
o
s,
DS
is
is
currently
down,
but
the
placement
group
and
the
data
in
it
are
still
perfectly
available.
B
This,
for
example,
is
the
case.
When
you
add
new,
OS
T's
and
data
is
being
reassigned,
or
your
previously
up
and
in
OSD
has
gone
down,
has
subsequently
been
marked
out.
Staff
has
selected
a
new
replica
on
a
different
OSD
for
a
specific
placement
group
and
is
now
backfilling,
meaning
filling
the
OSD,
the
new
OSD.
With
the
data
in
that
placement
group,
then
we
have
an
incomplete
state.
Now,
that's
a
little
bit
of
a
it's
a
more
tricky
one.
B
That's
actually
a
warning
state,
and
that
means
that
in
a
specific
placement
group
we
have
fewer
replicas
available
that
are
mandated
by
the
pools
minimum
size,
the
pool
min
science
parameter.
So,
for
example,
if
you
have
a
pool
min
size
of
two-
and
you
only
have
one
replica
available,
then
that
PG
would
be
more
incomplete.
So
what
can
you
do
in
order
to
recover
from
that?
Well,
if
you
have
a
PG
stuck
in
this
state,
it
basically
means
that
SIF
cannot
fulfill
the
redundancy
guarantees
that
you've
defined
for
the
pool,
given
the
current
crush
map.
B
However,
you
should
that's
a
big
beware
there,
because
you
should
take
into
account
if
you
make
it
fresh
map,
change
might
trigger
some
or
potentially
a
lot
of
reshuffling
that
you
actually
don't
want
at
this
point,
or
another
thing
that
you
can
do
temporarily
is
change
the
min
size
of
that
specific
pool
and
then
bring
it
back
up
to
the
previous
min
size
later
in
order
to
get
out
of
the
incomplete
state.
The
best
way
is
to
just
find
another
roasty
that
that's
f
kin
can
replicate
to
by
default.
B
At
least
if
a
placement
group
is
inconsistent,
then
that
means
that
you've
previously
run
a
scrub
or
a
deep
scrub
operation
on
the
it
has
detected
errors
now,
if
the
errors
were
detected
immediately
from
the
files
metadata
in
the
OSD,
and
that
is
something
that
a
scrub
would
catch.
If
you
have
files
in
there
that
have
the
correct
size,
the
correct
metadata
and
whatnot
don't
have
the
content
in
there.
B
B
You
have
one
of
your
PG
s,
one
of
your
replicas
going
down.
Then
you
modify
the
other
one,
then
that
one
goes
down
as
well
and
you
bring
the
original
one
back
up
then
that
one
has
data
that
is
actually
stale.
That
is
if
it
were
to
become
active.
It
would
warp
you
back
in
time
and
so,
therefore,
by
default,
Seth
will
disallow
that
will
mark
the
PG
down,
and
that
means
that
you
actually
can't
do
any
IO
to
that
placement.
B
You
also
find
out
with
PG
query
when
a
placement
group
is
down,
then
in
the
JSON
that
set
PG,
query
produces
you're
going
to
see
an
entry,
that's
called
down
OSD,
so
we
would
probe
and
that's
the
OSD
that
you
need
to
be
looking
for-
that.
You
then
subsequently
need
to
recover
or
declare
loss
again
as
a
caveat.
B
If
you
declare
an
OSD
loss-
and
this
is
actually
very
well
documented
in
the
Ceph
documentation,
you
will
potentially
lose
some
data
because
you
might
lose
the
last
updates
that
you've
seen
on
one
of
your
OS
DS,
okay.
So
that
basically
concludes
my
little
three-part
talk
about
placement
groups,
what
they're
good
for
how
we
can
get
their
status
and
how
to
to
interpret
that
before
we
go
to
the
questions,
if
you're
interested
in
the
slides
you're
going
to
find
them
well,
you
can
actually
find
them
right
now
on
the
I/o
flash
Tech
Talk
PG.
A
If
you
have
questions
feel
free
to
either
type
them
in
the
chat
or
unmute
your
microphone
and
and
ask
them
as
well,
the
only
thing
that
I
might
add
to
Florian's
talk
is.
We
also
have
a
handy,
dandy
placement
group
calculator
on
the
SATCOM
site
that,
if
you're
looking
to
build
a
new
SEF
cluster
or
expand
it,
you
should
definitely
take
a
look
at
that
and
it
will
help
give
you
an
idea
of
how
many
placement
groups
your
cluster,
might
need
and
I'm
pacing
it
out.
B
It's
very
simple,
yet
elegant.
The
way
this
is
done
is
that
the
OSD
map
you
may
have
picked
up
on
that
when
I
previously
did
the
the
Ceph
OSD
map
command.
The
OSD
map
is
versioned
with
an
incrementing
integer
ID
called
the
map
epoch,
and
what
happens
is
that
all
client
to
OST
and
OSD
to
OSD
communications
are
simply
signed
or
it
well
actually
not
signed
they're
tagged
with
that
epoch,
and
so
what
happens
is
if
a
client
talks
to
an
OSD
or
talks
to
any
other
component?
B
A
B
B
B
B
So
the
question
when
a
client
requests
has
already
reached
a
primary
PG
and
at
that
moment
the
PG
or
OST
dies.
What
happens
is
that
a
degraded
right?
That's
another,
really
good
question.
It
basically
boils
down
to
again.
Let
me
rephrase
that
house
synchronous
is
the
replication
to
an
OSD.
A
client
right
is
only
acknowledged
by
the
primary
OSD
if
it
has
reached
at
least.
B
So,
in
other
words,
if
a
right
reaches
a
primary
PG
and
does
not
reach
replicas
than
the
primary
I'm.
Sorry,
if
the
right
reaches
a
primary
OSD
and
in
the
primary
OS,
they
cannot
replicate
on
to
to
to
a
replica.
Then,
as
far
as
a
client
is
concerned,
that
right
never
completes,
and
if
the,
if
the,
if
the
primary
happens
to
go
to
to
go
down,
then
the
the
client
redirects
that
right.
If
it
is
the
primary
detecting
that
a
replica
goes
down,
then
it
will
effectively.
B
C
We
were
dealing
with
recovery
from
some
failures
and
basically
I
was
sitting
in
complete
DG's
and
more
or
less.
There
was
a
setting
in
the
fresh
two
tables
for
how
many
times
to
try
and
I'm
gonna
just
crank
that
number
up
to
like
a
hundred
and
all
the
sudden.
My
plates
went
away.
So
I
was
assuming
that
the
algorithm
was
such
that
I
didn't
give
it
enough
choices
in
the
standard.
I
think
it's
50
tries,
but
that
was
like
guess
why
it
was
failing.
There
was
no
real,
quick,
easier
life
is
going
to
be.
B
Okay,
I'm
afraid
that
that's
a
question:
that's
a
little
bit
too
specific
for
the
for
the
for
the
tech
talk
because
I'm
not
sure
if
it's.
If
it's
you
know
that
immediately
interesting
to
the
to
the
rest
of
the
audience,
but
like
I
said,
I'll,
be
happy
to
look
into
that,
and
I
actually
have
a
couple
of
follow-on
questions
to
that
and
as
to
how
the
situation
came
about
so
again,
if
I
could
ask
you
to
just
shoot
me,
an
email
if
Lori
and
Anna
stay
calm,
I'll
be
happy
to
look
into
it.
B
A
B
Scrub
and
in
a
deep
scrub
is
basically
information.
That's
stored
in
encoded
form
in
the
USD
itself,
I'm,
actually
not
entirely
sure.
If
that
is
something
that
the
USD
still
puts
into
file
attributes,
because
a
couple
of
releases
back
pretty
much
every
move
to
OMAP
Patrick,
you
might
actually
have
more
information
about
that
than
I
do.
But
the
basic
idea
is
that
you
have
a
separate
set
of
metadata
where
you
keeping
hashes
of
the
of
the
data
that
you're
trying
to
look
for.
B
That's
how
you
detect,
which
one
is
actually
the
good
data
set
and
which
one
is
the
bad
data
set,
and
that
information
you
can
then
simply
use
in
or
that's
that
information
set
can
simply
use
when
you
repair.
So
it's
in
fact
it's
the
scrub
or
the
deep
scrub
that
actually
detects,
which
is
the
data
set,
that's
good
and
which
is
the
data
set,
that's
bad
and
then
sfpd
repair
simply
operates
based
on
that
information.
A
A
I
get
many
questions
on
what
those
are
and
and
how
to
use
them
and
how
to
plan
for
them
once
again,
if
you're,
if
you're,
looking
for
more
information
or
more
specific
ideas
on
how
to
use
placement
groups
for
your
cluster,
definitely
check
out
SEF
comm
/pg
calc,
it's
a
very
helpful
use
resource
or
you
are
welcome
to
ask
questions
on
the
on
the
mailing
list
or
IRC.
So
this
is
a
great
talk
and
I
will
see
you
guys
all
next
month
for
the
next
Tech
Talk
thanks.