►
From YouTube: Ceph Code Walkthroughs: CRUSH
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Whatever
it
may
be,
look
you've
already
got
a
good
crowd
here
we
can
get
started.
This
is
going
to
be
a
good
walk
through
of
crush
one
of
the
fundamental
building
blocks
of
stuff
and
sage.
I
think
there's
lots
of
corners
of
clash
that
haven't
been
changed
in
many
many
years.
So
it'll
be
quite
interesting
to
see
what
those
look
like
and
where
they
came
from.
B
Okay,
so
I
think
I
might
start
with
just
like
a
really
quick
slide,
a
couple
slides
from
the
original
crest
presentation.
Actually
that
just
show
at
a
very
high
level
how
the
algorithm
works
before
diving
into
the
code.
But
then
I
just
want
to
encourage
everyone
to
like
stop
me
and
ask
questions,
because
it's
always
hard
to
tell
which
parts
are
most
interesting
to
cover
and
what
is
clear
and
what
is
not
clear
and
also
my
headset
will
turn
off
if
there
isn't
any
sound
coming
through.
B
So
if
somebody
asks
a
question,
then
that
will
reset
the
timer
each
time
all
right
so
crush.
Does
data
placement.
Everything
in
the
cluster
of
course
is
organized
as
a
hierarchy,
and
it's
all
based
on
these
placements
are
based
on
the
placement
rules.
So
this
is
just
like
a
really
simple
illustration
of
how
that
works,
though
you
have
your
initial
command.
B
These
are
the
commands
that
form
the
rule
and,
as
you're
interpreting
the
rule,
there's
a
working
value,
which
is
basically
like
kind
of
like
a
register,
but
it's
an
array
of
values
that
it's
working
on
so
take
root.
Well,
it's
root
into
the
value
or
it's
all
crooked,
because
this
is
a
powerpoint
slide
from
a
million
years
ago
and
I'm
using
libreoffice
so
first
step
flex.
One
row
there's
only
one
starting
with
root,
there's
a
bunch
of
rows
beneath
it.
It'll
pick
one
randomly
and
move
that
into
the
working
value.
B
Then
the
next
rule
select
three
cabinets,
distinct,
unique
cabinets.
We'll
do
that
and
then
each
of
these
has
those
parts
of
the
trend
underneath
them
here.
When
we
select
one
disk
for
each
one
of
these
working
value
items
it'll
do
that
step,
go
for
cabinet,
21,
it'll,
take
a
disk,
it'll
do
a
disk
and
so
on.
In
this
case.
B
D
B
Okay,
all
right
so
I'll
just
start
with
the
header
file
to
go
through
some
of
the
types
crust.h.
B
So
most
of
the
crush
code
is
written
in
c,
because
it's
it's
identical
code
on
the
kernel
and
in
user
space
and
then
there's
a
bunch
of
stuff,
that's
written
in
c
plus,
plus
that
kind
of
like
adds
all
the
modifiers
and
stuff
the
kernel
code
really
just
needs
to
be
able
to
do
the
mapping
and
not
much
else
so.
There's
a
couple
constants
up
here,
there's
a
constant
for
an
item
that
is
undefined
like
and
then
you
have,
let's
see,
there's
a
structure
for
each
step
in
the
rule.
B
These
are
all
the
different
opt
codes
for
different
rules
for
choosing
items,
submitting
items
and
so
on
these,
let
you
adjust
tunables
for
a
particular
rule.
Although
most
of
the
time
you
don't
need
to
do
that
I'll
talk
about
tunables
in
a
second
each
a
rule
is
composed
of
a
series
of
steps
it
used
to
have
these
parameters.
B
B
Let's
see
algorithm
defines
what
algorithm
we
use
when
we're
descending
through
each
node,
originally
crush
was
written
to
have
four
of
them
uniform
list
tree
and
straw.
Tree
was
buggy
and
never
worked,
so
we
dropped
that
one
and
straw
turns
out
was
also
incorrectly
conceived
and
so
there's
a
straw
to
version
that
we
use
instead,
these
days,
basically,
we
always
use
straw
2,
and
I
don't
think
anybody
ever
uses
these
other
ones
and
I
wouldn't
recommend
them
and
so
I'll,
mostly
mostly
ignore
them.
B
So
most
of
the
time
theft
by
default
always
creates
drawbacks.
Unless
you
like,
try
really
hard
not
to
do
that
or
start
to
stretch
you.
So
that's
the
that's,
the
new
type
that
everybody
uses
every
bucket.
That's
a
node
in
the
tree
has
an
identifier,
but
there
are
two
types
of
amplifiers
anything.
That's
greater
than
or
equal
to
zero
is
a
leaf
node
like
an
osd
in
this
in
steps
case
and
everything.
That's
negative
is
an
identifier
for
a
bucket.
B
These
go
negative
one
negative,
two
and
so
on.
The
type
of
bucket
is
the
crush
type.
So
that's
whether
it's
like
a
rack
or
a
host
or
a
row
or
a
data
center
or
whatever
those
sort
of
user
defined
types
are.
B
These
are
for
users
to
describe
algorithm.
Tells
you
what
algorithm
to
use
when
traversing
that
particular
node.
So
this
is
almost
always
straw.
Two
hash
function
is
always
our
junctions
there's.
Only
one
hash
function
to
find,
and
then
there
are
a
couple
items
here.
There
are
all
the
items
all
the
leaf
nodes,
basically
for
this
node
and
there's
a
weight
which
is
basically
the
sum
of
all
the
weights
for
all
the
items.
B
So
normally
the
crush
hierarchy
always
sums
up
from
the
leaves
up
to
the
top.
That's
a
bucket
weight
sets
I'm
going
to
come
back
to
you
later
so
for
the
different
bucket
types.
There
are
some
additional
fields
that
are
type
specific,
so
in
uniform
buckets
everything
is
the
same
weight,
so
there's
only
one
weight
and
the
other
ones.
There's
an
array
of
weights
list
buckets.
Has
this
thumb
thing
tree
bucks
to
totally
bonk
your
bonkers,
so
we
can
ignore
those,
but
for
straw,
two
bucks,
which
is
the
ones
that
we
mostly
care
about.
B
It's
really
just
the
there's.
This
sort
of
parent
bucket
thing
has
the
array
of
items
and
then
there's
a
weight
for
each
item
and
the
only
thing
that's
sort
of
weird
in
fresh.
That
is
a
bit
confusing
when
you're
thinking
about
the
code
and
stuff
is
that
the
weights?
You
would
normally
think
about
a
hierarchy
where
the
weight
is
a
property
of
each
item
and
then
they
sort
of
sum
up
the
tree.
The
way
that
crush
does
it
the
lee
the
weight
is
always
stored
in
with
the
pointer
at
the
reference.
B
So
the
bucket
has
a
list
of
all
the
items
that
are
beneath
it
and
it
also
stores
the
weight
of
those
items
that
it
uses
when
traversing
down,
instead
of
actually
storing
the
weight
with
the
item,
which
is
a
little
bit
weird,
but
normally
they're,
all
it's
all
identical
it's
just
stored
in
a
different
place,
but
it
can
be
a
little
bit
confusing
and
then
the
crush
map
has
all
the
different
buckets
that
compose
the
map,
all
the
different
rules,
how
big
those
arrays
are,
and
then
there
are
all
these
definitions
that
customize
the
behavior.
B
Basically,
when
this
was
first
written
15
odd
years
ago,
there
were
a
bunch
of
sort
of
weird
behaviors
that
seemed
to
make
intuitive
sense
at
the
time,
but
time
has
proven
or
bad
ideas,
and
so
all
of
these,
basically
let
you
selectively
disable
all
of
those
different
annoying
behaviors
and
in
a
couple
other
cases
just
make
subtle
changes
to
the
algorithm
and
make
it
work
better.
B
So
when
you
create
a
new
map,
all
these
are
sort
of
set
to
the
like
latest
best
practice.
But
old
clusters
might
have
values
here
that
sort
of
keep
doing
the
old
behavior
so
that
you
don't
reshuffle
all
the
data
and
I'll
talk
about
some
of
those
and
that's
basically
it
that's
the
that's.
The
crush
map
itself.
B
B
B
These
weights
are
the
the
out
the
in-out
map,
the
weight
weight
overrides
or
whatever
in
the
osd
map,
and
then
this
stuff
here
is
I'll
talk
about
later,
but
this
will
actually
calculate
a
placement,
so
it
sets
up
a
couple
of
arrays
here.
So
this
result
this
a
and
these
a
and
b
arrays
are
basically
the
working
the
working
set
that
I
mentioned
in
the
slides
all
these
temporary
annoying
variables.
B
E
B
Then
we
just
iterate
through
all
the
different
steps
in
the
rule
and
we
process
them
you'll
notice
that
all
of
these
rule
steps
that
customize
the
tunables
for
this
specific
placement
just
update
that
local
variable
these
local
variables.
So
we
can
for
one
particular
rule,
we
can
change
the
behavior
of
how
the
placement
is
calculating,
as
opposed
to
doing
it
globally.
B
B
So
assuming
that
the
arguments
for
this
step
are
valid
or
valid
device
and
bucket
or
whatever
it
is,
then
we
basically
set
a
working
array
to
have
that
one
value
with
the
length
of
one
and
then
the
next
step
is
usually
a
choose
one
of
these.
How
to
choose
or
choose
leaf.
I
know
they're.
B
First,
in
or
independent,
the
first
stand
is
meant
for
replicated
pools
and
in-depth
is
meant
for
erasure,
coded
pools
where
you
don't
want
things
to
move
positions
in
the
output
and
so
basically
based
on
that
we
either
get
first
in
as
one
or
zero
and
recursively
is
one
or
zero,
depending
on
whether
it's
choose
or
choose
leaf
and
first
enter
indepth
and
then
there's
no
output.
For
this
particular
step.
B
We
iterate
over
all
the
items
in
the
sort
of
the
working
variables,
probably
starting
with
just
the
root
and
we'd
go
through
and
and
process
it.
So.
B
We
make
sure
it's
valid
if
it's
the
first
in
case
a
replicated
pool,
then
we
set
up
some
temporary
variables
and
call
into
this
choose
first
in
function,
which
basically
does
all
the
work
or,
if
it's
not
first,
and
if
it's
an
in-depth,
then
we
call
a
different
function
for
the
ratio
coded
variant
and
we
pass
in
a
whole
all
the
different
arrays
and
sizes
and
all
the
stuff.
That's
super,
all
the
tunables
and
so
on.
B
B
B
But
basically
this
starts
at
one
particular
point
in
the
hierarchy
in
our
work
array
and
it
generates
a
certain
number
of
output
items
for
that
and
it's
going
to
descend
until
it
reaches
a
particular
type,
so
maybe
you're,
starting
at
the
root
and
you're
descending
until
you
hit
host,
want
three
distinct
hosts,
so
type
would
be
whatever
the
type
post
is
or
maybe
you're
going
all
the
way
to
the
leaves
and
you're
getting
leaves.
But
in
any
case,
that
sort
of
does
one
sort
of
step
of
that
descent.
B
And
there's
a
whole
bunch
of
nested
loops
here,
because
again,
the
original
algorithm
was
conceived
in
a
more
complicated
way
than
it
needed
to
be,
but
basically
the
we
have
to
find
n,
distinct,
unique
outputs
of
a
particular
type,
and
so
we
do
this
recursive
descent.
So
we
start
out
at
the
particular
at
the
starting
point,
which
is
going
to
be
whatever.
It
is.
The
item
I
think
actually
it's
passed
in.
B
I
can't
care
for
it.
I
remember,
but
any
case
we're
going
to
do
a
recursive
descent.
There
are
a
couple
different
and
then,
based
on
what
bucket,
where
we're
we're
at,
we
have
to
use
a
different,
a
different
function.
To
actually
do
that,
so
usually
we
use
crush
bucket,
choose
to
choose
a
set
of
items,
but
in
some
cases
we
fall
back
to
this
permutation,
which
is
sort
of
a
degenerate
thing
again.
B
This
was
sort
of
a
bad
idea
in
the
original
algorithm,
where
it
used
to
try
to
use
the
specified
algorithm
a
couple
times
and
if
it
gave
up,
if
there
are
too
many
local
tries,
we
would
fall
back
to
just
shuffling
the
items
in
the
bucket
and
just
using
them
in
order.
It
turns
out
that
was
a
bad
idea,
so
in
most
modern
cases
this
path
was
never
taken
and
we
always
do
crush
bucket
shoes.
B
But
in
some
cases
that
wasn't
the
case
so
crush
bracket
choose
is
just
a
one
decision
at
one
node,
so
we
have
our
our
our
input,
our
sort
of
random
values
that
place
in
group
id
and
so
on,
and
it
gives
us
one
item
makes
one
step.
Basically
it
picks
one
child
for
that.
So
if
you
look
at
use
function,
basically
it's
just
a
switch
statement
depending
on
what
bucket
it
is.
It
uses
that
particular
algorithm.
If
you
look
at
the
straw
how
to
choose,
for
example,
the
strategy
buckets.
B
This
is
what
every
modern
crush
map
will.
Do.
We
basically
get
the
array
of
weights
and
children
for
this
bucket
and
we
basically
just
iterate
over
all
of
them
and
so
for
every
potential
child.
We,
the
analogy
here,
is
drawing
or
the
metaphor
is
drawing
straws.
So
we
draw
a
straw
of
random
length,
that's
sort
of
based
on
the
proportionally
sized
based
on
the
weight
of
that
child.
B
So
if
it's
a
heavily
weighted
child
straw
is
more
likely
to
be
long,
so
we
calculate
all
the
weights
and
we
keep
track,
of
which
straw
was
the
longest
and
that's
the
one
that
we
vote
on
when
we
return-
and
this
is
doing
this
exponential
distribution.
This
is
the
the
newly
conceived
version
of
this
algorithm
that
the
intel
folks
helped
optimize
back
in
like
2013
or
something
way
back
when
that
works
quite
well.
So
that's
that's
crash
bucket
shoes.
Let's
see
where
we
were
in.
B
One
here
so
again
we
we're
looping
we're
at
a
particular
point
in
hierarchy.
We
choose
one
child
and
we
see
if
we've
reached
the
desired
type,
we're
trying
to
get
down
all
the
way
to
hosts
or
whatever
it
is,
and
if
we
got
to
the
right
type,
then
we
might
keep
going
and
continue
our
descent.
We
basically
would
loop
again
if
it's
not
the
right
type.
We
repeat
otherwise
can
you
we
want
to
make
sure
that
the
item
that
we
chose
is
unique.
B
So
if
it's
already
one,
if
we
already
chose
that
in
in
this
descent,
then
we
have
to
make
sure
we
just
retry
and
we'll
increment.
Basically,
this
r
value
keeps
getting
incremented
each
time
we
retry,
and
so
that
we'll
pick
a
different
random
value.
B
And
other
checks
here,
but
basically
yes,
we
keep.
We
keep
doing
this
descent.
This
code
is
super
hard
to
follow
because
it's
parameterized
based
on
these
tunables,
so
I
apologize
it's
very
hard
to
follow,
but
it's
probably
easiest.
If
you
go
through
and
say
for
a
normal
map,
then
this
is
always
zero.
These
are
all
zero,
for
example,
and
then
you
sort
of
like
prune
out
these
branches.
You
could
probably
make
a
simplified
implementation
of
this.
That
only
uses
the
default
tunables
and
then
it
would
be
more
understandable.
B
But
yes,
eventually,
you'll
get
down
there.
It'll
you'll
have
chosen
another
output
item,
then
you
have
to
choose
any
of
them
and
so
it'll
repeat,
keep
going
going
around
the
loop
and
that
is
it.
Let's
see
the
rule
eventually
you'll
go
through
all
the
steps
of
the
crush
rule
to
do
the
descent
and
then,
at
the
end
of
the
crush
rule,
there
will
be
an
emit
step
that
basically
takes
whatever's
in
your
working
value
and
it
just
adds
it
to
the
final
result.
That's
returned
back
to
the
caller.
C
B
B
Never
so
you
basically
always
want
to
use
the
defaults
new
clusters
which
are
sort
of
the
the
current
notion
of
like
the
best
way
to
make
this
algorithm
work
the
only
time
this
would
be
not
the
default
is
if
this
was
a
cluster
that
was
created
like
back
in
dumpling
or
bobtail,
or
something
really
old,
and
so
all
the
data
was
placed
in
a
particular
way
and
switching
to
the
new
tunables
would
just
reshuffle
the
whole
cluster
and
people
don't
want
to
do
that.
B
C
B
Ideas,
they
should
basically
just
never
be.
You
should
never
use
them
anymore
and
that's,
I
think,
all
of
those
are
fine.
Very
r
was
something
that
I
believe
a
user
discovered
where
somewhere
in
this
algorithm
as
we,
if
we
hit
a
collision
or
a
repeat
event,
and
it
loops,
we
weren't
varying
r
in
a
very
efficient
way,
and
so
it
was.
B
We
would
just
retry
a
lot
and
fail
before
we
finally
got
to
an
r
that
worked,
and
it
turns
out
there's
a
better
way
to
vary
r
so
that
it
it's
more
efficient
and
we
don't
have
to
retry
as
many
times.
So
that's
what
that
is,
and
then
stable
similarly,
was
that
I
think
that
was
also
a
user
or
I
can't
remember
one
of
these
came
from
user,
where
they
basically
just
tweaked
it
the
code
a
little
bit
and
it
was
way
more
efficient.
D
Yeah,
I
think
the
only
ones
I
recall
are,
I
think,
choose
total
tries
or
something
in
the
past.
I
think
yeah.
C
B
B
B
B
Anyway,
so
if
that
and
if
that
happens,
then
you'll
basically
get
lots
and
lots
of
retries,
and
if
you,
if
that
were
infinity,
then
we
would
try,
you
could
mark
all
the
ocs
out
and
it
would
just
look
forever
and
that
would
obviously
be
bad.
On
the
other
hand,
if
it's
too
small,
you
might-
and
you
have
a
lot
of
those
these
marked
out
it'll,
just
like
not
get
a
result
or
you'll
get
like
too
few
replicas.
B
50
or
100
seems
to
be
the
number
that
we
settled
on.
Okay,
a
couple
just
a
couple
other
things,
so
I
mentioned
that.
There's
this
this
perm
thing
bucket
choose
perm.
C
B
B
Sorry
straw
to
choose,
because
that's
the
bucket
that
we
mostly
use
now
but
for
different
buckets.
This
implementation
is
different,
so
there's
a
straw
chews
that
uses
the
original
straw
algorithm
that
is
sort
of
broken
in
some
subtle
ways
that
we
didn't
realize
in
the
beginning.
There's
the
tree
algorithm
that
we
never
actually
worked
right.
There's
the
list,
one
that
is
not
recommended
and
there's
a
uniform
one.
The
uniform
one
was
meant.
B
The
original
idea
was
that
you
might
deploy
like
arrays
of
disks
where
the
disks
are
all
the
same
size
and
they're
always
deployed
as
a
unit,
never
like
add
new
disks
to
an
array
because
it
only
has
16
slots
and
you
would
only
have
them
fail
and
you'd
mark
them
out,
and
so
it
just
sort
of
makes
makes
a
really
cheap
calculation
for
that
leaf
layer
of
the
hierarchy
and
the
way
it
works
is
it
originally.
B
I
did
some
like
weird
modular,
arithmetic
trickery,
discrete
math,
that
like
ended
up
not
being
super
effective,
and
so
it
was
replaced
with
this
sperm
choose
which
basically
just
the
first
time
you
encounter
it.
It
calculates
a
random
permutation,
a
deterministic
due
to
random
permutation
of
the
order
of
the
items
in
the
bucket,
and
then
it
just
sort
of
walks
through
them
so
like
pr
here
is
just
the
r
parameterist
modular
size
and
we
would
return
that
permutation
position
at
that
position
right.
B
So
this
is
sort
of
the
like
the
fallback
just
scrambling
these
equally
weighted
things,
and
it
used
to
be
that
there's
this
thing
where,
if
we
tried
the
regular
algorithm
a
couple
too
many
times
this
fallback
local
tries,
we
tried
it
too
many
times
and
we
just
couldn't
get
anything.
Then
we
would
just
fall
back
to
a
permutation
and
use
that,
but
that
was
just
a
bad
idea,
so
we
don't.
Hopefully
these
these
settings
are
never
are
never
actually
used
that
way
anymore.
C
B
C
A
You
had
in
mind
that
you
might
see
changing
in
the
future
or
thought
about
adding,
but
we
hadn't
actually
implemented.
B
The
first
one
is
shadow
shadow,
trees
and
I'll
show
example
of
that.
So
it
used
to
be
that
if
you
had
a
hierarchy,
a
crush
hierarchy
that
had
add
like
hard
drives
ssds
or
whatever
you
would
have
to
create
like
two
para,
you
have
to
explicitly
manually,
create
multiple
parallel
hierarchies
and
have
these
rules
that
reference
them
and
it
was
just
really
tedious
and
annoying
to
manage,
and
so
around
luminous.
We
added
the
ability
to
set
a
class
on
an
osd.
B
So
that's
usually
something
like
hard
disk
ssd
md,
but
it
could
be
anything
and
that
would
be
sort
of
associated
with
the
item
in
the
hierarchy,
and
then
you
could
have
rules
that
would
reference.
I
want
to
place
this
only
on
ssd
osd's,
or
only
on
envy,
neo60s
or
whatever
the
classes
are
that
you
define
it
turns
out.
B
We
figured
out
a
way
to
do
that
in
a
way
that
was
totally
backwards
compatible,
and
so,
when
the,
when
a
crush
map
was
actually
compiled
down
and
encoded
and
passed
to
a
kernel,
for
example,
the
kernel,
client
and
mapping
was
placed
it
would
work
in
exactly
the
same
way.
B
There
was
no
actual
difference,
but
all
the
management
tools
and
the
ways
the
way
that
you
manage
the
hierarchy
would
have
these
types
in
them
and
the
way
that
works
is
that
if
you
look
at
the
the
tree
of
devices,
each
item
has
a
type
associated
with
it
and
the
weights
and
has
a
weight
and
the
weights
still
sum
up
some
hierarchies.
So
this
is
the
sum
of
all
the
items
from
the
host
and
and
so
on,
and
so
that
was
that
all
was
normal.
B
But
if
you
look
behind
the
scenes
at
what
the
the
hierarchy
actually
looks
like
it
has
these
sort
of
hidden
shadow
items
that
have
that
are
denoted
by
tilde
and
then
the
type,
and
so
you
have
the
the
main
hierarchy
that
has
everything
of
all
types,
including
archives
and
ssds.
And
then
you
have
sort
of
hidden
from
the
user.
It
has
another
hierarchy
that
has
only
the
hard
drives
for
the
entire
tree
and
it
sort
of
has
only
those
leaves
and
then
the
intermediate
intervening
notes
have
weights.
B
That
sum,
only
those
items,
so
you
have
sort
of
a
hard
drive
only
route
and
you
have
an
nd
in
the
only
route
and
you
have
an
ssd
only
route
and
then
in
the
crush
rules.
When
you
write
these
rules,
the
take
step
used
to
just
be
take
step
default.
But
now
you
have
this
added
optional
part
that
specifies
with
class
and
that
basically
means
that,
instead
of
picking
the
like
the
default
route
with
the
root
for
everything,
it
picks
the
route.
That
only
has
the
ssd
items.
B
B
So
the
user
tools-
let
you
just
add
and
remove
stuff-
and
you
only
see
sort
of
the
main
tree
unless
you
pass
like
if
you're
jumping
the
tree
past,
show
shadow
and
it
shows
you
shadow
things,
but
it's
sort
of
these
are
all
automatically
generated
and
in
fact,
when
you
make
a
change,
they're
thrown
out
and
then
recalculated
and
regenerated
every
time.
So
you
don't
you
never
actually
sort
of
touch
these
shadow,
these
shadow
trees
manually,
but
it
lets
you
make
those
per
type
roots
automatically.
B
So
if
you
look
in
the
crush
directory
there's
this
crusher
wrapper
class.
This
is
the
c
plus
plus
class
that
wraps
up
all
the
c
code
and
it
has
sort
of
the
nicer
complete
interface
for
for
managing
all
this
stuff,
and
it
has
you
know
it's
got
stuff
to
deal
with
all
the
tunables
and
all
this
oiling
annoying
boilerplate
stuff.
But
in
here
there
are
functions
that
somewhere
in
here
there's
a
function
that
basically
rebuilds
all
of
the
shadow
roots
so
rebuild
build
roots
with
classes.
B
B
B
Let's
see
so
garv
has
a
question
if
we
have
two
classes,
if
the
ssds
are
full,
will
the
complete
cluster
end
up
in
read-only
mode?
That
used
to
be
the
case,
I
believe
it's
no
longer
the
case,
because
I
think
we
have
instead
of
having
a
full
flag
for
the
whole
cluster.
We
have
a
full
flag
on
a
per
pool
basis,
and
so
I
think,
when
an
lsd
is
full,
we
only
mark
the
pool's
bowl
that
touched
those
posties
and
so
the
hard
disk
pools.
B
All
right,
so
the
other
most
complicated
annoying
thing,
I
think,
probably,
is
the
choose
args,
and
so
this
was
it
starts
here.
So
this
was
meant
to
deal
with
a
case
that
we
didn't
notice
for
many
years,
an
issue
with
the
way
that
the
algorithm
is
designed.
So
if
you
think
about.
B
You,
if
you
think
about
a
hierarchy
like
a
weighted
tree
where
you
have
something
at
the
root,
and
you
have
a
bunch
of
intermediate
layers
and
then
you
have
some
leaves
all
of
the
weights
of
all
the
items.
Sort
of
sum
up
the
tree
so
that
when
you're
at
the
root-
and
you
have
say
a
bunch
of
racks
below
you-
the
weights
of
those
racks
represent
the
cumulative
sum
of
all
devices
that
are
nested
beneath
it.
B
And
so
you
can
just
proportionately
descend
the
tree
based
on
those
relative
weights
and
you'll
end
up
with
the
right
eventual
distribution.
So
if
you
do
that,
you
know
for
a
thousand
different
placement
groups
or
whatever
the
amount
of
placement
groups
that
are
on
each
leaf.
Actual
osd
ends
up
being
proportional
to
its
weight,
and
it
turns
out
that
the
math
for
that
works
out
perfectly.
B
If
you
have
that
that
decision,
that's
a
perfectly
independent
choice,
but
the
problem
is
that
we
aren't
placing
just
one
copy
of
a
placement
group
we're
placing
n
copies.
We
just
like
for
a
replica.
We
have
like
three
copies
or
usually
or
whatever
it
is
moisture.
Codes
is
even
more,
and
so
after
you
place
the
first
copy
and
then
you
go
back
to
place
the
second
one.
B
So
in
particular,
the
way
that
this
manifests
in
the
real
world
is
that
when
you
have
some
hosts
or
racks
that
have
much
smaller
weights
than
the
other
ones,
they
tend
to
get
too
much
data.
So
you
might
have
like
five
racks
that
have
you
know
hundreds
of
ocds
each
in
them.
B
And
then
you
have
one
rack
that
you
just
started
to
fill
up
you
just
only
you
plugged
in
the
very
first
host
and
you
set
the
weights
proportionally
or
whatever,
and
suddenly
those
osds
are
going
to
be
like
have
way
more
data
than
they're
supposed
to,
and
so
this
was,
it
turns
out.
There
wasn't
a
sort
of
an
efficient
way
to
adjust
the
algorithm
to
take
account
for
this
non-independent
probabilistic
selection
of
those
leaf
devices,
and
so
we
ended
up
sort
of
working
around
it
in
a
backwards
way.
B
Basically,
the
first
time
the
very
first
replica
we
placed
we
used
the
weights
as
they
originally
were,
but
when
we
place
the
second
replica,
we
use
a
slightly
modified
set
of
weights
where
those
those
osds
in
the
mostly
empty
rack
have
a
slightly
lower
weight.
And
so
we
have
a
lower
probability
of
picking
them
with
sort
of
the
baseline
algorithm
which,
because
of
its
flaw,
will
end
up
getting
more
and
it'll
all
sort
of
balance
out.
B
So
we're
basically
just
trying
to
like
compensate
for
that
bias
and
then
for
the
third
placement.
We
do
the
same
thing
even
more,
and
so
the
idea
is
that
sort
of
it'll
net
out
in
the
end.
So
that's
sort
of
the
workaround
for
that
particular
thing.
But
the
question
was:
how
do
we
implement
that
in
the
algorithm
and
the
way
we
do
it
is
by
using
something
called
weight
sets,
so
normally
a
bucket
has.
B
B
B
Args
is
basically
an
array
of
different
weight
sets,
and
so
the
first
weight
set
would
be
the
one
that
we
use
for
the
first
replica
placement
and
then
the
second
weight
set
would
be
the
one
for
the
second
replica,
the
placement
and
so
on,
and
so
as
those
weights
would
get
more
and
more
skewed
for
subsequent
placements,
and
so
this
is
just
a
way
to
store
what
those
weights
are,
and
they
have
to
be
like
pre-calculated,
ahead
of
time
when
you
generate
the
map
and
then
done
that
way,
and
then
finally,
so
this
would
just
choose.
B
Our
thing
would
still
be
for
one
particular
one
bucket
one
node
in
the
tree,
and
then
you
have
a
map
that
basically
has
one
of
these
for
every
bucket
in
the
map
and
in
total
it
basically
allows
you
to
have
alternate
sets
of
weights
to
use
for
different
steps.
When
you're
doing
your
your
calculation.
B
So
the
the
result,
then,
is
that
this
crush
choose.
Arg
map
is
passed
into,
do
rule,
so
in
mapper
dot
c
do
rule.
We
pass
in
this
fresh
choose
args,
which
is
basically
the
array
of
all
these
for
the
for
all
the
buckets
and
then
all
of
these
functions,
you'll
notice,
get
passed
in
to
choose.
Args
gets
parameterized
all
the
way
down
whatever
so
that.
B
Finally,
when
you're
doing
like
straw
tube
placement,
for
example,
we
have
the
juice
args
and
we
actually
go,
and
we
look
up
the
particular
weights
for
this
particular
selection
and
a
lot.
This
will
either
give
us
back
the
bucket
weights
or
if
this
choose
arc
thing
actually
exists
and
it's
defined
and
has
a
bunch
of
alternative
weights,
then
we'd
use
that
instead.
B
B
So,
instead
of
normally
you
think
of
the
tree
having
a
defined
set
of
children,
it
could
be
that
maybe,
when
you're
in
a
different
step
of
the
way,
you
want
those
children
to
be
in
a
different
order
or
just
different,
or
maybe
you
have
a
totally
some
other
reason
to
have
like
totally
different
parameters
that
you
pass
in.
That's
just
a
totally
different
tree
structure
or
something
I
don't
know,
and
so
we
also
made
it
so
that
you
could
parameterize
these
identifiers
too.
This
isn't
fully.
This
is
all
implemented
in
the
mapping
code.
B
So
if
you
generate
a
map
that
does
this
it'll
spit
out
a
result,
but
the
code
that
actually
manages
the
map
nothing
implements
actually
creating
these
because
we
don't
have
a
use
case
for
it.
But
the
idea
was
just
to
sort
of
try
to
future-proof
it
and
put
it
in
place
in
case
we
needed
it
later.
We
wouldn't
have
to
have
some
annoying
compatibility
problems.
B
So,
that's
that's
how
the
algorithm
does
it
in
practice?
Again,
I'm
not
sure
that
we
actually
got
all
the
way
to
using
this.
I
can't
remember
to
what
extent
the
balancer
does
it
so
they're,
basically
two
ways
that
the
weight
sets
can
be
used
and
the
first
way
you
can
use
the
weight
set
is
to
create,
what's
called
a
compat
weight,
set
or
compatible
weight
set
and
that's
a
weight
set
where
there's
basically
only
one
set
of
weights
this.
B
We
have
just
an
alternative
set
of
weights
that
we
use
instead,
and
we
use
that
for
every
step
of
the
way
and
the
reason
to
use
that
the
reason
why
the
compat
was
that
exists
is
because
it
allows
us
to
run
these
optimization
algorithms
that
make
sort
of
incremental
adjustments
in
the
weight
of
different
items
in
the
tree
in
order
to
tweak
the
placement,
and
so
that
you
get
a
very
even
distribution.
B
And
so
this
is
one
of
the
modes
for
the
for
the
balancer,
where
you
can
do,
you
can
generate
these
compat
weight
sets
and
it
will
run
this
optimization
it'll
make
all
the
pg
distributions
on
the
osds
like
very,
very
close
and
it'll
store
it
as
a
weight
set
with
exactly
one
position,
and
the
benefit
is
that
when
you
encode
the
crush
map
for
an
old
client
that
doesn't
have
support
for
weight
sets
or
any
of
this
stuff,
it
can
just
put
those
that
one
alternative
set
of
weights
in
the
regular
map
and
a
legacy
encoded
map
as
the
actual
weights
and
so
sort
of
the
legacy.
B
So
it
used
to
be
that
that
was
the
default
mode
for
the
balancer
module.
I
think
we
just
recently
switched
that
to
the
upmap,
which
is
a
different
way
of
optimizing
placement,
because
it
works
a
little
bit
better,
but
the
benefit
of
having
that
that
legacy,
that
sort
of
what
is
it
called
compat
wait
set
mode
with
that
it
would
work
with
pre-luminous
clients,
so
older,
older,
kernels,
older,
vms
and
stuff.
B
It's
probably
not
so
much
of
an
issue
now,
because
that
was
like
four
years
ago
or
whatever
it
was,
and
everybody's
upgraded
their
clients
by
now.
But
that's
why
that's
why
it
worked
that
way,
but
going
back
to
the
original
goal
of
these
weight
sets,
which
was
to
handle
this
case,
where
you
had
a
rack
with
a
small
rack
next
to
a
whole
bunch
of
large
racks,
and
it
would
get
too
much
data.
B
C
B
A
There,
with
the
weight,
sets
and
like
being
able
to
vary
that
by
position
that
seems
like
it
could
be
used
for
some
kind
of
very
interesting
placement
algorithm.
That
might
be
like
very
different
from
boosting
crush.
B
B
When
you,
when
you
actually
decompile
the
pressure
map,
you
can
see
these
cheese
args
things
down
here.
This
is
negative
one,
and
by
the
way
you
can
store
multiples
of
these
and
choose
arch
that
in
the
map
I
think
they're
associated
with
the
pool.
I
can't
quite
remember
how
steph
indexes
them,
but
but
basically
for
any
given
bucket,
you
can
have
these
different
alternate
weight
sets
in
this
particular
cluster.
It's
a
compact
weight
set,
there's
so
there's
only
one
of
these.
C
A
Trying
to
think
more
about
like
what
you
could
do
with
the
weight
sets
a
piece
like
what
kind
of
place
and
policy
would
that
what
you
might
want
there.
That
could
be
useful.
C
B
So
maybe
you're
doing
some
weird
thing
with
them
trying
to,
if
you
have
a
pool,
that's
iops
intensive,
any
other
pool,
that's
capacity,
intensive
and
you're,
trying
to
like
layer
the
two
over
the
same
set
of
devices
and
try
to
like
optimize
for
both
at
the
same
time,
so
you're,
balancing
both
ios
and
capacity.
B
C
B
Yeah
I
mean
these
the
fact
that
these,
where
is
it,
the
the
weight
set,
is
based
on
which
choice
you're
making?
Oh,
where
is
it
so?
This
is
for
one
particular
selection,
and
this
is
for.
B
Position:
zero
position,
one
position:
two:
they
have
all
these
different
arrays,
so
it
might
be
that
this
is
the
primary
basically,
so
you
might
have
one
set
of
weights
for
the
primary
and
then
this
one
is
the
non-prior
weight
and
you
have
that
skewed
so
that
the
net
balances
out,
even
though
you're
sort
of
prioritizing
primary
selection
one
particular
place.
You
can
do
that
there.
We
have
sort
of
another
mechanism
in
the
osd
map
that
that
tries
to
let
you
sort
of
control
the
primary
probability,
but
this
could
shift
further
down
into
crush.
A
C
B
Yeah
yeah,
I
mean
the
challenge
with
crush.
Is
that
it's
a
it's
kind
of
a
black
box
like
you,
a
number
goes
in
and
it
spits
something
out,
and
you
have
all
these
knobs
to
try
to
control
what
those,
what
the
distribution
of
those
outputs
is.
But
you
don't
have
very
fine
grain
control,
and
so
it's
worth
noting
that
when
theft
calculates
a
placement
crush,
is
sort
of
at
the
deepest
level,
but
there's
a
whole
bunch
of
layers
of
other
stuff
that
are
on
top.
B
So
maybe
it
makes
sense
to
look
at
those
really
quick.
Actually,
there's
look
at
where
the
osd
map
calls
do
question
rule
it's
just
pg
to
raw
osds,
which
is
sort
of
the
bit
where
we
finally
actually
call
crush.
B
So
it
generates
an
integer
from
your
pool
id
and
pgid.
That's
that
x,
value
that
gets
fed
in
here
figures
out
what
the
rule
is
and
whatever
it
passes,
in
the
osd
weights
and
so
on.
But
you
know
after
we
run
crash,
we
remove
any
non-existent
osd's.
Maybe
the
crush
map
doesn't
fully
agree
with
the
osd
map.
So
it
filters
that
on
out
that's
these.
There
are
there's
a
primary
osd
that
gets
selected
sort
of
in
the
degenerate
case.
B
B
Let's
apply
a
map
here,
map
and
map
items.
Layering,
there's
also
there's
one
other
step
in
here.
Maybe
it's
just
this
to
remove
non-existent,
no
there's
a
step
that
also
reorders
the
the
primary
affinity.
Yes,
okay:
where
is
that
called
area?
B
So
you
apply
the
app
map,
you
filter
out
the
ocs
that
are
down
right
and
then
you
pick
a
primary
and
then
you
apply
the
the
primary
affinity
which
basically
for
replicated
pools,
at
least
it
if
there's
some
probability
where
it'll
just
sort
of
if
one
osd
is
chosen
as
a
primary,
but
it
has
a
lower
probability.
B
B
Which
compiles
and
decompiles
crush
maps
to
that
that
text
format
that
looks
sort
of
vaguely
like
see,
and
so
this
in
here
there's
a
there's,
a
grammar
for
spirit.
That
is
what
the
language
looks
like
and
the
compiler
basically
goes
through
and
calls
it
or
decompiles
it
back
and
forth.
D
So
I
had
a
question
so
from
a
user
perspective
if
they
are
not
using
the
crash
compact
mode
of
the
balancer,
how
much
does
weight
set
matter
at
the
moment.
B
B
B
What
else
yeah
I
mean?
There's
there's
still
there's
so
much
like
legacy
corruption
here
it
feels
like
eventually,
there's
there's
an
opportunity
to
like
force
people
away
from
these
legacy.
Tunables,
perhaps
as
time
goes
by
and
and
then
eventually
drop
support
for
them
same
thing.
B
For
these
like
legacy
bucket
types
like
the
there's,
like
the
old
straw
buckets,
I
think
those
actually
should
all
be
on,
because
I
think
we
just
programmatically
went
and
just
changed
them
at
some
point,
but
like
list
buckets
and
uniform
buckets
like
these
are
all
like,
mostly
mostly
useless.
B
B
They
just
have
this
constraint
where
you
have
to
have
sort
of
a
fixed
size
node
in
the
tree.
You
can't
add
or
remove
items
from
it,
because
if
you
do,
everything
will
get
all
shuffled
around.
But
if
that
were
the
case,
then
there
eventually
we
could
like
strip
out
a
whole
bunch
of
the
annoying
compatibility
tunable
code
and
like
simplify
a
bunch
of
this
stuff.
D
C
B
So
matthew
in
the
comments
notes
that
it
used
to
be
that
for
the
strata
straw,
q2
conversion,
you
had
to
do
it
explicitly
manually
during
an
upgrade.
I
can't
remember
if
we
decided
to
just
do
that
for
users,
if
they
hadn't
done
it
yet
yeah.
A
B
C
B
D
I
guess
I
have
a
couple
of
general
questions.
If
other
folks
don't
have
any
questions,
so
one
of
the
biggest
concerns
people
have
with
crush
or
like
changing
anything
in
crush
is
data
movement.
So
I
guess
what
are
the
absolute
no-nos
with
respect
to
you
know,
crush
changes
and
people
should
be.
You
know
really
careful
about
whether
they
need
to
make
this
change
or
not,
or
this
can
lead
to.
You
know
massive
data
movement.
B
Right,
so
the
big
big
no-no
is:
don't
change
the
identifier
for
a
bucket.
B
So
when
you
decompile
a
crush
bucket
a
first
line,
there
is
id
negative
1
negative
10
whatever
it
is,
don't
change
that,
because
that
is
one
of
the
inputs
to
the
pseudorandom
function
that
picks
what
item
when
you're
descending
during
that
descent,
the
bucket.
So
if
you
have
identical
buckets,
two
different,
identical
buckets:
you'll
choose
different
children
for
the
same
placement,
so
that
id
has
to
be
part
of
that
input.
But
if
you
change
it
then
it'll
just
shuffle
everything
at
that
point
in
hierarchy
yeah.
B
D
Okay,
I
guess
my
second
question
was:
what
kind
of
debugging
tips
do
you
have
around
the
the
crash
code?
So,
if
like
if
something
looks
like
a
potential,
if
something
looks
like
a
potential
crush
issue,
how
do
you
go
about
at
least.
B
Yeah,
so
that's
that's
a
good
one.
In
the
mapper
code,
there's
this
d
print
k
defined.
That's
commented
out,
so
I
basically
comment
this
back
in
and
then
I
go
and
I
compile
crush
tool
and
then
crush
tool.
You
can
do
and
there's
like
test
map
pg,
something
like
that
or
maybe
that's
an
osd
map
tool
command.
Actually,
but
one
of
these
you
can
basically
you
can
put
in
the
placement
id.
B
I
think
it's
hosting
map
tool,
though
let's
do
it
by
my
iosd
map,
that's
the
specific
pg
that
seems
to
be
problematic
and
then
you'll
get
a
bunch
of
stuff
to
standard
out
that,
like
lets,
you
trace
its
way
through
all
this
code
see
all
these
d
print
case
statements
dotted
all
over
the
place.
B
It's
it's
like:
it's.
Output,
like
is
really
hard
to
follow
because
there's
so
many
nested
loops
with
like
very
subtle
differences
in
like
the
parameters.
So
it's
it's
hard
to
follow,
but
it's
yeah.
That's
that's!.
B
Yep-
and
I
guess
I
guess
the
last
bit
of
advice
would
be
try
not
to
do
you
can
get
away
with
it.
Don't
edit
the
crush
map
manually
use
the
cli
commands,
there's
like
a
oh
full
set
of
them
like
that
osd
crush
brush
for
dumping,
all
the
rules
and
renaming
buckets
and
adding
and
adjusting
all
the
stuff
like
there's.
B
Add
and
remove
items
or
whatever
the
oc
map
or
sorry,
the
not
the
ost
map.
Sorry,
not
the
crush
map,
the
crush
tool
lets
you
add
and
remove
all
this
all
these
things.
B
The
crush
tool
also
has
this
test
function,
which
will
calculate
over
a
bunch
of
like
synthetic
inputs
and
give
you
like
a
distribution
of
outputs,
and
there
are
also
you
can
do
that
and
you
can
do
show.
Mappings
and
that'll
show
you
every
input
it
tried
and
the
actual
output
it
got,
but
one
of
the
things
you
can
do
if
you're
you
want
to
test
some
major
change
to
your
crush
map.
B
You
can
pull
your
crush
map
out
of
the
cluster,
use
the
crush
tool
to
make
some
modification
or
recompile
it
or
whatever,
and
then
you
can
basically
run
tests
on
the
old
map
and
on
the
new
map
and
compare
compare
the
placement
and
see
how
much
how
much
it's
changed.
So
you
can
do
show
mappings
on
the
old
one
show
mappings
on
the
new
one
and
then
have
dish
count
the
number
of
lines
that
are
different
and
that
will
give
you
some
sense
of
like
how
much
data
is
going
to
move.
C
A
Going
back
to
the
other
question
about
any
other
future
improvements,
you
see,
is
there
anything
at
the
higher
levels?
Besides,
I
had
a
crush.
B
B
It
would
be
nice
if
there
was
a
way
to
do
that
in
a
in
a
way
that
didn't
wasn't
so
tedious,
probably
involving
some
other
tool
that
kind
of
tries
to
automate
that
process.
That
might
be
nice.
B
Cases,
well,
I
guess
the
the
optimizer,
the
balancer
there
are
opportunities
there
to
improve
it.
I
guess
there
was
a
threat
on
the
dev
list
earlier
this
week
or
just
recently
about
our
current
optimizer
optimizes
on
a
per
pool
basis
to
get
plus
or
minus
the
right
number
of
pgs
on
every
osd,
but
it
does
that
independently
for
each
pool,
and
so
it
might
be
that
across
a
whole
bunch
of
pools,
you
end
up
with
osds
that
are
have
too
many
pgs.
B
So
we
need
sort
of
like
a
global,
optimization
as
well.
That's
something
that
we
can
certainly
do
the
optimize,
optimization,
balancer
stuff.
Also
doesn't
it
looks
at
pg's
primarily
and
it
doesn't
look
at
the
sort
of
actual
net
usage
of
space,
which
might
vary
from
the
sort
of
expected
usage
versus
the
actual
usage,
and
I
think
that
still
isn't
taken
into
consideration.
A
D
Yeah,
I
just
added
a
link
to
a
pr.
There
was
this
recent
wrapper
tool
called
crush
cliff
that
got
added.
It's
a
wrapper
around
the
crush
tool
and
ost
map
tool,
which
makes
things
a
little
simpler
than
earlier.
E
Yeah
would
be
good
as
well,
if
crash
tool
would
understand
json
as
well,
because
because
right
now
it's
really
hard
to
manipulate.
Like
you
get
it,
you
can
get
the
output
like
of
the
crash
mapping
json.
But
if
you
want
to
map
it
to
let's
say
a
good
type
and
then
well
and
marginally
to
go
type
and
then
we
marshal
it,
then
you
cannot
really
import
it
or
manipulate
with
a
crash
tool.
So
that's.