►
From YouTube: 2017-AUG-02 :: Ceph Developer Monthly
Description
Monthly developer meeting for the coordination of Ceph project development.
http://tracker.ceph.com/projects/ceph/wiki/Planning
B
Sure
I
don't
know,
there's
not
a
whole
lot,
oops,
not
a
whole
lot
at
bay.
Pam
I
should,
when
I
can
give
a
really
quick,
luminous
update,
I
think
we're
getting
really
close
to
release
Josh
and
I
did
a
bug
screw
up
this
morning
in
mark
the
last
few
ray
dose
bugs
that
we
consider
blockers
to
be
immediate
about
half
of
them
about
half
of
them.
Some
of
them
already
have
fixed
this
ready,
they're
just
a
handful
over
the
ones.
B
We
want
to
either
diagnose
and
demote
or
fix
so
I'm
hoping
for
a
Loomis
release
beginning
next
week.
It
should
be
good
and
then
we
can
head
off
towards
mimic
I.
Guess
that's
about
it.
There'll
be
a
few
things
that
are
still
sort
of
in
people
have
been
working
on.
That
will
probably
keep
back
porting
to
luminous.
Some
of
the
manager
modules
are
sort
of
trivially.
B
Separable
I
will
get
back,
boarded
like
the
balancer
module
that
I'm
working
on
we'll
talk
about
in
a
little
bit,
but
otherwise
pretty
excited
about
luminous
and
looking
forward
to
getting
it
out
there
and
in
people's
hands
to
talk
all
I
have
really
Celia.
Just
posted
the
link
to
the
agenda,
which
has
quite
a
few
items
so
we'll
probably
want
to
keep
moving
and
get
through
it.
C
C
C
So
hopefully,
everyone
can
see
that
the
idea
of
this
is
not
to
dethrone
any
of
the
various
self
management
projects
that
exist,
but
to
provide
a
sort
of
basic,
simple
one.
That's
baked
em,
so
that
people
who
just
need
a
gooey
Amy
gooey
have
one,
but
also
so
that
people
who
want
to
implement
interesting
functionality
have
a
place
to
do
it.
So
if
you
guys,
like
a
cool
graph,
that
you
want
to
expose
or
the
status
of
your
feature,
you
want
to
add
a
page
for
it,
so
you
can
show
off
to
people.
C
C
There
are
a
few
things
that
are
fetched
remotely
live,
so
if
I
was
to
click
the
detail
button
next
to
the
clients
bed
well,
I,
don't
know
how
many
clients
about
it
at
the
moment,
but
if
I
did
this
would
be
populated
via
the
manager
module
doing
a
the
equivalent
of
ACF
tell
sending
on
one
command
out
to
well
one
one
command
them
in
command
message
out
to
the
daemon
to
get
the
list
of
clients.
So
you
can
do
that
stuff
from
the
dashboard
as
well.
C
The
some
of
the
stuff
here
is
enabled
by
new
stuff
under
the
hood.
So
notably,
you
can
see
the
status
of
our
GW
and
RBD
demons.
Now
the
auntie
dollie
stuff
isn't
in
the
dashboard,
yet
I
think
someone
will
probably
create
an
argue
W
page
pretty
soon
that
used
to
not
be
possible
because
the
manager
didn't
know
anything
about
our
GW
dealings.
There's
no
a
new
structure
in
the
manager
called
the
service
map,
so
that's
accessible
to
modules,
but
it's
also
accessible
from
the
command
line.
C
C
That
kind
of
thing,
the
other
thing
that's
gone
into,
enable
stuff
in
the
dashboard
is
to
list.
Rvd
images
actually
used
to
be
pretty
hairy
because
you
didn't
know
which
pools
to
look
at
you
didn't
know
up
front
which
pools.
Where
are
we
people,
so
you
didn't
know
where
to
go
and
run
the
OBD
command,
coalesced
images,
so
there's
a
new
concept
of
application
tags
that
apply
to
pools.
This
is
pretty
transparent
most
of
the
time,
but
you
can
also
drive
it
manually.
So
there
are
these
new
commands.
C
C
Let
me
see
what
else
there
are
lots
of
improvements
to
logging,
so
a
few
people
I'll
see
in
the
list
have
already
noticed
that
it
used
to
be
the
two
type
of
sessions
F
dash
W,
which
follows
the
cluster
log.
You
would
pretty
much
just
see
a
continuous
view
of
PG
method,
summaries,
which
was
quite
useful.
C
The
other
big
new
thing
in
the
logs
is
that
all
of
the
health
statuses
that
you
can
have
so
all
of
the
various
things
that
can
happen
at
the
top
of
your
safe
status
now
generate
more
messages
as
well.
So
it
used
to
be
that
if
you
had
something
going
on
healthy
on
your
system,
that
would
happen
and
if
you
happen
to
look
at
self
status
while
it
was
bad,
you
would
know
about
it.
But
if
you
didn't
and
it
went
healthy
again,
there
was
no
indication
of
that
anywhere
in
your
log.
C
So
now
all
of
the
possible
places
where
we
raise
health
alerts,
they're
called
health
checks,
now
have
a
unique
code
and
when
a
Health,
Alert
or
health
check
I
have
to
keep
getting
it
right
and
of
that
type
gets
raised.
You
get
a
lot
message
saying
this
condition
is
now
failing
and
then,
when
it
clears,
you
get
another
message
saying
this
is
now
clear,
so
that
hopefully
makes
a
lot
easier
for
people
to
interrogate
the
log
to
work
out
what
happened
when
on
their
system.
C
C
Well,
I've
actually
only
got
one
recent
line
here
and
it's
the
message
to
the
log
telling
me
that
a
health
check
failed.
So
this
is
a
generic
thing
every
time.
One
of
these
fails.
You
can
a
method
message
that
says
health
check
failed
and
then
the
text
of
a
blob
here
and
says
status,
and
if
I
do
health
for
my
place
and
pretty.
C
This
has
changed
again
ever
so
slightly
the
last
minute
before
we
release
Luminess,
so
that
the
message
now,
rather
than
just
being
a
string,
we
have
a
object
called
summary,
which
has
a
message
attribute:
that's
to
make
it
extensible
so
that
we
can
add
more
fields,
post
aluminous,
without
breaking
backwards.
Compatibility
again,
if
you
have
tools
that
need
the
old
format
of
the
health
output,
there
is
a
setting
that
you
can
set
to
make
the
molitor's
output
the
old
format
alongside
the
new
format.
So
you
just
get
both
sets
of
fields
in
the
JSON
elbow.
C
So
there's
like
a
PD,
degraded,
PG,
availability
and
PG
damaged
health
check
and
the
various
different
PG
states
get
assigned
to
one
of
those.
So
you
have
a
much
shorter
output
to
your
safe
status
and
you
do
still
have
access,
obviously
to
all
the
detail.
If
you
just
you,
know,
go
and
use
the
existing
PG
commands
to
go
and
look
at
it
and
also
during
initial
cluster
setup.
C
There
are
a
bunch
of
health
checks
which
is
used
to
trigger
during
setup
of
the
cluster
complaining
about
live
enough,
PDS
or
as
T's
being
a
flying
or
whatever
and
they've
all
been
sort
of
massaged
so
that
they
don't
trigger
under
those
circumstances.
So
there
are
checks
that
will
now
only
complain
about
having
a
bad
number
of
PGs
if
there
are
actually
some
objects
in
the
pool
and
that
kind
of
thing
so
that
when
your
system
initially
comes
up,
you
don't
see
a
bunch
of
scary
warning
and
error
messages.
C
So
there
was
a
there's-
a
module-
that's
actually
always
been
there,
but
it
wasn't
switched
on
by
default
before,
hopefully
it
switched
on
by
default,
yeah
yeah.
So
this
this
is
a
module
that
was
kind
of
mainly
written
as
a
demonstrator
for
manager
to
begin
with,
but
it's
now
switched
on
by
default.
It's
just
called
the
status
module
and
it
has
two
columns
FS
status
and
just
called
OSD
status,
possibly
yeah
there
you
go,
which
give
you
sort
of
slightly
friendlier.
Colorized
summary
views,
so
these
are
its
visit.
C
This
isn't
meant
to
blow
your
mind
that
we
have
these
couple
of
commands.
It's
meant
to
make
you
think:
hey
I
could
add
something
cool
there.
So
when
you
have
you
know
stuff,
you
would
like
to
monitor
for
the
features
that
you're
working
on
then
it's
ridiculously
easy
to
go
and
add
these
commands.
It's
all
just
iPhone
and
like
a
few
tens
of
lines
of
Python
in
one
of
these
existing,
whether
it's
in
the
status
module
or
the
dashboard
module
or
wherever
or
your
own
module.
You
can
create
this
stuff
really
easily.
Now
yeah.
B
C
C
So
big
options
work
have
changed.
You
may
only
not
notice
this.
They
used
to
be
a
file
called
config
dot
H
that
had
a
whole
bunch
of
preprocessor
macro
calls
in
and
that's
gone
or
a
rather
has
been
renamed
to
legacy,
configure
up
spot
H
in
order
to
have
a
nice
C++
class
definition
of
all
the
options
which
is
now
in
options.
Cc,
that's
not
just
for
fun.
The
reason
for
making
the
change
is
so
that
we
can
add
a
lot
more
fields
to
the
options.
Things
like
minimum
maximum
thresholds,
human
readable
description
strings.
C
So
all
the
options
now
with
the
description
string
and
a
long
description
string,
which
currently
aren't
being
used
to
build
the
documentation
but
ultimately
will
be,
and
that
stuff's
all
available
at
runtime
through
a
new
config
help
command,
and
so
now,
if
I
do
that,
there
are
a
complete
help.
I
get
this
huge
list
of
all
of
the
possible
options
in
the
system
and
all
of
the
metadata
about
them.
C
You'll
notice
most
of
the
descriptions
of
blank
at
the
moment,
though,
the
infrastructure
has
gone
in,
but
the
work
of
actually
going
through
and
like
typing
all
the
descriptions
in
is
still
ongoing
and
that
should
be
fairly
easy
to
pack
port
to
luminous,
even
if
folks
don't
get
around
to
writing
all
their
dog
strings
in
beforehand.
Romantic.
B
I
haven't
actually
used
that
sage.
What
are
you
this
doesn't
just,
instead
of
helped
active
it
just
dumps
config
settings
that
are
set
that
differ
from
the
defaults,
so
those
are
the
ones
that
matter
basically,
so
when
previously
you'd
have
to
go,
do
you
config,
show
and
I'd
show
every
config
option
there,
like
more
than
that,
I
wasn't
terribly
helpful.
This
will
just
tell
you
the
ones
that
are
have
been
modified,
just
usually
what
you
want
to
know.
C
This
is
also
as
well
as
adding
lots
of
metadata.
This
is
a
precursor
to
the
movement
which
hopefully
will
have
been
a
reasonably
near
future,
to
store
all
the
config
options
centrally
on
the
monitors
and
then
have
all
of
the
daemons
consume
them
from
there,
rather
than
each
loading
them
locally
from
a
from
a
text
file.
C
C
C
Okay,
so
you
can
you
can
imagine
this
is
going
to
be
pretty
useful
if
you're,
in
the
middle
of
an
upgrade
of
what
you're
dealing
with
user,
who
might
not
be
100%
sure
about
what
those
and
they're
running
or
might
not
be
completely
accurate,
but
I
will
version
they're
running.
This
is
a
very
quick
way
to
just
interrogate.
B
That
there's
a
feature
as
one
to
test
features.
That's
similar
pay
attention
to
the
feature
bits
implemented
by
clients
that
are
connected
so
that
knowing
what
the
treatments
are
is
less
useful
among
the
main
one,
here's
two
clients,
so
you
can
tell
if
you
have
Firefly
clients
connected,
or
you
know,
dual
clients
or
whatever
and
trust
him
for
what
release
each
connection
is
based
on
the
a
feature
that
said
it
supports.
C
Holden
is
probably
worth
mentioning
the
the
OSD
destroy
a
lousy
bird
stuff
that
job
did
so
that
was
like
a
while
ago
now,
but
it
sort
of
fits
in
this
general
category
of
commands
that
are
easier
to
use
and
make
make
things
easier.
So
there's
a
photo
SD
replacement
and
ice-t
removal
without
typing,
as
many
commands
and
without
doing
as
much
data
movement.
C
B
The
main
thing
is
that
other,
so
when
you're,
when
you
have
a
failed
disk-
and
you
want
to
replace
it-
it's
usually
best
to
preserve
the
same
OSD
ID
so
that
the
crush
mapping
doesn't
shuffle
data
around,
and
so
those
commands
allow
you
to
do
that.
So
you
can
do
destroy
which
marks
the
OSD
as
it
sets
a
flag.
That
says
it's
been
destroyed
and
removes
its
like
set
X
keys
and
stuff,
but
it
leaves
it
in
the
crush
map
and
then
we
need
to
use
set
just
prepare.
B
You
can
pass
the
OSD
ID
argument
to
pass
in
that
old,
LSD,
ID
to
reuse
and
it'll
check
if
it's
something
that
it
can
reuse
because
it's
been
previously
destroyed
and
if
so
it'll
let
you
be
reprovision
it
with
a
new
set
of
X
key
together.
So
it's
useful
both
for
failed
disks
and
also,
if
you're
doing
any
conversion
of
file
sort
of
loose
Toros
G's.
You
can
also
use
that
command
and
then
there's
another
one
purse
that
will
just
like
remove
all
trace
of
the
OSD.
B
D
B
B
B
Set
of
commands
that
run
all
the
all
the
bouncing
code,
so
I
used
to
be
the
revitalisation
command.
Filter,
monitor
that
generally
worked,
but
it
was
pretty
fragile
and
you
had
to
sort
of
trust
that
I
was
going
to
do
the
right
thing,
but
it's
still
pretty
primitive
and
we
have
a
bunch
of
new
tools
in
stuff.
Now
that
way,
I
just
do
a
much
better
job
of
doing
optimization
of
the
crush
weights
and
layout
and
distribution
and
so
on,
but
but
they're
not
easy
to
use.
B
So
the
idea
with
the
band
manager,
the
balancer
module,
is
that
they'll
just
be
a
Isetta
commands
that
sort
of
make
it
much
much
simpler
to
use.
Okay,
any
guys
is
it
sharing
my
screen,
my
terminal
I'll
make
it
big.
So
it's
a
set
any
commands.
The
basic
idea
is
that
once
the
module
is
enabled
you
can
do
balance
or
you
can
set
the
mode
to
up
map.
That's
the
only
one!
B
That's
currently
working
right
now,
there's
a
status
that
tells
you
whether
it's
working
eventually
you'll
just
kill
this
a
balancer
on
and
all
just
in
the
background.
At
all
times,
check
your
distribution
and
sort
of
make
small
changes
as
needed
to
make
sure
that
you're
evenly
distributing
all
your
data
across
all
your
devices
and
that's
sort
of
the
hands-off
approach.
That
eventually
will
one
good
too
that's
not
there
yet,
but
in
the
course
of
implementing
this
built
it.
B
So
you
can
sort
of
break
it
down
to
a
series
of
steps
and
actually
tell
what
it's
going
to
do,
make
sure
it
does
the
right
thing.
So
the
idea
is,
you
run
them
up
for
mice
command
and
you
give
it
a
name
of
the
plan.
You're
gonna
do
called
foo
it'll
decide
what
that
what
that
is,
it'll
run
the
optimizer.
Take
bunch
of
steps,
so
will
actually
show
you
what
the
plan
wanted
to
do.
So
in
this
case,
since
my
mode
is
up
map,
it's
using
an
EPG
up
map
exception
list.
B
B
You
know,
after
that
step
it
should
be
completely
bounced
because
I
didn't
have.
That
means
yeah
you'll
notice,
they're
exactly
24
cookies.
So
it's
done.
Let's
know
just
why
that
plan
was
empty.
There's
nothing
really
to
do
then
there's
a
new
eval
command.
That's
only
happened
woman
right
now,
but
it
basically
does
all
the
the
statistical
analysis
to
make
sure
to
figure
out
what
that
distribution
of
data
is
figure
out.
B
The
idea,
basically
in
doing
this
is
that
all
the
infrastructure
is
being
built
into
the
manager
module
interface,
so
that
and
you
can,
you
can
sort
of
in
a
sandbox
figure
out
what
changes
you
would
want
to
make
and
then
evaluate
what
the
result
would
be,
and
it
will
score
it
so
that
you
can
build
an
optimizer
that
runs
on
there.
That
is,
for
example,
to
do
a
gradient
descent
on
the
crush
weights
to
get
to
a
better
distribution.
B
Out
of
that,
it
would
have
a
proposed
set
of
new
weights
that
are,
we
have
as
part
of
the
plan
you
can
see
if
that
scores
better
than
I'm
better
than
your
current
clustering.
If
so,
you
could
execute
that
plan
and
then
it
would
actually
optimize
those
weights
so
still
work
in
progress,
but
getting
that
infrastructure
in
place
and
trying
to
do
it
in
a
way
that
makes
sure
it
sort
of
takes
into
consideration
all
the
complexities
that
have
been
previously
ignored,
I'm,
so
you'll
notice.
B
This
breaks
things
down
by
route,
which
is
essentially
roots
in
the
crush
tree
default
in
this
case,
and
then
also
by
pool,
and
it
does
all
the
analysis
analyses
for
both
of
those.
So,
for
example,
if
you
use
the
new
crash
device
class
feature
where
you
have
both
these
tagged
as
hard
disks
and
some
tag
business
as
thieves,
and
you
have
crush
rules
that
distribute
just
across
those
devices
when
it
does
its
analysis
of
the
data
to
be
distribution.
It'll
take
that
into
consideration.
B
These
peach
or
MPGs
around
I
know
how
big
that
TG
is
and
how
much
data
is
going
to
move,
and
so
I
can
sort
of
predict
what
the
usage
is
going
to
be
after
that,
which
is
all
which
is
all
good
it'll,
be
a
little
bit.
Well
it
to
be
a
little
bit
careful
because
the
OMAP
stuff
isn't
totally
unaccounted
for.
B
C
B
B
So
the
idea
is
that
you
set
it
at
like
three
percent
or
something
by
default,
and
it'll
only
make
small
changes
so
that
no
more
than
3%
of
the
PGS
or
is
currently
rebalancing
and
moving
data
around
so
I
can
make
it
go
slow,
basically
for
the
other
mode,
where
just
follow
the
one
that
people
are
actually
use.
The
crush
compat
mode,
which
optimizes
crush
weights,
it'll
sort
of
inherit,
the
ability
to
ramp
up
crush
weights
from
zero
to
whatever
the
actual
weight
should
be
by
starting
them
at
zero.
B
And
so
all
the
pieces
are
there
to
do
that,
but
that
mode
isn't
really
implemented
yet
so,
but
eventually,
yes,
there'll,
be.
The
idea
is
that
when
you
have
a
new
OSD,
you'll,
basically
say
that
the
target
weight
for
this
is
the
size
of
the
device.
You
know
like
four
terabytes
or
whatever,
but
it's
the
effective
weight
that
it's
actually
using
we'll
start
out
at
zero
and
then
it'll
slowly
ramp,
that
up
and
to
try
to
hit
that
three
percent
threshold
or
whatever
it
is
so
that
it'll
slowly
fill
over
time.
B
B
I'm
planning
on
making
it
so
that
people
who
are
you
have
been
using
the
old
method
of
the
OSD,
wait
reweighed
thing
when
they
switch
to
the
new
balancer
module
and
do
the
crush
weight,
optimization,
it'll,
it'll,
sort
of
back
off
the
old
Corrections
and,
as
it
starts
using
the
new
ones.
So
it
all
sort
of
have
a
smooth
transition
from
those
two
mechanisms.
I.
B
B
B
B
I
guess
the
only
thing
I'm,
not
sure
that
that
would
take
into
consideration
is,
if
you
have
a
cluster
with
like
a
thousand
OSDs
and
you
put
in
one
OSD,
that's
like
0.1%
of
the
data,
and
so
three
percent
is
basically
I'm
gonna,
be
hammering
on
that
one
OSD
I
wonder
if
we
will
also
want
like
a
per
OST
threshold
or
something
and
I
don't
know
if
it
matters
that
much
really,
because
all
the
existing
recovery
scheduling
stuff
is
still
there.
B
E
B
Yep,
so
the
current
code
is,
it's
basically
mirroring
the
calculations
for
PG
count
for
the
number
of
objects
and
for
the
number
of
bytes.
So
it
could
be
that
in
some
cases
we
just
count
on.
If
the
pool
is
full
of
a
math
or
something,
then
we
would
just
balance
objects
instead
of
all
rights.
The
problem
is
that
you
can't
then
equate
if
you're
having
to
mix
the
two
and
when
you're
doing
the
cross
compat
thing
you
have
a
single
set
of
crush
weights.
B
That's
you're
actually
optimizing
based
on
the
crush
roots
instead
of
the
on
a
per
pool
basis.
So
my
hope
is
that
we
can.
My
plan
is
to
eventually
basically
have
it
build
a
model
of
object.
Average
object
cost
on
a
per
pool
basis,
where
we
would
basically
try
to
infer
what
the
OMAP
sizes
by
doing
trying
to
solve
the
solve
for
the
unknowns
where
I
can
see
that
the
OSDs
are
using
up
this
much
metadata
space
and
I.
Think
I
can
infer
from
that.
B
That
takes
this
module
map
or
something
like
that,
or
at
least,
if
they're,
like
sort
of
the
obvious
model
where
it's
0.
If
that
just
disagrees
with
reality,
where
my
model
says
that
this
host,
you
should
be
at
2%
and
really
it's
at
like
40%,
then
I
know
that
I
don't
understand
where
the
usage
is
coming
from
and
then
I
can
stop.
E
B
B
So
this
came
up
in
one
of
the
discussions
on
the
usability
call
a
couple
weeks
ago,
and
the
question
was
basically:
if
we
can't
merge
VG's
well,
the
motivation
is
eventually
we
want
to
have
users
not
to
think
about
PG
counts,
so
they
have
the
system
just
automatically
adjusts
them
up
or
down
as
needed.
The
problem
with
that
right
now
is
that
we
can't
merge
P
G's,
so
the
question
was:
if
we
can't
do
that,
can
we
just
if
we
overshoot
the
PG
count
and
we
want
to
scale
it
back
down?
B
Can
we
just
adjust
the
PGP
num,
which
is
the
placement
back
down,
and
so
the
PGS
are
still
separate
but
stored
next
to
each
other?
Does
that
get
us?
Is
that
almost
as
good
and
I
I
think
the
answer
is
almost
there
still
they're
still
separate
P
G's,
so
you
still
have
the
twice
as
many
carrying
messages
that
go
back
and
forth,
but
the
Tsarist
data
placement
goes.
B
It's
all
the
same,
so
they
rely
ability
implications,
don't
change,
and
if
we
change
the
way
that
the
OSD
is
allocating
memory
to
PG
logs,
then
the
memory
footprint
won't
change
that
much
jihad
that,
like
the
/
PG
metadata,
that
OSD
is
pretty
small.
It's
really
the
PG
log
is
the
part
that
makes
each
PG
consume
a
lot
of
memory.
B
But
if
we
are
smart
about
that,
so
that
it
doesn't
use
it
so
much
it
just
has
fewer
PG
log
entries
per
PG,
for
example,
then
we
could
probably
get
much
closer
and
so
I
think
then,
assuming
that's
not
ideal,
but
it's
still,
probably
better
than
forcing
all
users
to
think
very
carefully
about
PG
s
and
then
getting
it
wrong.
Then
the
question
is:
how
do
you
make
a
set
of
policies
and
heuristics
that
sort
of
automatically
choose
a
PG
value
dynamically
and
adjust
it
over
time
so
that
users
don't
to
think
about
it?.
B
The
proportion
of
the
total
PG
count
that
you
want
something
like
that
at
least
as
a
starting
point,
but
I
think
John
had
some
more
specific
ideas
about
how
he
wanted
this,
the
sort
of
look
to
a
user,
my
camera,
what
you're
saying
that
you
wanted
it
to
be
sort
of
tied
in
with
an
application-level
correct?
My.
C
Thinking
was
that
we
would
want
to
not
just
respond
to
the
size
of
data
in
the
pool,
but
get
input
from
the
user
about
how
big
they
expect
the
ball
to
be,
or
at
least
what
percentage
of
their
cluster.
They
expect
the
pool
to
use
so
that
in
the
relatively
simple,
probably
quite
common
cases,
where
someone
has
one
or
two
applications
using
their
whole
cluster,
we're
not
sort
of
running
around
adjusting
P
genome.
If
we
could
have
just
originally
been
told
by
the
user
hey
this
is
my
rgw
I'm
gonna
use
half
my
cluster
Floria.
C
C
This
is
going
to
be
X
percent
of
my
cluster
and
then
have
some
kind
of
rule
that
says
how
big
the
various
or
like
force
FS,
how
big
the
metadata
pool
should
be
as
a
proportion
of
the
data
pool
as
an
initial
guess,
and
then
the
automatic
adjustment
sort
of
patterns
based
on
that
input
right.
So
it
would
be
like
if
they
said
if
they
said
they
wanted
100
terabytes
of
set
of
s,
and
we
have
a
rule
that
says
1%
of
that
should
be
allocated
to
a
metadata
pool.
C
C
B
If
you
have
the
the
user
input
about
what
they
expect
and
then
you
have
what
you
actually
measure
in
the
cluster
and
those
are
the
two
inputs
for
deciding
how
to
adjust
up
or
down
you
could
in
the
absence
of
the
user
input
you
could
also,
if
you
have
enough
confidence
about,
if
there's
enough
utilization
in
the
cluster
like
you're
at
30%
capacity
or
something-
and
you
have
a
pretty
good
idea,
what's
happening,
that
you
can
make
sort
of
conservative
decisions
about
what
to
do.
Also
yeah.
B
Yeah
well,
I
think
that
the
split
itself
is
pretty,
and
it's
pretty
cheap,
now
I
think
like
it
flushes
some
cues,
but
in
blue
store
at
least
there's.
No,
the
only
work
work
that
you
actually
do
is
is
splitting
the
PG
log,
which
is
just
in
memory.
You
know
it's
a
few
thousand
key
value
pairs
or
whatever.
B
F
B
C
C
Yeah
health
warning
or
a
sign
of
anemia
polite
telephone
call,
Larry,
where
we,
you
know,
ask
them,
because
this
should
be
I
mean
it
should
be
not
particularly
frequent
and
if
they've,
you
know,
if
they've
set
up
their
system
and
said,
I
would
like
to
provision.
You
know
a
petabyte
of
our
BD
and
they
exceed
their
petabyte
I
think
it
would
actually
sort
of
line
up
with
expectations
that
we
would
complain,
but
at
the
same
time
offer
them
the
solution,
which
is.
We
would
like
to
adjust
your
PG
noms.
B
Yeah,
if
you
cool,
has
like
a
I
mean
I
already
has
properties
like
target
max
bytes
and
Max
objects
that
we
use
for
the
cash
turning
out
of
stuff
or
whatever,
but
if,
if
they're,
basically
like
user
input
sizes
that
that
they
set
that
are
like
soft
quotas.
Basically,
then,
whenever
that
diverges
from
actual
usage,
we
can
just
tell
them
at
least
some
one
direction.
B
G
B
G
Okay
Oh
some
background,
I
I
work
for
flip
card.
Here
in
India
we
have
deployed
the
staff
as
a
as
a
as
a
rgw,
rather
as
a
object,
s3
object,
storage
cluster.
We
use
it
for
a
couple
of
business
workflows.
We
have
a
CDN
front-end
to
show
a
bunch
of
images
that
are
stored
as
objects
in
a
cluster,
so
that
that's
one
of
our
primary
use
cases.
We
also
use
it
to
to
let's
say
customer
invoices
when,
when
we
do
our
e-commerce
apeman's,
we
have
about
a
petabyte
of
data
right
now.
G
G
G
H
G
The
we
sort
of
figured
out
that,
okay,
maybe
we
need
to
add
a
bunch
of
throttles
on
a
per
count
basis,
so
that
we
have
some
sort
of
multi-tenancy.
We
are.
We
have
some
sort
of
QoS
that
we
can
guarantee,
or
rather
we
put
a
limit
on
the
on
the
amount
of
I/o
that
a
particular
account
could
hit
the
cluster
with.
G
So
our
initial
thoughts
were
mostly
to
put
put
limits
around
the
amount
of
data
at
the
data.
People
could
write
in
terms
of
boots
on
the
cluster,
because
we
sort
of
found
that
okay,
writing
data
or
deleting
details
from
the
cluster
was
quite
heavy
compared
to
let's
say
just
fetching,
the
objects
will
get
false,
so
do
this
and
we
ended
up
adding
adding
multiple
throttles
around
all
all
HTTP
rest
action
operations.
G
G
Okay,
maybe
two-thirty
gets
per
second
and
maybe
ten
object
boots
per
second
and-
and
we
count
the
week
on
tops
on
the
rgw
processrequest
park
and
whenever,
whenever
the
the
count
crosses
that
limit,
we
respond
immediately
respond
with
a
final
three
slow
down
this
error
code.
We
just
sort
of
lifted
from
what
AWS
s3
does
today,
primarily
so
that
the
the
clients
that
users
are
currently
using
don't
really
need
any.
Don't
really
need
any
changes
to
handle.
Let's
hear
for
xx
or
other
4:29
too
many
requests
code
that
would
be
more
appropriate.
G
G
So
if
we
give
our
user,
let's
say
30,
our
limit
of
30
gets
per.
Second,
we
end
up
sort
of
dividing
it
across
our
UAW's
and
we
we
have
a
load
balancer
in
front
that
does
a
pretty
decent
job
of
distributing
the
requests.
So
we
we
are
sort
of
relying
upon
that
not
to
really
target
disproportionate
amount
of
requests
or
single,
actually
w-4.
Now
so
we
know
working
okay.
We
would
probably
extend
this
to
put
a
global
limit.
Another
global
counter.
G
G
Period
did
you
guys
use
so
we
let
users
specify
that,
for
example,
when
we
onboard
a
user,
we
discussed
the
are
they
our
use
case
with
them,
and
if
we
see
that
okay,
some
guys
warn't,
let's
say
10-
puts
per
second
ok,
we
we
have
a.
We
have
a
timer
that
runs
every
second
and
resets
the
account
s.
Some
some
other
accounts
are
ok
with,
let's
say
our
timer.
That
gets
reset
every
minute.
I
agree:
it's
not
it's
not
really
a
great
way
to
do
that.
G
That's
why
a
leaky
bucket
might
make
motions,
because,
for
example,
you
put,
you
could
burn
all
your
quota
in
the
in
the
first
couple
of
seconds
itself
and
then
probably
idle
just
waiting
for
the
limit
to
be
reset
at
the
end
of
the
minute,
whereas
with
the
leaky
bucket,
you
could
probably
slowly
get
some
some
more
capacity
or
other
some
more
requests
back
into
your
quota.
Yeah.
G
It's
quite
minimal
right
now.
It's
it's
changes
that
the
changes
are
pretty
much
restricted
to
the
rgw
process
function.
As
in
we
invoke
the
the
function
to
actually
check
the
current
count
against
the
limit,
and
then
we
have
another
bunch
of
functions
that
I
treat
the
limits
from
a
config
file
for
now.
Ideally,
we
would
like
to
define
the
limits
through
the
reduce
gateway
admin
come
on
like
like
we
define,
object,
footers.
G
We
did
this
in
a
bit
of
a
hurry,
so
we
decided
to
just
dump
all
the
limits
into
a
config
file
and
then
pass
it
whenever
that
it
only
starts
up.
So
so
we
extract
what
will
if
we
were
to
take
this
to
completion
yeah
we
would.
We
would
change
that
eww,
the
Redis
get
we
had
in
common
to
specify
and
modify
the
limits.
B
G
G
B
G
B
B
I
And
as
well
would
be
funded
for
it
to
bring
it
to
the
to
the
right
w
stand
up.
So
if
you
have
this
an
interest
in
joining
our
you
know
the
upstream
owner
occasional
clustering,
we
have.
We
had
a
lot
of
actually
actually
daily
about.
You
know,
but
but
we
have
constant
upstream
so
to
communication,
where
you
can
discuss
things
yeah.
I
It
feels
to
me
like,
if
best
three
did
in
an
array,
ws
did
engage,
Koba
API
is
there
and
end
up
being
aptitude
forms
and
of
being
one
if
you've
read,
this
would
end
up
being
bucket
or
user
policy
if
it
was
intend
to
be
programmatically
updated,
or
at
least
I
would
expect
so,
but
but
they're.
Also,
an
increasing
number
of
April
I
guess
you'd
call
them
API
is
that
are
just
in
the
control
panel,
and
it's
not
only
the
specifially
notice
how
they
were
how
and
where
they're
materialized.
I
So
so
we've
expected
that
that
you
know
for
things
for
new
things.
Are
there
we
go
where
we
go
out,
we
try.
We
try
to
come
up
with
it
if
at
all
possible
policy
based
a
extension
grammar
extension
that
allows
us
to
be
in
line
with
as
much
as
possible
with
the
way
things
are
done
in
AWS,
even
if
we're
moving
a
little
bit
outside
what
it
does.
B
And
this
this
maybe
goes
about
saying,
but
if
you
post
a
pull
request
with
the
current
code
that
you're
using
that
will
definitely
generate
some
some
comment
to
commentary.
We
can
look
at
what
what
you've
done,
what
we
want
to
change
before
merging
as
a
stream,
whether
we
want
to
drop
in
the
leaky
bucket
or
not
or
whatever
sure
we
can
do
that
pretty
much
idly,
that's
a
great
yep!
This
is
great,
and
this
is
definitely
something
that
I've
heard
from
other
service
providers
that
they
hit
similar
issues.
B
I
think
the
last
time
it
came
up
was
robbing
a
dream.
I
was
talking
about
what
they're
doing
they're
using
a
chair
proxy
in
front
of
raitis
gateway,
and
they
use
it
to
do
whenever
they
identify
like
a
single
bucket.
That's
getting
hammered
or
something
like
that.
They
use
it
to
do,
install
exceptions
that
direct
them
to
a
specific
gateway
so
that
it
doesn't
affect
other
workloads
or
maybe
do
some
other
things
like
mad
enough
I,
don't
know
all
what
all
they're
doing
but
yeah.
B
G
G
So
it's
so
the
way
I
described
the
signal
in
in
a
post
acceptable
was
that
we
wanted
to
implement
something
that
looked
like
a
circuit
breaker,
so
that
okay,
if
it
does
go
latent,
then
we
would
stop
targeting
iOS
to
that
OSD
and
fail
fail.
Our
GW
requests
much
earlier
and
maybe
once
once
the
once
the
circuit
breaker
timer
expires,
then
you
would
probably
check
its
health
again
and
to
see
whether
it
was
able
to
so
I
use
normally
so
so.
This
looks
like
pretty
much
like
the
circuit
breaker
design,
design
pattern.
G
B
B
That
said,
there
are
probably
still
going
to
be
situations
in
the
future
where
there
is
a
problem
with
one
or
a
small
number
of
ghosties,
and
it
would
be
nice
to
have
that
not
eventually
make
too
many
requests
pile
up
on
that
one
OSD
and
then
do
s
the
rgw
as
well.
But
when
that
happens,
it's
not
necessarily
nasty.
It's
usually
a
placement
group,
that's
actually
problematic
and
not
the
whole
lowest
e.
Well,
I
guess
it
can
be
both.
B
So
it's
a
little
bit
a
little
bit
tricky.
There
is
a
new
greatest
back-off
mechanism
and
the
in
the
PG
case,
so
when
VG
is
blocked
based
like
gearing,
doesn't
complete
or
something
like
that,
there
is
feedback
in
a
protocol
to
the
client
in
labret
dose
so
that
it
knows
not
to
send
requests.
It
might
be
possible
to
take
that
information
and
surface
it
up
to
radius
gateway
in
a
way
so
that
it
can
see
that
this
request
is
going
to
be
blocked
for
the
foreseeable
future
and
behave
accordingly.
B
B
So
in
luminous,
it's
good
when
I
know
of
it
before
I
guess
if
I
went
in
like
February
or
something
the
back
half
code,
so
the
liberators
client,
okay,
and
it's
not
exposed
to
the
operators
API,
it's
internal
to
the
object
or
code
inside
of
liberators,
but
it
knows
which
PG
is
not
to
send
a
request
to
because
they're
blocked
effectively.
Okay,.
G
B
Probably,
probably
me
you
can
reach
out
such
as
well,
but
I
would
I
would
I
would
probably
focus
your
initial
efforts
on
moving
validating
in
the
moving
to
luminous
and
recharging.
Those
large
index
objects,
because
that's
going
to
make
a
whole
group
of
problems
go
away,
not
just
this
one
problem,
and
once
you
have
that
resolved,
it
might
be
that
this
is
less
less
of
a
pressing
issue
and
there
are
the
things
that
are
sort
of
more
higher
value.
Things
to
spend
your
time
on,
yeah.
G
I
Think
they
might
be
interns.
Yeah
SiO
liberate
us
guys
that
this
could
have
an
impact.
The
rails
back
on.
Just
as
I
was
just
talking
talking
back-channel.
We
will
be
talking
about
exposing
more
throttle
information.
This
sense
related
with
as
part
of
the
way
they
do
good
to
get
a
unified
event
loop
for
the
top
half
and
the
radio
spot
and
half
of
our
GW.
I
B
Okay,
that
the
thing
to
keep
in
mind
what
the
back
off
is
that
it's
it's
a
it's
only
used
in
certain
situations
by
the
OSD,
so
right
by
default,
do
see
only
just
backup
when,
like
peering
is
stuck,
it
also
has
an
option
to
to
trigger
back
off
whenever
an
object
is
undergoing
recovery,
but
it
mostly
does
that
just
so
that
we
get
like
very
heavy
exercise
use
of
that
code
during
QA
I,
don't
think
in
a
real
situation.
You
ever
actually
want
to
do
that.
B
B
D
B
I
think
so
that
so
that
there's
a
specific
root
cause
in
this
case
is
the
large
are
giv
effects
and
that's
just
I
should
have
you
needs
to
get
fixed,
but
in
general
it's
possible
that
something
goes
wrong
with
Rados
and
you
end
up
foots
like
so
that
the
scenario
you
could
think
it
was
you
know,
one
PG
gets
is
stuck
peering
for
some
reason:
there's
a
broken
something
I,
don't
know
something
happened
in
radio,
so
you've
lost.
You
can
do
this
or
something
and
rgw
will
happily
keep
going.
B
But
every
you
know,
100th
requests
will
happen
to
touch
that
PG
and
block
indefinitely,
and
eventually
those
will
pile
up
and
consume
all
of
your
threads
in
your
work,
you,
and
so
that
eventually
rate
of
scale
will
get
start.
Even
though
it's
only
one
percent
of
the
data
that's
unavailable
and
so
having
a
way
to
detect
that
situation
and
mitigate
it.
It's
it's
still
a
good
thing,
I
guess
so.
G
B
H
The
boys
start
talking
I,
just
wanna
say
that
my
English
is
not
very
very
good,
so
please
forgive
me
if
I
can
express
express
myself
very
efficiently
under
okay
last
night,
during
the
past
year,
we
experienced
some
kind
of
federal
disaster
like
manmade
operation.
The
network
problems
that
we
think
that
we
need
a
real-time
cross
cluster
allocation
mechanism
that
we
can
when
when,
when
one
class
per
goes
down,
and
we
can
quickly
change
to
another
one-
make
the
upper
level
system
run
smoothly.
Oh.
H
Now
we
are
using
our
VD
r
RP
and
we
are
planning
to
12
in
the
near
future
and
some
of
our
not
only
are
the
RV
declines.
We
need
this
replication
mechanism.
Fs
ffs
concept,
also
demand
are
entered
into
I,
see
the
data
replication,
so
we
think
maybe
we
can
be
Clemente
some
kind
of
greatest
level
replication
and
should
just
do
this.
One
I
don't
know
how
once
and
for
all-
and
we
don't
need
the
upper
level
systems
to
to
do
that,
to
do
the
job
separately
and
now.
H
At
our
crystal
hands,
we,
without
that
this,
the
this
may
be
maybe
a
little
difficult,
because
we
we
can.
We
can
just
replicate
the
rep
repeal
piece
to
another
cluster,
but
during
the
USD
failure.
If
that's,
if
that's
the
best
present
and
oh,
how
do
we
make
sure
that
this
replication
goes?
The
deal,
works
right
and
and
another.
H
Like
our
video,
some
RBD
operations
may
cross
multiple
objects.
So
how
do
we
ensure
this
this?
This
consistency
that,
for
example,
one
operation
across
involves
the
object,
a
and
object
B,
and
we
know
that
Raiders
can
make
sure
to
make
sure
operate
the
operations
from
the
same
client.
The
targeting
the
same
object
are,
are
done
in
the
sequence
they
they
arrived,
but
when
they
come
from
different
points
or
they
are
targeting
different
objects,
this
this
sequence
is
not
guaranteed.
So
how
do
we?
How
do
we?
How
do
we
maintain
this
consistency
for
cross
object
operations?
B
H
Our
approach
is
like
this:
for
the
first
one
across
depression
are
we
we
think
that
we
can?
We
can
make
the
make
new
replication
goes
right
with
the
presence
of
OST
failure.
If
we
can,
if
we
can
make
sure
the
journal,
the
OST
journal
is
only
is
only
deleted
or
removed
or
overwritten,
when
their
corresponding
wrapper
piece
is
replicated
another
another
cluster
and
the
during
during
during
the
recovery
recovery
phase,
when
before
a
Rickett
recovery,
sores
start
pushing
the
object,
the
missing
object
before
it
start
pushing
the
object.
H
It
makes
sure
that
all
red
poppies
related
to
this
to
this
missing
object
is
replicated
first
or
replicated
first
and
then
it
it
start
pushing
pushing
pushing
it
pushing
this
object.
The
if
this
is
possibly
if
this
is
make
make
make
sure
and
we
think
the
the
first
problem.
The
first
problem
is
this
kind
of
result:
I
don't
know.
If
we
are,
we
are
going
the
right
way,
I
think.
B
I
think
that,
and
in
a
at
a
high
level,
that
makes
sense.
So
as
long
as
you
only
replicate
something
to
the
secondary
cluster
after
you're
sure
that
the
first
cluster
has
it
fully
durable.
That
makes
that
template
make
sense,
yeah,
yeah,
I'm,
not
I,
would
be
be
careful,
assuming
that
it's
the
same
Journal
that
that
the
file
store
currently
implements
because
that
that
changes.
When
you
start
looking
at
different
post,
you
back
ends.
H
H
H
Start
talking
together
talking
this
from
this
question
again
and
so
first,
we
we
think
that
we
should
forward
the
object,
opieop
opiez
within
the
same
object,
set
operation
to
the
same
intermediate
node,
and
then
an
intermediate
node
then,
and
these
are
appropriate
to
through
the
biotic
phosphor.
Only
when
a
point
on
two
conditions,
the
first
is
all
objects
of
a
object
set
operation.
H
Is
that
that
for
this
before
this
one
are
replicated
to
the
other
cluster
and
the
second
is
or
all
ripple
piece
within
the
same
object
set
operation
are
right
at
the
at
the
intermediate
node
and
when
this
two
conditions
are
true
and
we
can
start
the
intermediate
node
start
card
when
hope
is
to
the
backup
to
the
backup
to
the
backup
prosper,
and
we
think
this
is
the
this
is
making
sure.
Let's
make
sure
again.
The
second
question
is
also
will
resolved.
B
If
I'm
understanding
right
what
your
but
essentially
is
describing
transactions
and
where
you
would
have
the
ratos
clients
would
say
this
group
of
operations
need
to
all
be
done
together
before
they
get
replicated
or
some
sort
of
transaction
concept.
Yeah.
So
I
think
you
know
that's
going
down
that
path
to
have
some
sort
of
transaction.
That's
one!
That's
one
direction!
We
can
go
the
thing
that
worries
me
about.
B
That
is
a
couple
things,
one
all
the
current
Raiders
clients
don't
operate
in
terms
of
transactions,
so
they
would
have
to
be
rewritten
or
modified
to
do
that
I'm.
The
second
thing,
though,
is
that
the
it's
not
just
it's
not
really
atomicity
that
usually
matters
its
ordering
so
they'll
be
some
some
application.
B
That's
using
say,
let's
say
it's
Oracle
it
has
to
to
RVD
block
devices
and
it
will
write
something
to
the
journal
block
device
and
once
that's
commits
then
it'll
make
some
update
to
its
other
block
device,
and
it's
the
ordering
that
matters
that
the
journal
change
has
to
be
sent
to
their
remote
closer
before
or
the
other.
They
ought
to
be
sent
together.
It
just
has
to
be:
they
just
have
to
be
applied
in
the
correct
order.
B
You
know
and
I
don't
know
that
this
mechanism
would
would
capture
that
they
might
be
able
to
be
modified
so
that
it
would
but
I
want
to
take
a
minute
to
when
we've
talked
about
this
in
the
past.
We've
had
a
different
approach
to
this,
because
we
spent
a
fair
bit
of
time
thinking
about
how
how
this
could
work
and
I
wish.
I,
don't
know
if
this
we
ever
wrote
this
stuff
somewhere,
but
the
basic
idea
was
to
have
a
a
series
of
checkpoints
in
time
that
are
essentially
consistency.
B
Points
across
the
source,
cluster
and
I
have
say
you
had
a
checkpoint
every
you
know
10
seconds
or
something
for
just
for
the
sake
of
argument,
and
then
the
remote
cluster,
and
at
that
at
that
particular
point
in
time.
Everything
in
the
in
the
source
cluster
was
it.
It
was
a.
There
was
a
point
in
time
checkpoint.
B
J
B
J
K
So
Ricardo
and
I
had
a
conversation
with
sage
not
too
long
ago.
Well,
it
feels
like
ages.
Now
we
have
a
pad
on
on
either
pad
describing
a
source
of
similar
approach
to
some
extent
in
which
we
have
a
demon
that
will
basically
act
as
a
proxy
and
as
a
truce
over
the
sequencer
for
the
operations.
I'm.
Sorry.
K
K
The
card
is
on
vacation
now
and
he
would
likely
be
the
the
best
person
to
to
argue
the
the
case.
But
one
of
the
things
we
considered
was
having
probably
a
maybe
a
quorum
of
these
demons
that
will
take
the
operations
yo
SDS
called
us.
Ds
would
push
the
oppressions
into
these
sequencers
and
maybe
have
a
sort
of
a
snapshot
that
would
be
basically
decided
by
DS
DS
proxies.
K
H
B
Problem
I
think
it's
a
complicated
topic,
it's
hard
to
it's
hard
to
cover,
also
in
a
short
period
of
time.
I'm
like
we
can
just
share
a
couple
of
links,
I
pasted
one
into
the
chat,
that's
the
output
of
the
release
of
code
from
the
clinic
project.
This
is
a
summer
fraud
or
no
year-long
project.
That's
a
meta
grad
students
did
looking
at
the
time,
synchronization
problem,
so
I
think
you
can
broadly.
You
can
sort
of
divide
this
whole
whole
piece
into
sort
of
two
parts.
B
If
you
have
better
clocks-
and
you
have
you-
know,
atomic
clocks
or
something
like
Google,
does
then
great,
it's
like
you
know
basically
zero
time
to
do
that
and
if
you
have
really
bad
clocks
and
you
have
a
longer
stop
the
world,
but
the
idea
is
that
it
also
to
work
in
any
any
environment.
Okay,
there
was
a.
There
was
actually
a
written
report
that
they
delivered
as
well.
I,
don't
know,
Greg,
do
you
know
if
that's
actually
posted
somewhere.
E
B
K
B
H
L
L
Here's
some
updates
for
the
shared
reading
cache
or
a
video.
Oh,
let
me
try
to
give
you
some
background.
So
initially
we
are
working
on
the
right
bank,
SSD
cache
for
a
BD
and
in
June,
CDM,
Jason
and
sage,
so
just
write
back
is
too
difficult
to
handle
those
consistency.
So
we
might
look
at
the
read
only
cash
first,
which
should
be
much
easier,
so
we
switched
go
to
the
read-only
cache
and
the
design
choice,
which
would
be
a
standalone,
as
was
the
caching
library
that
can
be
reused
between
a
B,
D
&
RG
w.
L
Currently,
we
are
focusing
on
two
use
cases
here.
The
first
one
is
Lee
Bobby
shared
medium
in
cache.
Basically,
if
you
have
a
parent
image,
you
have
a
lots
of
chrome
image.
Those
clone
image
can
read
from
parent
image
cache
before
copy
at
home
at
heaven
and
country.
We
have
some
calls,
that's
generally
work,
but
still
I'm
trying
to
make
those
billion
his
posse.
L
The
second
case
is,
whereas
gateway,
mute,
vocation
which
is
existing
or
request
from
Manya,
and
but
this
this
podcast
is
against
jewel
and
it
was
a
needs,
tiny
store
to
clean
up
okay.
So
this
is
a
current
status.
Let
me
try
to
give
you
some
details
for
the
design,
so
this
is
the
general
architecture.
L
Basically,
there
will
be
three
parts
in
in
the
you
know
currently
line
the
first
one
is:
there
will
be
a
lead
cache
file,
which
is
a
common
library
that
does
read/write
on
licensee.
For
now
we
are
using
a
specified
by
sketch
there
will
be
just
like
something
like
a
file
store
design.
There
will
be
many
small
four
mega
objects
among
those
SSDs.
L
L
L
Okay,
so
this
is
the
overview
for
for
the
shared
rhythm
in
cash
for
MIDI,
basically,
on
each
camp,
you
know
there
will
be
a
shared
cache
file,
which
is
which
is
actually
the
contents
of
the
protected
snapshot
and
for
each
clone
image.
If
the,
if
there's
no
copy
on
that
happen,
or
rates
or
B,
where
are
we?
It's
actually
can
go
for
serviced
from
those
shared
cache
file.
L
Yeah
this
is
the
cash
metadata
design
and
country
we
are
actually
using
on
you
in
64
for
8
bytes
for
12.
For
that
matter,
data
there
will
be
two
bits
indicates
whether
this
block
is
even
in
the
shell
image,
cache
or
or
in
its
own
cache,
and
also
there
will
be
a
stick
say
two
bits
which
were,
in
a
case,
the
block
ID
I.
B
L
B
L
Yeah,
so
all
right,
let
me
try
to
try
to
impedance
this
page,
so
this
is
the
real
flow.
Basically,
for
each
read
there
will
be
a
cash
flow
Cup.
First,
if
it's
in
the.
L
L
L
B
L
L
Ok,
so
here's
some
initial
results.
This
is
test
on
our
more
endowed
with
two
OSD
and
for
the
baseline.
It
is
tested
without
SSD
cache.
So
it's
it's,
oh
by
the
way
this
is
a
SSD
cluster,
so
they
are
out
here.
Obvious
is
not
lose
that
bad
and
with
for
the
second
row
we
use,
we
used
read
OMA
caching
here
and
we
can
see
the
IOPS
increased
a
lot
and
also
the
tale
that
he'll
say:
average
latency
will
reduce
a
lot.
L
L
M
M
L
L
L
L
B
B
That
might
make
sense.
I,
don't
I,
haven't
thought
about
this
too
deeply,
but
the
way
that
I
was
originally
hoping
this
could
be
done
would
was
that
this
would
effectively
be
two
different
caches.
So
you
would
have
a
shared
image.
Cache
you'd
have
a
shared
cache.
That's
on
the
on
the
on
the
parent
image,
and
so
you'd
have
multiple
processes
in
memory
that
have
their
own
sort
of
that.
B
That's
just
for
the
immutable
image
and
whatever
code
wrap
that
up
would
be
able
to
be
reused
in
gredos
gateway
and
then
there'd
be
a
separate
cache
that
probably
worked
very
differently
for
the
write
back
or
whatever.
We
usually
eventually
do
for
the
the
per
image
cow
data
cache
file,
I
used
to
have
there
with
the
two
prime.
B
So,
for
example,
you
wouldn't
have
those
two
state
bits
you
wouldn't
have
a
single
lookup
table
that
would
talk
about
State
in
both
caches,
but
instead
the
read
path
would
just
say:
this
block
doesn't
exist
in
the
child
image.
I'm
gonna
fall
back
to
reading
from
the
parent
image.
The
parent
image
has
a
cache
on
it.
Let
me
go:
look
it
up
in
the
cache
and
see
if
it's
there
and
if
that
misses.
B
L
B
I
don't
want
to.
We
should
probably
look
at
what
you've
written
first,
because
I'm
there
like
a
million
different
like
implementation,
details
that
like
come
in
as
soon
as
you
actually
start
trying
to
write
this.
This
is
just
sort
of
what
we
were
thinking.
I
think
what
we
talked
about:
okay
at
some
point
in
the
past.
So
let's,
let's
look
at
the
implementation
and
see
and
see
what
you
have
right
now.
B
My
camera
rode,
it
was
I
think
we
would
use
a
socket
so
that
somebody
has
to
manage
the
LRU
to
retire
things
from
the
immutable,
cache
and
so
there'd
be
some
socket
minimal
coordination
just
to
like
manage
the
LRU
and
epic
things,
but
that
getting
and
putting
would
be
able
to
just
read
directly
from
like
a
filesystem
on
the
SSD
I
think
was
the
original
idea
good,
but
again,
I
think
it
once
you
start
implementing
it.
B
You're
gonna
come
up
with
all
sorts
of
reasons
to
do
things
differently,
so
sure
I
don't
want
to
I.
Don't
want
to
tell
you
how
to
do
this
because
you're
the
one
who's
actually
doing
the
work.
So
let's
look
at
the
once
you
have
that
pull
request.
I
mean
you
might
not
I,
wouldn't
necessarily
wait
until
you
have
every
unit
test,
passing
or
whatever
to
publish
it.
I
would
just
publish
it
and
we
can
we
can
review
the
design
and
approach
would
be
good
work.
Okay,
we
get
too
far
along
and
I.
B
L
B
B
B
F
B
N
B
Think,
probably
the
next
just
to
review
this.
This
is
adding
a
new,
a
new
set
of
functions
that
includes
sub
chunks.
Is
it
also
changing
the
I
think
the
way
that
we
would
we're
hoping
to
make
this
transition
was
introduce
the
new
the
new
set
of
calls
that
pass
illicit
sub
drunks,
and
then
we
implement
the
old
calls
in
terms
of
the
new
calls,
so
they
just
pass
in
like
a
single
subject
that
covers
the
whole
thing.
Yes,.
N
N
B
Okay,
I
think
I
think
that's
fine,
I'm,
not
super
familiar
with
this
code.
I'll
be
honest,
I
think
Josh
and
Mike
are
more
familiar.
B
But
I
think
yeah
I
think
I
think
the
next
up
is
just
to
review
this,
and
we
will
eventually
want
to
rebase
it
so
that
it
doesn't
have
the
merchant
commits
in
there,
but
will
want
to
review
carefully,
get
it
tested
and
merge
and
then
I
think.
The
next
step
is
then
you'll
be
able
to
do
your
your
follow
on
changes
adding
the
new
the
new
code
on
top
of
it,
we
just
haven't:
we've
just
been
busy
with
the
luminous
release,
so
we
haven't
been
paying
much
attention
to
these.
B
N
Yeah,
we
would
like
to
submit
it
at
a
particular
conference
and
we
think
that
it
would
be
nice
to
have
this
as
part
of
self
at
that
time.
But
that's
one
reason
we
are
thinking
of
like
pushing
it
like
yeah.
So
if
you
want
to,
if
you
want
us
to
do
any
changes,
I
mean
maybe
after
your
luminous
relays
like
you,
can
like
talk
to
us
look
and
we
will
be
liquid
to
do
this.
B
O
C
C
So
I
guess
that
part's
pretty
pretty
simple
and
uncontroversial
and
the
I
think
the
big
unknown
of
the
moment
is
whether
we
want
to
implement
Prometheus
endpoints
on
individual
set
services
or
whether
we
want
to
just
say
to
people.
If
they
want
to
get
their
stats
directly
from
demons,
then
they
do
it
like
with
an
agent.
C
So
they
they
run
something
like
collecti
or
daimond
or
whatever
is
on
their
SEP
server,
and
so
that
then
influences
how
we
expose
the
service
discovery
stuff
from
set
manager
to
Prometheus,
because
if
they
using
an
agent,
we
just
need
to
give
them.
We
just
need
to
get
free
via
the
list
of
host
names.
Whereas
if
we're
talking
to
individual
set
services
to
get
the
Prometheus
stats,
then
we
need
to
tell
Prometheus
that
I've
got
all
the
individual
stuff.
So
is
this
so
I
for
that
to
the
room.
C
Chanyeol,
muted,
sorry
I
always
do
that.
I
haven't
thought
about
this
too
much,
because
I
had
kind
of
optimistic.
We
assumed
that
we
could
give
Prometheus
the
addresses,
usually
the
three
addresses
of
whether
managers
are
running
and
that
it
would
just
succeed
in
talking
to
whichever
one
happened
to
be
up,
but
we
should
test
that
theory,
because
I
I
guess
that's
not
guarantee
that
that's
actually
how
it
works,
I'm,
just
kind
of
assuming
it
does.
What
seemed
sensible.
O
B
C
C
If
you
have
a
manager
server
that
is
actually
offline,
but
in
a
situation
where
you've
you've
got
like
two
standbys
and
an
active
one,
it
means
that
if
Prometheus
tries
to
connect
to
a
standby,
it
will
get
redirected
to
the
active
one
which
could
actually
be
a
bad
thing,
as
we
just
discussed
if
it
doesn't
like
getting
people
metrics.
So
maybe
actually
we
wouldn't
want
to
enable
that
for
Prometheus
itself,.
C
C
So
you
you,
the
service
discovery
stuff
would
obviously
be
getting
served
from
the
manager,
so
something
has
to
have
told
them
from
ethier
server
have
a
talk
to
the
manager
to
begin
with.
But
if
the
thing
doing,
if
the
service
discovery
is
happening
via
a
script
from
the
Prometheus
node,
that
is
fetching
from
manager
and
then
writing
out
to
a
local
file,
it
could
be
the
Prometheus
itself.
Yeah
learns
about
the
managers
from
that
mechanism
and
the
the
initial
input
is
to
that
script
rather
than
to
Prometheus
telling
it
where
to
do
this
very
original.
B
So
the
the
question
I
had
so
that
there's
another
question
you
threw
out
to
the
room
earlier
about
whether
the
services
every
piece
would
be
scraping
from
daemon
targets
or
whether
you
just
give
it
host
names,
and
it
would
talk
to
an
agent,
presumably
there's
going
to
be
an
agent
of
some
form,
because
you
also
want
to
be
collecting
CPU
and
disk
and
host
all
that
other.
All
those
other
metrics
also
I,.
O
B
Yeah
I
mean
wouldn't
want
to
be
implement
it
I
wonder
if,
if
it's
a
node
exporter,
then
it
probably
only
knows
how
to
export
host
information
right.
It
wouldn't
know
how
to
also
fetch
a
demon
information
for
you
if
it
were
collective
and
it's
a
more
general
pluggable
thing.
So
you
could
get
both
CPU
information
and
ask
it
for
the
Ceph
demon,
information
right.
O
Well,
there
is,
there
is
a
text
exporter,
that's
included
in
the
node
exporter,
so
you
can
basically
just
create
a
text
file
and
a
certain
directory,
and
the
node
exporter
will
expose
that
too
I've.
Actually,
we
have
a
few
things
set
up
for
that
through
that
mechanism,
for
our
BD
and
smartphone,
for
example,
but
I
would
guess
that's
that's
not
too
performant
in
the
long
run
yeah.
If
you
basically
run
a
script
and
pipe
the
output
into
a
file
and.
B
Yeah,
okay,
I
guess
I'm,
not
too
worried
about
the
scraping
metrics
directly
from
stuff
demons.
It
seems
like
that's
always
going
to
be
annoying,
because
even
if
it's,
if
it's
something
like
collect,
do
you
have
to
go,
configure
collecti
and
tell
it.
These
are
the
demons.
These
are
the
rats
and
locations
all
right,
not
really.
C
C
O
C
We
assume
that
the
configuration
of
the
agency's
uniform
then
we
could
make
it
so
that
the
Prometheus
module
has
an
option
to
say
we'll
a
boolean
to
say,
I'm
running
an
agent
and
then
like
an
int
to
say
what
port
the
agent
listens
on
that
way.
The
Prometheus
module
would
be
able
to
take
the
list
of
Def
server
host
names
that
it
already
knows,
and
tag
on
the
port
that
the
users
configured
and
then
tell
Prometheus
how
to
go
talk
to
the
agents
based
on
that.
C
B
Mean
I
think
when,
when
Dan
was
setting
the
default
port
for
Prometheus,
there's
some
wiki,
where
you
at
you
just
assign
your
own
port,
so
they're
all
unique.
So
we
have
a
safe
port
assign'd.
So
we
could
just
pick
a
second
port.
That's
the
SEF
host
agent
or
if
you
happen
to
be
running
that,
so
it
would
be
a
zero
you.
Just
you
just
install
the
package.
It
would
just
serve
up
everything
in
borrow
and
SEF
yeah.
B
C
I,
haven't
I,
haven't
been
proposing
writing
or
a
little
Asian,
but
it's
not
a
crazy
idea.
You're
just
I,
don't
know
I
think
some
people
get
kind
of
warmer
fuzzier
feeling
if
we're
plugging
into
and
something
generic.
But
then,
if
you
have
no
exporter,
if
no
big
supporter
is
enough
for
the
generic
stuff
and
we're
not
throwing
anything
out.
If
we
just
write
our
own
agent
justices
a
stuff
as
well
as
exactly
it.
B
A
B
That
tells
that
would
also
direct
a
service
discovery
thing
to
like
tell
Prometheus
to
scrape
host
level
information
from
the
normal
host
agent,
because
I
think
I'm
doing
most
configurations
like
you
just
have
the
default.
Prometheus
host
thing,
but
you'd
want
to
just
have
Seth
make
sure
that
all
its
hosts
are
being
scraped
right.
B
C
Yeah
yeah:
we
need
to
go
and
check
the
the
service
is
going
to
reformat
and
see,
although
it's
like
I
guess
you
say
like
host
and
then
the
load
exporter,
port,
comma,
verse,
F
port.
Presumably
you
can
tell
it
multiple
ports
need
a
second.
The
only
kind
of
got
sure
to
that
is.
If
I
haven't
checked
how
good
the
disks?
That's
all
that
you
get
out
of
node
X
water.
C
P
P
P
P
Yeah
and
something
like
a
note
there,
but
there's
only
one
thing
I'd
like
to
add,
but
it
hasn't
been
merged
and
that's
the
IBD
G
gate
demon,
because
that
is
actually
the
thing
that
got
me
started
important
to
free
these
demons.
That
I
can
run
a
device
that
actually
has
an
RVD
image.
Though
I
like
virtual
machines
on
that,
and
though
you
think,
I
should
just
submit
PR
for
the
release,
notes
and
just
squeeze
it
in
there.
B
Yeah
I
mean
you
can
definitely
smell
up
here
for
the
release,
notes
I'm,
not
sure
whether
G
gate
makes
sense
to
merge
before
we
release.
Oh
I
haven't
looked
at
it,
I,
don't
know
if
it's
it's
something
new
that
can
break
something
else
or
if
it's
a
totally
separate.