►
From YouTube: Ceph Developer Monthly 2020-11-04
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
let's
get
started,
welcome
to
the
set
developer
monthly
for
november
2020..
Today,
we've
got
three
topics
on
the
agenda:
first,
we'll
start
off
with
discussing
manager,
scalability
and
specifically
how
to
measure
and
profile
problems
that
we're
seeing
so
here's
the
etherpad
link
with
us.
A
Then
three's,
what
folks
thoughts
are
what
kinds
of.
A
Ideas
folks
have
for
how
we
could
measure
which
parts
of
the
manager
are
causing
bottlenecks
or
being
slow.
A
Maybe
we
should
start
with
some
context
around
some
of
the
problems
we've
seen.
A
I
guess
we've
seen
a
few
different
symptoms
when
it's
simply
like
high
cpu
usage
with
the
manager
backing
up
and
different
the
finisher
queue
where
it
runs
completions
or
runs
different
calls
into
intense
modules
when
there's
a
new
map
update
that
can
get
backed
up
with
like
millions
and
millions
of
completions
and
therefore
cause
a
bunch
of
latency.
All
those
are
processed
know.
A
Another
symptom
we've
seen
is
some
calls
in
the
manager
taking
a
very
long
time
in
a
particular
module,
either
in
in
c
code
or
in
python
code,
but
either
way
holding
the
global
interpreter
lock
at
the
time
or
holding
some
manager
lock
prevents
other
command
manager
commands
from
being
responsive,
responsive
or
making.
A
Those
are
kind
of
the
few
symptoms
we've
seen,
I'm
sure
we'll
see
more,
as
we
continue
to
see
larger
larger
scale
deployments,
with
the
manager
trying
to
support
any
more
osds
or
other
other
demons
that
information
that
they
report
to.
A
A
A
We
talked
about
some
simple
things
we
could
do
with
ipads,
looking
at
like
automatically
tuning
out
the
reporting
intervals,
so
we
kind
of
scale
back
how
how
many
reports
we're
processing
from
osd's
and
if
we're
too
busy,
because
that's
that's
kind
of
the
workaround
that
we
have
for
now,
just
increasing
those
reporting
intervals,
and
so
at
least,
if
we
could
do
that
automatically,
we
could
avoid
some
of
the
issues
in
the
short
term.
A
There's
also
been
some
improvements
to
the
different
modules
to
make
them
more
efficient
in
the
way
that
they're
like
not
like
getting
the
data,
they
need
from
the
maps
more
directly
or
operating
more
in
c,
plus
plus,
instead
of
going
through
a
conversion
to
string
and
then
first
into
python,
then
convert
it
into
a
python
python
objects
from
json
that
can
be
kind
of
expensive.
If
you're
doing
that,
I
think
many
hundreds
of
thousands
of
times
a
second.
C
Yeah
and
one
of
the
biggest
challenges
that
we've
seen
with
the
manager
and
manager
modules
is
the
debug
ability
is
a
big
challenge
at
this
point,
it's
difficult
to
say
which
module
is
causing
a
problem
of
its
gen
generally,
the
manager
that's
backed
up,
and
it's
one
of
the
short
term
or
like
a
couple
of
items
on
that
ether
pad
are
short
term
items
that
will
help
us.
Add
things
like
you
know,
metrics
as
to
what
module
is
using,
how
much
amount
of
the
q
depth
or
other
things
like
that?
C
Currently,
there's
no
way
to
figure
it
out.
It's
mostly
like
a
hit
and
trial.
We
try
to
turn
off
modules
and
turn
on
things
and
then
see
which
one
is
problematic.
In
the
past
we've
had,
like
you
know
the
the
balancer
module
causing
issues,
and
recently
we
also
got
to
know
about
the
progress
module.
The
core
of
the
problem
remains
the
same
but,
like
you
know
the
some
of
the
short-term
things
that
we
can
do
to
make
it
easier
to
debug
things
and
also
control
things
and
act
upon
it.
C
Like
you
know,
control
like
what
you
know,
the
stats
that
we
are
reporting
from
the
manager,
the
period
at
which
we
are
doing
it.
We
can
always
auto-tune
those
things,
so
I
think
in
my
mind,
those
are
short-term
things
that
we
can
do
and
get
a
lot
of
benefit
out
of
it
longer
term.
I
think
there
are
other
items
on
that
ether
pad
that
have
been
listed
down,
that
we
can
also
go
through.
A
Yes,
you
mentioned
one
that
wasn't
down
there
yet,
which
was
trying
to
categorize
and
and
maybe
have
four
counters
for
different
types
of
things
that
are
in
a
finisher
queue
or
try
to
track
where
they
which
module
they
came
from.
A
A
Next,
the
flip
side
step
would
be
tracking
how
much
time
we're
spending
for
the
given
completion
or
given
notify
into
a
into
a
given
a
particular
module.
A
You
can
see
if
any
particular
one
if
it's
like
in
the
case
of
like
a
particular
calls
being
very
expensive,
or
it's
just
the
sheer
number
of
them.
D
Do
we
have
a
like
a
particular
threshold
where
you
know
problems
like
this
start
to
kick
in
you
know.
Is
there
like
a
recipe
so
that
we
can
talk
to
the
user
base
and
say,
look,
you
guys
are
in
the
danger
zone
and
they
could
be
like
the
first
people
to
help
us
hopefully
understand
any
kind
of
changes
and
the
impact.
C
A
We
could
probably
write
these
happen
artificially
as
well
by
that
like
make
creating
more
pgs
than
we
would
for
a
much
smaller
cluster
or
decreasing
their
reporting
intervals.
If
we
have
many
more
updates
happening.
B
Yeah
I
mean
personally
yeah,
it
seems
like
getting
some
metrics
around
things
is
the
first
step
of
knowing
yeah,
where
the
issue
is,
and
I
mean
you
know
when
you
talk
about
like
serializing
data
structures
like
your
your
pg
map,
updates
or
whatever,
between
c
plus
and
and
python.
Those
are
you
know
my
I'm.
Last
time
I
looked,
I
thought
they
were
done
pretty
you
know
trivially,
because
they
just
straight
up,
convert
everything.
You
know
not
on
demand.
B
They
just
straight
up,
do
it
as
as
opposed
to
like
kind
of
building
out
their
own
sub
data
structures
that
are
kind
of
like
more
like
view,
data
structures
that
know
how
to
like
then
get
the
data
on
demand,
as
opposed
to
like
you
know,
right
now,
where
we
build
it
all
out
first,
and
then
we
pass
it
down.
You
know
for
everyone
to
look
at,
but
right.
A
B
Have
to
have
the
metrics
to
know
where
things
are
spending
the
time
you
know,
as
as
it
launches
into
the
the
interpreter
to
like
invoke
a
function
on
something
you
know
to
start
collecting.
You
know
time
and
time
out,
calculate
latency,
metrics
and
expose
it
somewhere.
I
you
know
potentially
yeah.
It
says
here
you
know
the
the
top
or
whatever,
but
you
probably
just
even
right
now
expose
it
even
faster
with
like
some
perf
stats
or
something
like
that.
You
know
without
having
to
you
know.
A
B
Get
too
fancy
with
it,
because
it
is
kind
of
like
a
debug
level
tool
as
opposed
to
something
at
a
user.
If,
if
we
were
doing
our
job
right,
a
user
should
never
have
to
care
about
it
right.
E
It
doesn't
seem
too
invasive
to
me
not
not
like
sensitive
data,
so
yeah.
Maybe
we
can
collect
this
as
well.
E
Not
all
we
collect
them
daily
for
clusters
that
have
opted
in
so
we
receive
them
daily.
So
I
don't
know
how
we
I
mean
how
it
makes
sense
to
collect
this
specific
data
where
you
can,
edit
to
that
first
channel.
We
want
to
add.
A
D
D
So,
instead
of
looking
at
the
whole
data
set,
you
can
zero
down
on
the
ones
that
have
that
flag
set
that
I've
got
a
problem
experience
because
of
you
know
the
issues
that
we're
seeing
with
with
manager.
You
know
what
I
mean
so
there's
something
specific
that
we
can
look
to.
I
don't
know
whether
that
exists
in
the
telemetry
now
or
whether
it
would
just
be.
You
know
I
can't
see
the
wood
for
the
trees
in
the
data.
A
Yeah
yeah,
we
don't
have
that
kind
of
warning
right
now
but,
like
I
guess,
one
of
the
symptoms
that
I
can
see
is
commands
to
the
manager
being
slow.
So
we
that
could
maybe
fit
into
like
a
manager's
thought
warning.
Perhaps.
F
Also,
adding
the
module
name
that
is
causing
that
issue
will
be
useful.
A
Yeah,
definitely
if
we
can
report
like
what
the
we
can
look
at
these
paths
internally
and
report
kind
of
what
the
potential
blocker
is.
That
would
be
fantastic.
E
E
A
B
Well,
I
mean
personally,
I
think,
it'd
be
kind
of
interesting
if
you
know
if
we
get
to
the
end
of
the
day,
to
like
be
able
to
do
like
an
active,
active
manager
where
it's
not
a
given
manager
is
active
for
every
module,
it's
more
like
individual
modules
can
be
voted
to
be
active
in
a
given.
You
know,
manager,
or
you
know,
maybe
first
step
is
not
even
voted.
It's
just
like
saying
like
you
are,
you
know,
instance
x,
you're
responsible.
You
know
to
be
for
this
module
this
module
in
this
module.
B
Because
yeah
I
mean
when
it
comes
to
dashboard
or
prometheus,
or
something
like
that,
like
the
ability
to
like
scale
them
out,
I
mean
yeah
the
dashboard
right
now,
it's
it's
active
passive,
but
be
able
to
even
do
that.
One
where
it's
like
you
put
it
behind
a
low
balancer
like
say,
like
I've
got
like
three
instances
of
running
or
something
like
that
for
rha
is
great.
B
But
it
all,
I
guess,
all
comes
down
at
the
end
of
the
day,
is
we
kind
of
we
treat
the
manager
as
we
just
feed
it,
a
fire
hose
of
information,
and
you
know
necessarily,
you
know
if
you
scale
the
problem
out
that
doesn't
necessarily
help,
but
I
think
if
now
you're
still
feeding
the
exact
same
fire
hose
to
instead
of
one
manager
instance
with
all
the
osd's
reporting
to
one
manager
instance
now,
they're
reported
into
like
x
number
of
manager
instances
because
each
of
those
axe
manager
instances
need
to
like
process
and
pre-process
the
exact
same
amount
of
data.
A
Yeah,
the
longer
term
we
might
need
to
you
know
scale
out
the
modules
themselves,
potentially
if
they're
some
of
them
are
could
be
easily
paralyzable.
Like
the
I
mean,
collecting
things
from
prometheus
at
once,
embracing
the
parallel
right.
D
So
are
we
still
suffering
from
prometheus
scale,
because
the
last
I
sort
of
heard
from
patrick
was
that
he'd
quote
unquote
fix
that.
D
I
think
I
think
the
fix
was
to
put
the
data
gathering
in
a
separate
thread
and
then,
when
the
request
comes
in,
to
do
the
scrape
it's
just
basically
taking
it
from
the
cache.
So,
instead
of
it
being
an
instant
thing,
it
has
to
go
and
collect
and
do
all
the
calculations
etc.
He's
offloaded
that
so
he's
just
populating
the
cache
at
interval
and
then
every
script
that
comes
in
just
refers
to
the
current
contents
of
the
cache.
A
Okay,
yeah!
That's
enough
to
get
get.
I
think
it's
working
well
well
larger
scale.
That
would
be
fantastic,
because
that
would
make
a
lot
simpler
than
trying
to
scale
things
out
further,
but
that's
a
good
example
of
like
offloading
things
and
making
things
not
require,
like
the
interpreter
lock
to
process.
A
A
D
A
Yeah
sounds
pretty
interesting,
so
I
think
we'll
probably
know
more
once
we
can
have
at
this
planetary
data
and
displace
profiling
or.
C
C
Yeah
talking
of
telemetry
data
I'd
be
really
curious
to
find
out
which
modules,
though
we
have
like
always
on
and
enable
different
kinds
of
modules
which
modules
are
actually
being
used
by
users
versus
though
we
are
turning
it
always
on.
People
are
just
turning
it
off.
I'd,
be
curious
to
know,
and
that
may
be
one
of
the
reasons
we're
not
finding
out
some
issues
from
upstream
users
versus.
E
I
think
some
of
these
data
we
already
have
but
they'll
need
to
double
check
that.
E
Yeah
we
have
a
public
dashboard
that
everybody
can
access
opposite
paste
it
now
in
the
chat,
and
we
also
have
a
private
dashboard
that
has
some
more
information.
Like
more
drill
downs,
you
can
see
actual
raw
reports
that
we
received
here.
The
data
is
aggregated
in
order
to
have
the
most
privacy
are
the
users.
E
Let
me
double
check
that
josh.
I
think
I
think
we
have
it,
but
we
haven't
had
any
panels
with
that,
but
I'll
get
back
to
you
with
that.
E
D
E
Yeah,
that's
a
very
good
question.
So
most
of
the
data
is
anonymized
and
I
don't
know
there
is
no
strict
policy
right
now
about
sharing
the
data.
E
It's
so
there
are
cluster
data
and
also
device
data
that
we
mostly
want
to
have
in
order
to
build
better
disk
failure,
prediction
models,
but-
and
this
data
can
be
anonymized
and
it's
easier
to
share
that,
but
about
the
clusters
themselves,
even
though
we
do
not
collect
anything
which
is
can
be
considered,
sensible
information,
it's
not
open
in
the
sense
that
you
can
just
download
it,
but
you
can
have
access
as
a
developer,
of
course,
so.
A
Wasn't
there
a
blog
post
that
lizard
made
a
while
back,
but
I
can't
try
to
analyze
some
of
the
things
that
were
in
the
wire
data
set.
E
Yes,
but
it
was
the
screenshots
were
taking
from.
Oh
you
mean
if
he
had
access
to
that.
Yet,
yes,
he
didn't
have
access
to
that,
but
he
was
working
with
panda,
but
I
think
eventually
he
was
using
the
dashboard
itself,
like
screenshots
from
the
dashboard.
A
All
right,
well,
it
seems
like
we
have
some
data
collect
to
collect
here
and
good
plan
for
the
short
term.
Anything
else
on
this
topic.
F
Yeah
on
dashboard
in
demon's
column,
I
don't
see
manager.
E
Yes,
we
we
do
not
collect
yet
the
data
about
all
the
demons.
That's
we're
I'm
working
on
fixing
that
as
well.
We
have
several
missing.
G
A
H
Okay,
hello,
josh,
hello,
hello,
everyone
yeah!
This
is
nisa
from
intel.
Okay,
so
today
I'm
going
to
talking
about
the
replicated
rabbit
catch.
Okay,
so
let
me
present
the
slides
at
first
is
that
okay.
H
Okay,
okay,
so
today
I'm
going
to
talk
about
replicated
repair
cache
in
the
libra
bd.
This
okay
go
to
page
one
okay,
so
this
is
the
back
cache
based
on
image.
That
means
it
is
lp
based
and
ordered,
rather
catch
and
meanwhile
to
catch
the
data
we
are
stored
on
persistent
devices.
Nike
persist,
memory
and
eyes
as
these
okay.
This
is
the
currently
we
have
one
most
village,
oh
we
well.
H
We
have
one
of
those
villages,
the
first
phase
that
is
the
to
catch
nature
and
persistent
memory,
and
also
we
can
catch
nature
on.
I
said
this:
this
patch
is
ongoing.
Okay,
so
this
is
the
first
phase,
they'll
catch
the
data,
a
single
copy
and
the
second
patch
is
to
replicate
cat
data
across
diff
different
devices
in
different
servers.
This
is
this
is
to
guarantee
the
redundancy
okay
and
in
the
second
phase
we
will
use.
H
We
will
use
the
p9
device
as
a
catch
device
and
the
true
replica
data.
We
will
use
imam
device
over
again
ray
protocol
okay.
So
this
is
an
overview.
Let
me
give
more
detail:
please
go
to
page
three
okay,
so
here
I
just
need.
This
is
the
overview
of
our
components.
H
Okay,
so
in
the
compute
cloud
inside
the
library
we
will
provide
the
three
components.
The
first
component
is
the
red
log.
This
is
to
manage
a
cache
gauge
data
in
the
persistent
memory
device
and
the
second
patch
is
the
flasher.
This
part
will
be
flat.
Will
flash
data
in
the
flash
catch
the
data
to
osd
devices?
H
H
Okay,
and
this
is
the
computer
load
and
meanwhile
we
lead
to
replicate
to
catch
the
data
across
remote
servers.
For
example,
we
can,
we
can
start
a
regular
demo
services
service
in
a
storage
load
or
in
other
servers.
So
there,
when
directly
starts
it,
will
label
rapid
cache
at
that
moment
it
it
can.
I
love
the
cache
data
you
in
local.
Meanwhile,
it
can
replicate
the
catch
stage
through
remote
server.
G
H
So,
let's
go
to
painting
sorry,
that's
the
page.
Data
layout
persist
the
memory
okay,
so
for
the
catchphrase
yeah
for
the
catchy
catchy
day,
man
we
will
have.
We
have
three
parts.
The
first
patch
is
for
route.
H
H
Okay,
the
third
patch
is,
the
third
part
include-
contains
what
the
custom
data
so,
for
example,
if
a
write
request
comes
to
cams,
it
will
allocate
allocate
the
space
in
the
third
patch
and
write
it
copy
data
to
the
space,
and
then
it
will
insert
a
log
entry
in
the
second
patch
and
then,
after
that
it
will
update
the
tail
in
the
pro
root.
H
Okay
yeah,
so
this
is
the
data
layout
on
pm
device
and
then
so
yeah.
So
this
part
is
about
the
replicated
rabbit
catch.
Okay
about
before
that
it
is
yeah.
It
is
the
most
mostly
the
single
the
date
copy
in
the
live:
vd
server:
okay,
okay,
so
how
do?
How
do
we
replicate
the
data?
Okay?
So
at
first
we
it
we
are.
It
includes
three
kind,
three
kind
of
services.
The
first
one
is
the
libra
bd
libra
bd
process.
H
We
call
it
a
master
library,
okay,
so
that
means
in
the
computer
mode.
The
application
needs
to
open
the
rpd
image
and
do
the
right
rate.
Okay
and
the
second
part
second
kind
of
service
are
replicable.
H
Repcrea
demo
service
services
manage
the
manage
the
human
device
in
the
server
and
and
provide
the
the
catch
replication
for
their
master
libabidi
and
the
third
service
is
the
controller
here.
I
initially
use
iso
monitor
fact
we
can
create
a
controller.
This
controller
will
manage
the
status
and
the
information
of
radical
demos
so
that
later
mastery
by
b,
master
libra
bd
can
inquire
their
controller
and
ask
further
information
about
replica
demos
so
that
it
can
allocate
it
can
find
out
which
well
it
can
replicate
its
catch.
Two.
H
Okay,
so
for
the
replicated
rapid
catch
we
will
use
active
standby
mode.
That
means
yeah
active
standard.
F
H
Okay.
Here
I
list
the
main
functionalities
for
the
replicated
right
that
catch
okay,
we
are
isolated
into
three
kind
of
scenarios.
The
first
is
the
normal
io
flows,
okay,
so
this
is
a
normal
case.
Okay
for
so
in
this
case,
the
date,
the
cache
that
they
choose
can
be
replicated
across
on
local
pmm
device
and
the
remote
premium
devices.
H
Okay
and
the
second
patch,
is
about
the
handler
of
failures.
Okay,
if
something
something
wrong
happened
in
the
in
the
master
leave
rbd,
that
means
the
liberty
will
maybe
crash.
Then
master
baby,
maybe
crash
may
crash
okay.
So
if
it
crashes
there,
the
replica
demo,
the
corresponding
rapid
demos,
will
track
their
status
of
the
master
libabidi
if
it
fun,
if
they
found
their,
they
found
their
iro.
H
One
replica
demo
will
get
their
exclusive
lock
at
first
and
then
start
to
flash
date
flash
they
catch
the
data
to
osds
okay.
Here
I
want
to
emphasize
that
when
we
label
wave
label
when
it
wants
to
label
rubber
catch
in
the
libra
bd
at
first,
it
leads
to
guide
the
exclusive
lock
and
the
first
and
then,
after
once,
each
gets
their
log
successfully.
Either
it
starts
to
label
the
rather
catch.
F
H
And
the
second,
the
case
is
on
failure
of
if
in
failures
in
replica
demos,
so
about
this
scenario,
the
I
o
can
be
continued
war.
The
master
baby
needs
to
allocate
another
copy
in
other
replica
demos
and
recover
water
replication
and
then
continue
to
continue.
Airflows.
I
think
about
this
configuration
it
can
be
configurable
configurable
in
the
future.
H
Okay,
so
so,
let's
go
to
the
main
techniques:
okay,
so
this
part
okay,
okay,
one
moment:
oh
okay,
so
about
this
patch.
This
is
a
main
main
discussion
point
I
want
to
discuss
today,
okay,
so
to
implement
the
above
functionalities.
We
have
three
my
we
have
three
points
to
discuss.
Okay,
the
first
part
is
about
the
management
of
replica
demo,
okay,
so
at
first
I
want
we
want
to.
We
would
like
to
monitor.
H
Okay,
so
it
means
that
what
the
record
demos
report
their
status
to
the
self-monitor
and
staff
monitor
maintains
such
information
and
then,
when
a
master
library
starts
and
it
is
labeled
the
red
bag
catch
and
so
that
it
it
leads
to
query
their
information
of
what
available
replica
demos
from
the
molecule
and
then
it
can
find
out.
Well,
it
can
replicate
the
date
the
cache
the
date
yeah.
H
Yeah
and
okay,
so
any
suggestion,
any
suggestions
about
this
part.
B
This
was
the
idea
was
that
there
was
going
to
be
a
fixed
house
like
host
a
and
rack
a
host
b
and
rack
b,
and
they
were
just
going
to
be
tied
together
at
the
hip
and
call
it
a
day
now
now
you're
talking
about
a
system
where
lib
rbd
and
these
caches,
these
demons
need
to
understand
the
entire
topology
of
your
network,
which
hosts
you
know,
have
rdma
connections
potentially
between
what
other
hosts
and
which
hosts
have
capacity,
and
this
is
to
me,
this
is
a
huge
creep
of
you-
know,
responsibility.
H
Yeah,
I
saw
justin
so
just
like.
Do
you
mean
that
you
mean
self
monitor
leads
to
include
what
I
mean
other
information
about
replica
edema,
whether
it
supports
radiant,
ray
collection.
B
C
B
B
H
Yeah
right,
we
we
also
have
the
same
worries:
yeah
in
fact
that
we
we
were,
we
consider
the
chew,
put
the
replica
demo
on
their
sim
loads
same
I
mean
sim
loads
as
osd
I
mean
for
the
osds,
which
can
support
rdma
collection.
Maybe
we
can
put
a
rocket
demo
on
the
same
node.
H
Further
further
m
for
the
monitor
it
needs
to
contain
their,
I
mean
information
about,
should
replicate.
Their
epidemic
includes
the
status,
their
collection,
information
and
yeah
the
capacity
information.
Such
information
right.
B
F
B
B
How
do
how
do
those
images
that
were
previously
replicating
their
data
to
site
to
host
a?
How
do
they
get
those
other
sites,
then
up
to
all
those
other
nodes
up
to
state
so
that
you
know
they
can
actually
fail
over?
Because
if
you
ever
have
a
reallocation
like
that,
you
only
have
a
partial
view
of
the
world
for
the
the
right
backlog
right.
You
can't
just
pick
it
up
randomly
and
say:
go
from
here.
You
know
you're
missing
a
chunk
of
data
from
you
know:
pre-failure
pre-reallocation.
H
H
I
mean
every
time
when
they're
I
mean,
for
example,
at
first
livability
allocates
their
cat
copy
in
diff
in
different
dimmers,
so
at
first
in
the
needs
to
do
initialization
and
then
after
initialization,
under
the
idma
collections
I
mean
their
collections
are
created,
I
mean
in
so
after
that
it
will
store.
It
will
store
such
cache
image
cache
information
in
there
in
its
mighty
data,
as
is
metadata
and
once
any
fear.
Over
hyphens
I
mean,
for
example,
one
replica
d
red
one
replica
copy
fields.
H
Their
master
libabidi
needs
to
find
out
another
replica
copy
and
do
recovery,
and
I
mean
after
I
mean
what's
the
fear
it
needs
to
do
recovery
right.
Do
you
do
relocation.
A
Yeah
that
definitely
needs
like
a
lot
of
the
same
things
that
the
osd
does,
like
the
you
know,
failure
detection
like
every
backfill.
Okay,
that
kind
of
thing-
I
guess
maybe
some
of
the
main
differences
from
like
what
the
osd
does
today
would
be
around
like
their
application
path.
Being
argument
based
and
using
these
avoiding
the
cpu
and
probably
doing
something
a
bit
different
with
respect
to
all
that
not
not
using
not
sprint,
not
spreading
the
data
around
acro
across
different
hosts,
but
doing
more
like
a
mirroring.
H
Oh
sorry
here
I
forgot
to
mention
one
point,
because
we
use
persistent
memory
over
f
admit
to
do
wrap
data
replication.
So
that
means
we
whether
catch
copy.
We
all
we
have
the
same
data
layout,
exactly
same
data
layout,
so
we
will
use
idma
verbs
like
admin
read,
write
to
do
the
data
application,
so
it
means
the
operation.
It
doesn't
lead
their
involvement
of
remote
server
cpu.
H
I
mean
in
most
in
most
case,
in
what's
the
case
most
time,
yeah.
B
Oh
yeah,
right
yeah.
I
understand
that
it's
it's
it's!
It's
all
the
corner
cases
when
you
then
start
to
say
like
we
want
to
be
able
to
have
this
arbitrary
failover
demon,
whereas
regimen.
This
is
talking
about
the
fact
that
you
could
use
like
these
pmem.
I
o
like
replicated
offloading
functionalities,
is
no
you
know
we're
just
trying
to
save
against
the
one
case
where
you
have
a
you
know:
a
single
failure,
as
opposed
to
now.
You
know
you're,
potentially.
B
A
Get
a
little
confused
about
the
mirroring
part
if
you're
maintaining
the
same
layout,
would
you
be
like
say
reserving
a
third
of
the
premium
on
each
node
to
be
mirrors
of
like
a
set
of
three.
A
B
I
mean
yeah,
so
they
would
essentially
be
100
mirrors
of
if
you're
saying
like
here's,
my
primary
host
and
here's
my
replica
hosts,
you
know
so
so
I
just
have
one
backup
then
yeah,
whatever
whatever
is
currently
in
the
log
on
on
host
a
it,
would
also
be
a
log
on
on
host
b.
H
H
H
I
mean
one
one
gig
cache
in
the
remote
pm
device
and
the
water
one
kickback,
one
gig
by
p
map
device
we
have
will
be
unmapped
to
their
application
so
and
then
the
two
memories
I
mean
the
data
on
the
two
memories
will
be
exactly
seen.
H
We
allocate
data,
we,
for
example,
we
allocate
custom
data
in
the
third
patch
I
mean
in
the
third
patch,
for
example,
lba
1000
and
the
right
data
to
this
patch.
I
mean
I
mean
in
a
sim
in
the
in
the
remote
memory.
It
works
so
storing
the
should
choose
a
sim,
lpa,
1000
and
then
yeah.
B
It's
just
the
the
management
part,
the
you
know
the
bullet
point,
one,
that's
where
I
start
losing
others
like
the
fact
that
you
have
to
build
this
entire
set
like
system
almost
to
manage
the
you
know,
all
the
all,
the
corner,
cases
for
failure
and
assignment
and
and
things
like
that
versus
when
we
originally
talked
about
it.
It
was
just
like.
B
This
vm
is
allowed
to
migrate
on
this
host
to
this
host
because
its
data
will
be
either
here
or
there
so
that
you
know
these
are
the
two
spots
where
that
you
know
that
vm
is
allowed
to
to
land,
because
that's
where
its
data
is
going
to
be,
you
know
regardless,
and
then
you
don't
need
you
don't
even
need
a
demon
at
that
point
because
it's
just
it
starts
up
and
its
data
is
there
right?
You
don't
need
to
have
like
this
osd
like
process
of
you
know.
I
B
You
know
trying
to
write
back
the
data
to
the
osds.
You
know
on
failure
mode,
but
like
what?
If
what,
if
the?
If
the
process,
then
your
vm
migrates
someone
else
like
to
a
host
that
doesn't
have
the
data,
then
you
can't
even
start
up
until
this
flusher
process
finishes
flushing.
You
know
on
this
third
host
right.
H
H
B
B
Like
you
know,
your
data
is
not
corrupt
in
the
rbd
image.
It's
just
you
go
back
in
time
right
because
it
hasn't
replayed
the
log
to
bring
you
back
up
to
the
last
known
good
state,
but
to
get
to
the
last
known
good,
committed
state
when
it
restarts
your
vm,
it
either
it
it
needs
to
start
it
on
node
a
where
you
originally
located
and
that's
where
cache
is
or
needs
to
start
it.
On
node
b,
where
you
know
it
was
configured
to
replicate
the.
H
H
You
mean
at
that
moment
there
must
the
vm
restart.
It
will
read
data
from
the
rbd
image,
but
at
that
at
that
moment
the
data
is
not
correct.
B
B
Then
your
cumin
process
just
needs
to
wait
for
the
flushing
process
to
do
its
work
and
mark
the
cache
is
clean
and
release
the
exclusive
lock
so
that
the
other
node
can
declare
the
exclusive
lock
say:
hey.
My
cache
is
clean.
Oh
I
don't
have
a
cache
here
and
then
I
got
to
create
a
local
cache
to
start
all
over
again
and
now
somehow
I've
got
to
go
clean.
The
cache
on
the
other
host
the
original
host.
B
That
was
the
backup
host
so
that
it
can
start
from
fresh
or
maybe
you
know
this
new
monitor
process
is
you
know,
assigned
a
new
location
to
be
the
new
the
new,
I
would
say,
there's
a
lot
of
corner
cases.
There
right.
H
A
Yeah,
but
with
the
management
aspects,
I
mean
when
you're
talking
about
playing
all
these
things,
that
the
oc
does.
It
might
almost
almost
make
sense
to
if
you
wanted
to
go
that
down
that
path,
to
make
that
kind
of
a
new
pool
type
that
would
do
replication
in
this.
This
particular
way
and
have
its
own
kind
of
recovery
semantics,
but
that
be
able
to
share
the
existing
like
rp
failure,
detection,
and
perhaps
I
perhaps.
A
Lead
to
a
new
version
of
like
hearing
for
the
purposes
of
writing
back
to
the
other
osd's
solar
disks.
A
That's
kind
of
a
much
much
larger
project,
but
I
think
it
might
make
more
sense
than
than
running
a
parallel
set
of
the
same
kind
of
services.
Just
for
this
cash
demon.
B
Yeah
I
mean
yeah,
I
mean
I
I
I
you
know
at
the
end
of
the
day
right,
you
know
we're.
Our
goal
is
to
get
to
the
point
where
the
osgs
are
a
lot
faster
right
and
then
our
goal
should
also
be
to
get
to
the
point
where
the
osds
are
being
able
to
do.
Some
of
the
the
offloading
of
you
know
transferring
messages
between
its
peers
right
for
replication.
These
I
mean
these
are
regardless
of
how
we
get
there.
I
mean,
I
think
we
can
all
agree
that
it's
that's.
B
We
want
the
ocs
to
get
faster
and
we
want
the
osds
to
you
know
if
they
can
offload
some
work
in
terms
of
you
know
sending
data
you
know
from
osd
a
to
its
peer
osd
b
and
c.
That
would
be
great
as
opposed
to
having
the
osd
actually
have
to
then
read
the
data
put
it
on
the
you
know.
You
know
network
itself
and
you
know
on
the
other
side,
pull
off
the
network
and
you
know
and
process
it.
B
You
know,
I
know
sage,
you
know
years
ago
had
talked
about
the
whole
idea
of
you
know,
trying
to
use
nvme
over
fabrics
to
get
the
osds
to
actually
directly
put
the
data
where
it
needs
to
go
on
its
peer
osds,
as
opposed
to
having
you
know
all
this
message,
passing
of
data
so
that
the
other
osd
can
process
the
data
to
go
figure
out
where
it
needs
to
go.
So
I
don't
know
if
that's
still
the
plan,
but
I
know
historically
that
was
yeah.
F
A
Right,
that's
that's
the
dream
kind
of
like
that's
when
we
talked
more
about
that
and
like
for
the
future
and
of
crimson,
basically
like
once,
we
have
like
you
know,
that's
this,
that
basic,
fast
data
path
and
it
makes
more
sense
to
start
looking
into
like
vimeo,
f
and
and
perhaps
our
rdma
or
other
transports
to
be
able
to
do
a
lot
of
this
offload.
B
I
B
You
have
a
need,
you
have
to
use
a
story
of
saying
if
we're
using
this
right
back
cache,
we
need
to
make
sure
that
you
know
the
failure
of
the
host
that
has
the
right
back
cache
on
it.
The
failure
of
the
the
octane
device
in
that
in
that
host
doesn't
mean
the
loss
of
data
and-
and
that's
where
you
know.
Historically,
we
had
talked
about
just
this.
B
This
fixed
concept,
just
just
you
know,
for
the
people
that
the
people
that
are
number
one
opting
into
you
know
using
this
right
back
cache
and
and
number
two.
The
people
are
saying
that
their
workloads
are
are
so
sensitive
and
so
important
that
they
need
to
make
sure
that
at
least
has
like
end
level
redundancy.
B
I
think
the
simpler
path,
at
least
to
start
off
and
try
instead
of
trying
to
start
off
with
you
know
this
whole
other
scheme
is
just
to
fix
it,
and
the
hard
code
is
say
like
hey
when
I'm
turning
this
when
I'm
turning
this
feature
on,
send
a
copy
to
host
b
and
send
a
copy
to
host
c
and
send
a
copy
to
host
d.
B
You
know
whatever
your
replication
factor
is,
and
it's
just
it
you
know,
you're
we're
punching
the
problem
for
now
like,
I
think,
that's
a
way
more
achievable
bite
of
a
project
than
trying
to
cover
you
know,
building
another
stuff
in
parallel
to
seth,
and
then
it
might
just
turn
out
that
you
know
if
our.
If
our
dreams
come
true
and
chef
becomes
a
lot
faster.
B
You
know
you
know,
there's
there's
work
going
on
and
there's
projects
I
know
going
on
in
terms
of
even
just
having
the
rdb
client
being
more
abstract
for
block
and
having
it.
You
know
using
our
dma
from
you
know,
hypervisor
host
directly
to
you
know,
nosd
host
and
things
like
that.
So
at
the
end
of
the
day,
how
much
do
we
want
to
invest
in
this?
B
If,
if,
if
lifespan
is
in
a
perfect
world
limited
and
it's
gonna
be
a
lot
of
code,
it's
gonna
be
a
lot
of
untested
corner
cases,
because
just
odds
are
there's
not
gonna,
be
a
lot
of
people.
You
know
willing
to
to
set
this
up
with
what
you
know
the
hardware
you
need
to
throw
at
it.
You
know
initially
right,
so
I'm
just
trying
to
limit
the
scope.
You
know
get
us
to
the
point
where
it's
something
that
you
know
solves
the
one
user
story
issue
of.
B
I
don't
want
to
lose
my
data,
but
try
not
to
reinvent
the
wheel
of
backfill
recovery
and
alec.
All
all
these
things
about
data
placements.
You
know
it's
a
harder
problem
to
solve
and
I
think
it's
it's
not
a
problem.
That's
going
to
get
solved,
and
certainly
not
for
pacific
and
questionable
for
the
you
know,
potentially
the
future
key
release,
because
I
mean
it's
got
it's
it's
gonna
be
a
lot
of
work
right.
H
Yeah
and
I
want
to
mention
that
compared
with
osd,
I
mean
the
wrap
the
position.
The
weather
catch
will
be
much
simpler
because
oh
yeah.
A
H
B
You
end
up
with
all
the
same
cases
of
like
hey
the
the
image
is
running
fine
right
now
on
on
node
a
but
your
mirror
host
died.
Your
replica
is
done.
How
do
I
bootstrap
up
post
c?
You
know
with
a
good
copy
of
the
log,
while
note
a
is
still
running
and
that
image
is
still.
You
know
actively
attempting
to
replicate
data
to
node
b,
like
what's
the
failure
detection
path.
B
I
B
B
You
have
to
wait
for
the
monitor
to
detect,
like
30
seconds
later,
that
the
host
is
dead.
H
And
suggesting
so
currently,
my
search
is,
to
I
mean
depend
depend
on
exclusive
lock
to
keep
their.
H
B
B
H
C
H
Yeah
so
for
this,
for
this
case
you
mean,
if
their
local
catch
device
fields.
H
Right,
yeah
right,
you
mean
their
phrase
time,
maybe
a
little
long,
and
so
that
the
l
will
be
effected.
I
B
Yeah
right
right
until
it's
stablely
written
to
both
hosts,
but
I
can't
write
to
host,
b
or
or
node
b,
because
the
node
is
dead
or
the
the
obtained
device
died.
Or
what
have
you
so
now
we're
getting
to
the
point?
Well,
this
is
you
know
the
equivalent
of
like
the
osb
osd
like
heartbeat
interval
or
something
like
that.
It's
getting
marked
dead
and
it's
reallocating.
You
know
pgs.
B
So
that
it
can
tell
the
library,
client,
with
the
persistent
right
back
cache
on
node
a
hey,
never
mind
about
writing
to
node
b
now
you
gotta,
you
know,
write
to
node,
see
that's
your
new
host,
but
also
also
before
you
let
ios
continue.
You
need
to
copy
the
current
state
of
the
of
the
log
on
node,
a
to
node
c
and
then
yeah.
H
H
H
I
mean
so
about
this
case.
Do
you
have
any
suggestions
I
mean
about
the
next
step?
Do
do
you
think
it
is
a
versi
that
we
do
some
tests
to
check
the
time
to
replicate
the
data.
B
Well,
yeah,
I
mean
yes,
that's
definitely
interesting
to
know
the
the
overhead
of
the
replication.
It's
there
somewhere
right,
because
there's
a
latency
involved.
B
But
it
also
be
interesting
to
know
how
quickly
and
and
deterministically
can
you
detect
a
failure
in
that
path
and
not
a
transient
failure.
You
need
to
know
deterministically
that
that
node
b
is
dead
and-
and
I
don't
I
don't-
have
a
good
answer
for
you
about
what
you
would
do
once
you
do
detect
it.
B
You
I
mean
you
go
down
to
replication
factor,
you
know
one
for
the
the
log
and
then
once
it
comes
back
online,
you
catch
it
up
or
because
at
a
point
like
at
a
certain
point
in
time,
right
you
don't.
If
the
whole
point
of
this
persistent
rate
log
is
to
speed
up
ios,
you
know
now
what
you
involve
two
hosts
in
the
system:
you're
actually
like
decreasing,
yeah
or
increasing
the
likelihood
decreasing.
C
B
So
yeah
I
mean,
I
think,
it'd
be
good
to
know
like
what
pmio
has
for
its
replication
to
hopefully
quickly
detect
such
a.
You
know
a
case
so
that
you
can
then
make
a
determination
to
say
I'm
just
gonna
continue
on
without
replication
and
then,
if
it
ever
comes
back
up,
then
you
have
to
then
figure
out
about
how
to
backfill
it
with
anything.
That's
missed.
B
A
Maybe
one
thing
you
could
do
in
that
kind
of
situation
is
go
right
through
instead
of
right
back
because
then
you
don't
have
the
the
risk
of
a
data
loss
from
the
cash
going
away.
B
Flush,
the
existing
cash
like
and
then
continue
right
through
until
for
all
future
ios
yeah,
that's
one
way
but
yeah.
So
then
it
gets
down
to
the
question
of
how
fast
can
you
detect
a
failure
or
what
what's
there
in
the
in
the
in
the
pmem
application
library
to
handle
such
a
case,
but
yeah,
I'm
just
trying
to
I'm.
H
Yeah
raj
yeah,
I'm
recording
yeah
yeah,
so
about
the
case.
Just
just
just
talked
so
mean
that
when
the
when
the
red,
when
the
graphic
of
copy
field,
we
can
flash
the
data
in
the
master
libra
bd
and
to
cite
the
cache
as
a
restroom
mode
is.
This
is.
B
H
B
Yeah,
so
if
you
have
this,
if
you
have
this
2x
replica,
you
know
factor
enabled
yeah.
This
is
an
optional
thing
to
enable.
So
let's
say
you
enable
it,
then
I
want
to
back
up
with
my
cop,
you
know
of
my
data
on
and
number
of
hosts.
B
If
you
get
to
the
point
where
you
you
detect
that
your
peer
hosts,
you
can't
write
to
it
anymore.
What
josh
is
saying
is
the
thing
you
could
do.
Normally,
if
you
don't
do
anything
you're,
just
gonna,
you're,
free
you're,
frozen
right.
You
can't
do
anything.
B
So
what
you
can
do
is
you
can
basically
freeze
temporarily
why
you
basically
replay
your
entire
log
to
empty
it,
just
flush
right
back
everything,
that's
in
the
log
and
then
any
future
ios,
including
the
one
you
basically
caused
the
pause
you
just
avoid
the
you
just
got
to
do
a
right
through
mode
where
you
just
write
it
directly.
You
basically
just
disable
the
cache.
H
B
And-
and
this
is
again
for
the
case
of
keeping
it
real
simple
of
I
you
know
when
I
when
I
started
my
image,
I
told
exactly
what
its
peer
note
is
right
again,
trying
to
constrain
the
problem
down
to
a
simpler
thing
to
start
off
with,
because
if
the
pure
note
comes
back,
you
know
you
can
periodically.
You
know
ping
it
or
whatever
through
pmem.
You
know
io
replication
library
to
say
like
hey,
can
I
talk
to
you?
Can
I
talk
to
you?
B
You
know
whatever
health
check
it
and
then
once
it
comes
back
online,
you
know
you
can
in
theory
ensure
it.
You
know,
gets
reset
back
to
an
empty
log
and
then
proceed
from
there
right.
H
Yeah
yeah
so
about
the
persistent
rapid
catch.
I
didn't
I
I
didn't.
I
mean
we
didn't.
We
don't
use
pg
logs,
like
pg
logs
in
the
osd,
so
I
mean
if,
if
master
pd,
I
mean
re-allocate
a
new
copy
right.
B
H
B
Metadata
about
what
you
know
ios,
it's
seen,
isn't
it
you
actually
have
the
full.
I
o
picture
and
you
know
what's
going.
I
B
A
A
B
A
B
H
So
could
you
please
give
us
any
suggestions
about
the
next
steps
of
other
replica?
I
mean
replica
replicated
rabbit
catch,
so
do
I
mean
yeah?
I
need
to
consider
considered
water,
color
cases
and
yeah
and
how
to
handle
them,
and
after
that
easy
the
best.
It
is
good
for
me
to
show
such
detailed
design
in
their
stadium
or.
H
Yeah
you
mean
I
can
list,
I
can
list
the
water
color
cases
and
then
we
can
and
find
out
how
to
handle
them.
And
then
I
can.
We
can
discuss
the
the
case
one
by
one
in
the
cdn
meeting
right
yeah.
B
Cdm
or
just
bring
it
back
through
like
started
something
start:
a
new
email
thread
on
on
dev
at
stuff,
dot,
io
mailing
list
right,
like
hey,
I
went
back,
I
investigated.
You
know,
you
know
points
a
point
b,
point
c.
You
know
about
what
we
talked
about.
I
H
B
To
wait
for
another
developer
monthly
to
come
again.
G
B
But
I
think
we
should
just
aim
small
to
start
minimal
minimal,
viable
product
about
like
what
we
need
to
to
add.
You
know
to
solve
this
initial
user
story
corner
case
of.
B
If
I'm
using
this,
I
don't
lose.
My
data
like
I
want
to
have
at
least
like
one
copy
of
it
somewhere
or
something
like
that
and
then,
and
once
that's
in
we
can
see
where
to
go
because
it's
not
like
it
would
be
throwing
away
work
right.
A
Yeah,
I
think
it
would
definitely
be
good
to
get
like
a
minimal
steps
going
first
and
get
those
merged
before
considering
like
the
large-scale
management
aspects
that
you
were
talking
about
before.
A
I
I
just
said
it:
it
was
it's.
I
agree
with
you
best
to
start
with
smaller
steps
and
get
those
reviewed
and
merged
we're
trying
to
address
the
american
aspect.
So
that's
much
more
complex.
H
B
Fixed
configuration
pairs
like
you,
don't
need,
like
you,
don't
need
the
monitor
or
any
number
of,
or
this
other
whatever
process
to
manage
assignments
of
replica
demons.
You
don't
need
a
replica
demon,
you
don't
need.
You
know,
control
paths
for
discussing
all
that
data
via
stuff
manager,
so
I
think
it
punts
a
lot
of
the
puts
a
lot
of
the
work
when
you
just
say
like
this
process
can
live
here.
B
So
we
don't
have
to
worry
about
a
demon
to
do
right
back
because
once
cumulus
restarts
on
that
host
it'll
just
do
it
in
theory,
you
could
add,
like
an
rbd
cli
command
to
effectively,
do
it
as
well
right,
there's
already
like
the
cache
and
validate
one,
you
could
do
like
cache
flush
to
basically
force
the
flush
of
the
cache
on
a
given
host
right,
but
just
trying
to
trying
to
simplify
the
problem
down
to
you
know,
what's
the
minimum
amount
of
work
that
you
need
to
bite
off
and
work
on
to
just
add
support
for
making
a
replica
of
the
of
the
octane.
H
B
I
I
B
Such
a
way
to
say
that
this
workload
can
run
on
node
a
or
can
run
on
node
b,
because
that's
where
I've
configured
where
the
the
replicated
right
log
will
replicate
between
then
when
openstack
restarts
the
workload
after
the
crash
half
well
after
the
node
dies,
it
starts
at
node
b
and
the
cache
is
right
there
because
it's
already
been
replicated.
So
it
basically
starts
up
and
says:
oh
here's,
my
cache,
it's
you
know
dirty.
I've
got
a
lot
of
entries,
I'm
gonna
have
to
flush
back
and
I'm
gonna
keep
working
on
that.
B
I
mean
imagine
at
least
also
like,
maybe
just
because
I
don't
want
the
york,
I
mean
the
orchestra
is
not
going
to
be
starting
up.
Kimu.
B
B
Maybe
I
think
the
interesting
thing
would
be
somewhere
like,
because
what
is
if
you
could
have
like
a
like
a
mon
config
key
or
something
like
that
to
basically
define
the
the
pairs
or
whatever
once
so
that
it
can
like
look
that
information
up.
You
know
in
a
single
in
a
single
place.
I
think
that
would
be
cool
as
opposed
to
having
to
inject
it
at
image
creation
time
or
something
like
that,
because
in
theory
it
can
move
around.
B
You
know
a
b
and
c
or
whatever,
and
that
that
defines
my
my
replication
set
just
some
way
to
like
define
it
and
get
the
data
from
within
librbd.
So
we
don't
need
yet
another
like
api
or
whatever
to
inject
it
in
because
openstack's
not
going
to
change
realistically
speaking
or
you
know,.
B
Anything
we
can
do
to
like
implicitly
get
that
knowledge
from
within
liberty
and
not
have
to
force
pass.
It
in,
I
think,
opens
simplifies
the
problem
yet
again,
because
you
know
we
don't
have
to
then
modify
a
higher
level
tool
to
inject
that
knowledge
back
down
into
into
stuff.
I
mean
yeah,
you
might
have
like
ansible
or
something
like
that,
like
as
it's
setting
up
openstack
it
labels
nodes
and
then
labels
injects
the
config
keys
or
whatever.
A
A
A
D
Yeah,
this
is
just
a
a
simple
thing
compared
to
the
previous
discussion,
so
we
had
a
lot
of
conversations
on
the
dashboard
side
and
the
australian
side
about
ways
in
which
we
can
improve
the
collaboration
from
a
design
perspective,
because
some
of
us
especially
me
because
I'm
time
zone
challenged,
I
can't
make
a
lot
of
the
meetings,
and
so
what
we
were
looking
at
doing,
what
we
started
doing
was
to
put
up
design
pull
requests
and
we
started
putting
them
into
doc,
dev
and
whatever
the
component
was,
and
the
consensus
talking
to
the
rest
of
the
guys
was
that
just
to
sort
of
put
that
to
that,
you
know
this.
D
This
month's
developer
session,
just
to
sort
of
say
any
any
problems
of
us
doing.
That
is
there
a
better
way
to
do
that,
or
is
that
the
the
right
approach
to
take.
A
G
A
It
was
coming
in
a
little
bit,
but
I
think
that
sounds
like
that.