►
From YouTube: 2018-NOV-07 :: Ceph Developer Monthly
Description
Monthly developer meeting for the coordination of Ceph project development.
http://tracker.ceph.com/projects/ceph/wiki/Planning
A
B
B
B
With
Ron
Miguel
Olmo
Martinez,
whose
name
I
have
probably
totally
butchered
one
final
gala
or
one
me
yeah,
yeah
yeah.
So
it
looks
like
him
what
what
I'm
trying
to
do,
the
the
deep-sea
Orchestrator
module
and
what
what's
going
on
on
the
ansible
side,
are
more
or
less
following
the
same
or
a
similar
approach
and
similar
trajectory.
Okay,.
C
B
Vaguely
I
saw
some
things
about
using
web
storage
management
and
then
ultimately,
probably
a
decision
to
that.
We
need
to
call
this
out
through
the
orchestrator
somehow,
because
we
need
to
be
able
to
blink
lights.
If
say,
if
itself
is
busted
or
is
not
installed,
somehow
I
thought
was
the
outcome,
but
I
may
have
missed
in
pieces
yeah,
that's
so
yet,
and.
A
There
aren't
too
many
real
decisions
to
make
that
so
in
Nautilus
we
have
all
the
device
tracking.
Now
all
that
the
monitors
and
the
OSDs
are
both
reporting.
What
underlying
under
hardware
device
they're
consuming
as
a
new
model
underscore
or
offender
model
serial,
but
that's
good,
so
I
think
that's
at
the
user
level.
I
think
there
are
two
commands
that
we
want.
A
Do
it
I
think
that
and
under
the
covers,
it's
just
a
it's
just
a
else
MC
Li
and
then
the
pathname
to
turn
it,
and
then
you
tell
it
whether
it's
the
fault,
lid
or
the
ident
lid,
and
whether
it's
on
or
turned
off
the
problem
is
all
the
state
is
like
in
the
hardware
and
though
I'm
trying
to
figure
out
what
the
what's
there
like,
what
the
high-level
tracking
is.
That
state
is
that
stuff
once
said,
keep
track
of
you.
A
B
A
Think
that's
maybe
okay,
because
it's
better
to
have
deaf
think
the
light
is
on
and
have
the
light
not
be
on
then
the
other
way
around.
Is
you
don't
want
to
be
in
a
situation
where
I
turned
on
one
light
and
I
see
a
light
on
that
says,
there's
one
light
on
and
you
go
and
you
find
one
light
on
and
therefore
you
like
pull
the
wrong
disc,
wrong,
disc,
so
I
think.
A
If
it's
like
and
on
the
on
case,
you
would
earn
on
the
light
nicely
and
slow
and
then
record
that
the
light
is
on.
Then,
when
you
turn
it
off,
oh
pasta,
you
would
burn
that
the
light
us
I
know
you
did
it
yet
you
would
and
it's
off
I
guess
I
got
this
backwards.
You've
recorded
the
light
is
on
and
then
you
turn
on
and,
conversely,
you
would
turn
it
off.
B
B
Yeah
I
I
have
a
I,
have
a
nasty
feeling.
This
is
always
going
to
be
slightly
fuzzy
because
of
what
you
just
you
know
that
scenario
you
just
mentioned
and
also
and
I,
don't
know:
I
I'm
twitchy
about
certain
by
offices
and
things
exposing
lights
in
the
right
way,
but
yeah.
A
At
some
level
you
deceptive
yeah,
you
have
to
trust
and
if
it
doesn't
and
you
gotta
go
upgrade
that
storage
mansion.
So
it
seems
to
me
like
this.
This
is
pretty
easy
to
implement.
We
need
the
lair,
the
calls
in
work,
orchestration,
module
to
run
LS,
MCO,
I
or
whatever
other
command
is
appropriate.
Yeah
and
I
think
we
need
flags
or.
B
B
Agree
that
it's
same
for
mono
manager
to
record
at
least
what
lights
are
on,
it
probably
doesn't
need
a
whole
set
of
what's
on
or
off
just
what's
on
is
probably
enough
and
yeah
health
warned
when
lights
are
on,
because
then
then
you
know
that
you've
you've
lit
something
up
and
you
need
to
do
something
about
it.
You
won't
forget.
C
B
A
B
A
All
right
what
else
we
have
a
list
maybe
have
here
it's
pretty:
let's,
let's
just
jump
to
the
clay
and
erase
your
code.
Stuff
I
mean
since
you're.
Here
then
do
a
status
check
there.
You
ready
to
go.
A
C
A
Yeah,
okay,
I
think
it's
basically
done
then
I
also
expanded
the
tooth
ology
way,
definition
or
whatever
to
cover
it,
so
the
QA
runs
are
exercising
it
I
think.
Maybe
the
only
thing
that
we
want
to
do
is
figure
out
how
to
market
as
experimental
or
the
next
release.
Just
because
it's
new
and
it's
sort
of
critical
to
like
not
losing
your
data,
but
otherwise
I
think
it's
I
think
all
the
pieces
are
there.
C
A
Yeah
I
think
probably
like
what
actually
needs
to
happen
is.
The
next
step
is
for
somebody
to
like
get
motivated
to
do
some
like
performance
tests
like
actually
deploy
it
on
in
a
large
system
and
put
a
lot
of
data
in
it
and
like
just
the
outworks
and
compared
to
the
and
that
sort
of
thing.
The
final
implementation
I'm,
probably
repeating
some
of
the
experience
that
you
did
when
you
wrote
your
paper
originally.
C
C
C
A
Well,
we
should
give
it
a
tie
thing.
We
should
give
it
a
try.
I'm
not
usually
I,
would
sort
of
sign
markup
for
that
he's
pretty
busy
with
a
bunch
of
other
stuff
right
now,
so
I'm
not
sure
that
he's
gonna
get
do
that
phrasing,
but
maybe
maybe
it
would
be
worth
sending
an
email.
Well
here
here
are
two
things
that
we
could
do.
One
would
be
send
an
email
to
the
develop
memos,
just
like
announcing
it
and
just
an
users
list
actually
to
just
saying
this
has
merged,
and
this
is
why
it's
cool.
A
This
is
what
it
does
and
we're
interested
in
feedback
on
like
performance
and
and
and
so
on.
The
other
thing
if
you're
interested
it
would
be
awesome
if
you
want
to
write
a
blog
about
it
about
the
whole
thing
on
the
stuff
blog
and
you
can
talk
about,
you
can
talk
about
the
code.
You
can
talk
about
the
process
of
getting
it
upstream.
You
could
talk
about
the
paper.
A
A
All
right
March
we
do
that,
let's
do
the
our
buddy
shared
cache.
Read-Only
cache
updates
it's
you're
on
you're
here
how
you
sage.
E
F
E
E
Yep,
okay,
okay,
so
I
actually
provide
actual
slides
for
this.
So
first
I
already
tried
to
give
you
some
basic
idea
for
the
Cheryl
cash.
So
basically
it
has
three
components.
The
first
one
is:
we
have
a
cash
amount
in
for
on
running
on
each
compute,
which
can
actually
do
a
promotion
only
on
each
lookup
and
also
do
some
eviction
when
the
cash
space
is
under
some
high.
Under
some
high
pressure
and
the
second
component
is
we
actually
have
a
mini,
cashing
store
for
this,
which
is
actually
very
similar
like
a
file
store
based
layout?
E
It
also
supports
a
subdirectory
to
solve
the
to
solve
the
plots
of
a
small
fire
issue
and
the
third
one
is.
We
have
a.
We
have
a
newly
Bobby
hook
that
exists
inside
me,
Bobby
D,
which
can
redirect
the
the
read
request
on
cone
image
to
to
our
shared
cache.
So
this
is
the
basic
architecture
or
components
for
the
current
project
and.
E
This
is
the
general
I
flow
for
the
shallow
cache.
So
when,
when
there's
a
read
request
that
flows
to
object
s
catcher
object,
dispatcher
is
a
new
component
designed
by
jason,
dolley
man
thanks
mimic.
So
it
is
it's
a
it's
actually
working
on
the
object
level
and
with
this
our
dispatcher,
we
actually
can
do
some
very,
very
easily
integration
with
our
shell,
our
cash.
We
actually
have
a
shared
our
shared
object,
cashier
dispatcher
here,
which
can
redirect
the
really
request
to
our
caste.
More.
E
Basically,
they
just
look
up
in
the
cache
demon
first
and
if
there's
a
cash
kit,
and
it
will
just
try
to
read
from
the
local
caching
objects
and
if
it
is,
if
it
is
not
hit,
then
the
IBD
caching
demon,
we're
trying
to
work
will
actually
kick
off.
Pacing
promotion
requests,
just
that
tries
to
read
between
the
object
from
the
raddest
here
and
then
stall.
The
objects
to
the
local
SSD.
E
The
most
important
one
here
is
our
own
on
the
second
step,
if
a
BD
decides
so
I
really
has
a
internal
object
called
of
your
map.
Then,
with
this
object
map
we
can
easily
tell
if
it
is
okay
to
read
from
the
parent
image
or
not.
So
if
a
BD
pond,
this
object
is
able
to
read
from
a
parent
image.
It
were
just
achill,
go
to
the
cache
demon
first
to
do
a
look
up
there
and
customer.
E
Where
look
look
up
in
our
policy,
which
is
actually
a
very
simple
area
based
policy
right
now
on
the
first.
Just
it
were
trying
to
it,
worked
right,
then
kick
off
a
synchronized
promotion
from
the
reddest
here,
and
it
will
not.
It
will
not
block
current
really
request,
since
we
are
we
adjusted
region
returning
a
warm
here.
E
So
as
the
the
object
s
patcher
word
just
tried
to
flow
to
the
next
layer
which
were
busy
core
a
BG
layer,
though
the
reader
quest
can
still
be
serviced
quickly
here
and
then,
after
the
promotion
Islam,
it
were
tried
to
the
the
cache
demon
were
trying
to
update
the
local
metadata,
which
is
actually
a
hashmap
story.
Memory
ran
out
for
future
reads
for
future
lookups
on
on
this
object,
then
it
were,
it
would
be
a
hit.
A
E
We
do
have
a
policy
border
here
that
can
build
your
own
policy,
for
example
the
country.
We
have
a
very
simple
idea
here.
We
just
promote
only
first
possible
cap,
but
we
can
dig
further.
They
extend
this
to
some
more
complicated
form
like
we
can
do
promotion
only
when
hits
for
10
times
or
something
like
that.
Yeah.
E
But
this
is
for
the
emotion,
emotion,
flow,
it's
just
like
eviction,
so
the
current
eviction
implementation
is.
We
just
check
the
space
if
it
is
above
some
watermark.
We
then
to
image
seen
simply
remove
some
objects
from
the
cashing
store
based
on
the
REO
status.
We
just
remove
those
code
objects
because
this
is
for
the
read
hit
so
I
just
skip
that
for
the
right.
It's
since
I
believe
has
has
the
mechanism
to
detect
which
objects
are
able
to.
E
E
So
we
actually
have
a
PR
here
in
this
PR.
We
have
those
general
components
implemented
as
first
one
you
see
a
file
based,
cashing
store,
but
which
is
which
it
just
tries
to
promote
the
object
at
the
four
mega
level
and
then
I
just
just
tries
to
store
those
objects
into
that
training
store
it's
very
similar.
Like
fire
store.
We
actually
have
a
subdirectory
there.
The
default
number
is
10
so
for
photos.
For
example,
if
we
have
a
110
HK
got
a
BD
image,
then
it
will
be
like
3
3
to
7
6
8
objects.
E
E
E
E
So
it's
it's,
it's
real
spot
for
real
estate
way
and
then
third
ones
we
have
a
configurable
porosity
or
now
we
we
just
have
a
simple
policy.
A
very
simple
process
based
on
air.
Are
you
for
for
each
read
for
each
actually
for
each
look
up
in
the
cache
controller?
We
just
update
this
air.
Are
you
and
then,
when
the
cache
space
hit
some
watermark,
we
just
remove
those
cold
objects
based
on
this
air?
Are
you
so?
There
are
some
missing
points
like
I
said
just
we're
using
still
using
sync
period.
E
Actually,
the
better
way
here
is
to
use
live
al,
but
we're
still
working
on
that
and
then
third
one
as
second
missing
is
we
currently
they
store
those
matter
files
as
an
in-memory
hash
map,
and
then,
if
we
restart
that
that
caste
monk,
then
the
cache
where
we
are
actually
cleaned
up,
yeah
yeah.
So
if
we
can
possess
them,
this
cache
cache
mat
file.
Then
we
can
actually
save
some
time
to
warm
up
those
cash.
E
Okay,
so
I'm
going
to
show
some
initial
results
here.
So
this
is
a
warm
cache
home
cask
home
cash
test
case.
The
setup
is
like
three
OS
notes:
each
with
seven
hdds
with
boost
or
Phoebe
and
the
world
co-located
on
the
same
HDD
and
then
only
come
here
note
we
have
a
nvme
disk
as
the
classic
caching.
It
is
actually
a
piece
at
a
seven
passes
D
and
then
here
we
actually
use
16
crawling
the
image
on
the
on
one
Campion
old.
So
with
4k
random,
read
I
file
on.
E
If,
if
we
we,
we
use
no
caching
here,
then
the
IOPS
is
like
235.
But
if
we
have
a
warm
cache
here,
the
IOPS
is
going
to
improve
like
7-7
K,
10.7
K.
Here
there's
a
performance
improvement
is
quite
huge
and
also
yeah,
just
because
it's
actually
just
remaining
yeah
from
from
local
and
we
Amida,
but
also
the
table
it'll
say,
has
been
saved
a
lot
because
we
are
HDD
device,
so
telogen
say,
is
always
a
big
problem
for
HDD.
But
if
we
have
a
class
I'll
caching,
the
latency
can
actually
be
improved.
A
lot.
A
E
E
Nearly
the
x-ray
here
is
IOPS,
and
the
y
ray
here
is
a
time
seconds
elapsed.
So,
as
time
goes
by,
the
cache
is
being
promoted
more
and
more,
and
then
during
here
in
the
first
stage
the
owl
Pierce
is
like
I
should
say:
our
Pierce
is
like
200
a
same
level
as
the
LPS
without
caching,
but
as
time
goes
by,
the
caching
is
becoming
more
and
more
promoted.
So
nearly
at
two
seconds
later,
a
tool
200
seconds
later,
the
cash
has
been
fully
promoted.
Here
the
IOPS
can
actually
goes
to
four
okay.
D
Might
be
nice
to
see
like
superimposed
on
that
graph
would
be
like
the
without
cash
in
case
all
right
right
without
cash.
It
starts
off
at
like
five
thousand
like
that,
like
with
cashing
like
you
kill
your
performance
until
it
like
actually
populates
the
cash
just.
E
A
D
D
If
you're
doing
like
4k
reads
and
but
then
you're
already
saturated
and
your
hard
drive
in
the
background
is
basically
also
trying
to
saturate
your
hard
drive
with
full
four
megabyte
object
promotions.
So
it's
like
pulling
performance
away
from
your
cluster
to
do
a
full
four
megabyte
object:
promotion
for
every
single
4k
random
done.
It's
just
going
to
you.
Oh
I,
see
I.
D
E
D
Tell
you
the
objector
at
least
electronic
you
down
at
like
1000,
so
you
know,
you're
not
got
above
that
or
like
a
hundred
megabytes
on
the
default.
Okay,
okay,
they
might
be
hitting
that
like
right
at
the
epic
echo.
If
you
like,
send
out,
you
know
you
start
hitting
that
objector
built-in,
like
righto
I,.
E
E
A
I
have
a
couple
other
questions
and
so
I
think
I
think
it
makes
sense
to
will
rename
this
so
that
it
a
little
bit
more
general
so
that
it
could
be
used
by
something
else.
So
I'm
not
sure
exactly
when
that's
gonna
happen.
But
the
other
question
is:
there's
gonna,
be
some
configuration
option
that,
like
controls,
how
big
the
cache
is
temple
how
much
space
is
gonna
consume?
Oh,
yes,
yes.
E
We
look
I
forgot.
Imagine
we
have
a
few
options
to
do
this.
We
we
country,
use
a
at
cache
entry
structure
similar
like
the
cache
limiting
in
the
existing
object,
cache
or
so
the
entry
stands
for
the
tens
for
the
number
of
caching
objects.
So
if
you
have
a
ten,
if
you
have
a
one
solid
entry,
then
you're
going
to
have
like
4
Giga
cache.
Oh.
D
A
F
D
D
A
That
exactly
right,
yeah,
that
makes
sense
right:
okay,
okay,
so
in
that
case,
these
are
all
we
could
all
call
them
like
client
daemon,
that
whatever
so
that
I
have
a
the
client
name
is
different,
and
so
then
the
stuff
config.
You
can
do
stuff
config.
That
client
thought
cache
demon
and
you
could
set
that
on
a
per
host
basis.
In
the
little
selector
thing
for
the
configs,
you
could
say:
host
equals
foo,
maybe
host
such
a
friend
amount
of
disk
space.
E
Okay,
so
I
actually
have
a
rough
question
for
the
demo,
so
actually
in
the
current
implementation,
if
the
demon
is
not
started,
then
the
B
body
can
actually
fall
back
to
the
original
way
the
pass.
So
how
do
we
know,
for
example,
if
a
B
D
is
actually
reading
normally
and
then
a
demon
starts
up?
How
can
we
inject
this
message
to
the
currently
what
we
need
to
redirect
those
greens
to
the
demon
well.
D
D
It's
still
it's
still
in
the
cache
layer,
and
it's
if
it
says
my
state
is
I'm
connected
I
will
actually
send
them
IPC
message,
but
if
my
state
is
not
connected,
it
just
forwards
it
to
the
next
like
layer
in
the
in
the
dispatch
automagically.
So
oh
I
see
I,
see
pretty
transparent,
like
if
I
say
if
I
configure
my
live,
RBD
to
say,
hey
enable
this
like
cache
plug-in
layer
on
startup
of
a
cloned
image.
You
know
for
the
parent
image
layers.
D
It
would
instantiate
that
that
dispatch
layer
and
a
dispatch
later
be
responsible
for
establishing
the
the
IPC
to
the
to
the
daemon
and
restoring
it
avails,
or
you
know
any
number
of
things,
but
basically
it
stays
connected,
try
to
forward
it
and
maybe
even
like
timeout.
If
you
don't
hear
a
response
back
from
the
demon
in
you
know
what.
A
Yeah
happen
and
if
it,
if
it
times
out
for
any
reason,
then
it
could
it'll
just
not
try
to
talk
to
the
demon
for
some
period
of
time
and
then
try
again
a
minute
or
two
later
yeah.
A
A
Okay,
so
if
I
want
to
like
just
pick
a
location,
as
presumably
only
run
one
of
these
per
well,
because
that
thick
whatever
the
cache
is,
has
to
have
access
to
at
least
as
much
of
rate
us
as
our
VD
does
it's
gonna
work
right?
Yes,
yes,
but
yeah.
You
probably
just
put
it
in
like
by
default
to
be
like
var,
runs,
F,
okay,
okay,
object,
cache
or
something
dot
socket
whatever
I
don't
know.
D
E
A
A
G
G
G
G
A
G
Oops,
all
screens
or
nothing.
You
can't
just
do
apparently
I'm
showing
my
screen.
G
A
G
Slide
template
right
yeah,
so
you
know
it's
America
like
you
get
these
slides
done
so
much.
Let's
make
some
others
with
an
Intel
template.
So
there's
the
layout
of
the
moving
parts,
which
probably
aren't
very
surprising
me
we're
interested
in
having
your
write
back,
cache,
be
persistent
and
have
it
be
replicated
so
you
can
survive
a
failure
of
a
node.
G
A
G
Path,
so
you
know
that
keeps
obviously
the
latency
nice
and
predictable,
but
as
we
talked
to
some
customers
that
cephalic
on,
we
discovered
that
their
OSD
notes
are
often
out
of
CPU
cycles,
so
they
couldn't
even
consider
using
a
write
back
cache
if
it
meant
running
more
stuff
on
their
OSD
notes,
people
ask
well
gee.
Why
can't
you
put
replicas
in
the
client
and
and
well?
We
could,
when
I
started.
This
I
made
assumption
that
you
know.
We
know
whether
an
OA
steno
is
coming
back.
G
G
Everyone's
still
there,
because
now
it
strangely
silent,
yeah
right
blue
jeans,
helpfully
turns
the
black
so
that
I
can't
think
so
all
right-
some
of
this
is
probably
very
surprising
to
you,
as
it
turns
out,
is
a
little
more
about
staff.
It
seems
like
maybe
it's
not
such
a
big
deal
to
put
it
in
the
client,
but
you
know,
and
the
interest
of
actually
getting
something
done.
I
haven't
gone
back
and
and
worked
on
that.
Let's
concentrate
on
getting
something
working
here.
G
If
the
client
note
dies
you'll
flush
from
from
the
replica
we
explored
early
on
the
question,
wouldn't
it
make
more
sense,
maybe
to
flush
it
from
the
replica.
If
you're
gonna
put
the
replica
an
OSD
node
and
the
OSD
notes
are
on
dual
networks,
but
maybe
your
client
isn't
it's,
maybe
only
on
it
on
a
application
traffic
type
network.
Maybe
it
would
make
more
sense
to
flush
from
there
on
your
cluster,
but
we
have
we
had
this
restriction
that
the
current
limitation
of
of
the
MDK
doesn't
allow
replicas
to
be
to
be
read
now.
G
G
We
really
really
don't
want
that
in
Zef
or
in
the
application.
Ever
Oh,
as
Jason
can
tell
you
one
of
the
struggles
is
I
really
like
the
RW
l
to
use
the
installed
VMDK
on
the
node
to
get
the
benefit
of
that,
but
this
led
to
G.
You
need
to
make
your
image
cash
into
a
plug
it
because
we
don't
want
to
have
to
have
PMD
K
installed,
to
be
able
to
install
self
clients,
which
is
perfectly
reasonable.
G
Yeah,
you
guys
know
about
all
that,
so
so
the
thing
that
might
be
surprising
about
how
this
stuff
is
laid
out
versus,
if
you
were
doing
this
in
an
SSD-
is
that
you've
got
pointers
and
you
can
make
you
can
help.
You
can
touch
tiny
regions
of
DMM
for
there
it
isn't
painful
like
it
would
be
on
an
SSD,
and
so
so
considering
what
you
know
that
the
performance
target
here
we're
the
the
primary
use
case.
G
We're
after
first
is,
is
this
tail
latency
thing
so
that
you
can
have,
for
instance,
the
CSP
with
a
with
a
with
the
the
image
backed
and
RBD,
and
they
can
get
tail
latency
low
enough
that
you
can
do
a
bunch
of
rights
to
your
file
system
for
your
customers,
VM
in
in
a
nice
predictable
amount
of
time
and
the
data
that
we've
measured,
I'm
sure
everybody
knows
it.
It
depends
on
what
you
know
how
your
SEF
cluster
is
deployed
and
what
hardware
it
runs
on.
G
But
even
if
you
use
flash
everywhere,
there's
still
variable
right,
latency,
obviously,
there's
there's
a
lot
of
moving
parts
and,
and
the
whole
point
stuff,
obviously,
is
that
it
keeps
working
even
when
the
parts
aren't
healthy.
You
get
a
big
enough.
Cluster
something's
always
broken
all
the
time
right,
so
that
means
you're
gonna
that
there's
basically
tell
latency
is
sort
of
a
way
of
life.
You
know
in
an
operating
so
so
this
is
that
this
is
a
you
know,
the
the
small
end
of
the
performance
envelope.
You.
C
G
Need
very
much
cash
you're,
really
only
after
making
it
nice
and
predictable
and
for
workloads
that
are
kind
of
modest,
so
we
targeted
when
we
talked
about
this,
we
targeted
what
a
low-end
VM
from
a
from
a
CSP
or
what's
your
Amazon,
the
end,
my
dear,
which
is
a
few
thousand
four
K,
writes
a
second.
They
just
want
to
be
able
to
get
nine
point:
nine
nine
percent
of
those
done
in
less
than
three
milliseconds,
so
that
if
you've
got
to
do
eight
or
ten
of
them,
it
could
be
done
in
reasonable
amount
of
time.
G
G
G
Maybe
reasonably
well
on
an
SSD
without
modification
and
I've
got
data.
Now
that
shows
that
now
it
just
doesn't
so
it
does
work
he's
you
can
tell
that
way,
but,
oh
my
god,
so
it
actually
helps
average
right
latency
for
or
for
some
yeah
month,
my
toy
cluster
in
the
lab,
but
the
tail
late
C's
just
dramatically
worse,
so
it's
you're
making
all
these
Flush
girls.
So
alright.
So
so
couple
of
things
the
before
we're
here
to
you,
don't
care,
what's
going
to
happen
in
a.
G
C
G
Flight
when
something
died,
then,
when
the
power
comes
back
on
in
the
building,
it's
erased
from
history,
which
is
you
know,
the
result
that
you
want
and
here's
the
other
case.
Other
cases
like
the
entire
replica
goes
away
or
the
or
just
the
the
envied
in
there
goes
away
which
which,
as
far
as
we
the
client
are
concerned
and
the
rest
of
the
cluster
that
has
to
replace
it.
These
are
the
same
thing.
G
Let's
see,
what
do
they
do
here?
There's
this
other
thing
where
so
it's
kind
of
an
unfortunate
fact
of
persistent
memory
now
that
if,
if
that
ended
in
fails,
your
note
probably
goes
with
it.
You'll
get
a
check
and
it's
bad
news.
So
if
that
might
not
be
true
forever,
but
but
right
now
it
is
it's.
So
the
thing
is:
if
local
media
failure
is
going
to
kill
you
well,
then
you
don't
really
have
to
handle
that
case,
so
the
bottom
line
is
in
the
are
WL.
That
means
that
we
can.
G
G
Anyway,
so
actually
I've
backed
off
from
that
position,
because
the
this
is
the
thing
that
could
happen
is
another
thing:
that's
guaranteed
to
happen.
So
so,
if
you
could
survive
it,
then
you'll
have
to
handle
it
trying
to
skip
forward.
This
is
the
this
is
what
we
had
in
mind
for
the
performance
that
worked.
You
know
for
the
this
trail,
latency
case
really
modest,
and
then
you
know
we
built
the
thing
and
the
data
bad
for
it
anyway,
and
and
it
does
it
so
we
can
we
can
we
can.
We
can
hit
that
goal.
G
G
Obviously
you
have
to
be
able
to
flush
out
of
order
which
is
not
going
to
be
acceptable
to
to
some
users.
That's
just
the
way.
It
is
right
if
they're
doing
mirroring,
there's
there's
no
other
way.
I
talked
to
Jason
about
this
earlier
saying:
well,
gee.
What
about
a
series
of
snapshots
made
on
the
volume
and
then
deleted-
and
we
could
just
opportunistically-
create
these
that,
but
then
he
you
know,
gave
you
the
bad
news
that
deleting
all
those
snapshots
is
incredibly
expensive.
So
unless
that's
gonna
change,
that's.
A
A
A
G
D
G
G
I
thought:
okay,
fine
I'll,
give
you
guys
the
order
right
back
flushing
that
should
be
easy
and
then
I
can
get
onto
my
optimized
flusher
and
see
how
awesome
it
is
and
then
I
discovered
that
a
naive
approach
to
order
right
back
flushing
turns
whatever
whatever
queue
depth
you
had
into
one
and
that's
just
awful.
So.
G
Recording
clues
about
observed,
right,
concurrency
and
so
I
do
that
now
it
remains
to
be
seen
whether
that
needs
to
be
more
aggressive.
So
one
of
my
questions
kind
of
hanging
out
there
is
about
the
nature
of
the
user
mode
timer
mechanism.
If
we
have
sub-millisecond
timers
that
are
reliable
or
not
again,.
A
A
G
A
C
G
He
deliver
ordered
right
back,
that's
what
we
would
have
to
do.
We
can't
make
it
any
more
parallel
than
that.
If
the
user
relaxes
the
ordered
right
back
requirement,
then
we're
free
to
do
other
things,
but
so
far
we
don't
have
a
mechanism
well
to
relax.
That
and
your
position
is
that
no
one
will
ever
want
to
do
that.
G
G
I
mean
so,
if
you
believe
the
the
replicated
right
back
log
was
that
it
was
reliable
that
that
you
replicas
would
save
you
if
something
died,
then,
where
you
would
Deemer
come
unless
you're
doing
like
remote,
mirroring,
then
I
think
you're
stuck
and
there
may
be
other
volumes.
I
don't
understand
that
it
would
also
mean
that
you're
stuck,
but
for
the
usage
models
that
I
can
think
of
so
anyway.
Yeah
right,
we're
going
obviously
ordered
right.
Back
is
the
only
thing.
That's
implemented
right
now,
because
that
turned
out
to
be
enough.
A
My
my
gut
says
that
that's
that's
the
safe
pass,
there's
like
a
tangent
where
maybe
they'll
want
to
like
disable
that
but
I'm
not
sure
about
that,
but
there's
also
this
other
sort
of
possibility
that
they
don't
have
3d,
crosspoint,
dims
and
multiple
machines
and
an
RDM
a
fabric
attaching
their
clients
to
their
servers,
which
I
think
is
gonna.
It's
like
everyone
in
the
world,
except
you
right
now.
A
G
A
Then
they
would
get
that
big
boost.
It's
not
going
to
be
as
big.
Obviously
if
they
would,
if
they're
doing
right
back
into
tree
crosspoint
but
it'll
be
pretty
good
yeah
and.
G
A
G
A
A
But
if
you
do
or
it
fails
over
or
whatever
it's
okay
to
warp
it
back
in
time,
a
few
seconds
and
failover
with
a
point
in
time,
consistent
copy
from
a
couple
seconds
back
on
another
node,
because
you're
only
lost
like
two
seconds
of
data
and
you
didn't
like
corrupt
the
whole
volume
that
won't
work,
obviously
for
database
and,
like
a
whole,
big,
stack
or
whatever,
but
like
for
all
the
beams
that
are
like
this
random
other.
That's
scribbling.
All
over
their
drive
like
in
your
cloud
platform,
I'd.
C
A
G
G
G
But
that's
just,
but
it's
all
experimental
anyway
they're
you
know
faster,
is
better
yeah.
We
understand
that
we're
not
trying
to
you
know,
take
away
orders
right
back.
We're
definitely
gonna
support
that,
but
I
mean
the
reality
is
that
everybody's
gonna
want
more
performance
than
your
self
cluster
can
deliver,
and
the
only
way
that
we
see
that
I
see
to
do
that
is
basically
is,
is
with
a
cache
like
this
yeah
in
previous
life
we
saw.
We
saw
VDI
instances
that
you
know
we're
running
the
proprietary
OS
that
aggressively
trimmed.
G
It
would,
you
know,
write
a
bunch
of
files
and
then,
when
it
was
done
with
those
that
would,
it
would
actually
send
down
the
trims,
though,
if
you
could
afford
to
have
a
large
enough
write
back
buffer.
There's
a
bunch
of
stuff
you'd
never
have
to
flush,
because
eventually
it
gets
it
gets
discarded.
That's
you
know.
G
One
of
the
hopes
is
you
can
you
know
quote-unquote
improve
your
cluster
performance
by
not
writing
stuff
to
it,
because
you'll
wait
long
enough
to
find
out
which
things
they
don't
need,
but
yes,
that
you
definitely
need
replication
and
replication.
Pretty
much
needs.
Rdma,
otherwise
it's
too
expensive
or
it
doesn't
help
because
it's
too
unpredictable
on
the
other
end,
we
so
write
and
we
understand
the
need
for
SSD
support
and
I'm
I'm
trying
to
get
that
into
the
schedule
here
and
so
here's
the
first
thing
I
ever
wrote
in
C++
and
it
turns
out
I.
A
My
like
overarching
question,
I
guess
is:
is
the
interface
into
the
cache
and
abstract
a
bowl
or
modular
is
a
bowl
or
whatever
such
that
you
can
have
two
implementations,
one
of
them
that
does
you
know
that
that
you
pass
cross
already
they
and
the
other
one.
That's
like
writing
to
a
local
Indian,
a
device
and
I.
Don't
know
what
that
pass
was
you
just
talked
about,
but
sorry,
but.
G
There
are
some
things,
maybe
that
that
we
can
that
maybe
should
be
common
like,
for
instance,
the
fact
that
you
really
want
to
discover
which
writes
overlap
before
you
decide.
You
know
before
you
start
trying
to
persist
them.
If
we're
gonna
have
multiple
versions
of
image
caches,
maybe
that
should
be
factored
out.
I
would
certainly
do
that
inside
our
WL,
if
I
was
if
I
was
gonna,
make
say
that
the
backend
just
a
separate
pluggable
thing
this
that
wouldn't
be
the
image
cache
interface.
G
Another
weird
thing,
as
all
these
interfaces
give
you
extent
vectors
or
extent
lists
and
there's
no
guarantee
that
they're,
contiguous
or
in
order
or
even
not
overlapping
and
I've,
always
I've
got
this
to-do
item
to
basically
massage
that
vector
that
comes
in
the
top
of
our
WL,
because
what
you
just
pointed
out,
rights
between
flushes
and
theory,
I
can
I'm
free
to
reorder
those
unless
some
of
them
overlap,
I'm
kind
of
assuming
right
now
that
that
they
don't
overlap,
usually
well,
but
they
can't
yeah.
They
usually
won't
I.
G
G
When,
when
when
a
write
is
deferred
because
of
the
block
guard,
then
we
will
create
a
synthetic
sync
point
for
it.
When
it
comes
out,
though
so
a
I/o
flushes,
the
RVD
I/o
flushes
always
produce
a
log.
Entry
called
the
same
point,
but
we
can
produce
sync
points.
For
other
reasons
like
say
you
crashed
or
you're
closing
the
image.
G
This
overlap.
We
also
don't
want
because
sync
points
are:
are
our
granularity
for
retiring
entries
they're
going
to
be
our
granularity
for
synchronizing
with
the
replicas?
If
I
didn't
have
PMD
K
pooled
replication,
they
were
also
the
mechanism
for
resynchronizing
the
replicas
when
they,
when
they
came
back
up
but
I'm
happy
to
report
that
I.
Don't
don't
have
to
do
any
of
that?
None
of
that
complexities
in
this
because
it's
handled
for
us
that's
pretty
nice,
but
but
anyway.
So
there's
a
lot
of
uses
for
sync
points
and
that's
one
of
them
and.
G
Gonna
say
so,
so
that's
kind
of
a
question
when
you
know
the
plan
of
record
is
to
just
do
the
pluggable
back
end
for
our
WL
thing,
so
we
can
get
that's
this
D
support
for
the
community
at
some
point
here,
hopefully
soon
that
architectural,
it
might
actually
not
be
the
most
beautiful
solution.
So,
if
you'd
like
to
do
it
a
different
way,
let's
talk
about
that.
G
But
when
I
hit
some
nasty
stuff
with
with
the
unit
tests
that
there
was
suddenly
10,000
unit
tests
that
needed
to
test
for
the
existence
of
an
image
cache,
because
those
operations
just
really
didn't
make
sense
with
an
image
cache,
that's
not
coherent
shared
access
anyway.
So
now
the
image
cache
is,
it
is
enabled
and
disabled
during
the
exclusive
lock
acquire
and
release
phase.
G
And
so
I
had
to
I
had
to
go
and
and
create
all
the
logic
to
persist.
The
cache
the
cache
configuration,
though
so
it's
so
this.
This
notion
of
stackable
image
cache
is,
is
sort
of
sort
of
accreting
stuff
around
it,
and
so
that
would
be
the
time
if
you
thought
we
didn't
want
that
to
be
the
case.
What
would
be
the
point
affecting
them?
G
G
The
things
in
the
Paragon
edges
that
are
immutable
and
so
anything
you've
changed
in
your
image
and,
if
you're
a
database,
that's
everything
right
is
not
cached.
So
all
reads
that
Miss
obviously
goes
without
saying
that
our
WL
has
to
be
readable
right
because
you're
right,
we've
told
you
it's
done,
we
haven't
flushed
it
yet,
and
you
might
want
to
come
back
to
read
that.
How
likely
is
that
I
don't
know,
but
it
has
to
be,
has
to
be
sports,
so
I
just
think
we
would
do
this
by
stacking
image
caches.
G
G
G
A
G
A
E
F
G
Now,
right,
that's,
that
is
the
plan,
but
whether
I
can
actually
execute.
That
plan
remains
alright.
A
G
So
we
have
I
have
now
added
the
stuff
to
persist.
The
fact
that
that
there
is
a
stack
of
image,
caches
and
and
there's
some-
you
know
bits
in
there
that
tell
you
whether
whether
anything
exists
on
those
notes.
No
did
they
actually
make
any
files
there.
Those
files
have
any
any
any
writes
in
them
and
are
any
of
those
rights
on
flush.
You
can
tell
all
of
those
things,
but
you
open
it
and
you
can
decide
at
open
time.
Oh
it's
dirty
I'm,
not
gonna.
Let
you
open
it,
which
is
what
it
does
now.
G
You
maybe
don't
want
to
open
it,
even
if
it
exists,
because
if
you
don't
clean
it
up
now,
then
who's
gonna,
clean
it
up
ever
really.
So
what's
missing
is
I.
Haven't
done
any
of
the
CLI
stuff.
You
know
to
basically
abandon
the
cache,
though
not
quite
sure
what
that
should
look
like
if
they're
there
at
that
RBD
yeah,
it's
gonna,
be
an
army
thing
right.
The.
G
That
yeah,
so
so
here's
a
situation,
you
might
you
know,
say
I-
want
to
disable
the
image
cache,
but
it's
got
a
dirty
cash
on
some
other
node.
Well,
but
you'd
be
pretty
weird,
for
you
wouldn't
really
expect
that
tool
to
try
and
access
that
remote
node
and
flush
that
cache
for
you
right,
you
probably
rather
it
just
said,
go
to
that
node
and
do
this
again.
A
Should
be
a
command
that
like
well
we'll
just
flush
the
local
cache
and
then
when
the
flushing
is
completed,
all
sort
of
attach
or
record
that
it's
flushed
whatever.
That
looks
like.
G
A
A
G
A
G
C
A
G
G
G
G
A
G
A
Alright
well
I,
think
I'll
say
at
a
couple
items
on
here
to
just
give
an
update
on
the
orchestrator
and
the
messenger
stuff,
neighbor,
suitor,
stuff,
I.
Think
Tim,
probably
only
one
who
cares,
and
you
know
more
than
I
do
anyway,
because
you're
joining
calls
and
I'm
not
so
we'll
skip
that.
B
A
The
the
next
pull
request
emerge
has
a
sort
of
a
straw
man,
placeholder
implementation
of
protocol.
V2
that
doesn't
do
anything
different
than
the
first
version
really
accept,
exchange
multiple
addresses
and
and
then
there's
a
whole
pile
of
patches
that
just
fix
all
the
OST
map
structures
that
ingest
map
the
Mon
map
and
related
to
commands
to
deal
with
having
multiple
address,
endpoints
or
a
single
entity.
So
the
men,
each
monitor,
will
have
a
v1
and
v2
address.
A
Each
of
us
do
F
of
e
1
in
the
B
to
address,
and
all
the
machinery
just
alike
put
all
that
stuff
in
the
map.
And
so
that's
the
next
piece
to
get
through
in
parallel
card
is
working
on
the
actual
protocol.
D2
implementation
based
on
the
one
that's
currently
expect
out
in
the
document
and
then
also
in
parallel,
that
Daniel
is
working
on
some
of
the
initial
Kerberos
pieces
and
then
the
last
piece
is
I.
A
A
We're
creeping
forward
need
to
hopefully
accelerate
that
a
little
bit,
because
we
once
the
protocol
v2,
is
sort
of
there.
We
need
to
add
the
encryption
piece.
That's
sort
of
the
main
high
level
user,
visible
feature
deliverable
that
we
need
to
get
in
and
then
glue
in
the
Kerberos
stuff
in
a
way
that
lets
it
coexist
with
effects.
That's
going
to
be
served,
some
refactoring
in
the
auth
code
make
it
sort
of
authentic
eight
with
Kerberos,
and
then
give
you
suffix
tickets
to
talk
to
the
cluster
a
little
bit
tricky,
but.