►
From YouTube: Ceph Developer Summit Quincy: Crimson
Description
00:00 - boost::asio-based code be adapted to use the seastar reactor
3:12 - Manage RWL Replication Info
21:51 - m-n mapping
1:08:51 - Seastore
Etherpad: https://pad.ceph.com/p/cds-quincy
C
B
For
instance,
wrap
it
and
just
run
it
directly
in
the
reactor
with
appropriate
wrappers
for
wiring
up
the
callbacks.
I
think
this
is
just
investigation
if
it
needs
to
be
done
or
live
rbd
to
start
with.
I
think
that's
the
important
first
case.
I
haven't
myself
looked
into
this
yet
though,
but
that's
my.
B
A
D
About
this
as
well,
this
is
actually
jason
had
he
had
already
rewritten
over
d
to
use
at
the
near
raiders
interface,
which
is
based
on
bustazio,
and
his
thinking
was
that
it
might
be
possible
to
template
out
the
implementation,
specific
pieces
of
futures
and
not
flash
code,
routines,
slash
reactor
and
leave
the
messenger
with
on,
like
the
crimson
messenger,
for
example,
with
only
the
protocol
logic.
D
D
E
Right:
okay,
I'm
a
software
engineer
from
intel
working
on
saf
and
my
peer
champion
is
also
online
today
that
we
will
talk
about
the
design
of
the
rapid
car
diamond
and
the
rapid
calm
monitor.
Is
it
okay
for
me
to
chill
the
s
slice
test.
F
E
E
It's
unloading,
and
previously
I
think
that
lisa
has
talked
about
the
rw,
replica,
replication
work
and
today's
item
I
and
the
chimpanzee
will
focus
on
how
to
design
the
replica
german
and
the
rapid
car
monitor
to
manage
the
replication
information
and
and
and
for
the
rwr
work
previously,
that
we
have
finished
the
design
of
the
single
copy
for
the
nvidia
and
ssd
right
back
cache
and
these
features
has
been
merged
and
for
the
next
steps
that
we
are
going
to
do.
E
You
know
replica
the
data
across
the
medium
over
the
rdma
protocol
and
for
the
first
part,
is
that
we
focus
on
how
to
manage
the
replica.
How
to
manage
the
client
cache
information
to
do
the
further
rep
to
do
the
further
rapid
replication
work
and,
for
the
other
part,
currently
that
we
haven't
started
yet.
E
And
today
that,
on
this
slide,
it
is
the
basic
from
work
about
that.
How
we
design
the
replica,
german
and
the
replica
monitor.
Firstly,
is
that
on
the
left
screen
or
on
the
left
screen
of
the
slide,
and
that
is
the
replica
demon
we
are
reporter.
The
replica
damage
information
to
the
replica
monitor
this.
This
information
includes
atom.
E
There
are
the
rdma's
ip
address
and
the
support
that
is
used
to
listen
for
the
coming
connection
and
the
other
information
also
includes
that
the
free
size
of
the
mv
team
and
after
that,
the
red
car
monitor,
receives
this
information
that
it
will
that
it
will
store
it
into
the
replica
demon
map,
and
it
will
also
go
through
the
packages
service
to
keep
consensus
of
the
replica
german
map
information
across
all
the
replicas
across
all
the
repticon
monitors
and
for
the
client
side.
E
On
the
writer
screen
of
this
slide,
the
item-
the
folder
client,
it
will
use
the
replica,
it
will
use
the
get
replica
demon
map
message
to
request
the
replica
monitor
to
choose
the
proper
replica
demon
and
aggregate
the
information.
And
then
that's
the
replica
monitor.
We
are
feedback.
E
The
replica
demon
map
information
to
the
leap
rpg
and
then
that
is
the
and
then
that
is
the
leave
rpd.
We
are
starting
the
connection
to
the
replica
demon
further
for
the
replica
work
and
current
and
the
currently
that
we
have
write
us.
The
code
to
re,
to
request
a
further
comment
from
the
community
and
on
this
slide,
that
we
show
that
the
basic
fram
work
about
the
replica
german
and
the
replica
monitor.
E
E
Because
of
that,
we
also
have
defined
three
kinds
of
messages,
and
we
also
have
design
these
three
message:
classes
to
be
used
between
the
replica
demon
and
the
rebelcom
monitor,
and
also
include
that
how
the
leave
rbd
sent
the
request
to
the
red
car
monitor
and
then
that's
the
red
car
monitor,
we'll
use
another
message
to
feed
back
the
information
to
the
leave
rpg
and
I
will
open
the
framework
design
document.
E
Okay
and
on
this
screen
and
for
the
for
the
replica
damage
information,
it
included
that
the
demon
id
and
it
also
included
the
rdma's
port
and
the
idm
is
ip
at
the
address,
and
there
is
another
there
is.
Another
memory
field
is
to
store
the
free
size
and
for
them
and
for
the
replica
map
enter
in.
Because
of
that,
the
replica
map
will
be
stored
in
the
replica
monitor
and
it
includes
all
the
replica
damage
information.
E
So
that
currently,
is
that
we
only
use
a
stl
vector
to
store
all
the
replica
domain
information
and
and
further
and
the
for
the
cliented
silence
item
one
when
the
leave
rbd
of
other.
E
When,
when
the
leap
rpd
sent
the
request
to
the
replica
monitor,
it
will
include
the
information
that,
what's
the
number
of
replication
here,
we
define
the
field
rep
cards
and
it
will
also
include
that,
what's
the
requested
replica
size
to
the
replica
monitor
and
for
these
three
kind
of
man
for
these
three
kind
of
mental
data
information
that
we
also
have
defined
the
proper
messages
to
be
used
between
the
red
card
demon
and
the
replicom
monitor,
and
also
between
the
leave
rbd
and
for
them.
E
E
It
needed
to
send
the
gate,
replica
demon,
map
information
to
the
replicom
model
to
the
replica
monitor
and
then
that
the
replica
monitor
will
choose
the
proper
replica
demon
and
aggregate
all
the
information
into
another
replica
map.
And
it
will
send
the
replicom
demon
map
information
back
to
the
lib
rbd.
E
E
E
B
So
there
are
other
ways
to
maintain
persistent
mappings
in
rados
that
don't
involve
modifying
the
monitor
rgw,
for
instance,
maintains
information
about
caches
in
using
watch,
notify
and
rados
rbd
does
the
same
thing.
Why
does
this
need
to
use
a
new
paxo
service.
E
Apparent
current
thing
is
that
for
this
kind
of
information
that
we
have
that
that
I
have
that,
I
have
looked
through
about
the
osd
monitor
and
because
that's
also
because
that
most
of
the
texas
services
are
included
in
the
replica
monitor,
so
that
I
add
them
so
that
I
add
another
replica
monitor
which
is
inherited
from
the
pexel
service.
Yeah.
B
B
E
E
For
the
for
the
design
about
that
for
the
design
of
the
radius
of
the
rgw,
currently
that
I
haven't
looked
through
them
yet
because
that
I
I
I
just
take
some
other
pixel
service
and
then
yeah
yeah.
B
E
E
B
I'm
saying
rados
itself
like
the
interface
the
library
interface
offers
primitives
for
performing
operations
on
regular
radius
objects
with
the
ability
to
create,
locking
and
notify
mechanisms.
I
don't
think
you
need
to
embed
this
in
the
monitors.
Packs
of
service
it'll
be
less
code
to
write.
It
will
work
better
and
it'll
scale
better.
If
you
implement
this
in
rados.
E
E
So,
for
this
is
them
some
design
talk.
Is
there
some
design
document
about
that?
How
to
implement
this
kind
of
design
based
on
the
liberators.
B
All
I
see
here
is
a
an
rbd
name
to
set
of
damon
mappings,
that's
the
rbd
that
that's
how
you
locate
the
replicas
for
rugby
right.
Secondly,
there's
a
registry
of
available
replica
demons:
you
could
put
each
of
those
two
things
in
a
radio
subject
with
some
kind
of
watch
notified
to
maintain
authoritative
ownership
and
ensure
atomic
mute
mutation.
Lib
rbd
already
gives
you
ownership
of
a
particular
rbd
object,
so
that
part's
easy.
You
simply
ensure
that
only
that
client
is
allowed
to
modify
the
that
mapping.
B
B
The
second
part
you
can
maintain
with
just
a
registry
just
an
object
containing
a
binary
representation
or
whatever
an
omap
entry
for
each
one
of
these
demons
with
a
last
seen
a
lifetime.
If
you're
trying
to
detect
failure.
B
Right,
you
already
have
a
cluster
that
you
agree
on,
so
all
you
need
to
do
is
write
an
object
with
a
well-known
name
or
a
family
of
well-known
names,
a
lot
of
ways
to
do
it.
I
suggest
that
you
study
the
way.
Librarbd
manages
ownership
of
the
rbd
head
object
and
the
way
rgw
manages
its
cache
semantics.
B
E
All
right,
yes,
today,
that
we
are
talking
about
that.
Whether
this
way
is
the
proper
method,
yeah
yeah.
D
I
think
when
we
had
discussed
this
previously
cdm,
it
started
turning
into
almost
implementing
an
entire
osd
on
the
client
side
and
at
that
point.
D
B
B
That's
why
I
was
asking,
for
instance:
is
it
actually
necessary
that
the
library
instance
have
a
local
replica
demon?
Does
this
system
require
that
I
would
recommend
to
you
that
you
do
not
make
it
require
that
to
enable
the
possibility
that
all
of
the
right
all
of
the
replica
daemons
are
actually
remote?
B
E
C
E
E
D
F
Mapping
yeah.
B
B
Run
crimson
multiple
cores,
I'm
just
going
to
begin
with
the
assumption
that
it
doesn't
make
sense
to
run
one
process
or
just
flat
out
it
doesn't
so
coarsely.
We
have
three
different
resources.
We
need
to
to
deal
with
their
incoming
connections
from
clients
which
will
come
in
on
whatever
port
handles
the
message.
There
are
state
bundles
for
each
pg.
B
B
B
So
the
short-term
goal,
I
think,
is
to
implement
that
middle
piece
where
the
pgs
are
allowed
to
be
on
multiple
cores.
We
create
multiple
c-store
c-star.
What
do
you
call
it
whatever
their
thing
for
multiple,
multiple?
We
permit
multiple
reactors
and
we
create
a
partition
service
or
whatever
that
spreads
the
pg
across
them,
and
then
we
allow
the
messenger
to
farm
out
messages
to
the
relevant
core
based
on
which
pg,
I
think
that's
the
first
piece
of
work
after
that.
B
H
H
I
was
thinking
about
actually
turning
that
is
currently
crimson.
What
do
what
is
currently.
H
H
H
I
H
So
I'm
curious
whether
we
really
need
to
to
pass
connections
past
messages
across
the
course.
I
understood
that
we
are
thinking
about.
We
need
to
implement.
We
need
to
bring
more,
let's
call,
let's
say,
front,
end
processing.
Oh,
is
this
front
end
processing
capabilities
to
saturate
to
be
able
to
saturate
fast
devices?
H
One
car
is
not
enough.
We
need
to.
We
need
to
have
more
front-end
processing
power
to
saturate
to
saturate
fast
device,
but
I'm
not
sure
where
we
need
in
order
to
do
that,
we
need
to
we
need
to
chart
across
pg.
Maybe
we
don't
need
to
worry
about
about
sharing
resources
at
all
and
just
just
have
multiple
osd's
in
the
same
process.
In
the
same
address
space
sharing
just
single
instance
of.
B
H
Is
of
course-
and
this
would
I'm
not
I'm
not
saying
that
I'm
not
saying
it's
extremely
straightforward-
it
would
require
some
extension
like.
H
H
Okay,
that's
something
new,
I'm
a
I'm!
I
I
was
unaware
about.
I
keep
in
my
mind,
I
pictured
where
I
literally
the
10
000
osd
cluster
testing.
H
B
H
Okay,
but
still
if,
if
if
this
happens
in
production,
then
we
could
tackle
tackle
just
in
time,
not
sure
whether.
B
C
C
J
B
A
severe
space
wastage
problem
from
the
smaller
or
less
used
osds
in
that
group
sitting
on
sitting
on
space
that's
allocated,
but
that
they
can't
actually
make
use
of
if
they
were
all
one
big
bucket
we'd
get
much
better
disk
utilization.
H
Okay,
I
need
to
think
about
that
because
I
don't
have.
I
don't
have
an
answer
just
just
for
just
for
now
I
will,
I
will
rethink
and
we
could.
We
could
think
I'm
there.
D
B
Way
is
what
rad
is
describing.
We
set
things
up,
so
just
run
multiple
sds,
whatever
I
don't
like,
there's
planning
for
quincy
and
then
there's
deciding
what
to
do
next.
I
think
this
is
an
important
thing
to
do
next.
It's
not
that
it
needs
to
be
there
for
production
user,
but
it's
important
like
if
I
were
placing
resources.
This
is
one
of
the
places
I
would.
I
would
put
them.
H
And
actually
the
implementation
related
things
was
was,
it
was
was.
I
think
I
wanted
to
ask
during
this
meeting.
I
understood
that
chonmei
has
started
working
on
the
end-to-end
mapping.
B
Like
I
said,
this
isn't
a
single
feature,
this
is
a
collection
of
improvements
that
will
need
to
be
made,
so
no
one's
going
to
work
on
m2
mapping
someone's
going
to
work
on
modifying
crimsons
so
that
pg's
get
split
up
among
several
cards.
That's
probably
the
first
step.
There's
work
to
be
done
c,
star
c
store,
rather
that
I'll,
probably
work
on
and
then
there's
work
to
do
in
the
messenger,
both
in
allowing.
F
B
B
H
B
I'm
not
really
worried
about
it.
That's
already
running
in
its
own
process
in
its
own
threads,
so
it'll
scale
independently
and
any
card
that
needs
to
do.
I
o
simply
sends
it
to
the
common
queue.
We
can
be
clever
about
improving
parallelism,
thereby
partitioning
the
queue,
but
I'm
just
not
worried
about
it.
It's
not
a
big
deal,
it'll
be
much
more
relevant
for
c.
H
No
sorry
sea
answer.
B
K
B
B
K
B
D
K
B
B
B
That's
what
I'm
saying
that
isn't
necessarily
true.
Once
we've
fully
implemented
all
of
the
protocol
work
to
inform
clients
of
which
port
is
mapped
on
which
core
and
which
pgs
are
on
which
core
yeah
that'll
mostly
be
true,
but
even
then
the
client
we
may
choose
to
until
we
choose
to
do
that
work.
The
message
will
come
in
on
whatever
messenger
comes
in
on
and
the
messenger
will
need
to
look
up
which
core
it's
going
to.
C
K
No,
no,
I
mean
the
the
osd
map
look
up
to
to
evaluate
which
message
belongs,
to
which
pg,
I
think.
B
That
may
not
be
necessary.
Josh
is
telling
that.
B
B
Look
at
that
part
of
the
message
figure
out
where
it's
supposed
to
go
and
then
look
it
up
in
the
map,
not
the
osd
map,
my
dude,
that's
a
whole
separate
thing,
that's
just
reserved
words
and
stuff.
Sorry,
if
I
the
word
hosting
up
the
osd
service
inside
of
crimson
that
maps
a
placement
group
to
a
core,
whatever
we
choose
to
call
it
pg
core
mapping.
B
D
Fastest
fashion,
immediately
cues
the
message
into
the
appropriate
chart
in
the
chartered
work
view
based
on
the
pg
that
the
message
references
and
then,
if
that
pg
doesn't
exist
or
is
splitter
or
some
or
is
pretty
merged
or
something
when
it
when
it
gets,
gets
dequeued.
D
The
classic
oc
notices
that
this
pg
needs
to
wait
for
some
reason,
maybe
for
the
split
to
finish
or
a
oc
map
to
be
processed
by
the
pg
and
puts
that
pg
that
operation
on
a
waitlist.
D
K
B
B
F
F
B
Again,
all
the
messenger
will
do
is
it
will
check
the
current
cross-core
mapping
from
pg
to
reactor
if
the
pg
is
there,
it
just
immediately
cues
it
on
the
relevant
reactor.
If
it's
not
there,
it
sticks
it
on
a
waiting
list
for
that
core
or
for
that
pg
and
returns.
That's
it
when
the
pg
is
created,
the
reactor
will
go.
Oh
I
have
this
pg
now
it'll
claim
that
q
and
it'll
move
on
with
its
life.
There
will
be
some
subtlety
in
exactly
how
those
operations
happen,
just
to
ensure
correct
sequencing,
but
not
very.
K
K
D
B
If
you're
worried
about
the
memory
sequencing
problem,
there
are
techniques
for
getting
around
that,
so
these
mappings
will
have
some
monotonicity
properties
that
let
us
do
rcu
tricks
so
likely
what's
really
happening
under
the
hood.
Is
there's
a
published
pointer
to
a
current,
consistent
mapping
from
pg
to
from
pg
to
reactor,
and
any
messenger
wishing
to
do
a
read
from
these
things
does
so
without
barriers.
B
B
H
Yep,
but
if
I
understood
correctly,
we
would
need
at
least
without
initially
without
the
extension
to
radius
protocol.
We
will
need
to
somehow
somehow
pass
data
between
between
cars
again.
H
Yeah,
but
it
will,
it
will
be
a
bit
a
bit
complex
because
we
need
to
implement.
We
cannot
even
after
finding
the
proper,
the
proper
reactor
for
handling
traffic
to
particular
pd.
When
we
are
responding,
we
will
need
to.
We
cannot
directly
send
no.
F
H
Well,
I'm
I'm
perfectly
fine
with
that.
I
will
just.
I
would
love
just
to
ensure
that
the
the
assumption
that
we
really
need
to
worry
about
about
the
osd
map
entries
the
the
the
infrastructure
traffic,
whether
it
is
correct
yeah.
I
think
we
we
do.
C
B
H
Know
maybe
two
years
ago
there
was
a
discussion
on
on
set
devil.
We
were
iterating
over
that
and-
and
I
bar
and
bear
it
in
my
mind
that
that
we
have
plenty
of
of
of
of
resources
there,
maybe
because
maybe
well,
maybe
I
will
maybe
maybe
I'm
just
understanding.
F
B
H
I
imagined
I
I
was.
I
was
thinking
about
the
ten
thousands
osd
in
cluster
testing.
B
Yeah
I
mean
we
did
this.
I
mean
sage
worked
with,
oh
god,
certain
there
we
go
with
their
extremely
large
cluster
and
they're
constantly
hitting
monitor
scaling.
H
Problems
yeah.
If
so,
then
then
I
am
I'm
afraid
it
would
be
no
other
in
a
viable
option
than
then
initially
do
the
pg
sharding
and
then
make
an
extension
to
the
protocol.
B
H
Okay,
I
I
agree,
it's
it's
far
from
being
legend,
but
it's
it's
damn
stupid.
H
Sorry!
Well,
I
believe
it's
the
the
one
osd
pair
per
car,
it's
very,
very
illegal.
B
H
The
the
situation
where
you
were
you
have,
you
are
sharing
solely
object:
single
objects,
for
instance,.
H
Just
just
simplicity:
no
need
to
worry
about
protocol
extension
at
all.
No
worry
about
passing
about.
B
B
B
H
At
the
price
of
of
putting
extra
burden
on
on
deployment
engineers
and
deployment,
tooling
yeah,
it's
it's
all
it's
actually
it.
It
moves
completely.
The
complexity
somewhere
else.
B
F
H
Shirt
memory,
not
entire,
not
entire
address
space,
just
just
just
a
region
and.
B
G
K
I
have
a
second
question,
so
we
have
a
heartbeat
message:
messing
messenger,
which
is
not
pg
related.
It
is
like
a
host
service
post
lab
service,
so
we
can
leave,
leave
them
as
it
as
it
is
now
right,
because
if
we
need
to
shut
make
the
happy
messenger
to
to
be
across
course,
it
will
be
really
different
to
manage
those
metadata
in
different
course,
and
we
don't.
I.
B
B
Yeah,
I'm
not
talking
about
the
messages
I
mean,
so
what
the
heartbeat
messenger
actually
does
is
it.
It
maintains
a
bunch
of
osd
wide
state,
some
of
which
are
minimums
and
maximums
of
the
pgs,
all
that
osd.
So
all
of
the
pgs,
when
they're
doing
their
own
thing
are
updating
state
that
the
heartbeat
messenger
then
condenses
into
a
message
that
gets.
That
is
the
heartbeat
right.
So
there
is
information
that
will
need
to
make
its
way
from
the
other
reactors
to
the
heartbeat
messenger.
B
F
A
K
Okay,
a
third
question:
how
does
those
event
so,
like
the
the
connected
event,
the
the
reset
event
remote?
Is
that
event,
if
we.
B
What
are
these
for
an
osd,
heartbeat
failure
or
for
an
osd
connection?
Failure
it
that's
entirely
messenger!
Well,
okay!
So
there's
the
parts
of
the
for
stateful
connections.
F
B
B
Like
client
connections
don't
generate
events,
so
those
don't
matter
for
osd
osd
information.
B
When
we
get
a
reset,
we
it
tr
it
triggers
the
behavior,
where
we
send
a
message
to
the
monitor,
so
whatever
handler
exists
for
that
probably
just
gets
pinned
to
reactor
zero.
For
now
I
don't
know,
but
let's
look
at
that
code
in
some
more
detail.
C
B
B
K
K
B
I
think
that's
that's
correct.
Yes,
I
think
messengers
will,
in
general,
be
single
core.
So
actually
I
guess
this
is
something
to
ask
you
about.
If
you're
serving
on
a
single
port,
is
there
any
advantage
in
the
messenger
being
able
to
run
on
multiple
cores
at
all,
keeping
in
mind
that
the
messenger.
F
K
B
Reactor
accepting
on
a
single
port
where
each
connection
that
gets
come
in
that
comes
in
ends
up
local
to
a
single
core
that
might
be
worthwhile,
but
even
in
that
case,
the
core,
the
the
connection
ends
up
on
won't
necessarily
have
any
relationship
to
the
messages
it's
serving
without
the
future
work
we
haven't
talked
about
yet
so
in
that
case,
you're
still
going
to
have
to
hand
off
the
messages
to
whichever
reactor
they're
supposed
to
go
to.
B
So
my
suggestion
to
you
is
that
indeed,
we
just
let
the
messenger
be
one
core
for
now
and
work
on
creating
the
multi-core
infrastructure
in
the
osd.
First.
J
Hey
sam
one
question
here
so
still:
there's
some
information
shared
between
the
articles,
for
example,
configuration
with
map
collection.
So
how
we
share
the
information
you
just
one
call
all
the
the
data
and
other
courses
request.
The
information.
B
Not
necessarily
so
we
can
do
the
same
rcu
trick
I
was
describing
before
configuration.
Information
could
be
shared
as
a
sort
of
widely
available
memory
map,
or
we
can
investigate
some
other
implementation.
If
we
need
to
I'm
open
to
suggestions
on
that,
in
other
words,
I
don't
think
we
want
every
config
query
to
involve
a
remote
message
to
another
core
that
would
get
expensive.
B
H
And
when
it
comes
to
watch
notify,
I
believe
that
this
shouldn't
be
a
problem
in
crimson.
Everything
related
to
crimson
watch
network
is
actually
encapsulated
in
inside
the
boundaries
of
rpg,
except.
B
For
the
fact
that
the
sessions
that
react
to
it
actually
interact
with
multiple
pgs
watch
notifies
going
to
be
a
problem
not
like
an
unsolvable
one,
but
one
that
will
require
actual
architecture.
B
F
G
B
Get
too
far
into
this
because,
honestly,
I
don't.
I
don't
remember
the
interface
well
enough,
even
to
remember
what
the
requirements
are,
but
we're
going
to
want
to
write
some
little
boxes
on
google
docs
or
something,
but
we're
all
satisfied
with
what
it
looks.
B
B
I
don't
I
mean
it's
not
a
big
problem
again.
Classic
osd
already
has
pretty
much
the
solution
we
want
here.
It's
really
a
matter
of
translating
it
into
libraries
and
primitives
that
are
more
convenient
for
c-star
watch,
notify.
F
B
I
B
B
C
B
I
In
classical
d
for
it's,
it
also
have
a
have
a
name
like
collocation
or
sd
in
the
same
same
host,
something
no.
B
D
So
to
summarize,
your
you're
discussing
going
forward
with
n10
mapping
where
the
osd
operates
much
the
same
as
it
does
with
classical
osd,
having
like,
essentially
a
single
oc
instance
per
object,
store
and
per
device.
B
Yeah
for
deployment
purposes,
you
could
still
choose
to
partition
a
device
and
create
two
of
them
like
same
with
classic
osd,
but
I
think
there
are
a
couple
of
core
questions.
One
do
we
want
to
require
use
of
multiple
cores
require
different
processes.
I
would
argue:
no,
because
we
want
the
backing
object,
store
implementation
to
be
able
to
share
memory.
B
B
So
that
leaves
us
with
this
version,
which
is
that
an
osd
runs
multiple
reactors,
a
subset
of
the
reactors
host
messengers,
a
subset
of
the
reactors,
host,
object
stores
and
a
subset
of
the
reactors
host
pg
data.
B
It
would
also
support
a
situation
where
we
don't
want
128
c-store
instances,
so
instead
we
have
128
messengers,
128
pgs,
but
only
whatever
32
of
these
four
instances,
because
for
whatever
reason,
that's
the
more
efficient
way
to
do
it,
so
we
would
be
able
to
independently
scale
these.
These
things
does
that
make
sense.
I
B
Usds
is
is
bad,
like
that's,
not
a
good
thing,
so
by
the
time
we've
chosen
to
go
to
the
to
go
to
the
work
of
supporting
multiple
reactors,
at
least
at
the
object
store
level.
We
also
don't
want
multiple
osd's,
because
having
multiple
osd
is
in
the
same
process
when
they
don't
have
independent
failure.
Domains
means
we're
artificially
imposing
extra
work
on
the
monitors.
H
And
our
discussion
here
was
whether
this
is
actually
a
problem
or
not.
H
B
H
Okay,
the
reason
would
be
could
be
actually
to
not
worry
at
all
about
ext
about
extending
the
protocol,
wouldn't
be
at
all
about
passing
messages
across
passing.
H
I'm
just
arguing
that
in
in
that
scenario,
the
cross
communication,
the
cross-court
communication-
would
be
extremely
limited
just
to
adjust
to
the
adjust
to
the
object,
store,
related
crossbar,
actually.
H
Conceptually
yep,
I
I
can
agree
with
that
at
the
at
the
level
of
implementation
there
it
will
be,
it
will
be
two
different
things.
I
So,
in
other
words,
we
still
need
to
worry
about
how
to
short
a
given
set
of
core
to
to
different
to
different
ocd
instances
right.
B
C
B
B
D
Can
talk
more
about
that
part
in
the
orchestrator
session?
That's
tomorrow,
worst
case.
D
Yeah,
I
don't
think
it's
a
big
deal
if,
like
a
safeadm,
has
to
like
pass
in
a
list
of
course
to
use,
for
example,.
C
B
Be
embedded
like
there's
metadata,
even
with
classic
osd,
we
like
right
down
into
the
little
blue,
store
config
folder.
This
can
be
one
of
those
those
things
and
it's
soft
state
too,
so
it
doesn't
even
need
to
be
consistent
boot
to
boot.
B
No,
I
think,
I
think
the
hard
part
is,
as
you
guys
have
put
your
fingers
on
it's.
There
are
a
number
of
pieces
of
state
like
the
watch,
notify
state
the
osd
map
state
itself.
This
mapping
from
pg's
to
coors
are
that
are
fundamentally
cross-core
pieces
of
state.
We
need
to
write
code
for
all
that
stuff.
We
need
to
define
the
messenger
to
osd
interface,
that
an
osd
or
a
messenger
osd
interface,
that
appropriately
deals
with
this
localization
of
pg
to
reactor,
and
we
need
to
test
the
hell
out
of
it.
B
B
And
if
we
do
want
to
defer
this-
and
we
are
happy
with
statically
partitioning
disks
and
running
multiple
sds-
that's
fine
too.
We
should
work
on
something
else
in
that
case
and
use
what
we
already
have
if
we
want
to
defer
this
work
to
later,
but
I
don't
think
we
should
do
an
intermediate
thing.
We
should
either
start
working
on
the
final
version
or
use
this
version
for
a
while
longer.
D
It's
going
to
be
like
the
essentially
that
make
it
make
it
have
the
same
deployment
model
as
the
classical
ost,
which
means
that
the
integration
with
deployment
tools
is
going
to
be
quite
a
bit
simpler.
There's
no
reason
to
implement
like
a
crazy
deployment
method
that
we're
not
gonna
throw
away
in
six
months.
B
C
B
B
D
So
looks
like
that
last
thing
here
is
c
store.
B
B
A
no
node
index,
an
omap
and
garbage
collection
and
a
little
tool
for
running
fio
workloads
against
the
lower
level
interfaces.
We're
just
now
wrapping
up
the
work
of
wiring
up
the
higher
level
interfaces
so
that
we
can
actually
run
an
osd
on
it.
We've
got
the
omap
portions
wired
up
as
well
as
the
oh
note,
stuff
itself,
I'm
finishing
up
the
extent
stuff
in
the
next
week
or
two,
and
I
believe
shuehan
is
working
on
x,
adder
and
the
metadata
machinery.
B
Neither
of
us
should
be
a
huge
deal
after
that.
It's
mostly
there
are
two
sort
of
big
device
integration
stories
to
consider.
One
is
that
c-store
currently
is
designed
for
zns
devices,
because
that
seemed
like
the
primary
design
limiting
use
case.
So
I
wanted
to
make
sure
the
transaction
internals
were
capable
of
dealing
with
it,
but
there
are
devices
for
which
direct
mutation
of
the
storage
is
actually
fine.
So
we
there's
a
document
at
the
bottom
detailing
the
initial
ideas
of
how
we're
going
to
support
that.
F
B
B
Here
for
integrating
into
the
cache
itself-
and
the
notion
here
will
be
that
the
currently
ephemeral
cache
will
become
will
when
there
is
persistent
memory,
we'll
just
have
its
extents
located
directly
there
and
we
will
journal
enough
state
to
reconstruct
the
mapping
from
those
extents
back
to
the
physical
addresses
they're
meant
to
represent,
which,
which
should
layer
nicely.
On
top
of
the
other
support.
B
D
When
we're
consorting
these
pieces
of
metadata
in
the
same
form,
they're
they're
represented
in
memory.
So
there's
no
like
this
decoding.
B
So
by
cashier
I
mean
a
block,
cache
4k
blocks,
it's
just
4k
extents.
The
thing
that
maps,
those
extents
back
to
where
they
came
from
is
an
ephemeral
mapping
that
we
will
periodically
dump
into
the
journal
with
delta
updates
as
we
go
through.
So
every
time
we
put
something
into
the
cache
the
transaction
that
represents
it.
We'll
also
have
a
special
delta
updating
the
in
in-memory
mapping,
so
that.
D
The
question
is
more
on
the
cache
lookup
side,
so
you're
looking
up
an
o
node.
What
does
that
involve.
B
The
o
nodes
are
a
b3,
so
you
descend
the
b3
looking
at
each
extent.
In
turn,
one
for
zooms
the
upper
levels
of
the
b3
will
be
appropriately
kept
in
cache,
so
those
will
be
in
memory
or
persistent
memory.
Lookups.
Eventually
you
all
you
either
fall
out
of
cash
or
you
find
what
you
need.
B
B
D
But
I'm
getting
out
of
it.
Yes!
Is
that
because
there's
no
extra
translation
needed
that's.
Why
using
a
block
cache
for
everything
makes
sense
like
in
blue
store,
for
example,
there's
a
separate
block
caching
layer
versus
a
cache
that
is
dedicated
to
things
that
were
already
decoded,
as
of
course
accessing
things
that
already
decoded
as
much
much
quicker,
but
in
this
case
there's
no
extra
step
there.
So
it's
not
necessary.
B
No
we're,
I
do
not
think
the
anc
will
show
up
at
any
point
here
right
now.
All
we
have
are
actual
direct
integers
and
like
unexpressed
buffers,
so
that's
clearly
you
don't
decode
those
the
object
info.
B
Presumably
you
still
do
because
that's
the
osd's
problem,
but
that's
a
problem
to
be
solved
later,
but
the
rest
of
the
node
now
like,
for
instance,
if
you're
trying
to
read
out
the
extent
information
what
you'll
read
out
is
a
couple
of
integers
representing
the
logical
address,
map
or
logical
address
range,
corresponding
to
the
object
to
the
object's
own
range.
And
then
you
just
directly
look
those
up
you!
Don't
you
don't
decode
it
to
the
extent
that
those
ex
blobs
exist
at
all.
B
B
D
I
guess
maybe
further
further
in
the
future.
I
might
whenever
you
think,
the
after
the
info,
the
piece,
but
you
said
this
later.
C
B
B
B
D
Is
also
the
memory
management
aspect,
but
that's
probably
less
of
an
issue
with
c
star.
D
B
Answer
the
larger
question
yeah
as
we
as
this
starts
to
become
useful
usable
messing
with
that
like
we
may
want
to
move
object
info
out
of
accelerating
into
its
own
special
purpose
thing.
B
B
See
what
other
c-store
things
then
there's
just
like
a
ton
of
performance
work
to
be
done.
It's
also
a
little
bit.
Crashy
right
now,
as
chad
may
have
been
discovering,
so
debugging
would
also
be
good.
B
There
is
essentially
no
way
that
any
of
the
data
structures
I
used
in
transaction
managers
are
appropriate,
so
those
will
need
to
be
profiled
and
replaced.
We
need
a
lot
of
work
like
I've
listed.
F
B
B
So
if
anyone's
interested.
So
basically
many
of
these
little
subheadings
are
things
pretty
much.
Anyone
could
work
on
and
see
store
has
the
benefit
that
it's
a
much
smaller
code
base
than
crimson
as
a
whole
and
right
now
it
literally
can't
run
in
grimsum.
So
you
can
only
run
this
little
tester,
which
is
much
less
code
than
crimson
as
a
whole.
So
I
would
suggest
that,
if
anyone's
interested
in
getting
comfortable
with
d-star.
B
I
don't
know
so,
for
instance,
I'm
I
can
just.
I
can
just
predict
what
I'm
going
to
want
to
know
in
the
next
couple
of
months.
I'm
going
to
want
to
be
able
to
know
how
many
extents
were
mutated,
how
many
extents
were
retired?
How
many
sets
were.
B
Bytes
were
written
how
many
bytes
were
released,
because
I
want
to
track
for
the
bytes
and
release
things,
because
I
want
to
be
able
to
track
the
disk
versus
like
the
number
of
bytes
written
versus
number
of
logical
bytes
written
the
track
right.
Amplification
will
also
want
to
be
able
to
track
usage
application.
The
way
the
garbage
collector
is
working
as
well
as
performance
indicators.
F
B
H
Yeah,
it's
actually
some
sounds
like
a
combination
of
of
a
perf
counter
with
event
tracker.
I
H
Believe
it's
already
done
at
least
abner
moss
was
working
on
on
that
on
damping.
The
sister
provided
counter
counters
to
the
we
with
the
tooling.
We
already
have
in
osd
admin
comments.
B
B
B
F
I
B
G
F
H
H
I
see
well,
I'm
not
surprised
the
kernel
drivers.
H
D
B
Yeah,
I
think
that's
right
until
things
just
become
available.
What
we
have
now
is
fine,
that's
it's
not
our
goal.
I
think
there's
not
an
immediate
need
to
transition
to
more
efficient
interfaces,
but
even
when
we
do
it's
more
likely
to
be
iou
ring
than
dvdk,
so
I
sort
of
double,
don't
anticipate,
dptk,
being
a
big
deal.
H
Okay,
but
those
things
are
actually
very
different.
H
About
about
kernel
provided
iou.
H
It's
not
it's,
not
it's
not
a
replacement
for
dpdk.
It's
richard.
This
is
a
replacement
for
the
cisco
interface.
Now.
B
They're,
both
now
that
in
dbdk
are
trying
to
do
similar
things,
they're
they're,
trying
to
make
it
so
that,
when
you're
accessing
certain
kinds
of
underlying
devices
you
can
do
so
with
low
latency
and
low
overhead.
They
do
it
in
wildly
different
ways.
Admittedly,
an
iou
ring
does
technically
still
involve
the
kernel,
but
it
also
allows
the
device
to
still
behave
properly
for
all
of
the
other
tooling,
which
is
a
pretty
big
deal.
So
I
agree
that
they
aren't
the
same
thing.
I
don't
agree
that
they're
wild
that
they're
like
unrelated.
B
Anywho
mainly
josh's
point
to
sell
this
isn't
actually
the
thing
to
worry
about
now:
okay,.
D
H
Well,
still,
still
a
lot
of
a
lot
of
to
do
anyway,
maybe
five
hours
ago,
the
watch
timeout
has
passed
the
unit
test.
So
oh
nice.
H
Getting
back
to
the
topology
testing,
just
after
polishing
the
code
which
I
believe
take
maybe
a
few.
H
G
Yeah,
pretty
sure
pretty
sure
there
there
is
that
a
lot
of
them.
D
Yeah
yeah,
I
think
gabby
will
be
able
to
work
on
the
snapchats
once
he
finishes
up
the
his
current
project
of
the
blue
store
allocation
metadata.
That's
getting
pretty
close
now.