►
From YouTube: CDS Jewel -- Peering Speed Improvements
A
B
Is
another
one
which
showed
up
the
last
session,
which
hasn't
seen
much
movement?
Yet?
The
basic
observation
is
that,
while
we're
stuck
with
a
certain
amount
of
latency
when
map
changes
happened
due
to
our
consistency
model
or
probably
stuck
with
somewhat
less
than
we
currently
have,
so
let's
talk
a
little
bit
about
the
actual
curing
process.
B
So
when
a
new
primary
discovers
that
it
is
primary
at
the
beginning
of
peering,
when
it
does,
it
sends
a
request
for
the
info
from
every
OSD
in
the
acting
set,
the
upset
and
then
every
acting
said
OSD,
which
is
rose
up,
which
is
not
marked
as
down
from
every
acting
set
back
to
the
last
time.
The
it
knows
that
they
placement
group
went
active.
This
ensures
that
we
see
at
least
one
info
from
every
interval
which
could
have
seen
a
right,
and
that
allows
us
to
be
sure
about
the
most
recent
right
version.
B
After
that
we
take
that
info
and
figure
out
which
OSD
we're
going
to
deem
has
the
as
having
the
authoritative
log,
and
then
we
set
it.
A
log
request
get
the
authoritative
log.
We
use
that
log
to
update
our
own
missing
set
if
it
has
new
entries
that
we
don't
have,
we
take
those
entries
mark.
B
So,
if
we're
coming
round
trips,
that's
one
two
three
before
we
even
get
to
start-
and
this
is
this
as
before-
we
even
get
started
I'll
on
the
PG,
and
some
of
these
are
not
totally
necessary
for
one
thing
if
we
were
primary
before,
but
that
that
is
at
the
beginning,
its
interval
we
find
out
that
were
primary
because
usually
it's
very
the
way
it's
currently
structured.
It's
very
stainless.
We
don't
remember
that
we
were
primary
before
we
wipe
all
over
I'd
memory
state
and
keep
going
and
start
over,
because
it's
much
simpler.
B
But
if
we
happen
to
be
primary
before
then
we
threw
away
a
perfectly
good,
a
set
of
infos
that
was
totally
up
to
date
for
the
previous
act
except
velocities.
Furthermore,
we
know
that
our
own
log
is
authoritative,
so
in
that
case
we
could
easily
skip
the
first
and
second
step
and
move
right
to
get
missing
and
in
the
case
that,
in
the
fairly
common
case
that
an
OSD
that
went
down
one
of
the
two
replicas
that
is
or
a
Down
lowest,
he
came
back
up
and
we
remain
the
same.
B
B
So
that's
probably
the
smallest
change
we
can
make
and
it
shouldn't
be
too
difficult.
The
second
thing
we
can
do
is
that
we
have
a
lot
of
information
about
the
previous
history
and
there's
a
decent
chance
that,
based
on
the
history,
we
can
just
guess
that
the
previous
interval
has
the
correct
that
anyone
in
the
previous
interval
probably
has
an
up-to-date,
has
a
sufficiently
up-to-date
log.
In
that
case,
we
could
preemptively
request
the
log
and
the
missing
from
the
OS
DS
in
the
most
recent
interval.
B
That
would
allow
us,
in
the
case
where,
even
if
the
primary
was
the
one
that
went
down,
one
of
the
replicas
would,
on
the
first
round,
trip,
get
the
authoritative
log,
or
rather
get
the
log
and
missing
from
its
replicas
and
be
able
to
immediately
send
back
the
activation
message,
bringing
us
down
to
two
round
trips.
In
that
case,
let's
see.
B
Another
wrinkle
is
that
we
can't
actually
do
any
so
before
we're
able
to
send
the
info
back
to
the
primary
as
a
replica.
That
is,
we
need
to
flush
all
over
in
all
of
our
in
memory
state,
the
disk,
because
we
can't
report
back
that
we
have
version
four
thousand
until
we've
actually
committed
the
operations
that
results
in
that
version.
So
we
can't
send
back
our
volatile
in-memory
knowledge
that
the
most
recent
outstanding
iOS,
four
thousand
until
we
flush
it.
B
We
can't
do
anything
about
that,
but
we
can
probably
relax
the
restriction
to
a
commit
rather
than
requiring
it
to
be
applied
to
the
file
system.
Would
you
on
XFS
at
least
happens
after
doing
that,
would
allow
us
or
and
do
that
all
we
need
to
do
is
track
objects
that
are
unstable
on
a
particular
replica
across
interval,
boundaries
which
we
need
to
do
anyway
for
replicas.
B
So
that
may
not
be
too
bad,
not
sure
how
big
a
win
that
one
will
be,
since
you
can
only
have
a
certain
number
of
uncommitted
foul
store
transactions
before
the
journal
starts
to
throttle,
but
I
think
the
one
where
the
primary
simply
doesn't
forget.
It's
cached
info
across
interval
is
probably
the
biggest
one
all
right.
Any
comments.
I've
got
a
I
generated.
If
you
look
at
the
blueprint,
I
generated
a
PNG
with
the
current
PG
state
diagram.
If
you'd
like
a
little
more
information.
C
D
E
B
Any
map
update
that
changes
the
acting
set,
the
upset
or
the
primary
for
a
particular
placement
group.
So
a
particular
all
of
the
OSD
is
paying
attention
to
a
particular
placement
group.
For
that
particular
placement
group
will
go
through
this
process
when
the
acting
the
upset
or
the
primary
change.
Okay,.
D
D
B
So
there
is
work
to
be
done
on
the
monitor,
specifically
in
a
slower
path
version
of
what
I,
just
so
what
I
just
described
can
be
derisively
referred
to
as
the
fast
path.
There's
a
there's,
a
slower
path,
where
the
current
primary
or
current
acting
set
is
not
appropriate,
perhaps
because
the
new
primary
has
no
data
whatsoever
on
this
placement
group.
In
that
case,
the
primary
will
send
a
request
to
the
monitors
requesting
a
different
acting's
up.
B
This
is
the
PG
type
mapping
when
it
will
cause
the
place,
will
go
to
be
temporarily
remapped
and
then
because
that
changes
the
acting
set,
we
go
through.
Another
interval
change,
but
the
new
primary,
when
he
looks
at
the
temporary
acting
set,
will
come
to
the
conclusion
that
it's
fine
and
proceed
with
with
caring.
B
Fun
that
one
is
going
to
be
harder
to
optimize,
although
I
think
there
actually
is
already
logic
in
the
monitor.
Now
that
when
you
do
something
that
they're,
there
are
heuristics
in
the
monitor
that
will
cause
it
to
preemptively,
create
those
temporary
mappings
in
a
lot
of
cases,
I
think
that
went
in
before
hammer
so
that
that
that
does
help
that
case
somewhat.
That.
B
Would
know
more
about
but
yeah
I
think
that
actually
did
added
them.
What
you,
when
you
do
certain
administrative
actions,
the
monitor
will
spend
a
bounded
amount
of
time,
trying
to
figure
out
what
the
implications
are
and
building
up
a
partial,
temporary
mapping
that
should
reduce
the
amount
of
effort
Theo's
these
have
to
go
through
and
then
whatever
they
don't
get
too.
Then,
okay,
the
OSD,
send
them
a
temporary
update.
Request
anyway,
is
that
right
draw.
B
C
E
B
Basically,
that
part's
actually
parallel,
so
that
that's
a
that's
a
separate
optimization
that
we
can
do
and
I
think
we
already
at
least
partially
good.
But
the
part
we
need
to
do
on
the
OSD
side
is
simply
make
it
more
clever
about
this
process,
especially
leveraging
information
that
it
already
has
that
we
currently
throw
out.
B
I
think
getting
getting
two
thirds
of
a
size,
three,
a
pool
to
peer
in
one
round
trip
would
be
pretty
good,
so
it
needs
to
be
at
least
one
round
trip,
because
we
need
to
notify
the
replicas
about
some
state
information.
Specifically
I
mentioned
before
that
we
go
back
to
the
last
interval
where
we
could,
possibly
where
we
know
we
went
active
and
the
reason
we
know
that
is
that
every
placement
group
records
a
monotonically
increasing
epic
number
that
records
the
most
recent
interval.
B
D
B
Primary
version,
I,
don't
think
it's
going
to
be
all
that
hard.
What
bugs
me
about
this
is
that
the
current
design
is
for
some
version
of
the
word
simple,
it's
already
hard
to
read,
and
this
is
going
to
make
it
harder
to
read,
because
there
will
be
ways
to
bypass
some
steps
that
you
currently
always
see
the
debug
logs,
which
is
going
to
be
annoying
and
it'll
increase
the
number
of
ways
bugs
can
propagate
from
from
interval
to
interval,
because
right
right
now,
we
we
throw
out
a
lot
of
information.
B
B
B
A
D
C
B
The
if
the
relevant
logs
are
actually
four
intervals
back
because
we're
in
some
situation,
where
some
rack
was
flapping
up
and
down
and
caused
a
few
hundred
maps
of
just
complete
insanity
to
happen.
Then
we
actually
do
have
to
talk
to
all
of
the
OSD
s
involved,
because
we
really
have
no
clue
which
one
has
the
right
information.
The
idea
here
is
to
take
the
simple
case,
the
simple
common
cases,
where
nothing
interesting
or
exciting
happened.
C
B
It's
unclear
to
me:
how
much
of
that
is
the
actual
pier
is
the
message
run
trips
and
how
much
of
it
is
to
two
commits
are
needed
to
complete
this.
This
process
there's
a
commit
at
the
beginning
before
you
can
talk
to
the
primary
again,
because
you
can't
communicate
any
of
your
in
memory
state
until
it's
been
committed,
so
you
have
to
complete
that
before
you
can
respond
to
the
first
message
and
then
before
you
can
begin
accepting
I.
B
Oh,
you
have
to
commit
the
second
number,
that
is
the
last
epic
started
number
and
both
of
those
seem
to
be
pretty
much
necessary,
so
it
if
those
are
the
overriding
requirement
that
it
or
if
those
are
the
part
that
are
actually
causing
most
of
the
problems,
then
something
else
will
be
needed,
but
we
can
do
this
at
least
we'll
see.
Okay,.
D
B
B
B
B
You
I
think
this
is
kind
of
it
like
I
said
so.
We
could
always
just
require
I
mean
okay.
So
if
we
wanted
to
be
sure
that
it
that
we
never
had
to
do
the
get
log
step,
we
could
always
request
the
log
and
the
missing
from
all
of
the
replicas
every
time,
and
that
would
work
it
would
just
increase
the
amount
that
we
had
to
send
over
the
wire.
So
I
don't
know
if
that's
I
mean
for
all
I
know
wasting
that
bandwidth
is
well
worth
it
to
avoid
the
round
trip.
B
That
would
get
us
in
the
worst
case
down
to
two
round
trips.
I
believe
what
the
logs
are
pretty
big
in
the
worst
in
the
case
that
were
that
it
would
actually
help
it's
probably
the
case
that
we've
been
unhealthy
for
a
while
that
were
probably
closer
to
the
10,000
log
entry
bound
than
the
3000m
elegant,
rebound
and,
at
you
know,
some
number
of
kilobytes
per
log
entry.
Let's
say
one
or
two:
that's
whatever
one
or
two
kilobytes
tens
10,000
is.
B
I,
quite
a
lot
not
be
worth
it.
We'll
have
to
see
I'm,
not
sure
that,
with
this
peering
algorithm
there's
a
lot
you
can
do
to
get
better
than
that.
It
seems
like
you
definitely
need
to.
A
B
Oh
because
peering
starts
as
soon
as
the
replicas.
Don't
actually
wait
for
that
initial
request,
that
initial
get
info
usually
gets
out
twice.
It
gets
sent
preemptively
by
the
replica
as
soon
as
it
sees
that
the
interval
changed,
and
then
it
receives
a
request
to
the
primary
saying.
Please
send
us,
and
the
reason
is
that
the
primary
doesn't
really
know
that
the
replica
has
a
copy
of
the
placement
group.
So
it
needs
to
get
a
null
answer.
If
the
place
we
group
doesn't
have
it
or
if
the
OS,
he
doesn't
have
it.
B
But
it
occurs
to
me
that
I
fibbed
a
little
bit.
It's
not
necessarily
the
case
that
we
need
to
commit
previous
intervals
rights
before
we
can
send
the
response
and
get
info
as
long
as
the
primary
is
really
careful
not
to
remember
any
of
that
in
that
information
when
if
the
OSD
goes
down,
but
that
would
be
hard
and
error-prone
what
possible,
because
we
do
do
another
commit
before
we
actually
begin
accepting.
B
Actually,
no,
the
primary
except
streets
before
that
point,
we
don't
block
reads
on
yeah.
It
may
not
be
much
to
the
one
well
have
to
see
problem.
Is
that
you,
you
really
do
have
to
complete
up
I.
Think
you
really
do
have
to
have
the
previous
intervals
rights
committed
before
you
can
accept
reads,
because
otherwise
you
might
allow
a
client
to
read
state
that
later
turns
out
to
be
divergent.
C
C
We
outside
of
this,
this
pre-owned
equipment
in
particular,
have
you
measured
the
general
latency
appearing
overall
and
whether
it's
like
impacted
more
by
this
algorithm
being
slow
or
I'm,
just
being
pink
serialize
bit
based
on
individual
pgs,
doing
more
work.
B
B
B
So
I
don't
think
that's
the
main
problem
that
part's
already
pretty
well
paralyzed-
and
these
are
the
peering
process-
has
a
much
higher
priority
than
client
I.
Oh
I'm,
going
to
that
happens
in
its
own
thread,
pool
it's
yeah.
It's
different.
The
closest
thing
there
is
to
a
bottleneck
is
that
the
messages
come
in
through
the
regular
dispatch
process,
not
the
fastest
best
process.
B
So
they
are
serialized
on
the
OSD
lock,
but
there
are
far
fewer
peering
messages
than
our
client
messages
and
I've
never
seen
any
evidence
that
we're
serialized
there
I
mean
you
only
have
a
hundred
pts
per
0
SD
or
something
you
don't
have
thousands
so
I,
don't
think
that's
likely
to
be
a
problem,
although
I
don't
know
a
richer
coded
pools
mean
you
handle
a
lot
more
messages
per
PG
than
with
a
replicated
pool.
So
maybe.