►
From YouTube: CDS Infernalis (Day 2.2) -- OSD: Peering / Latency
Description
Videos from Ceph Developer Summit: Infernalis (Day 2.2)
04 March 2015
https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)
A
A
B
In
the
blueprint
I
put
in
the
appearing
state
just
so
we
get
a
feel
for
which
steps
are
necessary,
but
there
are
three
particular
steps
that
are
the
sort
of
main
contributors
to
round-trip
latency
in
peering,
and
that's
that
if,
when
peering
first
starts
the
first
thing
we
do
is
we
compile
ourselves
a
list
of
everyone?
We
could
conceivably
need
to
talk
to,
and
then
we
asked
them
for
a
PG.
We
asked
them
to
set
us
a
peachy,
notify,
let's
see
PG
in
film
query
or
whatever
yeah
do
all
date.
B
They
send
back
a
peachy,
notify
message.
We
wait
for
all
of
them.
Before
we
proceed
for
alcohol
correctly.
We
don't
send
these
queries,
of
course,
to
download
these,
though.
That
means
that
if
an
OSD
is
not
going
to
send
us
a
notify
because
it's
dead,
this
is
going
to
stall
until
that
it
gets
marked
as
down
and
we
stop
waiting
for
it.
It
affects
the
prior
set
and
causes
to
restore
peering,
so
we'll
assume
from
here
forward
that
the
oasys
don't
die.
So
this
stop
part
of
the
process.
B
So
first
we
wait
for
everyone
to
set
us
notifies.
Then
we
use
those
notifies.
Those
notifies
contain
the
PG
info
for
each
host
E,
and
we
use
that
to
make
a
decision
about
which
OSD
has
the
authoritative
PG
info,
the
one
we
are
going
to
go
forward
with.
We
then
ask
that
OSD
for
its
info
and
log,
because
we
are
going
to
use
that
to
adjust
our
own
log
and
missing,
set
and
then
to
adjust.
B
Everyone
else
is
missing
steps
once
we
get
their
logs
so
once
we
have
that
we
adjust
our
own
log
and
then
ask
everyone
in
the
acting
set
and
everyone
we
could
conceivably
need
to
pull
an
object
from
for
their
login
missing
set
so
that
we
know
which
objects
they
actually
have
on
disk
and
for
our
own
acting
set.
We
know
which
objects
we
need
to
recover
over
to
them.
That's
log
based
recovery
for
supposed
to
be
following
along
at
home.
B
So
if
we're
keeping
track
that's
three
round
trips
and
at
the
beginning
we
have
to
flush
all
of
the
iOS
from
the
previous
interval
before
we
can
send
anything
to
the
primary
or
before
the
primary
consent
anything
to
anyone
else
and
at
the
end,
before
we
can
accept
rights.
We
have
to
persist
the
update
to
our
own
info
and
log,
lest
we
tell
the
clients
I
think
we're
not
allowed
to.
So
that's
two
distinct
flushes
and
three
round-trips.
So
the
question
is
how
much
of
that
is
necessary
and
can
we
make
it
faster?
B
B
So
when
we
send
out
the
first
round
of
requests,
if
we're
in
a
situation
where
we
can
hear
us
to
cle,
feel
like
we
have
a
good
shot
of
being
right,
we
can
also
ask
for
the
login
info
or
fungal
lot
for
the
login
missing
set
from
that
OSD.
That
saves
us
to
get
log
step.
If
we
happen
to
be
right-
and
we
can
check
this
once
we
get
back
off
the
notifies,
we
can
verify
that
we
asked
for
the
correct
one
and
it
happens
that
we
have
at
them
yay.
B
Similarly,
for
the
get
missing,
we
already
know
which
OS
DS
we're
going
to
need
missing
sets
and
logs
from
that
would
be
well.
We
know
first
batch,
we
know
everyone,
we
know
everyone,
we
need
to
go
active.
We
only
actually
need
the
missing
and
logs
from
the
acting's
up.
We
don't
actually
need
them
from
the
people
we're
going
to
recover
from.
We
can
do
that
part
subsequently.
B
Well,
we
can
ask
for
the
logs
and
missing
in
the
get
info
step
as
well.
So
in
the
best
case,
we
might
be
able
to
get
this
down
to
one
flush.
End,
do
one
flush
and
one
round
trip
and
then
one
additional
commit
a.
B
Related
piece
is
that
we
currently
need
to
flush
the
stuff
from
the
previous
interval,
where,
ideally,
we
would
actually
would
like
to
only
commit
it.
That
is
only
make
sure
that
it's
in
the
journal,
the
reason
we
need
to
flush.
It
is
because,
when
we
go
ahead
later
and
serve
reads,
we
don't
know
which
objects
are
dirty
or
we
don't
know
what
judges
have
pending
iOS.
So
if
we
track
object
context
across
intervals,
that
is
we
remember
from
the
previous
interval
which
object
set
in
flight
iOS.
We
have
an
in-memory
structure
for
this.
B
We
keep
around
wall
where
the
primary,
though,
if
we
extend
that
to
keep
that
structure
around
when
we're
the
replicas
as
well.
Even
if
we're
not
active
until
off
the
iOS
finished
blushing,
then
we
would
not
have
to
wait
for
it
to
apply
only
until
it
commits,
which
might
say
this
a
little
bit
more
more
time.
Okay,.
C
It
seems
like
in
many
cases
there
actually
wasn't
anybody
going
down
up
and
down,
but
we
have
a
forced
interval
change
anyway,
like
because
the
the
PG
temp
record
was
set,
for
example,
but
it
actually
said
it
to
what
we
were
before
in
that
case,
like
everything
we
already
know
is
our
is
still
correct
and
in
fact
we
already
have
the
peers.
Like
a
lot
of
times.
We
already
have
the
peers
info.
We
don't
need
to
request
it,
because
we
know
they
didn't
go
down
or
come
back
up
again.
That's.
B
Well,
it's
more
like
when
we
go
through
what
does
it
start?
Peering
interval
maybe
start
bearing
interval.
We
observe
that
the
acting
set
didn't
change
and
that,
therefore
we
should
not
flush
our
missing
info
and
log
sets
and
go
through
a
truncated
pairing
process,
because
all
over
all
of
our
information
is
still
authoritative
or
even.
C
B
B
C
B
C
B
C
B
C
B
Yes,
I
think
so,
basically
we
would
we
would.
We
would
have
to
remember
that
we
would
have
to
know
that
the
pre
we
have
to
know
that
the
previous
interval
went
actually
what
acting
I.
Guess
it?
Okay,
if
we
know
that
the
previous
interval,
when
acting
because
we
can
look
at
her
state
and
were
the
primary
when
acting
then
we
know
that
our
own
info
has
to
be
authoritative,
yep.
B
C
C
B
B
B
We
do
something
like
bound
the
amount
of
time
the
monitor
spends
on
it
and
then,
wherever
we
come
up
with
that
goes
into
the
next
map
and
hopefully
that
cuts
down
on
the
work
I
Nate
doesn't
want
me
to
go
over
that
more
because
there's
that
okay
cool
one
other
thing
we
can
do
is
there
is
a
thing
where,
if
you
take
the
primary
the
placement
Roop
down
and
bring
it
right
back
up,
it'll
be
missing
a
few
rights,
but
it'll
still
be
primary,
and
some
IO
will
tend
to
kind
of
hang
on
those
objects
because
we
have
to
recover
them
before
we
can
serve
reads
or
writes
on
them.
B
E
B
B
C
C
C
So
this
is
actually
a
problem
with
with
the
PG
temp
repopulating
too.
It's
also
going
to
it
up
days,
I'm
confused.
Ok,
let
me
see
if
I
can
remember.
So,
if
you,
if
you
set
they,
they
your
map,
21
23
and
crush
changes
to
be
like
312,
and
so
you
set
up
PG
temp
to
be
the
old
thing
and
it
goes
through
peering
and
it's
like
well
I
could
be
one
two
three
like
that's
good
enough.
They
will
sit
there
and
block
instead
of
waiting
for
them.
Yet
another
OC
map
update
cycle
instead
of
continuing.
C
B
C
B
C
Right,
so
that's
like
the
what
I'm
saying
is
that
that's
a
totally
generic
thing?
Basically,
when
we,
when
we
get
to
the
weight
up
to
up
through
decision,
if
we
could
go
active
with
our
current
acting
set,
even
though
we
want
it
to
be
something
different,
we
should
continue
and
go
active
and
asynchronously
yeah.
B
B
C
B
C
B
C
Okay,
I.
B
D
D
Yeah,
the
first
one
is
of
the
purines
appearing
what
and
another
one
yeah
another
one
is
ungracefully
shot
down,
as
in
our
as
I
mentioned
in
the
in
the
blueprint
is
like
because
on
the
done,
Staters
of
the
OSD
can
only
be
a
be
noticed
by
by
the
class
or
by
his
peers,
where
heartbeat
and
that
could
take
up
to
20
seconds
and
and
cause
the
map
change
so
that
the
client
could
retry.
In
that
regard,
I'm
wondering
if
there
is
there's
any
plans
we
can
make
that
better.
The.
B
B
C
Yeah
I
wonder
if
a
more
general
thing
would
be
I
mean
basically,
but
if
there's
any
any
situation
where
we
know
for
certain
that
the
OSD
is
down,
that's
has
no
false
positives
than
by
great,
where
to
begin
immediately,
we
can
immediately
market,
so
I
mean
the
things
that
might
work
would
be.
If
you
are
another
process
on
the
same
host,
and
you
know
that
the
OSD
was
a
specific
PID
and
that
the
ID
it
appears,
then
that
would
be
a
the
trigger.
Yeah
I
mean.
C
Exactly
right,
so
maybe
the
calamari
agent
could
do
it.
Maybe
p
ro
s
DS
on
the
same
process
on
the
same
host
could
like
monitor
each
other's
bi
DS
way.
B
I
think
that
the
it
would
need
to
be
wired
through
the
upstart
or
other
thing
system
d,
so
that
it
actually
is
the
thing
that
does
the
starting
induce
the
parent.
Yes,
we.
C
C
B
C
B
C
E
E
C
Maybe
it
was
a
list,
but
ok
does
that
make
sense
like
if
we
can
figure
out
a
way
yeah
if
we
can
figure
out
a
way
with
the
systemd
hook,
to
identify
that
it's
that
specific
process,
so
that
when
we
send
the
OST
down
command,
it
only
marks
it
down.
If
it's
it's
the
one
that's
currently
up
in
the
OST
map
is
that
same
one,
then
then
we're
golden
well.
C
B
C
B
D
Previously
I
I
sort
of
two
options:
one
is
that,
like
we
like
just
these
classes,
that
out
out
of
the
OSD
precise,
we
have
another
watchdog
process
or
some
other
stuff
which
attacked
the
failure
of
the
OSD
and
report
immediately
as
soon
as
the
filter
is
detected.
Another
one
is,
as
I
mentioned
in
the
blueprint.
D
B
E
C
B
B
D
D
Right,
okay,
yeah,
the
second
one
is
that
there's
some
slow
osts.
Actually,
this
includes
at
OSD
goes
down
for
some
reason,
and
currently
we
we
have
a
page,
I
think
is
already
in
that
has
test.
The
speed
is
that
we
read
all
the
trunks,
both
data
and
code,
eight
trunks
for
ear,
ear
coding
and
use
the
only
the
first
return:
okay
trunks
to
serve
the
request.
D
The
definitely
avoid
that
if
there
is
OSD
that
is
blow
or
even
a
stung,
that
the
request
can
be
successful
and
in
low
latency,
but
the
problem
is
that
that
doesn't
work
for
the
senior
that
the
slow
OSD
is
a
primary
one,
all
right
yeah.
In
order
to
accept
a
problem,
it
seems
like
we
need
to
shape
the
responsibility
from
the
primary
OSD
to
client
side
yep.
C
The
other
sort
of
hurdle,
though,
is
that
the
lib
greatest
client
then
needs
to
be
able
to
link
in
all
the
eraser
code,
plugins,
mm-hmm
and
currently
doesn't
so.
We
have
to
change
the
way
that
the
package
their
way
right.
Another
part
of
this
F
Steph
package,
I,
guess
we'd-
have
to
change
it
and
we
have
to
switch
around
the
way
that
the
package
dependencies
work
basically
so
that
it
yeah
do
that
at
least
in
the
in
some
cases,
if
it
fails,
I
can
always
fall
back
to
reading
from
the
primary,
but
yeah.
B
So
another
thing
is:
if
it's
the
storage,
that's
slow
and
not
the
request,
handling
on
the
primary,
then
you
can
still
just
Ryu.
You
can
still
go
ahead
and
try
to
reconstruct
when
you
have
the
first
k
come
in
and
you
don't
have
to
wait
one
other
primary.
So
finish,
yeah.
B
Instantly
for
for
rights,
at
least
on
replicated
pools
the
it
seems
to
me
that
the
analogous
thing
would
be
waiting
for
min
sighs
replies.
I,
don't
know
if
what
anyone
would
think
about
that,
because
we
already
have
a
min
size
parameter,
which
sort
of
defines
how
many
rights
we
require
to
be
persisted
before
we
accept
reason
right,
so
we
actually
could
possibly
wait
until
you
admin
size
replies
on
a
particular
right.
D
B
C
I'm
idea,
I
think
I
think
it
might,
it
might
work,
but
I'm
a
little
bit
worried
that
we
assume
that
we
touch
all
replicas
when
we,
because
of
the
way
that
we
do
peering
like
if
we
have
at
least
one
person
from
the
previous
internal,
then
we
assume
that
any
nacked
right
when
I
write
that
we
didn't
see
is
an
act.
Yeah.
B
B
D
C
C
B
C
D
Okay,
that
is
pretty
much
what
I
have
offer
of
this.
You
know
once
a
lot.
I
will
go
ahead,
obvious
more
information
for
the
first
stuff
that
is
ungraceful
each
other.
Ok,.
C
C
So
I
think
I
mean
just
to
throw
this
out
there
for
the
for
the
for
the
slowest
ease
for
the
easy
thing.
I
think
the
thing
that's
going
to
be
most
that's
going
to
best
solve
the
problem
for
the
EC
case.
It's
going
to
be
the
client.
Turin
reads.
I
think
that's
probably
where,
if
in
latency
sense
of
environments,
I
think
that's
the
way
to
go,
because
you
don't
care
about
racing
rights
and
and
so
forth,
so
yeah,
okay,
okay,
I
must
and
disagrees
I,
don't
know
if
you'd
like
oh.