►
From YouTube: Ceph Code Walkthroughs: OSD Read Lease
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
I'm
going
to
start
with
this
doc,
because
it's
actually
it's
a
pretty
good
overview.
A
Of
why
all
this
works?
So
the
underlying
problem
is
that
with
radius,
when
you
write,
you
have
to
touch
all
the
replicas
and
so
writes
always
ensure
that
everybody
is
like
in
not
technically
quorum.
But
whatever
everybody
agrees.
There's
no
split
pane
but
on
reads:
you're
only
reading
from
the
primary,
which
is
normally
fine
because
usually
either
the
primary
is
still
the
primary
or
the
primary
has
seen
an
osd
map.
That
says
it's
not
the
primary.
So
it
knows
that
it's
not
the
primary
and
so
it'll
ignore
the
read
request.
A
But
it's
possible
that
you
could
have
a
network
partition
so
that
you've
partitioned
off
the
primary
and
it
doesn't
realize
that
the
cluster
is
moved
on
and
somebody
else
is
now
primary
and
is
doing
something
else.
So
the
osd
doesn't
know
that
yet
and
client
a
client
could
still
be
reading
from
it.
It's
pretty
hard
to
trigger
that
case,
because
you
have
to
sort
of
partition
that
osd
from
the
rest
of
the
cluster
and
the
the
client
also
and
the
client
to
be
reading
from
that.
A
But
it
can
happen
as
the
test
there's
a
sort
of
a
unit
functional
whatever
test.
That
does
it,
basically
by
blacklisting
connections
between
certain
sds
and
clients
and
having
two
radius
clients
and
whatever
it
induces
a
situation
where
you
have
a
stale
read
where
you've
written
to
the
remaining
replicas.
But
this
other
client
can
still
read
the
old
value
from
field
replica.
A
So
this
has
been
around
for
a
long
time.
It
was
finally
fixed.
I
think
we
discovered
it
like
around
firefox
or
something
like
that,
but
it
took
a
long
time
to
actually
fixing
it
because
it,
but
around.
A
A
So
basically
we
do
this
by
adding
releases.
So
the
the
key
thing
is
that
there's
sort
of
a
release
interval
property.
It's
an
optional
pool
property
that
you
can
specify
on
a
per
pool
basis,
so
you
can
define
how
long
this
is.
But
by
default
it's
it's
not
set
on
a
pool.
C
A
Instead,
we
use
this
global
config
option,
release
ratio
and
it
just
multiplies
that
times
the
heartbeat
race.
So
the
idea
is
basically
that
the
releases
will
always
be
shorter
than
the
heartbeat
grace,
and
so,
if
we
ever
have
an
ost
go
down,
then
your
it
fails.
A
bunch
of
heartbeats.
By
the
time
we've
actually
decided
to
mark
that
osd
down
the
releases
will
have
already
expired.
A
So
we
won't
actually
notice
the
fact
that
there's
this
like
release
mechanism,
that
is
happening,
it'll
sort
of
be
hidden
and
by
the
delays
already
in
the
system
and
most
for
for
the
most
part
that
that
works.
So
the
main
way
that
this
works
is
by
tracking
two
values
for
every
place
in
group,
both
on
the
primary
and
the
replica.
A
One
is
readable
until
that's
the
value
on
the
primary
that
how
hell
actually
no
replica
two,
because
you
can
do
reach
for
replica,
but
it's
basically,
how
far
into
the
future
we're
allowed
to
service
reads
on
data
in
the
spg
locally
and
then
there's
this
uv
version,
the
upper
bound,
which
is
an
upper
bound
on
the
readable
until
value
for
all
the
other
replicas.
A
So
how
long
any
of
our
peers
might
into
the
future
be
the
servicing
reads,
and
we
maintain
both
of
these
values
through
these
lease
and
lease
ac
messages.
So,
first
the
primary
sends
a
message,
basically
extending
the
lease
into
the
future
and
that
updates
everybody's
readable.
I
can't
remember
exactly
what
the
sequence
is,
but.
A
A
message
to
the
replicas
they
acknowledge
back
to
the
primary
and
then
the
primary
sends
another
message
to
the
replica,
and
so
basically,
once
all
the
acting
osds
acknowledge
that
they
saw
the
new
upper
band,
then
the
primary
increases
its
own
readable.
Until
and
so
there's
this
invariant.
That's
basically
always
true
that
our
local
readable
until
is
always
less
than
the
upper
bound
on
all
the
clients
or
stated
the
other
way.
The
upper
bound
that
we
have
is
always
greater
than
or
equal
to
the
readable
until
on
any
of
the
acting
posties.
C
A
Only
us,
but
none
of
the
other
peers
are
allowed
to
read
and
that's
sort
of
the
thing
that
lets
us
implement
everything
else.
So.
B
A
C
D
A
A
So
the
the
trick
is
that
we
have
to
deal
with
clock
skew
so
we're
talking
about
time
stamps
in
the
future,
and
we
don't.
We
have
to
deal
with
the
fact
that
clocks
a
aren't,
all
aligned
and
also
they
might
be
changing
because
you
might
be
adjusting
the
time
because
ndp
might
be
doing
some
annoying
thing
or
an
admin
did
something,
and
so
all
of
this
code
uses
a
monotonic
clock
for
everything
the
monotonic
clock
is
starts.
I
think
when
the
system
boots
or
whatever
and
it
never,
it
increases
monotonically.
It
never
goes
backwards.
A
Even
if
you
change
the
time,
the
monotonic
clock
doesn't
change,
it's
sort
of
a
monotonic
local
time,
but
it
is
meaningless
on
any
other
host,
because
it's
basically
the
time
since
I
think
I
can't
remember
whatever
I
think
it's
implementation
dependent,
but
it's
only
meaningful
locally.
So
we
use
monotonic
clocks
throughout
and
so
all
these
time
stamps
are
sort
of
like
local,
only
to
the
current
host,
and
so
the
way
that
we
share
in
between
hosts
is
that
we
maintain
upper
and
lower
bounds
on
the
deltas
between
any
pair
of
osd's.
That
are
communicating.
A
Handle
osd
ping
code
here
when
we
get
a
ping.
This
s
is
the
session.
So
there's
like
a
stateful
object
that
we
have
between
any
two
osds
that
are
sharing
pings,
which
importantly,
is
any
time
you
have
a
peering
relationship
anytime.
You
have
a
pg
that
shares
data
with
another
osd.
You
are
sharing
pings
with
those
osds,
so
those
are
the
ots
you
care
about.
A
A
Why
we
also
do
it
here
and
those
go
to
there's
a.
C
A
Monotonic,
clock
and
so
anytime,
you,
you
send
a
message,
then.
Basically
you
know
that
it
was
sent
in
the
past
or
at
this
instant,
but
in
the
past
and
so
calculate
sort
of
the
upper
bound
and
anytime
you
receive
whatever
whatever.
I
can't
remember
exactly
how
it
works.
But
it's
you
basically
use
this
exchange
of
both
receiving
the
ping
and
getting
the
ack.
Then
you
can
calculate
this
bound
and
the
the
lower
latency
or
network
the
tighter.
These
bands
are
so
in
practice.
A
They're,
like
you,
know,
sub-millisecond,
but
if
you'll
notice
that,
if
you
look
closely
at
the
logs,
you
can
see
these
these
deltas
get
printed
out
as
an
interval
brackets
or
whatever
yeah
anyway.
So
that's
how
we
translate
timestamps
between
osds
that
are
within
the
acting
set,
and
so
when
we
handle
when
we
are
sharing
these
lease
messages.
C
A
We
can
use
those
to
adjust
that
mostly
works,
but
there
are
also
cases
where
we
have
to
share
timestamps
with
osd's
that
we
weren't
already
pinging
and
that's
when
there's
an
interval
change
the
oc
map
changes
and
suddenly
we
have
a
totally
new
set
of
osd's
or
whatever.
A
You
have
all
this
information
that
goes
into
the
pg
history
tee.
You
know,
remember:
pg,
history
is
this
structure
that
has
it's
got
like
all
this
historical
information
like
the
last
interval
that
was
started
last
epic
clean
all
this
stuff
same
upsets
or
whatever
it
has
some
new
fields
now
around.
A
A
As
expressed
as
a
duration,
that's
obviously
not
super
accurate
because
you
know
who
knows
how
long
it
takes
to
get
to
this
destination
and
then
that
two
seconds
is
going
to
be
like
more
than
two
seconds
into
the
future
when
they
add
it
to
their
time.
But
remember
it's
an
upper
band,
so
it
doesn't
have
to
be
perfect
as
long
as
it
doesn't
travel
backwards
in
time.
Then
it's
still
correct.
A
A
One
sort
of
annoying
thing-
and
there
are
a
couple
helper
functions
down
here-
that
basically
this
is
called
before
we
send
a
message
where
we
have
the
local
ost's
current
monotonic
time
and
its
upper
bound,
and
we
either
calculate
what
this
new
duration
is
or
if
the
upper
bound
is
already
expired,
then
we
zero
it
out
because
they're
readable
until
the
old
lease
has
expired,
and
so
we
know
that
the
previous
interval
is
no
longer
readable
and
we
can
we
can
move
on.
And
similarly,
if
we're
reading
the.
A
So
those
are
those
are
time
stamps
which
are
mostly
annoying
but
they're.
All
they
only
sort
of
crop
up
in
a
couple
places,
so
we
can
mostly
ignore
them.
A
Are
expressed
in
terms
of
this
duration
instead
of
an
actual
time
step,
then
there
are
two
new
pg
states
that
we
introduce,
so
the
laggy
state
is
happens
if
the
pg
is
active,
but
for
some
reason
the
least
leases
aren't
getting
renewed
fast
enough,
though
we're
active,
we're
servicing
reads
and
writes,
but
for
some
reason
we
have
a
read
come
in
and
our
readable
until
value
is
in
the
past,
which
means
we're
not
allowed
to
service
that
read.
A
A
So
most
of
the
time
this
doesn't
happen
unless,
like
the
clusters
stress
or
if
you
have
a
partitioned
osd
that
doesn't
know
that
it's
not
talking
to
the
primary
anymore,
and
so
it's
leases
aren't
getting
removed
or
it's
the
primary
and
it's
not
being
able
to
get
response
lease
acts
from
the
replicas
because
it's
been
partitioned.
In
that
case,
it
will
eventually
go
laggy,
it'll,
stop
servicing
reads,
and
who
knows
what
will
happen?
I
guess
either
it'll
eventually
get
killed,
or
maybe
it
permanently
is
partitioned.
E
A
A
So
in
this
sort
of
like
say
that,
let's
see
actually
was
partitioned,
they
would
actually
basically
queue
up
until
eventually
the
osd
finds
out
that
those
d
math
changes
changed
and
then
it
would
drop
them
on
the
floor
or,
if
there's
like
a
temporary,
a
short
term
partition.
Maybe
it's
network
cable
goes
down
for,
like
whatever
connection
goes
down
for
like
five
seconds
and
then
comes
back
up.
It'll
go
laggy
network
will
reconnect,
all
the
messages
will
go
through,
the
leases
will
get
renewed
and
then
all
those
read
requests
will
get
requeued.
D
A
It's
releases
or
appearing,
or
I
can't
whatever
all
these
different
things,
and
so
it's
there's
some
tricky
ordering,
and
so
the
releases
are
either
the
least
lowest
priority
at
the
highest
priority.
I
can't
quite
remember
they're
in
there
somewhere,
but
it's
all
handled
in
the
same
within
that
same
function,
that
does
it.
A
Right,
so
that's
laggy,
so
that's
an
active
pg
that
gets
laggy
and
then
comes
back
to
life.
The
weight
state
is
a
little
bit
different.
That's
the
case
where
the
pg
is
repeating
and
the
prior
interval
upper
bound
still
in
the
future.
So
we
just
have
to
wait.
Peering
is
complete,
so
we're
all
ready
to
go,
but
we
have
to
wait
before
we
do.
A
Any
reason
writes
until
the
past
intervals
have
timed
out
for
the
old
pieces
to
expire,
and
then
we
continue,
but
this
one
is
probably
a
little
bit
more
common,
although
in
practice
generally,
we
don't.
Ideally
we
don't
see
either
of
these,
but
again
also
if
you're
in
the
weight
state,
you
block
the
requests
until
the
timer
expires,
and
then
you
just
proceed
and
so
that
the
trick
is
basically
to
hide
this,
so
that
it
never
happens
and
users
don't
actually
notice
and
so
with
with
laggy.
Basically,
the
pg's
are
sending
lease
back.
A
The
primary
is
sending
lease
messages.
If
the
pg
is
idle-
or
possibly
I
forget,
if
I
implemented
it
or
not,
it
can
piggyback
least
messages
on
existing
messages,
and
so
it's
sort
of
like
a
free
that
having
the
additional
network
overhead,
there's
sort
of
free
renewal
releases
as
other
stuff
happens
and
in
the
but
the
during
peering
is
the
main
thing
where
like.
A
If
an
osd
goes
down,
the
lease
interval
is
always
shorter
than
the
like
the
heartbeat
timeout,
and
so,
if
you
have
to
sit
there
and
ping
an
osd
for
like
10
seconds
before
you
decide
it's
really
dead,
then
the
least
time
lease
would
have
timed
out
at
eight
seconds,
and
so
the
lease
will
have
already
expired
by
the
time
you
move
on
and
you
don't
wait
for
it.
A
The
only
problem
is
osd's,
don't
always
fail
by
timing
out,
usually
like
the
process
or
often
the
process
dies
or
you
stop
them,
and
in
those
cases
we
don't
have
to
wait,
eight
seconds
or
whatever
that
heartbeat
interval
is,
and
so
for
that
reason
the
osd
map
has
new
fields
in
it,
starting
in.
A
Just
dead
epic,
though
this
is
like
more
extra
stuff
on
a
per
for
osd.
This
dead
epic
field
is
added
in
octopus
and
it
basically
says
that
this
is
the
last
epic
that
we
were
confirmed
to
be
like
completely
dead,
like
the
process
is
not
just
that
we
can't
reach
it,
but
that
it's
like
known
to
be
dead
because
we
killed
it
or
something
like
we
just.
We
know
it's
dead
and
the
main
way
that
we
conclude,
that
is,
there
are
two
ways.
One
is
that
the
osd
can
send.
A
The
mark
me
dead
message
to
the
monitor
and,
like
explicitly
say,
I
stopped
responding
to
requests.
Go
mark
me
dead.
Currently,
no
code
codepath
exercises
that,
although
that's
the
thing
that
we
need
to
fix
that
discussion,
we
had
the
other
day
we
need
to
fix
with
the
the
fast
shutdown
notify
mon
or
whatever
it
needs
to
send
the
dead
message,
and
do
that
only
after
it
stops
responding
to
requests.
A
That's
one
way
to
do
it.
The
other
way
is
if
we
have
a
if
you're,
sending
a
message
to
the
other
mon
and
the
messenger
disconnects
and
it
tries
to
reconnect
and
it
gets
an
e
connection
refused.
A
Refuse
handle
refused,
which
is
triggered
if
you
get
an
e-connection
refused
and
basically
the
the
underlying
assumption.
Theory.
Whatever
is
that
if
we
try
to
connect
to
a
process-
and
we
get
connection
refused,
that
means
that
the
process
is
dead,
not
that
it's
like
the
network's
disconnected
and
it's
not
responding
in
that
case.
We'd
have
a
timeout
or
something,
but
if
we
really
get
connection
refused,
that
means
we
can
reach
the
remote
host
and
that
process
is
not
bound
to
that
port
anymore.
A
Then
the
failure
message
that
we
send
to
the
monitor
has
this
flag.
That
says
immediately
mark
this
oc
down
and
market
failed
and
that
triggers
that
dead,
epic
to
be
set,
and
so
we
can
immediately
conclude
that
osd
is
truly
dead
and
then
later
in
all
the
peering
code,
when
we
are
calculating
that
prior
readable
until
upper
bound,
we
ignore
we
assume
that
osd
is
dead.
A
Wait
for
it,
which
I
can
look
at
here
in
a
second.
A
So
readable,
until
it's
sort
of
the
keyword,
that's
all
this
code,
you
can
pretty
easily
find
it
just
by
grabbing
for
readable.
Until
so
this
is
the
history
t
you'll
find
this
the
prior
readable
until
this
is
the
duration
for
the
any
known
prior
intervals.
How
long
we
need
to
wait!
A
It's
like
a
case
here
where,
if
we
have
one
history-
and
we
have
another
history
that
we
get
there's,
this
merge
function
that
basically
in
most
cases
it
just
takes
a
max
of
the
two,
but
the
logic
for
readable
until
is
a
little
bit
subtle.
So
if
we're.
A
Newer
interval,
then
the
readable
until
might
be
our
our
old
readable
until
doesn't
make
any
sense,
because
it's
for
an
earlier
interval
and
we're
learning
about
the
interval
that
came
after
that,
and
so
we
just
take
that
value.
Or
if
we're
learning
about
we're,
getting
a
history
from
the
same
interval,
then
we
can
take
the
upper
bound
of
whatever
those
two
readable.
Untils
are.
A
This
abbreviation
is
used
in
a
lot
of
the
debug
output.
It's
the
rub,
readable
until
upper
bound
you'll,
see
in
a
few
places
in
p
prior
readable
until
upper
band.
So
it's
like
it's
a
mouthful
to
say,
but
you
can
find
it
that
way.
A
The
least
these
are
the
least
messages
that
go
back
and
forth
that
they
have
yeah.
These
values
aren't
police
act
that
comes
back.
A
States,
so
in
here,
if
we
have,
these
are
sort
of
all
the
peering
values,
so
read
a
volunteer.
This
is
the
local
notion
of
how
far
into
the
future,
the
current
osd
is
allowed
to
service
those
reads
and
the
upper
bound
for
anybody
else
in
the
acting
set.
So
this
gets
fed
into
the
history.
A
If
there's
like
a
peering
process,
we
want
to,
we
want
to
pass
around
the
upper
bound
and
make
sure
we
wait
for
the
whoever
the
furthest
future
reader
is
but
locally.
We
only
that
value.
This
is
the
value
from
the
primer,
readable
and
so
there's
also
this
set
here,
which
is
the
prior
readable
down
osds.
So
if
you
go
into.
A
So
on
new
interval,
this
happens
just
after
we've
calculated
the
prior
set
future
bits
all
this
stuff
and
we
sort
of
reset
all
this
stuff.
So.
A
Okay,
so
when
we
start
this
is
when
we
first
start
appearing,
and
we
do
the
scan
info,
we
do
the
build
prior
and
then
we
set
the
prior
down
osds
to
all
of
the
osds
that
were
in
the
prior
set
that
are
now
down
and
the
way
that
the
prior
set
is
built.
It
has
both
a
list
of
alive
osds
that
we
need
to
query
and
also
a
list
of
down
osds
that
might
have
useful
information
that
were
in
those
prior
intervals,
but
are
now
down,
and
so
those
are
the
ones.
C
A
A
We
don't
need
to
worry
about
it,
but
if
they
are
still
up,
then
we
put
them
in
a
set,
and
these
are
the
ones
that
we
need
to.
We
need
to
wait
for
then
there's
this
check
prior
readable
down.
That
happens
when
we
have
an
osd
map
progressing,
and
this
is
where
we
look
for
these
dead
osds.
A
So
we
look
at
all
the
ones
that
are
that
we're
waiting
for
the
releases
to
expire,
and
if
the
map
says
that
they're
dead,
then
we
erase
them
from
the
step
and
if
we
erased
any,
then
we
and
we're
done
we
all
of
them.
We
aren't
waiting
for
any
more.
Then
we
now
know
that
we
can
clear
the
previous
interval,
because
all
the
previous
ones
that
we
were
waiting
for
are
now.
A
Don't
need
to
wait
for
them,
and
this
prior
readable
down
set
is
empty.
We're
good
to
wait.
A
Probably
the
thing
that
will
re-cue
old
requests
if
they're
yeah.
This
is
a
thing
where,
if
we're
not
in
the
waiter
laggy
state,
then
it's
fine
nothing's.
A
Of
those
two
states,
then
we
have
to
go
re-cue.
We
can
clear
the
pg
state
or
the
laggy
state
and
re-cue
the
waiting
for
readable,
best
act,
life.
A
So
when
we're
activating,
when
the
primary
is
going
active,
it's
about
to
send
this
pg
info
to
all
the
replicas,
so
it
refreshes
the
duration
in
in
that
history,
pg
history,
because
remember
it's
a
duration
based
on
the
current
timestamp
monotonic
now
and
the
prime
readable.
So
it
does
that
before
it
shares
it
same
thing
any
time
it
shows
pg
info.
We
always
refresh
that
duration
before
we
send
it
out.
A
A
Any
case
in
that
function
we
make
our
own.
We
update
our
own
history
so
that
it
has
an
accurate
time
relative
time
stamp,
and
then
we
do
a
merge
of
the
other
history
with
our
history,
which
basically
sort
of
I
showed
you
that
earlier
we
take
the
basically
the
upper
bound
unless
one
of
them
looks
the
new
interval
and
it's
expired
or
whatever,
then
we
can
just
reset
it,
but
sometimes
we
have
to
advance
that
history.
A
This
is
unrelated
to
unrelated
to
the
releases,
but
this
here
basically,
is
f.
If
we
do
this,
if
we've
updated
our
history,
then
we
refresh
our
prior
readable
until
upper
bound
by
reading
that
value
out,
and
we
print
out
a
nice
message
to
say
if
it
expired
just
in
case,
so
you
can.
The
debug
messages
are
something
to
actually
understand.
A
I
think
those
are
the
main
pieces
I
was
thinking
it
might
make
sense
to
look
at
the
that
recent
pull
request,
that's
open
just
because
it
it
touches
one
subtle
piece
of
this
whole
process.
So
explaining
that
really.
A
A
I
showed
you
this
bit
of
code
here
where,
when
we
start
peering,
we
do
the
build
prior
and
if
there
there
are
no
readable
osds,
then
we
clear
the
readable
upper
band.
I
actually
just
showed
you
that
in
my
code,
because
my
checkout
here's
a
little
bit
out
of
date.
A
Quite
right,
because
it
sort
of
relies
on
this
inference
right.
The
idea
is
that
if
there
are
no
down
osds
from
the
prior
interval.
D
A
During
period
we're
going
to
talk
to
them,
we're
going
to
change
all
these
info
messages
and
gather
logs
and
all
that
stuff
and
as
a
consequence
of
doing
that,
we're
guaranteed
that
they
have
the
latest
osd
map
and
they
know
that
they're
not
still
readable
for
prior
intervals.
They
know
about
the
new
interval,
and
so
they
won't
service
old
reads
and
before
or
after
the
new
primary
goes
active
right.
But
it's
that's
an
inference
about
what's
going
to
happen
in
the
future,
and
so
we
don't
want
to
clear
the
prior
readable
until
yet.
A
A
Some
new
peering
process
starts,
and
maybe
now
there
is
whatever
we
basically
shared
information
that
wasn't
really
true,
because
we
haven't
talked
to
those
other
osds
yet,
and
we
don't
know
that
they
have
seen
the
latest
listing
maps
and
that
they're
no
longer
servicing
reads.
So.
This
is
basically
a
premature
conclusion
that
could,
in
some
weird
convoluted
way,
lead
to
a
wrong
conclusion.
A
A
So
that
was
the
fix.
The
reason
why
I'm
trying
to
remember
exactly
what
the
scenario
was
where
this,
where
this
happened.
So
this
came
out
of
that
thread
on
the
email
list
where
people
were
they
were
starting
an
osd,
and
when
the
ost
came
up,
it
was
going
into
the
pg.
Wait
state
waiting
for
prior
interval
to
time
out,
and
it
was
it-
was
kind
of
a
subtle.
A
It
was
sort
of
that
was
sort
of
a
secondary
effect
of
this
underlying
bug.
Because
of
the
way
that
the
code
in
that,
like
that
update
history,
function
was
behaving,
and
it's
probably
not
worth
explaining
exactly
what
the
weird
convoluted
path
through
the
code
was
before.
But
basically,
if
we
were
clearing
it
too
too
early
and
it
lets
some
other
confusing
effect.
A
A
This
is
an
accessor
that
basically,
the
new
version
actually
slightly
different
to
this,
because
so
you
can
set
up
cool
property,
but
it
basically
tells
you
what
the
release
length
is
going
to
be
yeah
here,
but
now
we
have
the
least
length
so
either
the
pool
property
is
set
and
we
use
that
pool
property
as
the
least
length.
Otherwise
we
multiply
the
heartbeat
race
times
that
lease
ratio
there's
a
some
stuff
to
set
up
the
monotonic
clock
that
we
use
extensively
the
lease
messages.
A
I
think,
oh
okay,
I
think
these
are
used
when
we're
sharing
leases.
So
basically,
let's
see,
if
I
can
stretch
this
up,
we
need-
I
mean
we
have.
A
So
that's
the
infrastructure
for
that
there's
some
stuff
so
that
we
regularly
send
out
these
pg
lease
messages
then
to
process
messages.
We
regularly
send
those
out
unless
we're
piggybacking
them
on
something
else
which
happens
sometimes,
but
not
always.
A
And
then
what
else?
This
is
a
test.
So
there's
a
there's
a
unit
test
in
here
there's
a
ceph
test,
osd
stale,
read
that's
been
around
for
a
long
time.
Just
basically
turns
it
on
this
reproduces
the
bug,
the
original
issue-
and
this
adds
it
to
the
qa
suite.
A
This
stuff
adds
the
the
dead
epic
information
and
basically
it
has
that
optimization,
so
that
we
know
that
an
osd
is
dead.
Then
we
don't
have
to
wait
for
places
to
time
out
and
then
there's
the
document
that
sort
of
summarizes
the
high
level
stuff.
B
I
have
a
general
question
so
like
when
the
pg
goes
into
active,
clean
laggy
or
active,
clean
weight,
essentially
you're
not
able
to
really
write.
Do
it
right,
but
we
do
have
an
active
in
in
the
pg
state.
So
the
assumption
behind
this
I'm
guessing
is
that
this
is
such
a
momentary
state
that
users
will
not
even
notice
it
so
that
essentially,
the
pg
is
active
from
their
perspective.
A
Yeah
yeah
I
mean
it
should
be
short,
like
it'll
generally
only
be
a
couple
seconds
if
that-
and
I
think
active
already
meant
so
many
other
things
within
the
context.
Whereas
and
this
this
laggy
and
weight
states
are
only
just
pausing
client,
I
o,
but
everything
else.
A
D
A
A
It
seems
simpler
to
do
that
than
to
make
it
a
sort
of
a
intermediate
state.
Before
you
go
into
active
yeah.
I
don't
remember
exactly
why
I
decided
that,
but
that
seemed
to
work.
B
So
I
guess
in
general,
when,
like
you
know,
users
do
see
these
two
states,
the
idea
would
be
to
look
at
the
readable
uncle,
make
sure
that
it's
it's
really
laggy
or
really
invade
state
and
start
like
the
first
thing
should
be
like
if
there
is
really
a
laggy
network,
like
you
described
earlier
right,
yeah.
Otherwise,
here,
okay.
A
Yeah
I
mean,
I
think,
trying
to
think
what.
E
A
We
can't
start
doing
I
o
yet
yeah,
that's
the
one
they
were
seeing,
and
that
was
due
to
this
to
this
bug,
I
think
in
general
you
would
see
that
if
somehow
your
osds
aren't
getting
marked
dead,
they
actually
die.
A
So
if
an
osd,
like
is
timing
out,
although
that
shouldn't
happen
because
the
lease
is
shorter
than
the
timeout
interval
but
or
if
you
did
like,
if
you
did
step
osd
down,
I
posted
down
one
two:
three
whatever,
where
you
just
update
the
map-
and
you
say
this
osd
is
now
down
the
osd-
is
still
alive
right.
A
D
So
what
I'm
asking
is
that,
let's
say
if
pg
somewhere
because
of
some
region
pg,
is
active,
plus
green
plus
weight,
and
if
it
is
getting
stuck
in
this
one
instead
of
marking
it
down
most
of
the
time,
let's
say
if
pg
gets
stuck
in
peering.
Instead,
what
we
do
is
go
ahead
and
it
just
says
separately
down
and
it
comes
out
of
the
you
know,
struck
peering,
and
so,
instead
of
using
this
better
to
restart,
instead
of
marking
it
down
right.
A
A
C
A
Be
less
than
that,
because
peering
took
up
some
of
that
time
and
the
leases
probably
weren't
renewed,
like
the
instant
before
neville
changed.
But
if
you
see
any
pg
that
is
actually
actually
stuck
in
the
weight
state.
That
probably
means
that
one
of
the
primary
osd
is
is
stuck
for
some
other
reason
like
just
like
any
other
peering
issue
like
something
something's
making
you
not
respond
or.
A
A
Let
me
share
this
really
quick.
This
is
this
last
thing,
so
this
is
the
test
that
reproduces
the
issue.
So
basically,
we
have
two
radius
clients.
A
So
the
first
one
connects
we
create
a
pool
and
we
yeah.
We
create
a
pool
and
create
an
io
context
for
it
and
then.
A
A
We
figure
out
who
the
primary
osd
is,
and
then
we
have
this
fence
osd
function,
which
basically
doesn't
inject
args
on
that
that
that's
these
flags
in
the
messenger
that
basically
make
all
osd
and
monitor
message
get
dropped
on
the
floor,
so
that
osd
is
like
no
longer
talking
to
any
other
list
or
monitors
they're.
Not
so,
basically,
they're
not
going
to
get
not
going
to
get
the
osd
map
update.
C
A
A
The
second
client
hasn't
gotten
that
osd
map
update
because
it
hasn't
been
doing
any.
I
o
nothing's
happening
whatever.
It
should
still
be
able
to
go
and
read
that
same
object
from
the
old
primary
right
as
the
old
primary.
It's
still
accepting
client
messages,
it
doesn't
know
it's
down
because
it's
in
black
hole-
and
so
the
client
too,
goes
and
reads
the
object.
A
Meanwhile,
with
client
one
again
we
go
and
we
write
an
update,
updated
version
of
that
same
object,
new
lsds,
and
then
we
go
back
to
the
client2
talking
to
the
old
primary.
We
read
it
again
and
we
make
sure
that
we
see
the
new
data
so
previously
before
all
this
infrastructure
was
in
place.
That
old
primary
would
continue
to
service
reads,
and
this
would
if
this
would
be
one,
it
would
get
the
old
value,
and
this,
in
fact
would
be
reading
from
the
new
osds.
A
With
the
changes
yeah
so
with
all
the
release
stuff
in
place,
this
will
actually
block
for,
however
many
seconds
before,
and
this
will
read
it
from
the
new
primary,
but
previously
it
would
read
from
the
old
primary
but
any
case
this
used
to
fail.
Basically,
we
would
get
one
here
and
we
would
get
a
crash,
but
now,
but
now
it
all
works.
A
B
Yeah,
I
was
just
reading
through
the
preview.
The
email
thread
on
set
users
so
that
laggy
state
that
they
saw
was
not
because
of
the
real
bug
it
was
just.
They
didn't
have
the
fast
shutdown
thing
that
was
completely
yeah,
two
different
things
going
on.