►
From YouTube: CephFS Code Walkthrough: MDSMonitoring
Description
Schedule: https://tracker.ceph.com/projects/ceph/wiki/CephFS_Code_Walkthroughs
A
So
today
we're
going
to
talk
about
the
mds
monitor.
I
believe
I've
done
this
at
least
once
before,
in
a
water
cooler,
but.
B
For
one
of
our
first
walkthroughs,
I
thought
I'd
go
through
this.
This
often
a
little
mystifying.
A
So
what
is
the
mds
monitor?
The
mds
monitor
is
part
of
the
step
monitors
it's
one
of
the
paxos
services
that
we
have.
A
A
They
make
changes
to
the
mds
cluster
by
making
modifications
to
the
fs
map
or
the
mds
maps,
and
then
distribute
those
mds
maps
to
all
the
clients
and
mds
according
to
what
those
entities
have
requested.
A
Some
may
some
entities
like
the
mds
map
or
like
mds,
only
care
about
the
mbs
mds
map
for
the
file
system
that
they're
serving
some
clients
may
want
to
subscribe
to
the
entire
fs
map
to
be
able
to
see
all
file
systems,
and
some
clients
will
also
like
the
mds
to
only
care
about
an
mds
map
for
a
particular
file
system,
which
is
generally
the
case.
A
The
other
function
of
the
mds
monitor
is
to
monitor
the
health
of
the
mdss.
A
A
A
For
example,
failing
an
mds
is
really
just
this
really
just
removing
it
from
the
fs
map
creating
a
new
file
system,
it's
creating
a
new
file
system
and
mds
map
in
the
fs
map.
Things
like
that
these
are
that
yep.
So
that's
that's
everything
the
mds
monitors
is
is
for
as
far
as
purpose.
B
A
So,
within
the
monitor
there
are
these
components,
primarily.
The
mds
monitor,
refers
to
two
given
mfs
maps
and
each
fs
map
refers
to
one
or
more
mds
maps.
Well,
actually
could
be
zero
or
more.
A
If
there's
no
file
systems
where
the
mds
monitor
looks
remembers
the
the
current
official
fs
map,
which
I've
marked
e
minus
one
fp
minus
one
and
that
is
set
in
stone,
it's
not,
it
can't
be
changed
once
created
or
once
the
the
meyers
have
agreed
on
what
his
current
state
is.
So
it's
immutable
and
then
fs
map
e,
which
is
the
the
next
fs
map
that
the
mds
monitor
may
be.
A
Making
changes
to
here
I
removed
one
of
the
mds
maps
just
to
indicate
maybe
a
file
system
got
deleted
and
that
would
be
the
delta
between
the
two
and
then
there's
this
fs
commands
module
within
the
the
within
the
monitor.
That
is
just
an
abstraction
for
performing
a
number
of
file
system
commands
on
the
fs
maps,
for
example,
creating
a
new
file
system
or
failing
all
the
mdss
on
a
file
system
or
changing
fs
settings
on
a
file
system.
A
Those
are
all
routed
through
fs
commands,
which
also
modify
the
current
fs
map.
That's
pending
you'll
often
refer
see
within
the
the
mds
monitor
the
the
next
epic
is
referred
to
as
pending
the
pending
one
is
never
shown
to
clients
until
it's
finally,.
A
Distributed
among
all
the
monitors
and
they
agree
on
the
changes
as
part
of
their
paxos
algorithm,
so
any
changes
to
the
current
pending
fs
map
are
never
revealed
to
clients
or
mdss.
It's
only
when
they're
fine,
when
when
it's.
Finally,
when
consensus
is
reached
by
the
monitors,
do
we
finally
effect
those
changes
which
may
be
sending
an
mds
that
it's
been
removed
from
the
mds
map
or
or
updating
clients
about
the
new
mds
map.
A
So
here's
a
look
at
the
mdss
and
the
monitors
the
clients
and
what
kind
of
messages
we
see
between
the
these
entities.
The
big
one
is
going
to
be
from
the
mds's
to
the
monitors
they
periodically
send.
What's
an
mds
beacon
message,
which
is
really
just
a
heartbeat
message
for
the
most
part
being
sent
to
the
monitors,
telling
the
monitors
that
the
mds
is
still
alive,
and
that
is
since
I
believe,
every
mds
beacon
interval
seconds
and
usually
about
two
seconds.
A
I
believe
from
the
mdss
to
the
monitors
and
when
the
monitors
receive
that,
let's
say
the
mds
sends
it
to
one
of
the
peon
monitors.
It
could
be
any
of
the
monitors
that
it
sends
a
message
to,
but
generally
once
it
picks
a
monitor.
It
stays
that
way
for
the
duration
of
the
mds's
lifetime,
unless
it
loses
connection
to
that.
A
Monitor
that
the
monitor
that's
the
peon,
that's
assuming
that
the
mds
sent
it
to
the
peon
would
forward
that
message
via
an
m
forward
wrapper
to
the
monitor
leader
and
the
leader.
Actually
is
the
one
that's
going
to
do
any
kind
of
necessary
notations
about
the
about
the
beacon,
for
example,
is
the
the
mds
no
longer
laggy,
or
is
there
a
state
change
that
needs
to
be
performed,
and
then
it
sends
the
response
back
to
the
piano
monitor
which
forwards?
A
Finally,
the
mmdsp
can
act,
the
mds,
the
beacon
messages
we're
going
to
get
into
more
later,
but
one
of
the
reasons
why
they're
so
why
they're
very
important
when
looking
at
the
mds
monitor
is
that
they
drive
state
changes
for
the
mdss?
A
If,
for
example,
an
mds
is
in
the
replay
state,
so
it
just
took
over
for
a
failed
rank
only
when
it's
done
replaying
is
it
going
to
tell
the
monitors?
Okay,
I'm
ready
to
move
on
to
the
next.
C
A
And
it'll
mark
it,
as
I
should
get
them
all
jumbled
in
my
head,
but
like
up
resolve
and
it
will
request
that
the
monitors
change
its
state
to
resolve.
A
Then
once
the
monitors
will
or
the
leader
will
drive
the
change
in
the
fs,
the
mds
map
to
updated
state
and
then
only
once
all
the
monitors
agree
that
on
the
new
fs
map
epic,
will
it
finally
respond
to
the
mds
that
okay,
you
can
go
to
up
resolve
and
that
would
be
included
in
the
mds
beacon,
ack.
A
Also,
the
monitors
distribute
mds
maps
periodically
via
an
mmds
map
message,
so
those
get
sent
to
the
mdss
whenever
there's
a
mds
map
change
same
thing
for
the
clients
of
the
file
system,
I
have
in
the
top
right
and
then
on
the
bottom
left.
We
have
this
m
command
message,
which
would
have
something
like
a
set:
fs,
new
or
cfs
set
or
mds
fail
any
of
those
kinds
of
commands
which
gets
sent
to
them
on
and
if
it
corresponds
to
an
mds
command,
that's
handled
by
the
mds
monitor
it
would
get
processed
there.
C
I
have
one
patrick
sure:
is
it
always
the
leader,
that's
that
distributes
the
mds
map
or
can
a
peon
also
do
it
once
all
the
monitors
have
agreed
upon
the
paxo
state.
A
They
yeah,
they
can
all
distribute
the
mds
map
once
they've
agreed.
So
you
usually
the
mds
will
pick
a
monitor
at
random
and
every
everything
it
gets.
It
gets
everything
from
that
monitor,
including
the
mds
map.
The
leader
doesn't
need
to
get
involved
in
distributing
the
mds
map
to
the
mdss,
and
that's
one
of
the
nice
things
about
having
several
monitors
is
you
can
actually
distribute
that
loadout
and
so
yeah.
The
the
leader
is
not
involved
in
that.
C
A
Alright,
so
we're
going
to
look
at
some
code
and
I
will
need
to
change
what
I'm
sharing
so.
C
A
A
D
C
D
Need
you
need
to
open
it
up?
I
think
permissions-wise.
B
A
All
right,
so
the
first
bit
of
code,
we're
going
to
look
at
is
called
the
paxos
fs
map,
and
this
is
fairly
new
code
in.
A
In
the
monitor,
I
think
I
added
this
like
two
years
ago
and
the
reason
we're
starting
with
this
is
that
it's
kind
of
unique
to
the
mds
monitor-
and
I
I
think
in
retrospect,
if
we
were
starting
the
monitor
from
scratch.
We'd
probably
use
something
like
this
throughout
all
the
monitor
services,
because
this
is
one
of
the
most
the
par
the
thing
that
gets
screwed
up
the
most
in
the
monitor
and
causes
all
sorts
of
problems.
That
is,
I
mentioned
in
the
diagram.
A
The
current
epic,
that
is,
you
know
that
that
has
been
being
distributed
to
mds's
and
clients
and
is
immutable,
is
is
stored
in
in
in
the
paxo
service,
like
the
mds
monitor
and
well
sometimes,
people
write
code
and
they
accidentally
modify
the
immutable
fs
map
and
that
causes
all
sorts
of
problems.
It
doesn't
do
what
they
expect
and
sometimes
changes
get
lost,
etc.
A
So
what
I
did
was
I
added
this
class,
which
just
protects
the
f,
the
current
fs
map
and
then
what
we
call
the
pending
fs
map,
so
the
next
epic,
and
only
the
you
can't
the
paxo
7s
map
is
inherited
by
the
mds-
monitor.
Take
a
quick
look
at
the
mds
monitor.
A
Let
me
back
to
that
code,
so
this
is
private.
It
can't
even
be
accessed
with
these.
These
members
fs
map
and
penny
fs
map
can't
even
be
accessed
by
the
mds
monitor
class.
A
The
only
way
to
get
at
these
these
members
is
through
the
the
methods
and
we
have
getters
for
the
fs
map
and
then
also
the
pending
fs
map.
These
are
public,
so
they
can
be
called
by,
for
example,
the
osd
monitor,
which
I
don't,
which
does
happen
in
like
at
least
one
place
within
the
ost
monitor.
It
actually
needs
to
look
at
the
fs
map,
but
when
it
does
use
these
methods,
it
only
gets
a
constant
version
of
it.
So
we
can
be
fairly
certain
that
it's
not
changing.
It.
A
We
don't
we're
not
aware
of
it.
The
protected
methods
are
designed
to
be
used
by
the
the
mds
monitor
itself.
So
here
is
how
it
actually
gets
a
writable
version
of
the
mds
map
and
it
can
only
get
a
writable
version
of
the
pending
fs
map
and
we
have
a
check
here
to
make
sure
that
this
is
the
leader
monitor.
A
That's
doing
the
changes
so
that
you
know
we
don't
have
a
code
path
that
somehow,
as
a
peon
monitor,
is
somehow
changing
the
pending
fs
map,
because
that's
never
what
you
want
to
have
happen,
and
then
we
have
two
methods
here
which
actually
create
the
the
next
pending
fs
map.
So
the
leader
anytime,
it's
it's
finished.
A
A
And
then
we
have
this
decode
method,
and
this
will
primarily
be
called
by
the
peons.
So
whenever
it
gets
a
new
fs
map
from
the
leader,
it's
going
to
decode
the
buffer
list,
update
its
current
fs
map
and
then
also
null
out
the
pending
fs
map
to
make
sure
that
there's
no
invalid
accesses
to
it.
Because
again,
the
pending
emphasis
map
should
only
be
read
and
written
to
by
the
leader.
A
So
we
have
these
protection
methods
in
and
there
were
a
number
of
changes
that
had
to
be
made
to
the
mds
model
to
use
this.
But
overall
this
actually
turned
out
to
be
a
very
positive
change.
I
feel
because
it
it
actually
caught
a
lot
of
bugs
and
is
preventing
through
code,
a
number
any
number
of
new
bugs.
A
But
again
this
is
fairly.
This
is
a
unique
protection
to
the
mds,
monitor
you're,
not
going
to
see
this
in
other
services.
A
All
right
and
then
just
to
look
at
the
mds
monitor
side
of
this.
So
again,
an
mds
monitor
is
a
paxo
service.
So
one
of
there's
two
methods
we
should
look
at
for
the
mds
monitor.
A
As
far
as
this
paxos
fs
map
is
concerned,
the
first
one
is
update
from
paxos.
A
A
So
even
on
clusters,
with
all
the
debugging
set
to
the
lowest
defaults
for
the
wires
you'll,
still
end
up
seeing
these
new
fs
maps
get
printed
out
in
the
monitor
lock.
So
you
can
rely
on
that
being
there
and
then
here's
a
called
a
check
sub.
So
anytime
the
monitors
update
the
fs
map.
They
also
need
to
notify
all
the
clients
who
have
subscriptions
to
mds
map
and
they
will
go
through
all
those
and
and
update
that
we'll
take
another
look
at
this
later.
A
So
they
get,
this
would
be
called
by
the.
It
is
just
one
of
the
the
abstract
methods
in
the
in
the
paxo
service
class.
So
whenever
it's
time
to
create
the
next
map
that
the
the
paxo
service
is
distributing,
it
calls
this
crate
pending
mat
method,
and
so
that
would
happen.
After
all,
the
monitors
have
consensus
on
what
the
current
version
of
the
fs
map
is,
and
then
the
leader
is
going
to
create
the
next
pending
one.
A
So
that
happens
here
and
again.
This
is
wrapped
up
in
this
paxos
fs
map,
precisely
to
prevent
the
mds
monitor
code
from
creating
a
new
pending
map
anywhere,
except
here.
A
So
the
next
code,
to
look
at
I
told,
as
I
mentioned
earlier,
the
beacons
are
what
drive
most
of
the
state
change,
that's
automatic
in
the
in
in
the
mds
monitors,
and
these
come
from
the
mdss
and
so
the
first
method
that's
going
to
be
called
whenever
there's
a
beacon.
Is
this
pre-processed
beacon
method
that
happens
both
in
the
peons
and
in
the
leader?
A
So,
as
the
name
of
the
method
suggests
we're
just
doing
some
basic
pre-processing
on
the
beacon
prior
to
actually
doing
any
changes
that
would
be
a
result
of
that
beacon.
So,
for
example,
here's
a
simple
permissions
check
that
even
the
peons
can
do.
If
it's
not
an
mds
with
this
x-cap,
then
the
beacon's
got
insufficient
privileges.
We're
just
gonna,
ignore
it,
and
here
we're
gonna
check
that
the
the
the
fs
id
for
the
sep
cluster
matches.
What
the
message
has.
A
So
is
this
an
mds?
It's
actually
part
supposed
to
be
part
of
a
different
step,
cluster
things
like
that,
and
then
some
compact
checks
for
the
most
part.
These
result
in
just
ignoring
the
message.
A
We
don't
the
peons,
don't
reject
messages,
and
this
is
actually
kind
of
a
sometimes
a
problem
within
the
mars
is
that
if
something
goes
wrong,
the
monitors
just
ignore
it,
and
so
what
you'll
end
up
seeing
is
a
client
that
hangs
and
because
it's
expecting
a
response
from
the
monitors
that
it's
never
going
to
receive.
A
So
this
is
either
it
can
be
a
good
thing
or
a
bad
thing,
depending
who
you
ask,
I
I
having
confusing
hangs
in
my
opinion,
are
kind
of
a
bad
thing,
but
that
that's
something
that's
kind
of
prevalent
in
the
monitors
is
that
there's
a
there's,
a
a
lot
of
the
default
behavior
for
handling
a
problem
is
to
just
ignore
it.
A
So
you'll
see
that
and
then
here's
the
key
bit
in
this
pre-process
message.
If
it's,
if
it's
not
the
leader,
if
this
monitor
is
a
peon,
then
we're
just
going
to
return
false
and
that
just
indicates
to
the
the
caller
period
across.
Let
me
see
if
I
still
have
this
code.
A
A
Yeah
so
preprocess
query
is
one
of
the
nds
monitor
messages,
methods
and
you
can
see
it
just
calls
pre-process
beacon
below
if
it's
at
a
message
beacon
when
the
paxo
service
calls
this
preprocess
query.
If
the
return
value
is
true,
that
means
that
it
was
processed
and
the
there
is
no
further
work
necessary
as
far
as
pre-processing
for
the
message.
A
But
if
it's
returns
false,
which
for
the
peons
will
generally
always
be
the
case,
then
the
beacon
needs
to
be
forwarded
to
the
leader,
and
that
is
handled
here
in
the
paxos
service
code
of
the
monitor.
If
it's,
if
it's
not
a
leader,
then
it's
just
going
to
forward
that
request
to
the
leader.
So
further
processing
will
happen
on
the
leader.
A
So
if
you're
trying
to
debug
issues
with
changes
to
the
mds
monitor,
you
generally
always
need
to
be
looking
at
the
leader
and
not
any
of
the
peons,
because
all
the
beacon
processing,
all
the
file
system
commands,
are
eventually
going
to
be
processed
on
the
leader.
A
So
finally,
you
know
when
the
leader
does
get
the
message.
It's
also
going
to
go
through
this
pre-process
business
and
then
finally
check
do
some
basic
checks
on
on
the
beacon
so
like,
for
example,
if
the
if
the
gid
of
the
mds
does
not
even
exist
in
the
current
fs
map,
then,
and
and
it's
not
asking
for
the
boot
state-
that
is
it's
not
a
new
mds.
A
A
That
could
happen
periodically,
for
example,
if
you,
if
the
mdf
gets,
kicked
out
for
being
laggy,
because
there
was
a
network
partition
and
then
it
comes
back
20
minutes
later
thinking,
it's
still
in
the
mds
map
and
the
monitor
says
no
you're
not,
and
it
sends
it
in
mds
empty
mds
map
to
cause
it
to
reboot.
A
A
That
is
this
method,
and
so
we're
just
getting
a
constant
version,
a
constant
reference
to
the
current
fs
map,
and
so
it
can't
be
modified
and
we're
not
calling
get
pending
fs
map.
A
A
We
only
call
get
writable
and
you
can
actually
see
exactly
where,
in
the
in
the
mds
monitor
that
we
we
we
get
this
reference,
so
it's
not
in
pre-process
beacon
we're
only
using
the
existing
fs
map,
so
we're
not
making
any
changes
to
the
fs
map
in
this
method.
We're
just
doing
basic
pre-processing
on
the
beacon,
for
example,
if
the
current
version
of
the
mds
is
laggy,
we're
going
to
definitely
note
the
beacon.
A
And
you'll
see
this
called
in
a
few
places
in
mds
monitor,
which
is
basically
just
saying
doing
some
internal
bookkeeping,
and
this
monitors
saying
that
the
the
the
mds
has
been
seen
recently
and
should
not
be
considered
laggy.
As
of
this.
A
Time
and
you'll
see
the
the
interesting
bit
about
that
being
in
the
pre-processing.
Is
that
we
want
to
pre-processing
is
done,
I
believe,
with
the
fast
dispatch.
So
it's
done,
but
basically
as
soon
as
the
message
is
right
off
the
wire
and
that's
important
to
prevent
the
an
mds
from
being
marked,
laggy
and
being
removed
from
the
cluster
just
because
the
mds
are
just
because
the
monitor
is
under
load.
Maybe
it's
getting
is
is
doing
a
lot
of
work
of
some
kind.
A
It's
getting
a
lot
of
messages,
something
slowing
down
the
monitor
and
you
don't
want
your
mdss
to
get
kicked
out
just
because
the
the
the
beacon
messages
are
not
being
processed
quickly
enough.
A
So
that's
one
of
the
functions
of
pre-processing
and
you'll
notice
that
if
you
especially
if
you
look
back
in
the
history
of
the
mds
monitor
code,
there's
been
numerous
changes
to
to
try
to
avoid
that
particular
situation
of
the
the
mds
monitor,
falsely
believing
that
the
mdss
are
laggy
or
disconnected
and
then
removing
them.
A
So
that's
been
a
recurring
issue
for
in
the
mds
monitor
code
and
then
there's
various
code
here
to
check.
If
the.
If
there's
going
to
be
a
state
change
and
eventually,
if
there
is
we're
going
to
actually
send
a
risk,
an
act
message
back
and
and
then
finally
do
some
further
processing
on
it
and
the
further
processing
is
going
to
be
done
in
prepare
beacon.
A
So
this
is
where
we're
actually
going
to
make
state
changes
to
the
the
fs
map,
potentially
and
because
of
that,
we're
going
to
get
a
writable
reference
to
the
to
the
next
fs
map.
And
so
that's
what
this
pending
reference
will
be.
C
A
Right
and
so
there's
a
few
things
that
preparer
is
doing,
it's
going
to
record
the
the
health
checks
from
from
the
from
the
beacon
so
appear
every
every
time
the
mds
sends
a
new
beacon.
It
will
that
all
of
its
health
checks
in
it,
for
example,
if
it's
and
I've
got
an
oversized
cache
or
a
client,
is
not
releasing
its
caps
fast
enough.
You
will
see
those
messages
in
these.
A
What
we
call
these
mds
mesh
metrics
and
the
monitor
is
going
to
look
for
differences
and
actually
note
those
like,
for
example,
in
the
cluster
lock.
For
example,
this
mds
health
message
was
cleared
and
all
that
work
happens
here
early
on.
A
And
then,
finally,
the
mds
monitor
is
going
to
look
at
for
state
changes.
So
is
this
going
to
be
a
brand
new
mds
map,
so
the
one
of
mds
first
boots,
it's
going
to
send
the
state
boot
to
the
monitor
and
here
we're
checking
if
there's
one
condition,
mds
enforce
unique
name
which
is
by
default.
True.
A
So
if
there's
another
mds
with
the
same
name
as
as
this
one,
that's
claiming
the
booth,
then
we're
just
going
to
go
and
find
the
old
version
old
mds
instance
with
that
name
and
we're
going
to
kill
it
so
as
part
of
killing
it.
You
have
to
block
listed
so
here
we're
waiting
to
see
if
the
mds
monitor
or
the
osd
monitors.
Osd
map
is
writable.
A
If
it
is
we're
going
to
block
listed,
that
happens
in
fail,
nds
gid,
but
if
it's
not
writable
and
that's
that's
just
corresponds
to
machinery
and
the
monitors
about
whether
or
not
you
can
do
a
right
at
the
moment
to
the
pending
osd
map,
then
we're
going
to
wait
for
it
to
be
writable
and
then
retry.
This
beacon
message,
but
here
we
actually
fail
the
the
mds
with
the
current
or
the
the
instance
with
that
name.
A
So
there's
various
bookkeeping
here
for
the
the
beacon
here's
the
else
for
that,
if
on
whether
it's
a
boot
mds-
and
that
means
that
there's
some
kind
of
state
update
eventually
so
here
again
we're
checking
if
the
gid
exists
in
the
pending
fs
map,
maybe
for
example,
we
have
a
mds
fail
command
was
issued
and
the
mds
got
removed
from
the
next
fs
map
and
at
the
same
time
as
that's
occurring,
the
monitor
is
also
looking
at
this
a
beacon
from
that
mds.
A
So
in
in
the
pending
fs
map,
if
that
mds
has
also
been
removed,
then
it's
going
to
hit
this
code
path
and
it's
going
to
need
to
wait
for
the
current
fs
map
to
be
distributed
amongst
all
the
monitors,
it's
it's
to
reach
consensus,
and
then
it's
gonna
send
execute
this
lambda
context
to
send
an
empty
mds
map
to
the
mds
that
sent
that
beacon.
A
That
would
be
the
situation
where
we
would
hit
this
code
path
and
then,
finally,
the
interesting
bit
here
some
of
the
interesting
bits.
So
here
we're
going
to
clear
the
laggy
state
if
it
was
laggy
and
moving
on
moving
on
here's
where
we're
handling
some
of
the
various
states.
So
if
the
nds
is
stopped,
that
means
that
the
the
rank
is
down.
So
the
mds
was
in
the
stopping
state
and
then
it
says
it's
stopped.
A
It's
finished
so
here
we're
going
to
add
this
to
we're
going
to
call
this
fsmap
stop
method
on
this
gid
and
that
just
does.
The
fs
map
will
do
some
bookkeeping
to
record
the
rank
is
stopped
and
then
we
just
remove
the
the
stop
gid
from
from
various
bookkeeping
structures
in
the
mds
monitor,
and
here
is
an
indication
that
the
rank
is
damaged.
A
So
again,
we're
also
going
to
block
list
mds.
So
we're
going
to
check
to
make
sure
that
the
osd
monitors
osd
map
is
writable
and
then
finally,
we're
going
to
block
list
our
mark
that
the
the
rank
is
damaged
here
in
the
depending
fs
map
and
I'm
looking
for
where
we
block
list
them.
Yes,
and
maybe
we
don't
because
it's
presumably
going
to
shut
down
on
its
own.
So
I'm
not
sure
why
this
check
was
here
to
begin
with,
that
might
just
be
unnecessary.
A
Here
we
go
we're
calling
the
osp
monitors
block
list
method
directly
on
the
addresses
of
the
nds
that
sent
us
the
damage
notification
and
that's
just
to
handle
the
case
of
the
nds
is
sends
the
damage
message,
but
then
somehow
continues
operating.
So
the
the
monitors
want
to
make
sure
that
it's
dead
do
not
exist
so
that
we
would
get
this
state
if,
for
example,
the
mds
was
failed
or
are
rather
terminated.
A
So
it
got
some
signal
like
big
term,
and
the
mds
is
just
going
to
tell
the
monitors,
hey,
I'm
going
away
and
I'm
not
coming
back
and
that
allows
the
monitors
to
immediately
do
a
replacement
rather
than
the
mds
just
going
away.
And
then
the
monitors
have
to
wait
for
the
full
mds
heartbeat,
grace
time
period,
which
I
believe
is
by
default.
A
15
seconds.
Excuse
me
and
then
at
that
time
it
would
do
the
replacement.
So
instead,
the
mds
immediately
just
show
sends
off
a
begin
to
the
monitor,
saying
I
no
longer
exist
and
the
monitors
are
going
to.
C
A
Block
list
it
and
then
let
the
mds
know
that
it
got
the
message,
so
it
sends
a
beacon
back
thing
that
it
got
it
and
then
the
mds
would
finish
shutting
down.
For
instance,
the
next
big
one
is
going
to
be
just
rolling
down
a
bit
this
one.
A
So
we're
going
to
check
that
the
the
mds
is
not
currently
in
standby
state
and
the
state
that
the
mds
is
requesting
is
not
equal
to
its
current
state
and
then
we're
going
to
make
sure
that
the
state
transition
is
valid.
If
it's
not
valid,
then
we're
going
to
make
a
note
of
it.
A
And
then
not
do
anything
and
then
here
we're
going
to
modify
its
state
according
to
whatever
it
requested
and
then
finally,
after
making
any
changes
to
the
pending
fs
map,
we
have
to
wait
for
the
finished
proposal.
You
can't
tell
the
mds
is
about
it's
changed
to
the
state
of
the
fs
map
until
it's
actually
complete.
A
So
the
reason
for
that
is
all
of
the
block
listing
and
stuff
is
driven
is
done
by
adding
to
the
block
list
of
the
osd
map.
So
if
I
want
to
block
list
and
mds
so
they
can't
access
the
osd's
anymore,
I
need
to
update
the
osd
map
by
adding
it
to
the
its
addresses
to
the
ost
map
block
list
section,
and
so
whenever
I
need
to
block
this
in
mds,
I
have
to
wait
for
the
osd
monitor
to
be
writable.
D
Okay,
okay,
so
so
because
mds
is
yeah
are
also
ost
clients,
so
you
want
to
block
list
them.
Okay,
right.
C
A
All
right,
the
next
one
is
prepare
command.
This
corresponds
to
prepare
beacon,
so
here
we've
got
a
mon
command,
so
this
is
basically
a
message.
Is
wrapping
up
some
api
requests
to
the
monitors?
A
This
would
be
the
message
that
carries,
for
example,
mds,
fail
command
or
fs
new
or
fs
set
any
of
those
commands
get
wrapped
in
in
this
mod
command
message
here:
we're
getting
the
command
map
from
the
json,
getting
the
prefix
of
the
command,
which
would
correspond
to
say,
mds
fail
getting
the
session,
ensure
it
has
sufficient
access
and
then
here
again
we're
getting
the
pending
fs
map,
because,
as
part
of
doing
these
commands
we're
going
to
have
to
make
changes
to
the
to
the
fs
map,
and
here
we're
going
to
a
number
of
handlers
to
see
if
we
can
process
those
messages
there.
A
These
are
going
to
correspond
to
what's
in
the
fs
commands
class
and
we'll
get
to
that
in
a
moment,
and
so
this
is
just
some
con
common
handling
for
as
an
abstraction
for
handling
those
commands,
and
if
we
can't,
then
we
go
to
this
poorly
named
method,
command
file
system
command,
which
has
really
just
become
an
mds
command
method.
A
If
we
go
to
that,
you'll
see
that
most
of
these
commands
that
we're
handling
if
the
prefix
is,
for
example,
mds
set
state.
This
is
some
of
these
are
just
development.
A
Commands
that
aren't
worth
talking
about
for
let's
talk
about
mds
fail
because
that's
something
people
actually
run
so
here
we're
getting
the
roller
grd
argument
and
then
checking
are
getting
the
g.
The
gid
from
that
argument,
by
applying
this
gid
from
arg
helper
method
in
the
mds
monitor,
and
if
the
gid
doesn't
exist
and
we
complain
it
it,
it
does
not
exist
various
checks
and
then
finally
fail
the
mds,
which
is
something
we
call
throughout
the
mds
smart
order.
A
And
fail
mds,
I
believe,
will
return
a
failure
e
again.
If
the
osd
monitor
is
not
writable
and
if
that
occurs,
then
we
wait
for
it
to
be
writable
and
retry
this
message,
so
the
check
for
the
writable
esteem
on
is:
it
happens
in
fail
mds
in
this
case,
so
all
these
just
make
changes
to
the
pending
fs
map
all
right.
A
So
this
is
a
method
in
the
mds
monitor.
That's
just
periodically
called
like
approximately
every
five
seconds,
I
believe
by
the
paxo
service.
It
just
lets
you
do
basic
upkeep
on
the
fs
map,
so
here
we're
again
getting
the
pending
fs
map
writable
handle
and
we're
going
to
just
do
a
number
of
checks
on
it.
So,
for
example,
here
we're
going
to
check
health.
This
is
just
going
to
make
sure
that
the
mds's
are
are
healthy.
A
Make
sure
it's
fully
populated,
so
you'll
see
again
like,
as
I
said,
before,
the
beacon
noting
when
to
remove
an
mds,
because
it's
timed
out
is
actually
fairly
complicated
and
there's
been
a
number
changes
over
the
last
decade
to
the
monitor
to
to
handle
various
corner
cases
of
making
of
mds
is
going
laggy.
A
So
that's
all
handled
here.
If
one
of
them
is
laggy,
then
we're
just
going
to
remove
them.
So
you
have
this
vector
of
gids
to
remove
and
you'll
see
the
various
notes.
For
example,
this
one's
being
marked
laggy.
This
one
is
going
to
be
removed
and
then
here
we
actually
go
through
the
two
remove
vector
and
find
replacements
for
those
mdxs
and
drop
them.
A
A
Current
mds
map
is
from
here
we
get
the
fs
map,
get
the
file
system
and
get
the
mds
map,
so
we
want
to
actually
check
if
the
current
epic
of
the
fs
map
is
resizable
and
then
mds
map
corresponds
to
the
pending
fs
map.
So
you
see
this
fs
map
is
the
the
writable
handle
here
and
then
this
method
also
wants
to
look
at
the
current
epic
as
well,
so
it
gets
a
handle
to
the
current
mds
map.
So
this
is
the
pending
fs
and
then
this
is
the
pending
mds
map.
A
The
names
could
perhaps
be
improved,
so
we're
just
making
sure
that
the
the
if
either
the
current
fs
map
is
resizable
or
the
the
pending
fs
map
is
resizable.
If
neither
is
true,
then
we're
going
to
say
that
the
mds
map
is
not
currently
resizable,
we're
not
going
to
make
any
adjustments
to
the
number
of
mds
in
the
cluster
based
off
of
max
mds.
A
Otherwise
we
look
at,
for
example,
if
there's.
If
the
number
of
in
mds's
is
less
than
max
mdf,
then
we're
going
to
try
to
grow
the
cluster,
and
here
we
would
find
a
replacement
by
asking
it
to
fill
in
this
rank.
And
if
that
fest
map's
able
to
do
that,
then
it's
we're
gonna
promote
it,
find
a
replacement
for
the
rank
and
then
promote
it
to
that
given
rank,
and
then
we're
done.
The
mds
monitor
at
one
point,
actually
allowed
you
to
promote
several
mds's.
A
So
if
I
sent
max
mds
to
10,
it
would
try
to
promote,
like
nine
mds's,
all
at
once
to
all
those
ranks
if
it
had
sufficient
standbys
available
that
turn
behavior
like
that
turned
out
to
be
a
source
of
bugs.
So
at
one
point
we
changed
it
so
that
things
happen
in
a
sequential
fashion.
So
now
the
mdx
monitor
will
only
promote
one
rank
and
then
it's
going
to
wait
for
all
those
ranks
to
be
active,
and
only
then
is
it
going
to
add
one
more
mbs
to
the
cluster
and
that's
also
true
of
stopping.
A
A
The
mds
monitor
will
only
stop
the
largest
rank
and
wait
for
it
to
fully
stop,
and
then
it
moves
on
to
the
next
highest
rank
in
room
and
stops
that
range.
So
that
happens
in
this
method.
Maybe.
C
A
A
And
then,
finally,
we
go
through
again
and
look
to
maybe
promote
any
standbys
for
fail
rings.
So
if
a
file
system
has
a
failed
rank
and
it's
waiting
for
a
standby
to
promote
it,
that
would
happen
in
the
in
this
method
and
if,
as
part
of
doing
this,
we
also
have
to
propose
a
new
osd
map
that
would
happen.
For
example,
if
we
failed
in
mbs,
then
we
need
to
also
do
a
block
list
and
update
the
block
list.
So
we
also
have
to
request
the
proposal
to
the
ost
monitor.
A
So
here
this
code
is
looking
to
see
if
we
can
or
it.
I
think
this
was
done
in
like
2011.
It's
going
to
update
the
clients
of
the
monitors
with
the
new
fs
map,
so
anytime,
there's
a
change
to
the
mds
map.
The
peons
and
the
leader
of
the
mons
will
will
send
out
new
nds
maps
to
the
clients.
The
one
the
codepath
that's
going
to
get
hit.
The
most
often
here
is:
is
this
one?
A
It's
going
to?
We
have
a
sub,
that's
requesting
the
mds
map,
and
so
is
it
a
client?
Is
it
requested
a
particular
name
space?
So,
if
you
recall
in
the
chrono
client,
we
referred
to
file
systems
as
name
spaces,
so
that
that
terminology
leaked
into
the
mds
monitor
at
the
time
all
that
code
was
written,
and
here
it's
going
to
look
up
the
the
file
system
and
then
find
the
fs
id
cid.
That's
what
this
code
path
is.
A
These
codepaths
are
really
looking
for
and
then
once
it
has
it,
the
fscid
is
going
to
send
it
off
the
mds
map
corresponding
to
the
fscid.
So
here
is
the
lookup
where
it
gets
the
mds
map
and
then
finally,
that
gets
sent
in
this
make
message.
Mmds,
map
and
shipped
off
to
whoever
is
asking
for
this
subscription.
A
All
right,
so
we
talked
about
beacons
a
lot,
so
I
thought
we'd.
Finally,
look
at
what
a
beacon
is
and
there's
not
a
lot
to
say
it's
a
fairly
simple
message.
The
probably
the
most
complicated
part
about
this
message
is
the
number
of
health
warnings
that
we've
we've
we've
got
in
here
these.
What
we
call
these
mds
metrics
these
get
shipped
off
to
the
monitors.
A
The
little
message
here,
the
main
bits
that
are
interesting
here
are
the
name
of
the
demon,
so
every
mds
has
a
unique
name:
the
demon
state.
So
this
would
be
what
the
mds
is
saying.
It
wants
its
next
state
to
be
in
the
steady
state
general
case,
it's
going
to
be
asking
for
the
active
state
repeatedly
and
nothing
will
change
and
here's
the
mds
health
metrics
that
it
includes.
A
And
then
also
the
the
file
system
for
the
mds
or
the
file
system,
name
and
then
also
one
last
thing:
sequence
number,
which
is
going
to
get
bumped
every
time.
The
mds
sends
a
new
message,
a
new
beacon
to
the
monitors.
It
sets
the
sequence
message:
sequence,
epic,
so
that
it
can
keep
track
of
what
the
monitor
has
seen
so
far,
and
that's
one
of
the
ways
you
can
tell
if
it's
being
like
if
it's
got
a
laggy
connection
with
the
monitors.
A
And
then
also
the
beacon
class,
so
this
is
a
class
that
operates
outside
of
the
mds
lock
for
the
most
part-
and
this
is
just
going
to
be
doing
things
like
handle
mds
beacon,
so
whenever
it
gets
an
acknowledgement
from
the
monitors
for
its
beacon
message,
it
gets
an
mmds
beacon
back
and
here
it's
just
going
to
note,
like
the
sequence
type
timestamp,
for
the
the
monitor
sent
back
and
detect
whether
it's
no
longer
laggy
for
the
most
part
beacons
are
sent
through.
A
This
send
method,
message,
send
method
and
predominantly
these
are
set
by
by
this
sender
thread:
that's
initialized
in
beacon
in
it,
and
so
every
interval
seconds
it's
going
to
call
this
send
method
and
then
just
wait
to
send
the
next
one.
So
again,
this
operates
outside
of
the
mds
locks,
so
you'll
see
regularly
in
the
mds
logs.
The
mds's
beacon
class
is
just
sending
off
a
new
beacon
to
the
to
the
monitors,
and
here
it
notes
the
sequences
sending
updating
the
last
sequence
number
setting
health
metrics
on
the
beacon.
A
Etc
and
again
because
I
was
telling
you
we're
trying
to
avoid
if,
if
the
mds
or
the
monitors
are
under
load,
we
don't
want
it
to
delay,
beacon,
processing,
so
you'll
notice.
I
can't
find
it
all
right,
so
here's
fast
dispatch,
so
whenever
the
we
get
a
beacon
response
from
the
wires,
it's
going
to
fast
dispatch
this
this
handling
so
and
that's
one
of
the
reasons
this
operates
outside
of
the
mds
lock,
because
you
can't
you
should
not
acquire
mutexes
as
part
of
fast
dispatch.
A
So
again
I
when
I
was
going
through
the
mds
failed
handling.
You
know,
I
noticed
that
I
noticed
there
were
a
bunch
of
handles
that
or
handlers
that
could
handle
a
given
m
command
message.
A
A
lot
of
the
file
system
handlers
were
moved
to
this
fs
commands
class,
and
that
was
just
to
simplify
the
the
a
lot
of
the
repetitive
code
that
happens
like
batch,
proposing
and
and
actually
requesting
a
new
proposal.
All
that
code
was
abstracted
out
and
we
have
these
classes
now.
C
A
Given
message
or
is
an
op
operation
even
allowed?
That's
perhaps
the
interesting
one
here
is:
if
is
op
around
allowed,
this
was
based
off
of
summer
rishov's
work
recently
to
add.
A
An
authorization
to
the
sevex
caps
that
say
which
file
systems
a
client
has
access
to
can
even
see
the
mds
map
for-
and
that's
here
so
here
we're
actually
we're
getting
a
copy
of
the
current
fs
map
and
we're
filtering
it
and
we're
filtering
it
based
off
of
what
the
session
is
allowed
to
see.
So
they
can
only
see
a
given
file
system
name.
A
This
filter
method
filters
out
all
the
other
file
systems
and
then,
after
that,
we
check
whether
it
has
access
to
that
file
system
by
trying
to
get
it.
If
the
file
system
doesn't
exist,
perhaps
because
it
was
filtered
out,
then
we
returned
an
error
message.
The
file
system
is
not
found
with
some
exceptions
if
it's
a
fsrm
command.
A
So
that's
one
of
the
examples
of
abstract
handling
there
so
and
just
to
look
at
it
at
a
given
file
system
handler.
Usually
when
writing
a
new
one
of
these
you
it's
a
lot.
It's
mostly
just
an
exercise
and
copy
and
paste
they're
templated
pretty
well.
So
here
we're
checking
if
the
osd
monitor
is
writable,
because
this
fs
fail
command
what
it's
going
to
do
if
you're
not
familiar
with
it.
A
Already,
it's
going
to
take
every
rank
in
the
file
system,
both
the
the
rank
itself
and
any
standby
replay
demons
for
that
rank,
and
it's
going
to
remove
them
all
from
the
mds
map,
effectively
failing
them.
A
So
as
part
of
doing
that,
you
have
to
block
lists.
So
it's
going
to
check
that
the
osd
monitor
is
writable.
If
it's
not,
then
we
have
to
wait
before
we
execute
this
command.
A
If
it
is
writable,
then
we're
going
to
get
the
file
system
associated
with
this
fs
name
and
then
we're
going
to
mark
it,
not
joinable.
So
no
new
nps's
can
join
the
file
system
and
that
would
prevent,
for
example,
the
mds
monitor
tick
method.
That's
going
through
the
file
systems.
Looking
for
failed
ranks,
it's
not
going
to
promote
any
ranks
because
the
any
standbys
to
the
to
a
given
rank,
because
it's
the
file
system
is
marked
not
joinable.
A
A
The
interesting
thing
here
this
is
really
a
c
plus
thing,
plus
accepting
here
we're
getting
a
vector
of
the
gids
we're
going
to
fail
and
then
pushing
them
back
into
the
vector.
The
reason
for
doing
that
is
you
don't
want
to
modify
this
map.
That's
returned
by
get
mds
info.
You
don't
want
to
modify
it
by
removing
mds's
during
the
traversal.
C
C
A
So
just
some
of
the
interesting
history
that
happened
in
the
in
the
last
10-ish
years
in
2020
we
added
mbs
affinity.
That's
the
mds
join
fs
config
that
mds
get
set
with
that's
being
used,
both
in
sep,
adm
and
rook,
to
indicate
which
file,
system
and
mds
is
been
created
for,
and
that's
just
to
simplify
the
current.
The
the
previous
situation
we
had
so
sev
adm
was
creating
mdss
with
very
specific
games,
and
the
names
also
included
the
file
system
that
the
mds
was
created
for,
and
it
was
really
weird
to
have.
A
Those
mds's
join
other
file
systems,
but
it
looked
like
just
a
a
mix
and
it
was
really
hard
to
track.
You
know
whether
or
not
you
had
stray
mds
or
not,
and
all
that
so,
and
you
also
may
want
to
have
mdss
with
different
hardware
for
a
given
file
system.
A
That
was
to
simplify
some
config
settings.
We
had
like
where
you
had
to
specify
that
a
given
mds
was
allowed
to
do
standby
replay,
and
that
was
really
confusing
because
generally
you
wanted
to
set
that
on
all
of
them
and
really
it
made
more
sense
of
the
file
system
flags.
So
this
this
file
system,
flag,
allows
standby
replay
was
added
in
2018
we
added
the
paxos
fs
map,
so
the
get
writable
fs
map
methods
and
all
on
all
that,
that's
when
those
protections
were
built
in.
A
In
2018
we
added
doug
actually
did
a
lot
of
this
work.
Was
the
incremental
hmong
control,
the
activist
deactivation?
So
we
don't
stop
a
bunch
of
ranks
at
once.
If
you
reduce
max
mds
to
one
from
like
five,
you
don't
try
to
stop
ranks
one
through
four,
you
just
you
you
do
it
in
an
incremental
fashion.
A
That
proved
out
to
be
an
extremely
big
stability
win
for
for
multi
mds
2017
to
2018
you'll,
see
a
lot
of
changes
in
the
mds
monitor
trying
to
fix
some
issues
we
kept
seeing
in
upstream
qa,
with
the
beacons
being
lost
or
or
not
being
recorded
properly.
A
So
we
had
tons
of
messages
and
ways
about
mds
is
being
replaced
because
they're
laggy,
so
you
there
was
a
lot
of
co-churn
there
and
some,
I
think,
there's
still
maybe
a
few
latent
bugs,
especially
when
the
mds
is
talking
to
a
peon
monitor
some
of
the
beacons
are
not
being
processed
quickly
enough
for
some
workloads
and
is
causing
the
mdss
to
be
removed,
falsely
so
I
think
there
are
some
latent
bugs
that
need
fixed
2017
last
rank
deactivation.
A
Oh,
I
may
have
these
mixed
up
2018.
The
incremental
hmong
controlled
activation-
I
should
say
so.
You
increase
max
mds,
it
only
increases
mds
is
incrementally
last
rank,
I
think,
was
the
stopping
change
anyway,
it
it's
a
mix
of
of
code
there
and
then
2017,
the
fs
commands
was
added
and
that
that
corresponded
roughly
with
the
2016
changes.
That
john
did
to
add.
Multiple
file
system
support,
so
all
that
was
done
fairly
recently.
A
2013
allow
new
snaps.
This
was
in
response
to
a
number
of
bugs
that
were
suspected
with
snapshotting
and
file
systems,
so
greg
greg,
farnum
added
the
allow
new
snap
setting
on
file
systems
to
prevent
news
or
to
indicate
that
new
files,
new
snapshots
were
allowed
to
be
created
on
a
file
system
so,
and
we
were
also
used
that
setting
to
detect
if
snapshots
were
ever
allowed
on
a
file
system,
and
we
needed
to
do
certain
upgrade
checks.
A
There
were
certain
upgrade
sequences
in
order
to
upgrade
a
file
system
that
had
snapshots
at
one
point
in
2011
standby
replay
was
added.
It
was
fairly
long
ago
that
this
feature
has
been
in
in
ffs
and
only
recently,
I
think,
became
more
usable
just
with
this.
Allow
allow
standby
replay,
setting
it
became
much
easier
to
set
up
and
then
in
2009
subscriptions
were
seth
has
always
had
some
kind
of
subscription
to
the
mds
map
by
necessity,
but
a
lot
of
the
code
to
create
subscriptions
was
back
then
that's
about.
A
As
far
back
as
I
went,
and
if
you
look
at
the
code
of
the
mds
monitor
you
see
like
at
least
half
of
the
commits
are
trying
to
fix
weird
consensus
bugs
or
you
know,
beacons
being
lost
in
certain
code
paths
and
and
or
not
being
recorded
properly,
and
so
the
monitors
falsely
believe.
A
Okay,
all
right,
that's
it
for
the
slides.
Any
questions
about.
A
Yes,
absolutely
all
the
monitors
have
their
own
store
where
they,
where
they
keep
the
mds
maps
and
with
some
look
back
eventually,
things
get
garbage
collected
after
a
while.
So
it
doesn't
remember
every
mds
map
it
created.
A
The
actual
machinery
for
storing
the
mds
map
is
is
handled
at
a
higher
level
than
the
mds
monitor,
though
it
doesn't
need
to
actually
worry
about
the
details
of
saving
the
mds
map
to
a
persistent
storage
handled
by
other
and
yeah.
That's
that's
all
I
say
about
that.
Any
other
does
that
make
sense
or
any
other
follow-up
questions.
D
D
And
the
other
question
I
had
is
when
you
make
any
changes
to
the
fs
map
and
you
get
a
new
version
of
fsmap
the
pending
one
and
so
automatically
the
mds
is
subscribed
for
it
and
that's
how
they
know
the
change,
because
they
keep
subscribing
or
the
the
the
monitor
sends.
Those
messages
to
the
nds
is
saying
that
the
map
has
been
updated.
A
So
there's
different
kinds
of
subscriptions.
One
is
that
you
might
just
want
the
next
epic
and
you
don't
care
about
any
follow-ups
and
then
there's
other
subscriptions
where
you
just
want
to
be
continually
updated
for
all
versions,
and
the
mds
is
what
I
will
do.
The
latter.
A
I
haven't
looked
at
that
code
very
hard,
so
I'm
maybe
wrong
on
the
details,
but
I'm
pretty
sure
that's
how
it
works.
It
may
be
that
you
just
need
every
time
you
get
an
epic
or
a
new
mds
map.
You
have
to
just
ask
for
the
next
one
immediately,
but
I
think
in
order
to
reduce
the
traffic
on
the
mons,
that
probably
was
changed
to
just
be
automatic.
A
Sorry
about
craig
is
not
very
happy
right
now,
any
any
last
questions.
A
All
right
thanks,
everybody
for
attending
the
walkthrough
I'll,
see
you
all
tomorrow.
Bye
thanks,
patrick
nice
talk.