►
From YouTube: CephFS Code Walkthrough: MDS Journal Machinery
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
so
that's
it
so
the
md
log
is
or
or
the
metadata
journal
for
the
mds
is
the
primary
holder
of
all
metadata
mutations
that
the
mds
makes
as
it
goes,
so
any
any
change
that
needs
to
be
durable,
that
the
mds
makes
will
end
up
in
that
in
the
mds
journal,
with
a
little
asterisk.
Next
to
that
we'll
get
into
it
later.
A
Some
of
the
things
the
journal
contains
are
some
of
the
recent
metadata
mutations.
So
that's
the
obvious
stuff.
If
I
do
a
change
mod
or
if
I
create
a
new
directory
or
move
a
directory,
any
of
those
events
will
always
end
up
in
the
mds
journal.
A
A
The
mds
needs
to
know
all
the
different
clients
that
were
connected
so
that
it
can
make
sure
that
those
clients,
you
know
reacquire
their
their
caps
and
and
reestablish
state
with
the
the
mps
the
the
mds,
will
also
move
on
from
the
state
of
waiting
for
the
clients
to
reconnect
once
it
knows
that
all
the
clients
have
connected.
A
So
it's
also
important
that
to
to
improve
the
time
the
mds
failure
recovery
that
the
mds
knows
when
all
the
clients
have
actually
reconnected,
which
is
something
you
can
compare
to
say
like
nfs,
where
it
has
to
wait
for
the
full
grace
period
whenever
it
does
a
failover
because
it
doesn't
know
which
clients
were
were
connected.
A
A
We
also
have
the
purge
state
synchronization
bits,
so
the
purge
queue
operates
somewhat
independently
from
the
mds
journal,
but
there's
periodically
certain
events
that
need
to
be
journaled
so
that
they're
they're
in
sync,
then.
Finally,
this
is
the
the
big
one,
especially
recently
is
the
subtree
map.
A
A
And
then
formally,
it
also
logged
which
files
are
opened
by
clients.
That
was
actually
something
that
was
recently
changed
by
zhang
two
or
three
years
ago.
He
added
the
new
open
file
table,
and
that
was
to
improve
mds
failover
times
and
what
was
mostly
nbs
failover
times
and
to
reduce
the
size
of
the
mds
journal.
A
If
you
had
a
num,
a
large
number
of
clients-
and
they
had
a
large
number
of
files
opened,
mbs
would
spend
a
lot
of
a
lot
of
time,
logging
to
its
journal.
What
files
were
being
opened
by
clients,
and
so
that
was
all
moved
out
into
a
into
a
separate
set
of
objects
to
reduce
the
load
on
the
journal.
A
Okay,
so
the
the
the
mds
journal
is
as
a
data
structure
is,
is
important
for
mds,
because
it's
it's
it's
a
way
to
do
serial
rights
to
a
conceptually
large
object
on
top
of
the
distributed
data
store.
That
is
ceph,
so
obviously
ceph
has
you
know
you
can
have
billions
of
objects
and
they
can
be
scattered
all
over
the
osds.
A
But
actually
you
know
knowing
which
objects
to
read
and
write
from
is,
you
know
is
just
something
you
have
to
code
up
and
the
journal
represents
that
the
journal
is
in
a
lot
of
ways
a
circular
buffer.
A
It
does
occupy
a
finite
set
of
potential
objects
and
the
reads
and
writes
go
to
a
certain
subset
of
those
objects,
and
then
the
journal
is
constantly
growing
towards
the
end,
and
then
the
the
trimming
is
is
constantly
catching
up
with
it,
and
all
of
that
is
managed
in
a
separate
bit
of
code
in
the
objects,
storage
device,
cacher
library,
but
we'll
get
into
in
a
moment
which
handles
all
of
that
handles
all
the
details
of
that
data
structure.
A
So
the
mds
journal,
also,
if
especially,
if
you
read
some
of
the
academic
literature
associated
with
stuff,
the
mds
journal,
it
is-
is
behaves
very
similar
to
or
I
should
say
that
the
genesis
of
the
idea
of
the
journal
was,
you
know,
began
with
the
you
know
the
log
structured
file
system.
So
if
you
want
to
learn
about
journaling
file
systems,
you
would
look
back
to
that
old
paper.
A
A
You
know
that
they
that
evolved
into
an
idea
of
we'll
have
all
the
file
system
state
stored
in
in
regular
data
structures
like
b
trees
or
whatever,
and
then
have
a
separate
journal
for
actually
making
for
for
making
sure
data
is
durable
and
also
to
allow
for
recovery.
The
mds
follows
that
philosophy.
The
the
directories
are,
you
know,
stored
in
in
individual
objects
on
on
rados,
and
we
store
all
this.
A
The
directory
hierarchy
state
in
those
objects,
but
we
also
have
the
mds
journal,
which
provides
that
the
fault,
tolerance
and
and
the
the
ability
to
do
failover,
but
also
it
it
provides
a
like
a
hot
cache
for
for
the
mds
is
taking
over,
so
that
it
can
load
all
the
most
recently
used
metadata
by
the
clients
and
get
its
cache
hot
when
it
finally
turns
active.
A
Okay,
so
let's
move
on
to
how
it's
structured
so
again,
I
said
that
the
mds
journal
goes
out
to
a
series
of
objects,
as
you
can
kind
of,
think
of
it
as
like
a
circular
buffer,
but
only
part
of
the
buffer
is
is
ever
allocated
on
on
disk.
A
A
We
call
them
journal
segments
and
the
mds
only
ever
writes
out
an
entire
segment,
and
each
segment
is
a
a
set
of
log
events
that
have
been
grouped
together
and
whenever
you
do
any
event,
sorry
any
time
the
mds
must
flush
the
journal,
maybe
the
client
requested
a
sync
or
or
someone
executed,
the
flesh
journal
command
on
the
mds
or
some
locking
logic.
A
There's
no
minimum
number
or
of
log
events
that
need
to
be
written
out
by
the
mds,
but
it
always
tries
to
delay
writing
out
the
journal
segment
as
long
as
possible.
There's
a
periodic
five
second
sync
in
mds,
so
eventually
things
be
metadata,
does
become
durable,
even
if
the
load
is
extremely
light
on
them.
Yes,
and
once
the
log
object
becomes
full
of
journal
segments.
A
A
That
yeah,
that's
all
there
is
to
say
about
that
all
right
and
then,
as
far
as
where
the
journal
is
laid
out
on
those,
we
have
this
concept
of
what's
called
a
journal
pointer
and
it's
actually
a
fairly
new
concept
in
cefs.
It
was
added
in
about
2014
by
john
spray.
A
For
those
of
you
don't
know
john
spray,
he
was
the
previous
team
lead
first
ffs
from
back
in
2016
2015
to
2017,
and
he's
been
working
on
steph
and
cfs
for
for
a
while
at
the
time.
So
the
journal
pointer
was
added
for
well
I'll
get
into
that
later,
but
the
journal
pointer
is
just
a
pointer
to
where
the
the
current
the
mbs
journal
is,
and
the
reason
it
exists
is
to
facilitate
doing
things
like
disaster
recovery
on
on
the
journal.
A
If
you
need
to
write
out
a
new
journal,
you
want
to
set
up
a
new
series
of
of
objects
that
the
the
the
mds
should
write
journal
out
to
well.
The
journal
pointer
exists
to
atomically
update
where
the
mds
journal
is
and
kind
of
think
of
it
as
a
double
pointer.
A
A
And
in
the
in
the
x
adder
information
for
that
object,
you
can
find
the
the
current
journal
so
it'll
point
to
the
current
journal
head
for
most
chef
installs,
especially
in
new
installs.
The
journal
head
would
be
at
200.
A
A
However,
it
can
also
be
put
in
a
different
location
and
we'll
see
for
the
long-running
cluster,
it's
actually
at
the
300
location.
So
this
would
be
300.,
and
so,
when
the
mds
is,
if
you're
doing
some
kind
of
recovery
event
or
for
reformatting
the
journal,
which
has
happened
in
the
past,
the
journal
pointer
would
be
updated
and
then
you
might
just
have
us.
A
That's
all
you
ever
see.
As
far
as
the
on
the
actual
on
disk
information
is
stored
with
the
journal
pointer,
it's
extremely
simple
it.
You
can
use
this.
This
rados
command
line
shell
command
to
actually
read
the
journal
pointer
and
it's
just
these
two
fields,
the
front
and
the
back
and
the
front
is
correspond
to
ox
300,
and
this
is
actually
coming
from
the
long-running
cluster.
So
this
is
receiv0001,
and
you
see
we
got
the.
A
I
said
the
x
adder,
it's
actually
stored
in
the
blob
of
the
of
the
400
object,
and
here
we're
importing
into
the
journal
pointer
type
and
decoding
it
as
json,
and
so
again
we
can
see
that
it's
at
os
300.
So
the
long
running
cluster
is
actually,
as
its
name
would
suggest,
been
around
for
a
very
long
time.
A
We've
been
constantly
upgrading
it
and
it's
served
as
a
nice
way
to
catch
bugs
that
may
occur
with
the
legacy
data
structures
that
may
exist
anywhere
on
in
ceph,
either
in
the
osd
maps
or
the
or
in
the
monitor
stores,
or
on
the
osds
themselves,
or
legacy
mds
structures
which
may
exist
somewhere
on
some
directory
instead
of
s.
So
this
cluster
has
been
around
for
a
very
long
time.
It's
survived
like
journal,
reformattings
and
and
whatnot.
A
You
can
see,
unlike
a
newer
cluster
with
like
vstart,
you
can
see
that
it,
its
journal
head,
is
actually
at
os
300,
which
is
you
know
if
you
were?
If
you
went
looking
there
for
the
journal
on
the
long-running
cluster,
I
think
it
was
in
the
200s
you'd
be
surprised.
It's
not.
There.
A
All
right
and
then
the
journal
header,
so
I
said
that
the
journal
pointer
acts
like
a
double
pointer.
So
if
you
look
back
here,
we
have
the
200s
and
the
potentially
300.
I
put
question
marks
there.
It
could
be
somewhere
else.
A
In
particular,
where
the
the
journal
objects
are,
as
it
is
a
large
circular
buffer
logically,
but
only
a
subset
of
the
objects
of
that
buffer
are
actually
physically
present
and
so
there's
millions
of
different
objects
that
could
be
of
that
in
that
set
of
objects,
and
only
some
of
them
are
used.
So
you
need
to
know
where
they
are
so
in
this
head
object
which
we
can
get
and
then
pipe
that
into
the
set
decoder
and
get
the
journal
header.
A
A
This
is
used
by
this
is
using
the
same
logic
that
the
setfuse
library
uses
when
it
uses
the
the
osdc
library
again
to
handle
the
filer
logic
concerning
where
to
actually
store
the
file
data
objects
for
the
for
a
file,
namely,
what's
the
stripe
size.
What's
the
object
size
which
pool
id
what
pool
name
space
all
of
that
is
stored
in
the
layout
and
the
mds
journal
uses
the
same
information
to
lay
out
his
journal.
A
A
A
Similarly,
the
expire
position
is
noting,
where
the
mds
has
expire
literally
expired
journal
segments,
meaning
the
what's
an
expired
journal
segment,
that's
a
segment
that
the
mds
has
made
durable
by
writing
all
of
the
metadata
mutations
for
that
journal
segment
out
to
the
physical
directory
objects.
A
So,
for
example,
if
I,
if
I
create
a
file
in
a
directory
and
it
and
that
event
is
noted
in
one
of
these
log
objects
that
I
want
to
expire,
I
have
to
go
out
to
the
directory
object
for
that
for
containing
that
file
and
actually
create
the
file
I
node
in
it
in
that
directory,
object's,
omap
and
then
once
that's
complete,
then
that
log
segment
is
eligible
to
be
expired
and
once
all
the
log
segments
in
a
log
object
have
been
expired,
then
I
can
adjust
the
expired
position
of
the
mds
journal.
A
Now,
just
because
a
lot
a
log
object
becomes
expired,
doesn't
mean
you
necessarily
want
to
get
rid
of
it.
Because
remember.
I
said
that
the
mds
journal
uses
the.
I
mean
the
mds
uses
the
journal
to
facilitate
recovery.
It
also
uses
the
journal
for
maintaining
a
hot
cache
of
recent
metadata,
so
you
don't
necessarily
want
to
expire
everything
up
expire
and
delete
everything
that
up
to
the
like
the
current
write
position.
A
If
you're,
if
you're
expiring
things
as
fast
as
your
writing,
you
want
to
keep
a
certain
buffer
of
recent
events
so
that
a
new
mds
can
kind
of
get
a
hot
cache.
A
So
the
journal
keeps
track
of
where
the
oldest
expired
expired
object
is,
and
then
it
also
keeps
this
trimmed
position.
So
that's
what
we're
going
to
use
to
keep
track
of
what's
actually
been
deleted,
so
there's
going
to
be
a
number
of
objects
which
are
trimmable
and
expired
and
a
number
of
objects
that
are
expired,
but
we
don't
want
to
get
trim
and
then
the
current
object
that
we're
writing
to
and
there
will
actually
be
a
group
of
objects
which
have
not
been
expired.
A
I
didn't
show
here
but
are,
are
complete
and
fully
written
and
then
so
the
trim
positions
constantly
being
updated.
Whenever
the
mds
journal
data
structure,
the
osdc
journaler
is
trimming
old
objects
and
which
is
really
just
deleting
them
and
then
there's
the
expired
position
which
is
constantly
being
updated,
as
the
mds
is
fully
writing
out
all
the
changes
in
the
journal
to
the
backing
directory
objects.
A
You'll
also
notice
that
the
if
you,
if
you
just
look
at
an
mds
running
for
quite
a
time-
and
you
know,
you've
made
a
few
directories
or
files-
and
you
just
wait
for
those
directories
to
show
up
in
rados
you'll-
be
waiting
a
very
long
time,
because
the
the
expiring
of
the
log
object
and
the
log
segments
in
that
object
occurs
only
as
needed
as
the
mds
journal
grows.
A
And
so
the
those
directory
objects
may
not
be
created
until
a
lot
of
load
is
placed
on
the
mds.
And
if
you
want
to
actually
see
all
of
those
changes,
flush
to
disk,
you
have
to
issue
a
journal
flush
command
to
the
mds,
to
the
admin,
socket
interface
or
really
through
the
septel
interface.
Now,
with
pacific
and
onwards,
and
that'll
actually
write
out
force.
A
All
of
the
current
log
segments
to
be
expired,
meaning
that
the
all
the
the
changes
correspond
to
those
log
segments
will
actually
be
written
out
to
the
backing
directory
objects
and
then
the
through
some
commands
to
the
journal
interface.
The
all
those
segments
will
also
be
trimmed.
A
All
right
so
another
way
to
look
at
the
journal.
Header
is
through
the
without
actually
running,
rattles
commands
and
then
manually
running.
The
decoder
is
to
also
just
run
this
ffs
journal
tool.
I
said
this
was
new.
It's
new,
relatively
new.
It
was
done
in
2014
again
by
john
spray.
He
added
this
tool
to
allow
us
to
do
some
kind
of
disaster
recovery
for
the
journal,
but
also
just
to
inspect
its
contents,
and
so
this
is
a
good
tool
to
become
familiar
with.
As
this
ffs
developer.
A
To
just
learn
how
the
mds
journal
works
and
also
be
able
to
learn
more
about
certain
types
of
recovery
situations
fairly
common,
especially
in
community
clusters,
that
something
really
goes
wrong,
and
if
it's
really
bad,
they
may
end
up
doing
a
full
journal
reset,
which
is
just
wiping
out
all
the
latest,
all
of
the
journal
of
the
mds
and
creating
a
new
one.
So
whatever
they
have
that's
been
expired
from
the
journal
onto
the
the
metadata
pool
is
what
they
get.
A
A
That
and
yeah
so
again,
this
is
this
is
from
a
v
star
cluster,
and
so
we
can
see
the
right
positions
just
a
little
bit
ahead
of
the
expired
trim
position.
These
two
actually
have
not
been
have
not
updated
yet
because
nothing's
been
expired
or
flush
from
the
v-star
clusters
journal.
A
You
can
also
use
the
cfs
journal
tool
to
look
through
the
different
types
of
events
that
are
in
the
journal,
so
here
we
can
do
an
event
get
list.
I'm
just
looking
at
the
latest,
15
journal
entries
and
you
can
see
we've
written
out
a
subtree
map.
Subtree
map
test
is
something
you
only
see
in
v.
Star
clusters,
usually
new
session
information
makers.
A
Cap
updates
those
types
of
things,
and
you
can
actually
add
that
in
the
next
slide
you
can
see.
There's
you
can
do
a
lot
of
fancy
filtering.
If
you
are
only
wanting
to
look
at
the
events
associated
with
like
a
particular
inode,
you
can
do
that
with
this
selector,
you
can
look
at
particular
paths
or
fragments
the
directory
fragments
or
if
you
just
want
to
see
everything
that
a
client's
been
doing
and
have
a
selector
for
that
as
well.
A
Here
is
the
reset
command
I
was
talking
about
earlier,
so
this
will
allow
you
to
just
do
a
full
wipe
of
the
mds
journal,
and
this
is
only
done
during
the
certain
disastery
recovery
if
the
journal
has
gotten
corrupt.
For
some
reason,
an
upgrade
went
horrible,
be
all
sorts
of
reasons,
though
generally
it's
it.
It
really
should
not
ever
be
done,
but
you'll
also
notice
that
community
members,
when
they
become
desperate,
they
often
run
this
this
reset
command.
A
So
it
unfortunately
is
fairly
common
to
see
in
mailing
list
posts
and
then
there's
these
import
and
export.
So
you
can
actually
import
and
export
the
full
journal
out
to
a
file.
A
All
right,
so
I
I
have
in
the
slides
you're
following
along
on
the
google's
drive,
I
think
I
sent
you
know
which,
which
code
we're
going
to
look
at,
but
I
can't
display
both
at
the
same
time,
so
we'll
just
look
at
the
code
on
or
in
the
screen
share,
okay,
so
starting
with
osdc
journaler.
So
this
is
the
all
the
code
that
handles
the
the
data
structure.
Logic
of
the
journal.
A
There's
nothing
nds
specific
in
this,
this
library
and
again
it's
used
by
both
the
cefuse
client
library
and
also
the
mds,
which
should
make
sense
because
both
actually
need
to
understand
how
to
manipulate
the
data
objects
for
for
the
files.
A
Now,
if
you're,
just
in
the
top
level
source
tree,
you
might
think
you
might
be
looking
around
for
a
journal
and-
and
you
find
this
and
you
think
that's
where
the
mds
journal
is,
but
that's
actually
not
where
it
is.
This
is
code
used
by
rbd
and
nothing
instead,
ffs
uses
it
so
just
be
aware
of
that.
A
So
this
is
actually
a
really
important
comment
if
you
ever
have
to
dig
into
this
journal
or
code
strongly
suggest.
You
fully
understand
this
comment
before
doing
anything
and
the
the
main
point.
Just
to
summarize
what
it's
gonna,
what
it
talks
about
is
that
the
current
trim
position,
the
current
expired
position.
The
current
right
position
are
actually
not
updated,
synchronously
with
where
whatever
the
current
right
object
is
or
expire
object,
actually
written
periodically
out
by
the
journaler.
A
The
journaler
just
assumes
that
these
are
moderately
out
of
date
and
it'll
actually
continue
looking
forward
a
few
objects
until
it
finds
whatever
the
latest
written
object
is
and
whatever
the
latest
trimmed
object
is
so
on,
and
so
whatever
the
head
object
says,
is
the
current
right
position.
If
you
were
doing
manual
debugging,
it
may
not
actually
be
that
object.
It
could
be
a
few
objects
later,
but
that's
just
something
to
be
aware
of.
A
So
I'm
not
going
to
actually
go
through
this
code
much
it's
just
a
data
structure.
Again,
it's
a
logical
circular
buffer
with
only
a
subset
of
the
objects
actually
allocated
and
there's
not
a
lot
to
say
about
it.
This
code,
the
code
for
the
journal,
error,
reuses
a
lot
of
code
elsewhere,
data
structure
code,
notably
the
filer.
A
Oh,
is
the
font
large
enough
and
near
any
complaining,
so
I
just
assume
it
is.
A
Yeah,
it
looks
good,
okay,
all
right,
so
here's
the
journal
pointer
object
again,
it's
a
basically
a
double
pointer,
there's,
not
really
a
lot
to
say
about
it.
There's
the
front
and
the
back
links
for
which
journal
we're
accessing
and
during
a
recovery
situation
there
may
be
two
but
generally
on
a
stable
cluster.
The
the
back
link
will
be
will
be
null
and
only
the
front
one
is
is
the
current
journal.
A
So
here
is
the
the
md
log,
which
is
the
the
set
of
code
that
manages
writing
out
all
of
the
journal.
Events
for
the
for
the
mds
and
interfaces
with
the
ostc
journaler.
A
Great
so
here
we're
just
creating
an
an
empty
log.
This
is
done
by
when
the
mds
is
starting
up
for
the
first
time
and
opening
the
mds
log
and
doing
performing
any
kind
of
replay.
A
This
is
the
pending
event,
so
this
is
a
structure
we're
gonna,
see
later
on.
Whenever
the
mts
needs
to
write
out
a
log
event,
it's
going
to
create
a
pending
event
structure
and
for
that
iq
during
the
design
of
the
md
log.
Is
that
there's
a
there's,
a
thread
for
actually
doing
the
rights?
How
to
to
the
journaler
and
there's
the
other
threads
that
come
in
and
try
to
create
events?
A
Let's
say
about
the
header
move
onto
the
code,
so
here's
creating
an
empty
log.
This
is
the
mds
I
know
log
offset,
so
the
inode
number
for
the
journal
layer
is,
you
know,
for
new
plus
will
be
like
ox200.
A
So
that's
what
this
lot
offset
is,
and
then
you
actually
add
the
node
id,
which
is
really
the
rank,
the
mds
so
for
rank
zero.
It's
just
adding
zero.
So
that's
why
the
rank
zero
mds
journal
is
at
os200
and
if
you
were
rank,
one
it'd
be
lx,
201
and
so
on
and
there's
actually
a
maximum
number
of
mds
ranks,
which
is
not
something
we
often
see
in
the
wild
of
256,
which
is
kind
of
there's.
No
real
reason.
A
We
have
to
have
that
limit
other
than
certain
on
disk
metadata
structures,
assume
that
there's
only
256
ranks
because,
once
you
add
256
to
200,
you
get
always
300
and
that
could
be
it
like
a
recovery
journal
or
something.
A
So
that's
one
of
the
reasons
why
that
that
limit
exists,
there's
a
bit
of
a
diversion
so
here
we're
actually
creating
the
journaler
metadata
pool
for,
for
example,
with
the
lx200
file
object
and
we're
setting
just
another
item
at
creating
a
journal
pointer
that
points
to
that
that
inode
saving
it
out.
And
then
we
just
wait
for
all
of
that
to
complete.
A
All
right
and
then
and
here's
that
writing
the
head-
object
all
right
smith,
red
entry,
so.
A
This
is
the
the
main
method
that
most
of
the
mds
other
mds
code
will
be
calling.
A
So
this
is
when
we
are
going
to
submit
a
log
event
to
the
current
journal
segment
and
actually
the
the
journal
segment
segment's
already
been
associated
with
the
log
event
by
this
point
for
certain
some
structural
reasons
of
of
associating
certain
state
with
the
log
events,
and
here
the
will
be
creating
with
one
of
these
pending
events
that
I
talked
about
earlier,
and
this
is
really
just
the
log
event
itself
and
then
the
station
here.
A
So
this
is
a
context,
theft
context,
which
is
how
we
do
our
continuations
within
ceffes
or
in
the
in
mds.
So
when
this
actually
becomes
durable
this
a
lot.
This
context
completion
will
get
completed
and
you
know
that
might
require
that
might
result.
That
would
result
in,
for
example,
in
telling
the
the
client
that
the
that
the
that
the
request
is
is
now
fully
durable
and
known,
and
we
would
get
a
safe
reply
from
the
mds
indicating
that
that
that
request
is
now
durable.
A
A
So
here
we
have
like,
for
example,
the
there's
a
number
a
maximum
number
of
log
events
per
segment
or
if
the
log
segment
sum
period
is
or
sizes
is
starting
to
exceed
the
size
of
the
backing
objects
and
we're
going
to
start
a
new
segment
and
we're
going
to
end
up
writing
out
the
new.
The
current
one-
and
here
is
that
logs
our
subtree
map
test
that
I
mentioned
earlier.
A
This
is
a
this
is
only
done
when
we
have
this
mds
debug,
subtrees
config
turned
on
and
then
we'll
actually
write
out
the
subtree
map
test-
and
this
is
just
a
as
it
is.
The
comma
indicates
just
a
catch
replay
box,
but
that's
not
an
event.
You
normally
see.
A
So
that
gets
put
in
a
pending
queue
and
you
can
see
there's
a
so.
This
is
a
helper
method,
but
actually,
if
we
looked
at.
A
You
can
see
there's
a
a
submit
entry
law,
the
mutex
that
is
is
guarding
this
this
view
and
here
we're
adding
the
so.
This
is
just
the
important
bit
adding
that
to
the
pending
event
list.
So
move
on
to.
A
So
this
is
the
thread:
that's
just
sitting
in
the
mds
constantly
reading,
pending
events
from
the
queue
and
writing
out
the
log
segments
as
they
become
complete.
A
So,
let's
see
so
while
mds
is
not
stopping
we're
going
to
be
continually
running,
this
thread
grabbing
an
event
off
the
pending
events.
Queue
it's
empty,
we're
going
to
wait
for
it
to
become
to
get
an
event,
and
here
we
read
the
any
event
data
and
then
we
check,
if
there's
a
log
event
associated
with
it,
there's
not
always
going
to
be
a
long
event,
sometimes
you're,
just
waiting
for
a
flush.
A
And
here
we're
going
to
get
the
current
right
position
of
the
journaler
incorp
associated
that
with
the
log
segment,
which
I
believe
is
mostly
for
debugging
and
then
finally,
here
is
the
where
we're
actually
appending
that
that
the
buffer
list
containing
the
log
segment
out
the
journaler.
A
A
A
A
Anyways,
as
this
actually
is
flushed
there's.
A
A
This
is
a
continuation
you'll,
often
see
in
the
mds
vlogs.
If
you
look
at
a
debug
log,
you'll
see
lots
of
these
and
it
actually
is
just
a
wrapper
for
for
completing
another
context
which
would
come
from
another
part
of
the
mds
that
wants
to
know
that
something
has
become
journal,
a
a
durable
but
also
records.
The
current
writing.
A
And
there'll
be
a
number
of
these
for
every
log
event,
so
you
can
see
this
was
created
for
with
the
data.fin
and
again,
the
data
object
is
just
the
pending
event
that
we
read
off
the
queue
and
the
the
fin
is
the
the
finalizer
associated
with
it.
The
continuation
that
the
the
mds
is
going
to
run
when,
when
the
vlog
event
becomes
durable,.
A
A
So
starting
a
new
segment
is
something
we
do
when
we
want
to
force
certain
events
prior
to
that
segment
to
be
flushed
or
expired.
This
is
something
you'd
you
might
see,
for
example,
when
we're
we're
telling
the
mds
to
flush
his
journal.
In
order
to
do
that,
we
need
to
to
fully
flush
his
journal.
We
need
to
tell
it
to
start
a
new
log
segment
so
that
all
the
previous
events
in
the
pre
in
the
prior
segment
can
actually
be
eligible
for
expiration.
A
So
that's
one
of
the
circumstances
where
you
might
have
this
this
starting
a
new
log
segment,
and
generally
it's
not
something
you
need
to
think
about
when
you're
interacting
with
the
md
journal.
But
important
thing
here
is
preparing
a
new
segment.
A
So
here
is
where
we
actually
allocate
a
brand
new
blog
segment
within
the
current
event:
sequence,
which
is
just
keeping
track
of
the
number
of
events
that
have
gone
through
the
mds
and.
A
A
Oh
here
here
it
is
so
we
actually
create
the
new
segment
and
then
the
first
thing
the
md
log
does,
whenever
it
creates
a
new
segment,
is
it
creates
a
new
subtree
map,
object,
event,
sorry
subtree
map
event,
and
every
time
the
mds
starts
a
new
journal
segment.
It
actually
writes
out
the
subtree
map-
that's
done
here,
so
it's
actually
going
to
ask
the
metadata
cache
to
create
the
subtree
map,
which
is
just
a
special
event
in
the
mds
journal
that
records
the
entire
subtree
map.
A
As
of
that
current
point
in
time
and
writes
that
as
the
first
entry
associated
with
that
journal
segment-
and
this
is
done
to
simplify
data
recover
recovery
during
mds
failover
and
the
reason
I
bring
this
up
is
that
especially
if
you're
looking
into
performance
issues
right
now,
one
quirk
of
the
mds
is
that
if
you
have
hundreds
of
subtrees
that
substring
map
can
get
extremely
large,
and
that
means
that
every
time
the
mds
is
preparing
the
journal
segment,
it
actually
will
have
this
gigantic
subtree
map
taking
up
space
in
the
mds.
A
The
mds
journal
is
just
spending
a
ton
of
time.
Writing
out
the
subtree
map
and
also
acquiring
the
metadata
server,
lock
big
lock
in
order
to
do
that,
so
it's
very
expensive.
This
is
actually
motivation
for
one
of
jung's
recent
changes
he
had
done
before.
He
left,
which
he
actually
just
renewed.
The
pull
request
a
week
or
two
ago
to
pull
the
subtree
map
out
of
the
mds
journal,
and
that
is
it's
all
centered
around
fixing
this
bit
of
code
writing
out
the
tree
map
segment.
A
Move
on
to
a
rethread,
so
this
is
called
when
mds
is
doing
replay.
This
might
have
to
be
the
last
thing
we
talked
about
because
already
10
minutes
till
top
of
the
hour.
I
wanted
to
leave
some
time
for
some
way,
so
here
we
have
the
recovery
thread
or
the
mds
or
the
metadata
log,
and
here
this
is
really
just
going
through
the
different
events
for
the
mds
journal
and
reading
them
off
the
the
disc
reading
them
off.
Often
the
metadata
pool.
A
So
here
we're
loading.
The
journal
pointer
and
the
current
journal
generally
is
we're.
Just
gonna,
be
so
here.
Here's
the
the
back
journal.
So
if
we're
doing
some
kind
of
reformatting
of
the
journal-
or
there
was
a
journal
reset,
this
might
be
not
null,
but
generally
it
will
be
null
and
if
it
is
no,
then
we're
going
to
just
load
the
journaler
with
the
front
object
to
the
journal
pointer
and
here
we're
going
to
recover
it,
and
so
the
recovering
is.
This
is
called.
A
This
is
the
osdc
journaler
class
method
to
recover
the
journal,
and
that's
just
again
remember
that
the
journal
head
may
be
out
of
date.
So
we're
asking
the
journaler
class
to
load
the
current
right
position
and
fire
position
figure
out
where
the
actual
latest
right
position
is
that's
what
the
journal
recovery
there
is
doing,
and
once
that's
done,
we
figured
out
what
the
actual
right
position
trim
position
expired
position
are.
A
So
here
the
error
handling.
A
And
then
here
here
in
the
replay
thread
is
where
we
actually
are
going
to,
and
this
is
called
by
if
we're
in
standby
replay
or
if
the
mds
is
in
the
the
replay
state.
It's
actually
going
to
read
the
events
off
the
journal
decode
the
event
here
and.
A
And
then
here
actually
we
do
the
replay
of
the
event
and
that's
just
taking
the
updates
out
of
the
log
event,
creating
a
directory
or
whatever
and
applying
and
applying
it
to
the
metadata
cache,
and
so
that
allows
the
mds
to
rebuild
the
the
the
metadata
cache
whenever
it's
replaying
events
off
the
journal.
A
Okay,
so
I
think
I
ran
out
of
time.
I
don't
leave
a
little
bit
of
time
for
q,
a
talk
about
the
other
parts
of
the
metadata
journal
sometime
in
the
future,
again
I'll
open
the
floor,
any
questions
or
comments
about
what
I've
gone
through
so
far.
A
You
might
do
it
if
you
need
to
do
some
kind
of
recovery
and
you
want
to
make
sure
that
everything
is
currently
in
the
journal
is
written
out,
but
generally
you
would
probably
be
doing
that
recovery
manually
with
the
recovered
entries
option
to
the
cfs
journal
tool
prior
to
perhaps
deleting
the
journal
through
a
reset
command.
A
I
do
a
lot
of
journal
flushes
just
because
I
I
want
to
perhaps
manipulate
the
directory
objects
or
I
want
to
ensure
that
everything
is
is
synced
out,
but
yeah
there's
not
as
an
admin.
There's
not
really
a
lot
of
reasons
to
do
it.
A
Oh,
if
you
want
to
drop
the
nvs
cache,
probably
want
to
flush
the
journal
first,
in
fact,
actually
the
fleshing
that
the
mds
cache
or
sorry
dropping
mds
cache
that
code
mds
actually
flushes
the
journal
too
and
that's
necessary
because
you
need
to
expire
everything
in
the
mds
journal,
so
that
a
lot
of
the
metadata
in
the
mds
cache
becomes
unpinned
and
eligible
to
be
dropped,
don't
flush
it
then
the
nds
may
not
drop
anything.
You
have
100,
you
know.
A
Hundreds
of
megabytes
of
of
cash
that
just
is
is
not
actually
dropped
and
the
mds
a
little
bit
diversion,
but
the
mds
dropping
the
cash
was
an
interesting
idea
as
part
of
doing
performance
tests,
but
due
to
how
slowly
you
have
to
drop
the
cash
in
order
to
prevent
destabilizing
them.
A
Yes,
it's
not
very
useful
in
practice
I
find,
but
that
you
know
at
least
for
a
performance
perspective
if
you're
doing
tests
you're
better
off
just
recreating
the
file
system,
because
it
just
takes
too
long
to
do
it
in
a
safe
way.
A
But
I
don't
think
dropping
cash
has
become
very
common
in
commun,
you
know,
but
for
community
admins.
So
you
don't
see
it
there.
Very
often
I
mean
I
don't
hear
talked
about
on
the
mailing
list
and
that's
really
the
only
way.
I
would
know.
A
A
So
culture
asks
what's
the
average
size
of
stepfs
journal
in
comparison
to
the
data
stored.
So
I
was
just
read
the
numbers
for
this,
but
the
mds
journal,
I
think,
can
get
to
a
few
hundred
megabytes
in
size,
but
it
is.
It
tries
to
cap
itself
around
that
size
by
forcing
trimming.
A
A
And
patrick,
can
you
talk
a
bit
more
about
the
optimization
regarding
the
subtree
map
able
to
follow
that?
What's
the
issue
and
what's
what
is
done
when
trying
to
solve
regards
to
the
the
sub
tree
map
that
is
being
stored
in
the
journal,
so
the
the.
A
A
A
So
that's
not
that's,
not
a
scaling
problem
by
itself,
but
all
right
that
at
that
level,
but
once
we
added
ephemeral,
pinning
where
you
can
have
a
number
of
subtrees
created
according
to
whatever
the
size
of
the
directory,
is
at
least
especially
with
distributed
subtree,
pinning,
which
at
least
the
first
iteration
of
the
design
and
distributive
thermal
pinning
we
had
a
sub-tree
for
every
sub-directory
under
a
directory.
A
So
think,
though,
again
the
canonical
case,
I
always
mention-
is
the
home
directory
that
we
want
every
user's
home
directory
to
be
pinned
to
a
particular
mds
through
ephemeral
penny.
So
you
would
have
a
subtree
for
every
home
directory,
which
you
know.
Obviously
you
could
have
hundreds
thousands
and
that
did
not.
We
had
scaling
issues.
The
the
rank
zero
would
be,
which
was
authoritative
for
the
home
directory
would
be
overloaded
with
with
journaling
it's
low
to
respond
to
lock
requests.
It
just
was
we've.
A
The
initial
idea
was
to
somehow
figure
out
how
to
top
writing
out
the
subtree
map.
Every
time
we
do
a
journal
segment.
Is
there
some
clever
way
we
can
reconstruct
the
journals,
the
subtree
map,
while
reading
the
journal
jung
ultimately
didn't
elect
to
do
that
he
pulled
the
substring
map
out
of
the
mds
journal
completely.
A
I
have
not
actually
gotten
dug
into
his
that
part
of
his
pr.
Yet,
as
far
as
where
he's
storing
the
subtree
map
now
might
just
be
a
separate
journal,
which
reminds
me
I
mentioned,
I
was
going
to
talk
about
the
purge
queue.
Purging
was
originally
part
of
these
the
mds
journal.
This
is,
I
have
a
little
bit
of
time
off.
Just
mention
that
you
know
like
the
mds
would
periodically
when
an
unlink
was
event
was
going
to
be
expired.
It
would
actually
delete
the
corresponding
entry
from
whatever
directory
object
held.
A
The
link
that
was
low
for
hanging
fruit
to
improve
the
performance
of
the
mds,
because
you
know
to
reading
deleting
sub
directory
trees
is
fairly
common
and
it
doesn't
necessarily
need
to
be
done
immediately
during
event
expiration.
So
the
purge
queue
within
the
mds
exists
to
handle
the
logic
of
actually
deleting
files
and
truncating
files
and
certain
other
truncating
and
deleting
files,
and
also
stray
reintegration
outside
of
the
journal
machinery
of
the
mds.
A
So
the
perch
queue
actually
maintains
its
own
separate
journal,
same
journaler
class
for
all
the
files
that
need
to
be
deleted,
and
it
does
that
out
of
band
of
the
mds
journal.
So
this
is
kind
of
continuing
a
trend
of
splitting
things
out
of
the
mds
journal,
because
it's
become
a
whole
large
data
structure
that
takes
too
long
to,
or
it
doesn't
scale
very
well
all
the
different
things
that
the
mds
actually
needs
to
do.
A
So
the
subtree
map
is
now
the
new
victim
to
remove
out
of
the
substrate
or
out
of
the
mds
journal.
A
As
far
as
whether
or
not
we
will
merge
jung's
pr,
I
have
to
go
through
it
still
and
I'm
not
exactly
sure
whether
it'll
be
the
right
approach,
but
I
generally
trust
jung
to
make
good
good
choices
in
that
regard,
so
it'll
probably
be
merged,
especially
since
apparently
he's
still
working
on
sevenfest.
I
wasn't
sure
if
he'd
be
around
to
clean
up
any
bugs
that
may
surface
with
the
result
of
merging
his
pr.
So
we'll
see
about
that.
A
So,
just
I'm
going
to
close
this
topic
out,
you
know,
I
think
what
we
will
find
going
forward
is
that
the
mds
journal
will
probably
be
other
things
we
want
to
pull
out
of
it
as
far
as
making
the
mds
faster.
A
Anything
we
can
pull
out
of
the
journal
into
a
separate
thread,
and
data
structure
will
probably
end
up
being
a
large
performance
improvement
for
the
mds.
So
that's
something
to
consider
in
future
work
force
ffs.
A
One
example
of
that
is
the
recursive
unlink
rpc
that
we
want
to
implement.
Where
you
know
we
we
no
longer
journal
it
or
have
a
number
of
rpcs
with
the
client
and
number
of
journal
events
associated
with
deleting
the
directory
tree.
That
could
be
something
we
potentially
pull
out
into
the
purge
queue.
A
And
then,
as
another
example
again,
I
mentioned
that
the
every
time
we
have
a
new
client.
We
write
that
out
to
the
journal.
We
ever
get
to
a
situation
where
we
have
thousands
or
tens
of
thousands
of
clients
and
then
yes,
we
have
no
idea
what
scaling
issues
exist.
At
that
point
it
may
nee.
We
may
need
to
rethink
how
we
journal
client
sessions
out
to
in
mds
in
that
particular
situation,
but
I
don't
think
we're
anywhere
close
to
that.
Yet.