►
From YouTube: CephFS Code Walkthrough: Kernel Client Overview
Description
Join us every Monday: https://tracker.ceph.com/projects/ceph/wiki/CephFS_Code_Walkthroughs
Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/
A
Wait
well
welcome
everybody.
I'm
gonna
do
an
overview
of
the
kernel
client,
the
kernel,
the
kernel,
7ds
client
anyway,
and
so
first
thing.
I've
also
got
a
I
kind
of
have
a
syllabus
here
for
our
well.
A
A
So
should
be
able
to
and
that'll
it's
basically
just
we're
gonna.
You
know
what
I'm
gonna
cover
in
the
in
this
talk
here,
so
it
doesn't
look
like
a
lot,
but
that
last
little
bit
of
walking
through
opening
and
reading
and
closing
a
file
will
probably
take
most
of
the
hour.
A
So
any
case
you
know
first
thing
you
that
we
should
realize
is
that
there
are.
A
A
You
know,
but
the
the
test
driver
is
a
vfs
driver
first,
first
and
foremost
right
it
is,
you
know.
Basically
it
implements
a
file
system,
which
is
you
know
the
magic
of
linux
really
is
the
fact
that
it
has
a
unified
bfs.
A
You
can
access
a
file
and
you're
generally
in
the
same
way,
regardless
of
what
sort
of
file
system
what
sort
of
backing
it
has,
and
that
is.
B
A
Contrast
to
most
older
operating
systems
like
dos
or
even
windows,
which
you
know
often
needed
really
specialized
libraries
and
things
to
to
access
files
that
would
maybe
exist
on
a
different,
different
sort
of
back
end
environment.
A
So
the
other
thing
that
you
should
realize,
too,
is
that
the
vfs
itself
is
object-oriented,
maybe
a
bit
of
a
surprise
for
people
coming
from
c
plus
plus,
but
we
do
implement
object-oriented
stuff
in
c
and
the
way
we
do
it,
you
know
it's
a
little
clunkier
than
it
is
in
c
plus
plus,
but
but
it
does
work
for
us.
That's
the
way
we
we
operate.
So
we.
A
Allocations,
freeze
and
stuff
like
that,
we
have
function
pointers
to
instead
of
actual
methods
that
sort
of
thing,
but
you
know,
overall,
it
is
a
a
object-oriented
system.
B
A
First
of
all,
if
you
look
at
like,
if
we
look
at
how
you
know
how
it
is
laid
out,
what
you
find
is
that
you
know
we
have
this
sort
of
network
of
structs
right
and
the
structs.
Have
things
have
operations
trucks
in
them?
So,
for
instance,
if
we
look
at
like
you
know,
sort
of
the
first
stop
we
can
look
at
here
is
the
super
block
product,
and
this
super
block
is
sort
of
describes.
A
mounted
file
system.
A
So
any
case,
here's
a
here's,
a
you
know
the
super
block
struct
that
if
you
look
here
to
the
the
file
include
linux
fs.h
is
where
most
of
the
file
system
definitions
are.
A
But
if
you
look
here,
you
can
look
through
here
and
see,
there's
a
whole
bunch
of
fields
and
stuff,
but
the
one
that
we
primarily
interested
in.
A
Super
operations
see
here
superblock
in
particular,
has
a
bunch
of
different
operations.
It's
got
ones
for
you,
know,
quotas
and
exports.
If
you
want
to
export
via
nfs-
and
you
know,
you're
doing
fs
encryption-
that
sort
of
thing
too
there's
separate
operations
for
those
as
well.
A
But
if
we
look
at
the
super
operation
struct
you
see
here,
we've
got
a
whole
bunch
of
function,
pointers
and
those
do
effect.
You
know
effectively
allow
us
to
look
at
you
know
or
to
do
different
operations
like
if
we
want
to
allocate
nine
nodes
right.
You
know
we
have
to
do
that.
We
have
to
call
in
this
alec,
I
know,
or
the
vfs
will
call
into
this
alec.
I
know
truck
for
a
for
a
new.
I
know
for
a
profile
system
same
with
destroying
and
freaking
destroy.
A
I
know
I
know
a
lot
of
these
things
are
not
named
very
well,
and
some
of
them
have
morphed
over
time
right.
You
know
some
of
them
had
different
functions
at
one
point
in
the
you
know,
calling
conventions,
other
things
have
changed
and
that's
the
other
thing
you
should
realize
about
linux
is
that
the
internals
are
very
fluid,
and
so
things
change
all
the
time.
A
Okay,
but
any
case
we
can
look
here
at
the
super
block
and
the
super
lock
really
represents
the
mounted
file
system.
So
whenever
we
mount
a
sfs
file
system,
we
get
this.
You
know
it
will
allocate
the
super
block
structure
and
do
a
bunch
of
work
to
fill
this
out.
A
I'm
not
gonna
go
into
the
mounting
process,
but
basically
what
will
happen
is
we'll
get
this
and
then
we
will
also
allocate
a
a
a
ffs,
sfs,
client
struct,
and
then
that
will
a
pointer
to
that
will
end
up
in
this
ssfs
info.
So
if
we
look
here.
A
A
The
super
box
drop,
but
it
has
a
bunch
of
other
fields
as
well.
Here's
one
refer,
you
know
just
the
boolean
for
being
blacklisted.
For
instance,
there's
a
boolean
for
have
copy
from
two.
You
know,
which
is
a
particular
operation
that
the
osds
can
do
and
so
on
right.
We've
got
a
bunch
of
others,
a
bunch
of
stuff
for
debugging
fast
centuries
as
well.
A
Work
fuse
trucks
in
here
we
have.
I
know,
and
cap
work
hues
I'll
talk
about
work
use
a
little
later,
but
in
any
case
you
know
we
so
we've
got
a
bunch
of
these.
You
know
got
one
of
these
objects
per
mounted
file
system.
Beyond
that
we
end
up
to
there's
a
structure.
I
know
right.
A
So
here's
the
struct,
I
know,
structure
right.
So
this
you
know
a
struct.
I
know
you
know.
Most
of
you
guys
know
is,
you
know,
represents
you
know
a
file
on
disk,
basically
or
file
or
directory
on
disk.
A
There
are
also
other
things
are
backed
by
those
as
well
sim
links,
you
know,
name
pipes
and
all
this
sort
of
exotic
I
know
types
are
also
have
a.
I
know,
instructed
as
well,
and
so
anyway,
we've
got
a
bunch
of
these.
You
know
you
know
the
running
kernel
will
have
tons
of
these
at
any
given
time
right,
they're,
cached
and
whatnot,
but
any
case
the
the
struct
inode
also
has
an
associated
set
struct
to
it.
A
We
can
find
the
sephi
node
info
here,
so
this
is
actually
what
a
real
sapphi
node
looks
like
and
you'll
notice
that
it
has
a
this
object.
You
know,
has
a
vfsi
node
embedded
inside
it.
So
some
cases
you
know
some
of
these
structures
in
the
kernel.
Will
you
know,
allocate
an
auxiliary
structure,
just
keep
a
pointer.
B
A
A
This
is
fairly
new,
it's,
but
the
netfest
layer
that
we're
starting
to
use
in
the
inseph
has
a
an
associate
struck,
perinode
where
it
tracks
it
well,
so
we
embed
one
of
these
inside
there
as
well,
and
so
they
have
to
be
and
the
way
we
lay
these
out
in
a
particular
way,
because
when
occasionally
we
have
to
go
back
and
forth
between.
You
know,
step
by
note
info
at
the
dfsi,
node
or
or
the
netfest
context.
So
these
have
to
be
in
predictable
locations.
A
Beyond
that
you
know,
we've
got
a
bunch
of
other
fields,
you
know
here's
the
vino,
which
is
you
know,
tracks
both
the
snap
id
and
the
inode
number
there's
the
isep
lock,
which
is
a
special
spin
lock
that
we
use
to
protect
most
of
the
stuff
specific
stuff
so
on
and
so
forth.
Right
we've
got
a
bunch
of
other
fields
too.
I
won't
go
into
now
after
the
inodes.
A
A
So
the
struct
entry
is
this:
an
inventory
is
what
represents
a
patenting
component,
so
what
we
will
do
is
you
know
when
you
go
to
do
a
lookup
or
something
like
that
or
you
know
you
know
we
pass
in
path
names
all
the
time,
and
this
is
one
of
those
things
in
the
kernel
that
is
hugely
performance.
Critical
right,
you
know
we
deal
with
a
you
know.
We
have
to
deal
with.
You
know,
path,
name,
walking
all
the
time.
You
know
most
system
calls
that
we
do
take.
A
You
know
many
of
them
take
a
path
name,
and
so
we
have
to
be
able
to
walk
that
path
and
do
that
efficiently,
and
so
the
denture
is
what
is
really.
Linux
has
the
most
advanced
method
on
the
planet
for
doing
all
this
stuff
much
better
than
any
other
commercial
operating
systems.
A
So
we
have
a
we
track.
Each
pathing
component
as
an
individual
object
called
entry,
and
then
there
are,
you
know,
you
know
it
keeps
track
of
like
where
the
parent
is
and
who
the
parent
of
this
denture
is
and
then
there's
you
know,
every
denture
has
an
inode
associated
with
it
or
not
or
not.
A
B
A
Case
the
the
denture
cache
in
particular
is
is
highly
tuned
and
so
and
there's
a
great
document
in
the
kernel
sources
on
path,
name
on
path,
lookup
and
how
that
all
works
quite
comp,
complex
so
but
any
case,
the
identity
object
is
what
is
the
interest,
but
you
see
the
operation
struck
here.
The
entry
operations
right,
sorry.
A
Operations,
okay,
so
here's
the
denture
operation
struct,
we
have
you
know,
there's
a
bunch
of
different
ones
depending
on
you
know
how
they
get
hashed
and
revalidated,
and
so
on
so
forth,
and
then
we
also
have
this
field
called
dfs
data
and
so
we'll
have
a
death
entry
in
both
what
it's
called
yeah.
And
so
every
time
we
create
an
entry,
a
set
entry.
A
We
allocate
one
of
these
structures
and-
and
you
know,
set
the
pointer
to
it
and
it
also
has
a
back
pointer
entry
as
well,
and
so
here
there's
things
like:
oh
you
know,
you
know
whether
the
you
know
whether
we
have
a
lease
for
it
and
how
you
know
what
sort
of
generation
it
has
there's
a
flags
field
so
and
so
forth.
A
All
right
and
beyond
that
we
have
this
thing
called.
A
File
and
then
finally,
the
last
object.
I'll
cover
is
struct
file,
which
it
represents
an
open
file
description
right.
So
if
you
go
to
open
a
file,
we
have
a
you
know
a
record
of
that
open
file,
all
sorts
of
info.
We
have
to
track
like
where
the
you
know
you
know
current
pointer
to
the
inside
the
file.
Is
you
know
where
the
current
position
of
the
file
is.
You
know
what
you
know.
You
know
there's
how
the
file
was
opened.
A
You
know
whether
it
was
read
only
you
rewrite
that
sort
of
thing
all
that
stuff
is
is
tracked
here
inside
this
thing
we
call
struct
file,
which
is
technically,
we
call
a
file
description.
You
know,
most
of
us
are
familiar
with
file
descriptors.
A
The
file
descriptor
is
just
a
number,
but
it
is
a
you
know:
number
that
is
indexed
to
a
particular
structure
in
the
kernel
and
that
kernel
is
the
struct
file
and
we
also
have
the
you
know.
Of
course
it's
got
its
own
operation
struck
too.
A
And
so
we
have
this,
you
know,
there's
a
whole
bunch
of
file
operations
here
right
and
once
you've
opened
the
file
you
can,
you
can
read
from
it,
you
can
write
from
it.
We
also
have
redid
or
right
editor,
which
are
you
know,
we're
added
later,
and
this
is
another
thing
you'll
notice
in
the
kernel.
Sometimes
you
know,
because
we
have
so
many
file
systems,
it's
really
hard
to
change.
A
You
know
calling
you
know
these
operations
all
at
once
right.
So
what
we
found
a
lot
of
times
is
that
people
will
add
new
ones
and
and
just
and
patch,
some
of
the
files
to
them
and
then
leave
the
rest,
as
you
know
rest
to
be
done
later.
We
you
know,
we
have
that
here.
A
In
this
case
right,
you
know
there
are
some
some
file
systems
that
were
never
converted
to
use
redid
or
right
editor,
and
so
they
use
this
old
style
rewrite
construct
and
we
have
handling
for
earl
style,
read,
write
operations,
and
so
we.
A
For
both
in
some
places,
well
we'll
see
that
that
point
later,
and
then
we
have
this
like
iterate
iterate
shared
here.
A
lot
of
these
names
are
not
very
descriptive.
Unfortunately,
iterated
rate
share
is
all
about
reader.
A
All
right
I'll
take
that
as
a
note,
so
you
know
now
what
I'm
going
to
do
is
cover
like
what
happened,
we're
going
to
cover
a
real,
simple
thing
right,
which
is:
let's
suppose
we
want
to
open
a
file
on
the
set
file
system
right,
let's
open
it,
read
only
we're
going
to
read
that
file
in
the
first
4k
or
whatever
the
first
page,
and
then
we're
going
to
close
that
file,
let's
walk
through
how
that
how
that
actually
works.
A
So
the
first
thing
we
do
is
whenever
you
call
into
the
kernel
or
whenever
you
call
a
system,
call
right
what
we
usually
call
it
through
libsy,
so
libc
will
then
go
and
stuff
the
right
arguments
into
all
the
registers
or
into
onto
the
stack
dispatch
into
the
kernel.
You
know
on
your
architecture,
and
you
know
it:
does
it
a
different
way,
but
anyway
it
will
do
that,
and
so
the
first
thing
we're
going
to
do
is.
A
So
when
we
go
to,
you
know.
B
A
B
A
And
we
do
you
know,
but
basically
you
know
the
kernel
needs
to
handle
legacy
syscalls
as
well.
So
in
this
case
we're
gonna
call
the
legacy
open
syscall,
and
it's
going
to
do
this.
Call
this
deuces
open
function.
If.
A
Anyway,
but
any
case,
we're
gonna,
if
you
you
can
look
at
all
the.
If
you
look
at
all
these
syscall
defined
macro
and
you'll
notice,
they
all
this
one's.
This
is
called,
define
four
define
three.
There
are
different
definitions
depending
on
the
number
of
arguments,
so
we
have
to
you
know
this
is
called
you
know,
structural.
B
A
A
and
okay,
and
here
we
go
so
now,
we've
got
a
for
open
app2.
We've
got
a
directory
file,
descriptor
file,
name
and
in
the
open,
hal
construct,
which
gives
us
some
information
about
what
it's
supposed
to
do
when
it
opens
not
just
the
open
mode,
but
also
some
things
too,
like
you
know
whether
it
needs
to
follow.
A
That
kind
of
stuff-
so
first
thing
we
do-
is
we
call
this
get
name
so
you'll
notice
here
up
here,
this
const
char
user
file
name,
and
this
underscore
underscore
user
annotation
lets
us
know
that
this
is
not
a
kernel
pointer.
This
is
a
user
land
pointer
right.
The
address
space
inside
your
process
is
different
than
the
one
inside
the
kernel,
and
so
when
we
get
a
pointer
from
userland,
we
have
to
interpret
it
as
if
it
you
know,
in
the
context
of
the
address
space
inside
the
process
of
that.
A
That
of
that
you
know
of
that
process,
and
so
what
we
do
is
we
call
this
get
name
file
name
here,
and
so
what
that
does
is
it
goes
and
grabs
a
you
know,
allocates
a
big
chunk
of
what
calls
this
get
name.
A
Flags
and
the
get
name
function
will
then
go
and
copy
that
current
that
value,
the
copy,
the
name
in
that
string
in
there
carefully
very
carefully
into
a
to
you
know
it's
going
to
be
the
strength
copy
from
user
and
basically
copy
this
into
a
buffer
that
the
kernel
can
work.
So
when
we
are
grabbing
stuff,
you
know
we
can't.
B
A
The
the
name
directly
that
was
you,
know
the
memory
directly
directly
that
we
pointed
in
because
it
could
change
right.
You
know
if
we
call
you
know,
if
you
pass
in
a
you,
know
a
path
name
and
then
start
it
starts
processing
and
then
some
other
thread
comes
in
and
overwrites
that
path
name
with
some
garbage
or
you
know
something
that
will
cause
it
to
buffer
over
run.
That
could
be
real
problematic
later,
so
we
make
a
copy
first
and
then
we
vet
that
copy
very
carefully
from.
B
A
So
anyways
we've
got
so
from
there.
We
look
at
the
file,
descriptor
flags
or
the
fd
flags
that
get
passed
in
and
then
we
call
this
function
called.
Do
fill
open.
A
Sometimes
you'll
see
the
file
a
file
in
the
in
the
kernel.
You
know
it's
sometimes
called
fill
flp,
which
is
like
your
platform.
It's
just
a
naming
convention.
Sometimes
it's
used,
but
in
any
case
we
call.
This.
Do
filth,
open.
A
So
we
have
this
new
filp,
open
and
you'll
see
it
sets,
and
this
is
essentially
where
the
path
walk
out.
So
what
happened?
So
you
know
again,
we
are,
we
have
to,
you,
know,
walk
a
pat,
you
know
path,
name
which
is
just
a
string,
and
so
this
is
a
hugely
complicated
process,
but
basically
we
call
into
set
name
my
data
here,
which
basically
creates
the
structure
called
called
the
nema
data.
A
Yeah,
so
this
thing
here-
and
this
is
sort
of
a
state
states
tracking
structure
for
a
path
walk.
So
the
name
I
data
is
basically
the
thing
that
you
know
keeps
track
of
where
we
are
in
the
path
walking
process,
and
you
know
how
that
path.
Walking
needs
to
be
done
again.
I'm
not
going
to
go
into
this
in
great
detail,
because
it's
just
way
too
complicated
and
there's
better
articles
for
it,
but.
B
A
Case
we
do
all
this,
we
go
down
into,
do,
filter
open
and
then
we
do
a
path
open,
app
and
you'll
see
here.
First
it
does
this
thing
with
lookup
rcd.
So
the
kernel
will
you
know
one
of
the
really
advanced
things
really
cool
things
about
the
kernel's
path.
Walking
is
that
it
can
do
it
under
rcu.
Just
can
do
it
locklessly,
and
this
is
a
huge
performance
win.
I
remember
when
they
added
the
blockless
path
walking.
A
Locks
all
the
time
so
any
case
it
will
do
this
it'll
attempt
to
do
the
rcu
lookup
first,
but
that
doesn't
always
work.
You
know.
If
you
have
to
go,
do
things
that
sleep.
It
will
end
up
having
to
come
out
of
path
walking
and
what
happens
is
we
have
this?
It
sends
back
this
rcu
can't
if
we
can't
do
an
rcu
pathway,
we
will
go
back
into.
A
It
will
return
e-child,
which
is
sort
of
like
a
nonsensical
error
code
in
this
sort
of
code
right,
but
we
use
that
to
sort
of
show
that
you
know
to
say
that,
oh
you
know
the
rcuv
lookup
can't
be
done.
You
got
to
do
a
what's
called
a
ref
walk
where
we
take
references
and
locks
and
things
like
that
and
walk
back.
It's
a
lot
less
efficient,
but
it's
really
not
that
bad
from
there.
A
If
we
get
an
e-stale
so
like
if
you're
working
with
nfs
in
particular
right,
you
know
you
might
get
a
file
handle.
Sometimes
we
see
this
in
step
two,
but
if
we
get
that,
then
we
have
to
do
something
called
a
revalidation
hookup
where
we
go
not
only
path
walk
through,
but
we,
if
we
have
any
cached
entries,
we
go
and
have
to
go
and
validate
that
they're
actually
correct.
A
So
if
that
happens,
and
then
so
basically
at
the
end
of
that,
though,
we
will
end
up
calling
into
so
here's
path
open
that
right.
If
you
look
here,
it's
got
a
special
path
for
for
handling
temp
file.
Here's
another
one
for
doing
a
path
open
which
is
sort
of
like
a
open.
That's
opening
the
path
name,
it's
useful
for
certain
things,
but.
B
A
We
end
up
down
here
for
most
files
we'll
end
up
in
this
link.
Path:
walk!
That's
where
really
the
magic
of
the
pathwalk
happens,
and
you
know
most
of
the
time
we
are
not
interested
in
most
of
the
components.
We're
not
terribly
interested
right.
They're
just
directories
get
to
the
next
one
right,
but
the
last
little
bit
has
to
be
opened,
and
so,
if
we
go
look
at.
A
We
do
two
different
types
of
opens
and
so
most
of
the
time
for
most
file
systems.
What
we'll
do
is
we'll
look
up
first
and
then,
once
we
get
the
inode
that
associ
is
associated
with
the
last
path
component
of
the
entry
or
the
that
last
path
component,
then
we'll
then
open
that
file
right
turns
out
that
that
is
a
hugely
wasteful,
usually
wasteful
process
on
the
network
file
system.
A
A
An
atomic
open
just
basically
says:
instead
of
doing
a
lookup,
we're
gonna
issue
the
open
directly
and
then,
if
it
turns
out
that
we
get
back
enough
and
it
wasn't
a
create
right,
then
we
can
just
assume
that
it's
a
negative,
lookup
right,
and
so
that
saves
us
some
round
trips
to
the
server.
This
was
hugely
helpful.
Once
we
move
to.
A
In
nfs,
because
the
originally
nfsv4
had
to
do
two
round
trips
to
the
server
to
do
this,
in
any
case,
eventually
we're
going
to
end
up
down
in
this
look
up
open
function,
which
is
again
hugely
complicated.
You
can
see
here
it
does
a
d
lookup
first
right
tries
to
look
up,
and
then
it's
going
to
come
down
here,
but
eventually,
if
we
have
a
way
to
do
an
atomic
open,
we'll
do
that,
and
so
we
look
here.
So
that's
our
first
stop
inside
the
actual
set
code.
A
We
go
down
to
cephatomic
open
now.
If
it
turns
out,
we
have
if
we
already
know
what
the
lookup
is
right,
if
we
already
have
the
in
the
eye
node.
For
this
thing,
we.
A
This
only
when
you
have
a
negative
density-
or
you
know
you
don't
have
an
entry
at
all
or
it's
a
negative
entry,
then
we'll
go
try
to
do
the
try
to
do
a
top
coat.
A
So
any
case
it
will
go.
You
know
so
for
seth.
What
we
do
is
first
thing
we'll
do
is
take
different
steps,
depending
on
whether
it's
ocreate
we'll
try
to
do
an
async
open
in
some
cases
and
that's
enabled.
A
Basically,
we
prepare
an
open
request
here.
Basically,
this
will
build
inside
the
kernel
client.
We
have
this
thing
called
a
set
mds
request.
So
whenever
we
need
to
call
out
the
nds
we'll
allocate
one
of
these
guys
and
then
send
you
know,
fill
it
out
and
then
send
it
off
to
the
to
the
server
to
the
mds.
To
do
its
thing
so
any
case
this
prepare
open
request
is
what
does
that
in
this
case
we're
doing
a
read-only
open.
So
we
don't
care
about,
creates.
A
Yeah
and
here's
what
it
looks
like
right
now:
you've
got
a
bunch
of
fields
here
that
get
filled
out.
You
know
when
you
have
the
entry
or
an
old.
You
know
in
some
cases
like
if
you're
doing
a
rename,
you
have
an
old
entry.
A
A
We
build
an
open
request
first
and
then
we
eventually
call
seth
mds
the
do
request.
So
this
is
the
part
and
like
where
we.
This
is
the
part
where
we
call
him
to
do
the
the
actual
you
know
to
fire
the
thing
off
and
send
it
to
accept.
Mds
do
request,
just
calls
submit
request,
and
then
it
waits
on
the
reply.
A
A
I'm
sorry,
that's
the
weight.
I
don't
want
to
do
that
yet
so
any
case.
A
Submit
request,
and
so
what
it
does
here
is
it
will
get
cat
refs
if
it's
got
certain
this
our
eye
notes
set.
If
we
have
a
group
parent
pointer,
we
are
going
to
take
cap
reps
for
that
and
pin
them
basically
pin
them.
A
A
Decide
which
one
to
send
it
to
again
it's
a
pretty
complex
process,
but
we're
going
to
skim
over
that.
But
basically
it's
going
to
call
into
there
and
then
check
to
make
sure
the
session's,
okay,
blah
blah
blah.
A
Then
eventually
we
come
down
here
to
send
request.
So
that
will
go
and
stick
the
thing
on
to
the
into
the
you
know:
hand.
A
Off
to
the
messenger
to
be
sent
off
onto
the
wire
once
we're
done
there,
we,
you
know,
put
references
or
whatever
and
then
and
then
in
this
case
we're
gonna
wait
for
the
reply
to
come
in
yeah
so
anyway
down
here,
we
we've
submitted
the
thing
we
call
down
in
here
to
wait
for
the
reply
and
when
the
reply
comes
in
we'll
we'll
process
it.
A
Right
so
we're
down
here
yeah
down
here,
so
we
will
return
here
and
so
we've
got
the
result
now
at
this
point
right
in
this
case,
we're
not
going
to
give
you
no
n
because
we're
already
there
and
then
you
you
have
to
some
special
handling
here
because
it
has
to
be.
You
know
we
are
sort
of
doing
a
combined
open
and
look
up
at
the
same
time.
A
So
this
will
end
up
essentially
instantiate
fully
instantiate
the
struct
file
and
give
us
a
file
description
that
we
can
then
use
to
do
other
things
all
right,
and
then
we
go
back
all
that
will
unwind
and
go
back
to
userland
with
and
I'm
sorry
the
you
know,
once
we've
got
a
file
description
that
goes
back
to
the
you
know
back
into
the
vfs
layer.
The
vfs
will
then
attach
a
file
descriptor
to
it,
and
then
we
hand
that
file
descriptor
back
to
userland.
A
A
So
now,
what's
what
we
do?
We
we
have
a
it's
call
line
3
the
argument
struck
and
find
the
one
for
read
and
get
some
read,
write
c,
yeah,
yeah,
all
right
and
here's
a
read.
So
we
have
this.
A
Yeah,
we're
gonna
do
it.
You
know
it
calls
into
this
function.
Called
cases
read,
you
know
which
goes
and
figures
out.
You
know
where
the
position
of
the
file
is
and
whatnot
and
then
calls
to
this
function.
Called
vfs
read
a
lot
of
the
generic
vfs
layer.
You
know
handling
is
prefixed
for
this
vfs
underscore.
When
you
see
those,
usually
that's
pretty
good
indicator
that
it's
a
generic
structure
or
generic
function
that
you
know
is
used
across
different
file
systems.
A
Does
some
you
know
it
does
some
checks
and
it
was
this?
Can
we
actually
read
from
this
file?
Descriptor
you
know
is
it
is
the
buffer
we're
dealing
with
in
our
address
space,
but
eventually
we
come
down
here.
It's
going
to
call
this
other
rw
verify
area
as
well,
which
does
some
of.
A
Yeah
it
does
some
things
like
you
know,
make
sure
we're
not
trying
to
read.
You
know
beyond
the
you
know,
the
head
of
you
know
we're
not
trying
to
read
the
you
know
an
awful
negative
offset
into
the
file
or
something
all
sorts
of
you
know,
there's
lots
of
ways
that
you
can
try
to
trick
the
kernel.
So
it
has
to
check
all
this,
but
in.
A
Area
and
then
we
come
down
here
and
if
we've
got
this
file
f
read
we'll
call
that
we
don't.
B
A
That
for
ceph,
so
we
have
a
reiter
function,
but
that
is
there.
If
that
is
set,
we
will
call
new
sync
read.
A
A
The
reason
we
have
this
is
that
you
know
the
original
kernel.
You
know
we
would
pass
down
a
buffer
and
stuff,
but
you
know
about
seven.
Eight
years
ago
we
went
pretty
big
into
using
this
struck.
Iod
editor
inside
the
kernel
all
over
the
place,
and
what
this
is
it's
sort
of,
like
you
know
when
you
need
to
iterate
over,
you
know
a
user.
You
know
a
buffer
of
you
know
some
sort.
A
Then
we
can
pass
this
in
and
we
can
have
these
iterators
of
different
types.
So
there's
some
that
are
referred
to
user
land
buffers
or
some
that
refer
to
kernel
buffers
some
that
may
refer
to
a
pipe
that
sort.
You
know
all
sorts
of
stuff
right
and.
A
Use
a
lot
of
the
same
code
without
having
to
worry
so
much
about
what
the
destination
or
the
source
buffers
look
like.
A
A
Let
me
take
a
pause
here
for
a
second.
Anybody
have
any
questions:
everybody
sleep,
no,
okay,
any
case
all
right
so
anyway,
here
we
are
seth
reader
and
so
what
so,
basically
we'd
call
this
any
time
we
get
like
a
read.
This
call
from
userland.
A
We
also
will
call
this
for
things
like
if
you've
got
an
mapped
area
on
nsf
file,
and
you
fault
in
that
page
right.
You
try
to
read
from
the
map
or
open
it
or
try
to
read
off
of
the
map.
It
will.
You
know
turn
that
into
a
read.
A
You
know
to
read
that
to
populate
the
page,
so
in
any
case
we
have
this
read
or
sorry.
Actually,
that
will
call
read
page
that
doesn't
call
it
this
part,
but
in
any
case
the
redid
is
where
we
call
in
to
do
a
handle
a
read,
so
we
do
all
sorts
of
stuff.
A
A
We
have
some
special
handling
depending
on
whether
it
is
I
o
direct
or
it's
a
direct
open
or
not
not
familiar
then
oh
direct
is
a
way
for
to
tell
the
kernel
that
you
want
to
bypass
the
page,
cache
and
acts,
and
you
know
either
read
into
or
write
to
some
buffers
directly
and
this
will
almost
always
cause
on
the
wire
or
read
or
write.
A
A
You
know
file
data
so
that
it
doesn't
have
to
go
and
re-read
it
or
you
know
you
know
whenever
it
needs
to
be
accessed
again,
so
those
sort
of
operate
one
of
the
operating
principles
of
linux
from
very
way
back
is
that
you
know
any
ram
you
have
that's
not
used
is
wasted,
and
so
we
try
to
use
most
of
our
ram
in
the
in
the
kernel
for
page
cash
for
buffering
disk,
reads
or
buffering,
you
know
reads
from
file
systems
down.
A
In
this
case
we
have
some,
so
we
have
to
decide
whether
we,
whether
we
can
use
that
or
not,
and
so
in
this
case
we
have
some
cases
like
if
we're
not
doing
a
direct.
I
o,
we
don't
have
this
fsync
f
sync
lag,
which
is
a
cuddle
that
can
set
and
you
can
force
a
file
to
be
synchronously
accessed.
A
Then
we
will
try
to
get
the
cash
cap.
The
fc
cap
there's
also
this
lazy.
I
o
stuff
here,
which
is
somewhat
poorly
defined
really
so,
but
it's
there
so
first
thing
we'll
do
here
is
we'll
call
into
seth
getcaps,
and
this
will
issue
a
you
know.
If
we've
got
caps
all
ready
for
this
inode
we'll
take
references
to
them
to
ensure
that
they
don't
get
released
while
we're
trying
to
use
them.
A
But
if
not,
then
we
have
to
call
out
to
to
the
mds
and
request
them
so
seth
getcaps.
Does
that
take
a
quick
look,
real,
quick.
A
Yeah
and
so
we've
got
a
bunch
of
stuff
here
again,
a
hugely
complicated
process
go
and
try
to
get.
You
know
request
caps
from
the
basically
just
calls.
We
don't
have
them,
we'll
just
call
subtract.
Try
get
cap
refs
here
and
and
that
will
call
out
to
the
mds
to
say,
try.
A
If
you
can
okay
in
this
case,
let's
just
pretend
we
got
the
caps
in
most
cases,
we
do
so
depending
on
whether
we
got
caps
or
not.
If
we,
if
we
didn't
get
cash,
then
we're
going
to
try
to
access
this
thing
synchronously,
and
so
we
have
this
direct,
read,
write
code
here
that
handles
doing
dio.
A
It's
a
direct.
I
read
right.
If
it's
not
direct,
I
o
we
do
it.
We
can
call
a
synchronous
read
here
and
that
will
just
go
and
actually
issue
a
read
on
the
wire
for
the
particular
range
that
we're
trying
to
read
or
a
number
of
breeds,
if
it's
a
very
large,
for
instance,
but
most
of
the
time
we
end
up
calling
we
get
cash
caps
and
we
call
down
here
to
generic
file
reditter.
A
This
calls
back
out
into
the
vfs
and
what
it
does
is.
This
is
basically
the
so
if
we're
not
doing
you
know,
you
know
direct,
I
o
or
synchronicity,
then
we
are
doing
page
cache.
I
o,
and
so
all
the
reads
and
writes
in
this
case
will
be
page
aligned,
and
so
we
can
look
at
file.
A
This
is
just
a
generic
helper
function
from
responses
to
you
know
for
doing
I
o
direct
or
whatever,
then
we'll
call
it,
but
eventually
it's
going
to
call
down
here
and
do
this
filemap
read.
A
I
found
that
read
calls
down
here,
we'll
grab
page
cache
pages.
I
get
them.
You
know
prepped
and
ready
to
go
check
the
size.
You
know
it
does
a
bunch
of
other
checking
too
then.
A
Yeah
yeah
down
here,
and
so
basically
we
go
down
here
and
you'll
notice
here
too,
most
of
the
time
so
historically,
the
kernel
has
always
operated
on.
A
A
We
are
in
the
middle
of
a
huge
turn
in
the
kernel
right
now,
where
it
will,
where
we
are
converting
to
from
a
you
know.
So
the
problem
is
that
memory
has
gotten.
You
know,
memory
sizes
have
gotten
really
huge
right
and
so
tracking
4k
pages
in
this
day
and
age
is
pretty
wasteful.
A
You
know,
if
you
could,
you
could
imagine
we
have
to
have
a
structure
for
every
page
in
the
files
in
the
kernel
and
that
you
know
we
have
millions
of
these
things
because,
because
they're
all
attractive
for
4k
granularity
that
tends
to
turns
out
it
turns
out,
most
mmu's
can
work
with
larger
pages
without
too
much.
You
know
without
any
problems,
but
the
kernel
is
not
equipped
for
that.
Currently,
so
we're
moving
to
a
new
structure,
that's
called
a
folio.
A
A
B
A
Not
everything
in
the
kernel
was
really
equipped
to
deal
with
those.
The
page
cache
in
particular
never
did
that,
so
so
we're
trying
to
move
the
page,
kernel's
page
cache
eventually
to
work
with
larger
pages
and
as
part
of
that,
we're
converting
a
lot
of
our
a
lot
of
where
we
have
traditionally
operated
with
page
pointers
to
something
called
folio,
and
so
that's
what
this
is
all
about.
A
In
any
case,
we've
got
here's
a
you
know:
here's
some
batch
batch
handling
if
you've
got
a
bunch
of
you
know
big
folio,
but
in
any
case
we
have
this
folio
test
read
ahead.
If
we've
got
we're
able
to
do
a
read
ahead,
the
colonel
will
try
to.
We
can
try
to
expand
that
read
to
something
larger.
A
A
May
not
be
in
here
yeah,
so
in
case
we
call
this
and
then
it
calls
down
to
this
on
demand,
read
ahead
function
and
then
eventually
we're
going
to
get
into
calling
the
read
ahead.
A
Function
for
step,
okay,
yeah
seth
has
and
so
seth
we
traditionally
had
our
own
read
headphones
and
read
ahead
is
actually
fairly
new.
There
was
a
if
you
look
at
older,
kernels
you'll
see
something
called
read
pages,
which
is
the
old
style
method
of
doing
it,
which
was
pretty
wasteful
with
page
locking.
So
we've
moved
to
a
new.
You
know,
like
I
said:
the
internals
are
very
fluid
in
linux,
and
so
you
know
we
change
stuff.
All
the
time
and
so
read
ahead.
Is
the
new
new
way
to
do?
A
You
know
multi-page
multi
multi-page
reads,
and
so
we
end
up
doing.
We
call
into
this
netfest
read
ahead,
and
so
the.
A
B
A
A
A
Metaphase
layer
has
a
new
operation
struct
right
here,
so
this
is
where
you
can
see.
The
net
request
stops
so.
A
A
much
more
natural
interface
for
dealing
with
network
files,
so,
rather
than
having
to
say
okay,
giving
us
a
pile
of
pages
and
say:
okay
fill
all
these.
You
know
set
all
these
pages
up
for
a
read
and
fill
them
up.
You
know
the
netfest
layer
calls
it
to
us,
says:
here's
a
pile
of
pages,
we've
already
prepared,
let's
go
and
build
them
for
us,
and
so
we
end
up
with
this.
A
A
A
So
in
any
case,
but
eventually
we're
going
to
end
up
with
this
death.
Manifest
issue
read-
and
this
is
where
the
actual
reading
app.
So
it
has
given
us
this
netfest
io
sub
request.
A
So
when
we
issue
a
read,
we
may
that
read
may
span.
For
instance,
multiple
deaf
objects
right,
and
so
we
might
have
to
issue
two
actual
read
calls
onto
the
wire
to
fill
it.
So
the
first
thing
we
you
know
it
does
is:
it
tries
to
recall
a
number
of
functions
to
to
sort
of
like
expand
the
read
ahead,
and
then
we
also
try
to.
We
also
have
to
do
some
some
clamp,
the
length
like
you
know.
A
If
the
thing
we
can't
read
beyond
the
end
of
the
object
and
and
seth
right,
so
we
have
to
clamp
the
link
that
that,
where
that
object
ends
and
that's
what
this
is
all
about,
object
mapping,
but
eventually
we're
going
to
get
down
to
doing
a
net
test
read
and
we
set
up
a
new
osd
request.
That's
what
this
guy
does.
A
That
goes
down
into
here.
We
go,
we
get
what's
called
an
x
array
of
pages
for
the
from
the
iot
header
and
the
the
ideator
is
what
we're
copying
into
we
get
x
an
x.
We
go
and
grab
an
x-ray
from
this
thing,
and
then
we
call
those
iodine
pages
alec,
which
basically
just
allocates
a
bunch
of
pages
page
cache
pages
for
us
and
or
sorry.
A
It
grabs
the
page,
cache
pages
in
the
right
locations
and
gets,
and
you
know,
pins
them
basically
and
gives
us
back
an
array
of
these
pages,
and
then
we
hand
that
down
we
stuff,
all
that
into
the
osd
request.
A
We
set
our
callback
when
it
goes,
and
then
we
go
and
start
that
request
and
then
we
return,
so
the
request
is
allowed
to.
So
that
request
will
run
asynchronous
right.
You
know
it
will
so
if
you've
got
multiple
reads:
we're
going
to
go
and
boom
boom
boom
fire
fire
several
of
them
off
and
then
and
collect
all
the
replies.
A
Then,
when
the
reply
comes
in
the
osd
code,
handling
handler
will
call
finishnetfs
read
and
then
we
go
and
look
at
the
result
like
it
may
be
that
we
there's
no
object
there
right.
If
there's
no
object
there,
we
get
back
an
enom,
and
then
we
pretend
that
that
is
a
zero
link
read
if
we
get
block
listed,
then
we
have
to
deal
with
that.
A
If
the
read
was
shorter
than
we
had
expected
right,
you
know,
let's
say
we
issue.
We
tried
to
read
a
whole
for
four
mag
object
off
the
osd,
and
then
it
turned
out
that
it
was
not
that
it
wasn't
that
long.
We
want.
We
need
to
tell
the
netfest
layer
to
clear
the
the
last
thing.
That's
what
this
flag
does
and
then
eventually
we
call
this
netfest
subreddit
terminated,
which
will
tell
which
tells
the
nfs
later
that
okay,
we're
done-
and
here
is
you-
know
the
new.
A
It
either
ended
with
an
error
or
with
a
successful,
read
and
then
then
we
go
and
put
the
page
references.
You
know
we
have
to
hold
references
to
the
pages
while
we're
filling
them
sure
they
don't
get
purged
out
of
the
cache
and
then,
when
we're
done,
we
put
those
up
and
then.
B
A
A
A
A
Seth
has
a
lot
of
really
strange
handling
for
things
if
we
hit
the
file
or
a
hole,
zero
out
the
end
so
any
case,
we've
got
all
this
stuff
and
then
eventually
we
bubble
all
that
back
up
the
user,
land
and
say:
okay,
we
gotta
read
right.
So
let's
say
we
read
100k
or
whatever.
A
A
A
A
All
right
and
now
close
goes,
and
you
know
if
it's
a
writable
file,
descriptor
we're
gonna
need
what
we're
gonna
flush
every
data
we
can
generally.
B
A
If
put
isn't
a
wrapper
around,
this
function
f
put
around
underscore
underscore,
but
this
does
a
bunch
of
other
stuff,
but
basically
we're
going
to
call
this.
Eventually,
we
call
down
to
the
release
for
that.
A
A
A
And
at
that
point,
that's
pretty
much
it
we
will
go
and
you
know
the
vfs
will
do
its
thing
to
free
that
file
descriptor
and
read
whatever
you
know,
pre-destruct
file
as
well
and
then
that's
the
end.
A
That's
not
all
I'm
going
to
cover
today
any
questions.
B
Chef,
I
had
a
question
for
in
the
function
self
atomic
open.
It
said
in
the
comment
it
said:
if
the
file
or
sim
link
is
non-existent,
the
vfs3
tries
okay,
does
vfs
keep
retrying
or
what's
what's
what's
the
idea
behind
that.
B
If
you
go
to
the
comment
on
top,
it
says
if
the
file
is
non-existing
to
it,
yeah.
A
Oh,
that
looks
like
that
comments.
Just
bogus.
A
If
we
get
a
so
if
we
get
a
non-existent
father
son,
like
return
once
said,
yes,
the
calling
convention
for
atomic
open
has
changed.
I
don't
think
this
comment
is
actually
correct,
but
basically
yeah
I
mean
when
you.
If
it
turns
out
that
the
file
is
non-existent,
you
know,
basically,
we
don't
retry.
Really
we
just
it
just
returns.
You
know
it
at
that
point,
we're
doing
a
read.
Essentially
we,
instead
of
doing
a
separate
lookup,
we
just
do
a.
A
We
do
a
read
or
do
an
open
and
try
to
to
get
that
yeah
this
this
book.
This
comment
is
bogus.
I
probably
should
set
up
a
patch
to
get
rid
of
it
at
one
point
you
would.
We
would
return
one
and
may
still
do
that
internally,
but
a
lot
of
that
has
been
changed
to
use
this
finish
open
structure,
and
so,
if
we've
got
a
finish
open
you
know
is
what
hap
actually
handles
most
of
this.
So
in
fact
you
take
a
quick
look
at
it.
If
you
want.
A
B
B
Can
you
repeat
that
he
said
one
of
the
calls
is
network
calls
is
synchronous
and
atomic?
Is
it
submit
a
weight
entry
or
is
it.
A
Yeah
yeah,
we,
it
is
synchronous,
you
know,
because
I
mean
we
can't
do
we
can't
return
well,
you
know
we
can.
We
do
allow
async
open,
sometimes
right,
for
instance,
if
the
kernel,
if
we
have
so
if
atomic
open
you
know,
is
called
in
a
very
special
circumstance
right.
We
either
don't
have
a
cached
entry
for
this
thing
or
we
have
a
negative
entry.
Okay.
A
So
if
so,
we
either
don't
know
what
the
provenance
this
entry
is
or
we
know
we
know
that
it
doesn't
exist
right
and
then
it
will
call
atomic
open
to
do
the
open
and
look
up
at
the
same
time
that
will
allow
instantiated
entry
if
it
needs
to
and
and
also
if
it
turns
out
that
we're
wrong
right,
because
sometimes
we
sometimes
like,
if
you
know
when
we
have
negative
dentures,
sometimes
they
can
change
on
the
mds
and
we
don't
and
the
client
doesn't
know
right,
and
so,
if
that
happens
so
so
the
atomic
open
allows
us
to.
A
You
know
if
it
turns
out
that
we,
you
know,
we
thought
this
file
didn't
exist,
but
now
it
turns
out
that
it
does
and
we
didn't
do
it
create
and
we're
not
doing
create
right.
So
it
turns
out
that
it
does
exist.
A
You
know
we
will
issue
the
open
call
to
the
nbs
and-
and
that
comes
back,
that
we
can
go
and
instantiate
the
entry
this
identity
does
exist.
You
can
attach
it.
I
know
to
it
now
and
turn
it
into
a
positive
entry.
A
It's
this
is
a
pretty
complex
situation
and,
and
quite
frankly,
alvaro
hates
this
right.
You
know:
we've
we've
been
trying
to
find
him
for
about
a
decade
now
figure
out
a
bit
a
cleaner
way
to
do
this,
but
it's
unfortunately,
not
not
trivial.
A
Top
scoping
is
like
kind
of
one
of
those
clue,
g
things
that
we
added
to
work
around
a
problem.
It
was
never
very
elegant
to
begin
with,
and
it
still
isn't
so
but
yeah
in
this
case
we
will
always
do
a
synchronous
rpc
to
the
mds.
A
So
we
are
always
going
to
call
the
nbs
on
on
that
on
there,
because
we
don't.
We
have
to
because
we
don't
know,
we
either
think
that
the
thing
doesn't
exist
or
we
or
we
don't
know.
So
we
have
to
call
the
mds
either
way
and
we
have
to
wait
for
the
response
to
come
in.
B
Okay,
yeah
and
the
final
question
I
had
is
like
I
mean
there
was
one
of
these
pr's
you
know
of
introducing
async
io
and
lips
ffs,
and
one
of
the
questions
asked
was:
how
do
we
detect
ender
file
when
reading?
B
And
so
you
said
you
need
to
because
we
read
from
a
bunch
of
objects
and
you
always
keep
track
of
the
inode
size,
and
you
know,
and
then
you
read
from
those
is
that
is
that
that's
pretty
unique
to
our
file
system
right
but
to
detect
end
of
file.
We
need
to
do
that.
A
Yeah
to
some
degree
yeah
I
mean
it's
not
I
mean
we
have
some
of
those
similar
situations
with
nfs
as
well
you're
doing
pnfs
in
particular.
You
know.
Sometimes
we
may
come
back.
It
may
have
a
very
similar
kind
of
thing
where
it
is
reading
from
some
sort
of
you
know,
aggregate
of
multiple
servers,
multiple
back
ends
or
whatever
multiple
objects
but
yeah
for
seth.
It's
we,
you
know
we
have
to
basically
query
the
mds
and
say
what
is
the
you
know?
A
How
long
is
this
file,
because
the
files
can
you
know?
We
can't
trust
that,
just
because
we
got
a
short
read,
that's
where
the
file
actually
ends
right.
If
we've
got
a
short
read,
it
may
just
mean
that
the
end
of
it
is
sparse,
and
so
it
just
didn't.
You
know
the
you
know.
Never
that
part
was
never
written,
but
but
it
was
maybe
truncated
out
to
a
larger
size.
B
A
B
So
I
think
we
okay,
so
when
you
keep
track
of
the,
I
know
size
very
carefully
to
make
these
okay.
Yes,.
A
B
A
A
Okay,
yeah
yeah,
any
other
questions.
B
A
Thanks
for
coming,
it
was
a
pleasure
to
talk
to
you
all.
So
let
me
know
I'm
gonna
plan
to
do
at
least
probably
one
more
walk
through
covering
some
other
areas
as
well,
but
I'll
probably
be
in
another
week
or
two.
So
thanks
for
coming
see
you
later.