Ceph CephFS Code Walkthrough, 24 May 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: CephFS Code Walkthrough: Kernel Client Overview

Description

Join us every Monday: https://tracker.ceph.com/projects/ceph/wiki/CephFS_Code_Walkthroughs

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Wait well welcome everybody. um I'm gonna do an overview of the kernel client, uh uh the kernel, the kernel, 7ds client anyway, um and so uh first thing. um I've also got a I kind of have a syllabus here for our well.

A

I'm gonna try to share the link to this, but I can't figure out how to get it to do to make a public link for everyone. So if.

B

Anybody's outside.

A

Red hat, they may not be able to see this I'll put paste it in here anyway.

A

So um should be able to uh and that'll it's basically just uh we're gonna. You know what I'm gonna cover in the in this talk here, um so it doesn't look like a lot, but uh that last little bit of walking through opening and reading and closing a file will probably take most of the hour.

A

So any case uh uh you know first thing you that we should realize is that there are.

A

A

You know, but the the uh test driver is a vfs driver first, first and foremost right it is, uh you know. Basically it implements a file system, which is you know the magic of linux really is the fact that it has a unified bfs.

A

You can access a file and uh you're generally in the same way, regardless of what sort of file system what sort of backing it has, and that is.

B

A

Contrast to most uh older operating systems like dos or even windows, which you know often needed really specialized libraries and things to to access files that would maybe exist on a different, uh different sort of back end environment.

A

So the other thing that you should realize, too, is that the vfs itself is object-oriented, maybe a bit of a surprise for people coming from c plus plus, but we do implement object-oriented stuff in c and um the way we do it, uh you know it's a little clunkier than it is uh in c plus plus, but uh but it does work for us. That's the way we we operate. So we.

B

You know we're doing our own memory.

A

Allocations, freeze and stuff like that, uh we have function pointers to instead of actual methods that sort of thing, but you know, overall, it is a a object-oriented system.

B

A

First of all, um if you look at like, uh if we look at how uh you know how it is laid out, uh what you find is that you know we have this sort of network of structs right uh and the structs. Have things uh have operations trucks in them? So, for instance, if we look at like uh you know, sort of the first stop we can look at here is the super block product, and this super block uh is sort of describes. A mounted file system.

A

So any case, here's a uh here's, a you know the super block struct that if you look here to the uh the file include linux fs.h is where most of the file system definitions are.

A

But if you look here, um you can look through here and see, there's a whole bunch of fields and stuff, but the one that we primarily interested in.

A

Super operations um see here superblock in particular, has a bunch of different operations. It's got ones for uh uh you, know, quotas and exports. If you want to export via nfs- and uh you know, you're doing uh fs encryption- that sort of thing too there's separate operations for those as well.

A

But if we look at the super operation struct you see here, we've got a whole bunch of function, pointers and those do um effect. You know effectively allow us to uh look at uh you know or to do different operations like if we want to allocate nine nodes right. You know we have to do that. uh We have to call in uh this alec, I know, or the vfs will call into this alec. I know truck for a for a new. I know for a uh uh profile system same with destroying and freaking destroy.

A

I know I know a lot of these things are not named very well, and some of them have morphed over time right. You know some of them had different functions at one point in the you know, calling conventions, other things have changed and that's the other thing you should realize about linux is that the internals are very fluid, and so uh things change all the time.

A

Okay, but any case we can look here at the super block uh and the super lock really represents the mounted file system. So whenever we mount a uh sfs file system, um we get uh this. um You know it will allocate the super block structure and do a bunch of work to fill this out.

A

I'm not gonna go into the mounting process, but basically what will happen is we'll get this and then uh we will also allocate a a uh a ffs, um sfs, client struct, and then that will a pointer to that will end up in this ssfs info. So um if we look here.

A

Client, whenever we go to, you know, create a new file system to mount a new file system. Will you know, it'll allocate a super block struct and then we'll allocate one of these guys and so.

B

It has a pointer back into.

A

The super box drop, but it has a bunch of other fields as well. Here's one refer, you know just the boolean for being blacklisted. For instance, there's a boolean for have copy from two. You know, which is a particular operation that the osds can do and so on right. We've got a bunch of others, a bunch of stuff for debugging fast centuries as well.

A

Work fuse trucks in here we have. I know, and cap work hues um I'll talk about work use a little later, but in any case uh you know we so we've got a bunch of these. You know got one of these objects per uh mounted file system. Beyond that we end up to uh there's a structure. I know right.

A

So here's the struct, I know, structure right. So this you know a struct. I know you know. Most of you guys know is, uh you know, represents uh you know a file on disk, basically or file or directory on disk.

B

It also does things.

A

There are also other things are backed by those as well sim links, uh you know, name pipes and all this sort of exotic I know types are also have a. I know, instructed as well, and so anyway, we've got a bunch of these. uh You know you know the running kernel will have tons of these at any given time right, they're, cached and whatnot, but any case the uh the struct inode um also has an associated set struct to it.

A

We can find the sephi node info here, so this is actually what a real sapphi node looks like and you'll notice that it has a this object. You know, has a vfsi node embedded inside it. So some cases you know some of these structures in the kernel. Will you know, allocate an auxiliary structure, just keep a pointer.

B

A

Other times we do things that, like embed the thing inside it, so in this case the vfs node is embedded inside. We also have this thing here, this netfest context.

A

This is fairly new, uh it's, uh um but the netfest layer uh that we're starting to use in the inseph has a an associate struck, perinode where it tracks it well, so we embed one of these inside there as well, and so they have to be uh and the way we lay these out in a particular way, because when occasionally we have to go back and forth between. You know, uh step by note info at the dfsi, node or uh or the netfest context. So these have to be in predictable locations.

A

We can't can't randomize those or anything.

A

um Beyond that uh you know, we've got a bunch of other fields, you know here's the vino, which is uh you know, tracks both the snap id and the inode number there's the isep lock, which is a special spin lock that we use to protect most of the stuff specific stuff so on and so forth. Right we've got a bunch of other fields too. I won't go into um now after the inodes.

A

um So an inode represents a you know: file on disk or directory on disk right, but uh but the path name to that file is stored differently.

A

We store that in an object, called inventory, and so the.

A

So the struct entry uh is this: an inventory is what represents a patenting component, so what we will do is uh you know when you go to do a lookup or something like that or you know you know we pass in path names all the time, and this is one of those things in the kernel that is uh hugely performance. Critical right, uh you know we deal with um um a uh you know. We have to deal with. uh You know, path, name, walking all the time. You know most system calls that we do take.

A

You know many of them take a path name, and so we have to be able to uh walk that path and do that efficiently, and so the denture is what is really. Linux has the most advanced method on the planet for doing all this stuff much better than any other commercial operating systems.

A

So we have a uh um we track. Each pathing component as an individual object called entry, and then there are, you know, uh you know it keeps track of like where the parent is and who the parent of this denture is and then there's you know, every denture has an inode associated with it or not or not.

A

uh You know in some cases so, um and so then it has like you know: here's where the name is stored, if it's small enough so on and so forth, we have this thing called lock ref, which is um sort of a uh with a reference counter and a spin lock, embedded together, uh and so this um so.

B

A

Case the the denture cache in particular is uh is highly tuned and so uh and there's a great uh um document in the kernel sources uh on path, name on path, lookup and how that all works quite comp, complex so um but any case, the identity object is what is the interest, but you see the operation struck here. uh The entry operations right, sorry.

A

Operations, okay, so here's the denture operation struct, um we have you know, there's a bunch of different ones depending on you know how they get hashed and revalidated, and so on so forth, um and then um we also have this field called dfs data uh and so we'll have a death entry in both what it's called yeah. And so every time we create an entry, uh a set entry.

A

We allocate one of these structures and- and uh you know, set the pointer to it and it also has a back pointer entry as well, and so here there's things like: oh you know, uh you know whether the you know whether we have a lease for it and how you know what sort of generation it has there's a flags field so and so forth.

A

All right and beyond that um we have this thing called.

A

File and then finally, the last object. I'll cover is struct file, uh which it represents an open file description right. So uh if you go to open a file, we have a you know a record of that open file, all sorts of info. We have to track like where the you know you know current pointer to the inside the file. Is you know where the current position of the file is. um You know what uh um you know. um You know there's how the file was opened.

A

You know whether it was read only you rewrite that sort of thing uh all that stuff is is tracked here inside this thing we call struct file, which is uh technically, we call a file description. You know, most of us are familiar with file descriptors.

A

uh The file descriptor is just a number, but it is a you know: number that is indexed to a particular structure in the kernel and that kernel is the struct file um and we also have the you know. Of course it's got its own operation struck too.

A

And so we have this, uh you know, there's a whole bunch of file operations here right and once you've opened the file you can, you can read from it, you can write from it. We also have redid or right editor, which are you know, we're added later, and this is another thing you'll notice in the kernel. Sometimes uh you know, because we have so many file systems, it's really hard to change.

A

uh You know calling you know these operations uh all at once right. So what we found a lot of times is that people will add new ones and and just and patch, some of the files to them and then leave the rest, as you know rest to be done later. um We uh you know, we have that here.

A

In this case right, you know there are some some file systems that were never converted to use redid or right editor, uh and so they use this old style rewrite construct and we have handling for earl style, read, write operations, and so we.

B

A

For both in some places, well we'll see that that point later, um and then we have this like iterate iterate shared here. A lot of these names are not very descriptive. Unfortunately, iterated rate share is all about reader.

A

That's done. Here's fsync right! Sorry, we got a bunch of these um so.

B

Now I was gonna.

A

Go through and.

B

A

Have any questions so far if you have any go ahead and speak up and I'll answer them, but if not I'll just keep going.

A

All right I'll take that as a note, um so um you know now what I'm going to do is cover like what happened, we're going to cover a real, simple thing right, which is: let's suppose we want to open a file on the set file system right, let's open it, read only uh we're going to read that file in the first 4k or whatever the first page, and then we're going to close that file, let's walk through how that how that actually works.

A

So the first thing we do is whenever you call into the kernel or whenever you call a system, call right what we usually call it through libsy, so libc will then go and stuff the right arguments into all the registers uh or into onto the stack dispatch into the kernel. uh You know on your architecture, and you know it: does it a different way, but anyway it will do that, and so the first thing we're going to do is.

B

Up four yeah.

A

So when we go to, um you know.

B

A

This case we're going to do an open right and basically all open does is it has. This is called define three, and that does a um this calls into this. You know function called do system. We have several different, open, syscalls right, you know, there's open actors have been at.

B

A

uh And we do you know, but basically you know the kernel needs to handle uh legacy syscalls as well. So in this case we're gonna call the legacy open syscall, and it's going to do this. Call this deuces open function. um If.

B

You notice, oh.

A

This is a compat syscall actually, so this is actually a little different about the actual space to be there.

A

Anyway, um but any case, we're gonna, if you you can look at all the. uh If you look at all these syscall defined macro uh and you'll notice, they all this one's. This is called, define four define three. There are different definitions depending on the number of arguments, so uh we have to you know this is called you know, uh structural.

A

uh You know this will set up all the goop to handle all the you know calling into this call and.

B

A

The two they're a little different right, you know, there's a little bit about stuff. We have to do for this one. uh You know for open at 2 versus just plain open. In any case, we're going to end up in this function called, do sits open, which calls deuces open at 2.

A

uh and okay, and here we go so now, we've got a for open app2. We've got a directory file, descriptor file, name and uh in the open, hal construct, which gives us some information about what it's supposed to do when it opens uh not just uh the open mode, but also some things too, like you know whether it needs to follow.

B

A

That kind of stuff- um so first thing we do- is we call this get name uh so you'll notice here uh up here, this const char user file name, and this underscore underscore user annotation lets us know that this is not a kernel pointer. This is a user land pointer right. The address space inside your process is different than the one inside the kernel, and so when we get a pointer from userland, we have to interpret it as if it uh you know, in the context of the address space inside the process of that.

A

That of that uh you know of that process, and so what we do is we call this get name file name here, and so what that does is it goes and grabs a you know, allocates a big chunk of what calls this get name.

A

Flags and the get name function will then go and copy that current that value, the uh copy, the name in that string in there carefully very carefully uh into a um to you know it's going to be the strength copy from user uh and basically copy this into a buffer that the kernel can work. So when we are grabbing stuff, you know we can't.

B

A

The the uh name directly that was you, know the memory directly directly that we pointed in because it could change right. You know if we call uh you know, if you pass in a uh you, know a path name uh and then start it starts processing and then some other thread comes in and overwrites that path name with some garbage or you know something that will cause it to buffer over run. That could be real problematic later, so we make a copy first and then we vet that copy very carefully from.

B

A

So anyways we've got uh so from there. um We look at the file, descriptor flags or the fd flags that get passed in uh and then we call this function called. Do fill open.

A

Sometimes you'll see the file a file in the in the kernel. uh You know it's sometimes called fill flp, which is like your platform. It's just a naming convention. Sometimes it's used, but in any case we call. This. Do filth, open.

A

um So we have this new filp, open uh and you'll see it sets, um and this is essentially where the path walk out. So what happened? So you know again, we are, we have to, you, know, walk a pat, you know path, name which is just a string, and so this is a hugely complicated process, uh but basically we call into set name my data here, which basically uh creates uh the structure called called the nema data.

A

Yeah, so this thing here- and this is sort of a state uh states tracking structure for a path walk. So the name I data is basically the thing that you know keeps track of where we are in the path walking process, and uh you know how that path. Walking needs to be done again. I'm not going to go into this in great detail, because it's just way too complicated and there's better articles for it, but.

B

A

Case we do all this, um we go down into, uh do, filter open and then we do a path open, app and you'll see here. First it does this thing with lookup rcd. So the kernel will uh you know one of the really advanced things really cool things about the kernel's path. Walking is that it can do it under rcu. Just can do it locklessly, and this is a huge performance win. I remember when they added the blockless path walking.

A

It was like you could see in some benchmarks we would get 20 performance increase because because we weren't.

B

Contending for.

A

Locks all the time uh so any case it will do this it'll attempt to do the rcu lookup first, um but that doesn't always work. uh You know. If you have to go, do things that sleep. uh It will end up having to come out of path walking and what happens is we have this? uh It sends back this rcu can't if we can't do an rcu pathway, we will go back into.

A

uh It will return e-child, which is sort of like a nonsensical error code in this sort of code right, but we use that to sort of show that uh you know to say that, oh you know the rcuv lookup can't be done. You got to do a what's called a ref walk where we take references and locks and things like that and walk back. It's a lot less efficient, but it's really not that bad uh from there.

A

um If we get an e-stale um so like if you're working with nfs in particular right, you know you might get a file handle. Sometimes we see this in step two, but if we get that, then we have to do something called a revalidation hookup where we go not only path walk through, but we, if we have any cached entries, we go and have to go and validate that they're actually correct.

A

So if that happens, um and then so basically at the end of that, though, we will end up um uh calling into so here's path open that right. If you look here, it's got a special path for for handling temp file. uh Here's another one for doing a path open which is uh sort of like a open. That's opening the path name, uh it's useful for certain things, uh but.

B

A

We end up down here for most files we'll end up in this link. Path: walk! That's where really the magic of the pathwalk happens, and you know most of the time we are not interested in most of the components. We're not terribly interested right. They're just directories get to the next one right, but the last uh little bit has to be opened, and so, if we go look at.

A

We do two different types of opens and so most of the time for most file systems. What we'll do is we'll look up first and then, once we get the inode that associ is associated with the last path component of the entry or the that last path component, then we'll then open that file right turns out that that is a hugely wasteful, um usually wasteful uh process on the network file system.

A

Right, if you have to do a lookup and then open, then that is two round trips to the server instead of just one right, and so what we do for network file systems a lot of cases or some others too, uh is we do what's called an atomic open?

A

An atomic open just basically says: instead of doing a lookup, we're gonna issue the open directly and then, if it turns out that we get back enough and it wasn't a create right, then we can just assume that it's a negative, lookup right, and so that saves us some round trips to the server. This was hugely uh helpful. Once we move to.

A

In nfs, because the originally nfsv4 had to do two round trips to the server to do this, in any case, eventually we're going to end up down in this look up open um function, which is again hugely complicated. um You can see here it does a d lookup first right tries to look up, and then it's going to come down here, but eventually, if we have a way to do an atomic open, we'll do that, and so we look here. So that's our first stop inside the actual set code.

A

We go down to cephatomic open uh now. If it turns out, we have uh if we already know what the lookup is right, if we already have the in the eye node. For this thing, we.

B

A

This only when you have a negative density- or you know you don't have an entry at all uh or it's a negative entry, uh then we'll go try to do the uh try to do a top coat.

A

So any case it will go. You know so for seth. What we do is uh first thing we'll do is take different steps, depending on whether it's ocreate uh we'll try to do an async open in some cases and that's enabled.

A

Basically, we prepare an open request here. Basically, this will build inside the kernel client. We have this thing called a set mds request. So whenever we need to call out the nds we'll allocate one of these guys uh and then send you know, fill it out and then send it off to the to the server to the mds. To do its thing so any case this prepare open request is what does that um in this case we're doing a read-only open. So we don't care about, creates.

A

In any case, we're going to build this request structure I'll, actually pull that up.

A

Yeah and here's what it looks like right now: you've got a bunch of fields here that get filled out. um uh You know when you have the entry or an old. You know in some cases like if you're doing a rename, you have an old entry.

A

um We have path names that get stuck in here along with parent inodes.

A

You know, there's also things like this, our parent here in our target. I know where we keep pointers to inodes, that we'll need when we go to handle the reply and then there's a bunch of other more specialized fields and stuff in here too.

A

So we're going to do that.

A

We build an open request first and then we eventually call seth mds the do request. So this is the part um and like where we. uh This is the part where we call him to do the uh the actual uh you know to fire the thing off and send it to accept. Mds do request, just calls uh submit request, and then it waits on the reply.

A

So this is a synchronous call. In some cases we do some asynchronous calls, but in this case it's the synchronous call this fbs weight request just goes through.

A

I'm sorry, that's the weight. I don't want to do that yet um so any case.

B

A

Submit request, and so what it does here is it will get cat refs if it's got certain this our eye notes set. If we have a group parent pointer, um we are going to take cap reps for that and pin them basically pin them.

A

uh If we're doing some async creates, which we're not worried about here.

B

A

Eventually, we come down here and we'll call register request a register request. We'll then stick it basically onto a list.

A

Here just adds this thing to a list.

A

It adds the thing to a list and then it calls do request view request.

A

Then actually calls down into the messenger layer.

A

Yeah it'll first choose an mds right and because we may have multiple nds's.

B

A

Decide which one to send it to uh again it's a pretty complex process, but we're going to skim over that. But basically it's going to call into there um and then check to make sure the session's, okay, blah blah blah.

A

Then eventually we come down here to send request. So that will um go and stick the thing on to the uh into the you know: hand.

B

A

Off to the messenger to be sent off onto the wire uh once we're done there, we, you know, put references or whatever and then and then in this case we're gonna wait for the reply to come in uh yeah so anyway down here, we we've submitted the thing we call down in here to wait for the reply and when the reply comes in we'll we'll process it.

A

If you see here, here's wake request.

A

And so we have some handling for stuff, like you know, if we got aborted in the meantime, depending on what you know how it's handled and then we'll you know, if we've got an error, we'll return an error in this case, but.

B

A

What we'll do here is.

A

Right so we're down here um yeah down here, so we will return here and so we've got the result now at this point right uh in this case, uh we're not going to give you no n because we're already there um and then you uh you have to some special handling here because it has to be. uh You know we are sort of doing a combined open and look up at the same time.

A

You know in some cases we so we have to actually handle the results of the lookup and, in addition to the open, but down here, eventually we're going to finish finish opening.

A

So this will end up uh essentially instantiate fully instantiate the struct file uh and give us a file description that we can then use to do other things all right, and then we go back uh all that will unwind and go back to userland with and I'm sorry the you know, once we've got a file description that goes back to the you know back into the vfs layer. The vfs will then attach a uh file descriptor to it, and then we hand that file descriptor back to userland.

A

So now userland is going to say: let's go and read this file right, we're going to read um both the uh we've got a file descriptor. We we're going to pretend to stuck a you know, buffer on the stack or whatever, and then we're going to read the first 4k of the file right.

A

So now, what's what we do? We uh we have a uh it's call line 3 the argument struck and find the one for read and get some read, write c, yeah, yeah, all right and here's a read. um So we have this.

A

Yeah, we're gonna do it. You know it calls into this function. Called cases read, uh you know which goes and figures out. You know where the position of the file is and whatnot and then calls to uh this function. Called vfs read a lot of the generic vfs layer. uh You know handling is prefixed for this vfs underscore. When you see those, usually that's pretty good indicator that it's a generic uh structure or generic function that you know is used across different file systems.

A

uh Any case vfs read um calls in and let's see,.

A

Yeah here it is so.

B

A

uh Does some you know it does some checks and it was this? Can we actually read from this file? Descriptor you know is it is the buffer we're dealing with uh in our address space, but eventually we come down here. It's going to call this other rw verify area as well, which does some of.

A

Yeah it does some things like uh you know, make sure we're not trying to read. You know beyond the you know, the head of you know we're not trying to read the uh you know an awful negative offset into the file or something all sorts of uh you know, there's lots of ways that you can try to trick the kernel. So it has to check all this, but in.

B

A

A

Area and then we come down here and if we've got this file f read uh we'll call that uh we don't.

B

A

That for ceph, so we have a reiter function, but that is there. If that is set, uh we will call new sync read.

A

And that I hate this sometimes in the kernel, we'll just have this sort of call reader. So we have this sort of useless wrapper function instead of just calling the thing.

A

Look at that real, quick yeah and all it does is just call this redid.

A

The reason we have this um is that you know the original kernel. You know we would pass down a buffer and stuff, but uh you know about seven. Eight years ago uh we went pretty big into using this uh struck. Iod editor inside the kernel all over the place, uh and what this is it's sort of, like you know when you need to iterate over, uh you know a user. You know a buffer of you know some sort.

A

um Then we can pass this in and we can have these iterators of different types. So there's some that are referred to user land buffers or some that refer to kernel buffers uh some that may refer to a pipe that sort. You know all sorts of stuff right and.

B

So this allows us to.

A

Use a lot of the same code uh without having to worry so much about what the destination or the source buffers look like.

A

So we use these these things all over the column, all right in any case, we're going to call reditter um and then for us.

A

This becomes seth readeter.

A

Let me take a pause here for a second. Anybody have any questions: everybody sleep, no, okay, any case all right so anyway, here we are seth reader and so what uh so, basically we'd call this uh any time we get like a read. This call from userland.

A

We also will call this for things like uh if you've got an mapped area on nsf file, and you uh fault in that page right. You try to read from the map or open it or try to read off of the map. It will. You know turn that into a read.

A

uh You know to read that to populate the page, so in any case we have this read or sorry. Actually, that will call read page that doesn't call it this part, but in any case the redid is where we call in to do a uh handle a read, uh so we do all sorts of stuff.

A

First, we see if the I know, shut down, if you forcibly unmount the thing or if it's you know, we've been block listed, we consider that to be, we shut down the you know, access to the eye mode and then we'll return. It.

A

We have some special handling depending on whether it is I o direct or it's a direct open or not not familiar then oh direct is a way for uh to tell the kernel that you want to bypass the page, cache uh and acts, and you know either read into or write to uh some buffers directly uh and this will almost always cause uh on the wire or read or write.

A

We come down here um so here we have some flags. We have to start deciding like how we're going to do right. The stuff is a bit special. Most file systems use the page cache unconditionally. The kernel tries really hard to cache.

A

You know file data so that it doesn't have to go and re-read it or you know you know whenever it needs to be accessed again, so those sort of operate one of the operating principles of linux from very way back is that you know any ram you have that's not used is wasted, uh and so we try to use most of our ram in the in the kernel for page cash for buffering disk, reads or buffering, you know reads from file systems down.

A

um In this case um we have some, uh so we have to decide whether we, uh whether we can use that or not, and so in this case we have some cases like if we're not doing a direct. I o, we don't have this fsync f sync lag, which is a cuddle uh that can set uh and you can force a file to be synchronously accessed.

A

uh Then we will try to get the cash cap. The fc cap there's also this lazy. I o stuff here, which is somewhat poorly defined really so, but it's there so first thing we'll do here is we'll call into seth getcaps, uh and this will issue a you know. If we've got caps all ready for this inode we'll take references to them to ensure that they don't get released while we're trying to use them.

A

But if not, then we have to call out to um to the mds and request them so seth getcaps. Does that take a quick look, real, quick.

A

Yeah and so we've got a bunch of stuff here again, a hugely complicated process go and try to get. uh You know request caps from the basically just calls. We don't have them, we'll just call subtract. Try get cap refs here um and and that will call out to the mds to say, try.

B

To give me these caps.

A

If you can okay in this case, let's just pretend we got the caps in most cases, we do so depending on whether we got caps or not. If we, if we didn't get cash, then we're going to try to access this thing synchronously, and so um we have this direct, read, write code here uh that handles doing dio.

B

Reads and writes.

A

uh It's a direct. I read right. If it's not direct, I o uh we do it. uh We can call a synchronous read here um and that will just go and actually issue a read on the wire for the particular range that we're trying to read or a number of breeds, if it's a very large, for instance, but most of the time we end up calling uh we get cash caps and we call down here to generic file reditter.

A

This calls back out into the vfs and what it does is. um This uh is basically the uh so if we're not doing you know, you know direct, I o or synchronicity, then we are doing page cache. I o, and so all the reads and writes in this case will be page aligned, and so we can look at file.

A

This is just a generic helper function from responses to you know for doing I o direct or whatever, uh then we'll call it, but eventually it's going to call down here and do this filemap read.

A

I found that read uh calls down here, we'll grab page cache pages. I get them. You know prepped and ready to go um check the size. uh You know it does a bunch of other checking too then.

A

Yeah we're just the actual thing.

A

uh Sorry I find where it actually does the read: uh that's the copy, it must be.

A

Oh filemap get pages. I guess.

A

Yeah yeah down here, and so basically we go down here and you'll notice here too, uh most of the time so historically, the kernel has always operated on.

A

Most, you know most computers that we work with have what's called an mmu and the mmu operates in a memory management unit and the mmu operates in units of pages. Typically, those pages are 4k.

A

um We are in the middle of a huge turn in the kernel right now, uh where it will, uh where we are converting to from a you know. So the problem is that memory has gotten. You know, memory sizes have gotten really huge right uh and so tracking 4k pages uh in this day and age is pretty wasteful.

A

uh You know, if you could, you could imagine we have to have a structure for every page in the files in the kernel uh and that you know we have millions of these things because, uh because they're all attractive for uh 4k granularity that tends to turns out it turns out, most mmu's can work with larger pages without too much. You know without any problems, but the kernel is not equipped for that. Currently, so we're moving to a new structure, that's called a folio.

A

A

Sort of a gathering of pages we've had ways of ganging together pages before in the kernel, but it was always really poorly defined. uh You know. Sometimes you would get something that was called a compound page and it was.

B

A

Not everything in the kernel was really equipped to deal with those. uh The page cache in particular never did that, so um so we're trying to move the page, kernel's page cache eventually to work with larger pages and as part of that, we're converting a lot of our a lot of where we have traditionally operated with page pointers to something called folio, and so that's what this is all about.

A

In any case, um we've got um here's a you know: here's some batch batch handling if you've got a bunch of you know big folio, um but in any case we have this folio test read ahead. If we've got we're able to do a read ahead, um the colonel will try to. uh We can try to expand that read to something larger.

A

So let's say, let's pretend that we're going to be allowed to do a read ahead here, and so we look here. Call this page cache asyncra.

A

May not be in here yeah, so in case uh we call this and then it calls down to this on demand, read ahead function and then eventually we're going to get into uh calling the read ahead.

A

Function for step, okay, yeah seth has and so seth uh we traditionally had our own read headphones and read ahead is actually fairly new. There was a if you look at older, kernels you'll see something called read pages, um which is the old style method of doing it, which was pretty wasteful with page locking. So we've moved to a new. You know, like I said: the internals are very fluid in linux, and so uh you know we change stuff. All the time and so read ahead. Is the new new way to do?

A

uh You know multi-page multi multi-page reads, and so we end up doing. We call into this netfest read ahead, and so the.

B

A

uh Is a new thing too? uh If you look at some of these old um old kernels things like that, you did stuff like read ahead or they had the old uh read.

B

A

uh Operation, uh you'd notice that it had spends a whole lot of time doing uh you know, setting up pages and handling page flags getting.

B

A

To the pages and stuff like that, a lot of that stuff is really sort of a pain in the butt for for a network file system, and so we are trying to consolidate a lot of that and a new layer called the netflix layer. So the.

B

Netfest layer.

A

uh Metaphase layer has a new operation struct right here, so this is where you can see. The net request stops so.

B

Basically, this is.

A

A much more natural interface for dealing with uh network files, so, rather than having to say okay, giving us a pile of pages and say: okay fill all these. You know set all these pages up for a read and fill them up. uh You know the netfest layer calls it to us, says: here's a pile of pages, we've already prepared, let's go and build them for us, and so we end up with this.

A

uh Do some setup. You know with these things we have this init request function that will go and sort of prep the thing and decide whether it needs to be inlined.

A

um We, um it can begin cache operation too, and we've got a number of things here that we do the other nice thing about the netfest layers and encapsulate a lot of the gory details for handling uh fs cache and eventually we have f script.

A

So in any case, but eventually we're going to end up with this death. Manifest issue read- and this is where the actual reading app. So uh it has given us this uh netfest uh io sub request.

A

um So when we issue a read, we may that read may span. For instance, multiple uh deaf objects right, and so we might have to issue two actual read calls onto the wire to fill it. So the first thing we you know it does is: it tries to recall a number of functions to um to sort of like uh expand the read ahead, uh and then we also try to. We also have to do some some uh clamp, the length like you know.

A

If the thing we can't read beyond the end of the object and and seth right, so we have to clamp the link that that, where that object ends and that's what this is all about, object mapping, but eventually we're going to get down to doing a net test read um and we set up a new osd request. That's what this guy does.

A

That goes down into here. We go, we get what's called an x array of pages uh for the um from the iot header and the the ideator is what we're copying into we get x uh an x. uh We go and grab an x-ray from this thing, and then we call those iodine pages alec, which basically just allocates a bunch of pages page cache pages for us um and uh or sorry.

A

It grabs the page, cache pages in the right locations and gets, um uh and you know, pins them basically and uh gives us back an array of these pages, and then we hand that down uh we stuff, all that into the osd request.

A

We set our callback when it goes, and then we go and start that request and then we return, so the request is allowed to. uh So that request will run asynchronous right. You know it will uh so if you've got multiple reads: we're going to go and boom boom boom fire uh fire several of them off and then and collect all the replies.

A

So eventually, when the reply, I'm not going to dive down to the osd code at this point, but eventually this code will come back.

A

Then, when the reply comes in the osd code, handling handler will call finishnetfs read and then we go and look at the result like it may be that we there's no object there right. If there's no object there, we get back an enom, and then we pretend that that is a zero link read if we get block listed, then we have to deal with that.

A

um If uh the read was shorter than we had expected right, you know, let's say we issue. We tried to read a whole for four mag object off the osd, and then it turned out that it was not uh that it wasn't that long. uh We want. uh We need to tell the netfest layer to clear the the last thing. That's what this flag does and then eventually we call this netfest subreddit terminated, which will tell which tells the nfs later that okay, we're done- and here is uh you- know the new.

A

uh It either ended with an error or with a successful, read uh and then then we go and put the page references. You know we have to hold references to the pages while we're filling them sure they don't get purged out of the cache and then, when we're done, we put those up and then.

B

A

Call I put as well to put the reference to the inode that we're holding every every time we send off an osd request. We take a reference to the parent inode to make sure that it doesn't go away right now. We have populated pages or populated page in this case.

A

So now we've gone and go back into.

A

A

We're back in here now at this point and generic file redid or is not returned.

A

We go and put cap rafts with our cap references and that sort of thing we have blocking to to ensure that we don't mix buffer to direct io and we have some handling too. If it turns out that we had to retry, because maybe we do have the right calves or there's some other problems.

A

uh Seth has a lot of really strange handling for things if we hit the file or a hole, zero out the end um so any case, we've got all this stuff and then eventually we bubble all that back up the user, land and say: okay, we gotta read right. So let's say we read 100k or whatever.

A

Now we're going to close the file.

A

A

It's kind close right here and here's this call to find one right. We close so now we're going to close up d, um so we tell them.

A

That will allocate the sorry that's wrong there. It will find the file uh you know so we've, given it a file descriptor right. It goes and looks up that file descriptor. It says all right we're going to close this or this file description and it's going to close it. So it calls this build closed.

A

All right and now close uh goes, and uh you know if it's a writable file, descriptor we're gonna need what we're gonna flush every data we can generally.

A

Remove any posix locks right and then we put the file descriptor.

B

A

If put isn't a wrapper around, this function um f put around underscore underscore, but this does a bunch of other stuff, but basically we're going to call this. um Eventually, we call down to the release for that.

A

And this is the set release function in this case. It's a not a directory. I know so. We open this guy.

A

And so we get rid of the fs cache cookie that we were holding for that, and then we put whatever f mode. We've got for the for the file so indicate that we're no longer holding this file open for read and.

B

A

uh The uh that step file info that was attached to it attached to that struct file and.

B

Then we wake up anybody that happened.

A

To be waiting on captions.

A

And at that point, um that's pretty much it uh we will go and you know the vfs will do its thing to free that file descriptor uh and read whatever you know, pre-destruct file as well and then uh that's the end.

A

That's not all I'm going to cover today um any questions.

B

Chef, I had a question for in the function self atomic open. It said uh in the comment it said: if the file or sim link is non-existent, the vfs3 tries okay, does vfs keep retrying or what's what's what's the idea behind that.

A

Okay, sorry, let's see which part were you talking about.

B

If you go to the comment on top, it says if the file is uh non-existing to it, yeah.

A

uh Oh, that looks like that comments. Just bogus.

B

So so the dfs keeps retrying, I mean: when does it.

A

um um If we get a uh so uh if we get a non-existent father son, like return once said, yes, uh the calling convention for atomic open has changed. I don't think this comment is actually correct, but basically yeah I mean when you. uh If it turns out that the file is non-existent, you know, basically, we don't retry. Really we just it just uh returns. You know it at that point, we're doing a read. Essentially we, instead of doing a separate lookup, we just do a.

A

We do a read or do an open uh and try to uh to get that yeah this this book. This comment is bogus. I probably should set up a patch to get rid of it at one point you would. We would return one and may still do that internally, but a lot of that has been changed to use this finish open structure, and so, if we've got a finish open um you know is what hap actually handles most of this. So in fact you take a quick look at it. If you want.

A

Oh we'll just call it dude reopen uh yeah.

A

Unfortunately, we have a lot of those places where things have changed, didn't update the comments.

B

Yeah and then atomic open, he said one of the calls is synchronous. I didn't get that was it uh submit a weight entry. He said the call is synchronous.

A

B

Can you repeat that he said uh one of the calls is network calls is synchronous and atomic? Is it uh submit a weight entry or is it.

A

Yeah yeah, we, it is synchronous, um you know, because I mean we can't do we can't return well, you know we can. uh We do allow async open, sometimes right, um for instance, if the kernel, uh if we have so if atomic open you know, is called in a very special circumstance right. We either don't have a cached entry for this thing or we have a negative entry. Okay.

A

So um if um so, we either don't know what the provenance this entry is or we know we know that it doesn't exist right and then it will call atomic open to do the open and look up at the same time that will allow instantiated entry um if it needs to and and also if it turns out that we're wrong right, because sometimes we uh sometimes like, if uh you know when we have negative dentures, sometimes they can change on the mds and we don't and the client doesn't know right, uh and so, if that happens um so so the atomic open allows us to.

A

You know if it turns out that uh we, you know, we thought this file didn't exist, but now it turns out that it does and we didn't do it create and we're not doing create right. So uh it turns out that it does exist.

A

uh You know we will issue the open call to the nbs and- and that comes back, that we can go and instantiate the entry ah this identity does exist. You can attach it. I know to it now and turn it into a positive entry.

A

It's this is a pretty complex uh situation and, and quite frankly, alvaro hates this right. You know: we've we've been trying to find him for about a decade now figure out a bit a cleaner way to do this, but it's unfortunately, not not trivial.

A

Top scoping is like kind of one of those clue, g things that we added to work around a problem. uh It was never very elegant to begin with, and it still isn't so um but yeah in this case we will always do a synchronous uh rpc to the mds.

A

So we are always going to call the nbs uh on on that on there, because we don't. We have to because we don't know, we either think that the thing doesn't exist or we uh or we don't know. So we have to call the mds either way and we have to wait for the response to come in.

B

Okay, yeah and uh the final question I had is uh like uh I mean there was one of these pr's uh you know of introducing async io and lips ffs, and one of the questions asked was: uh how do we detect ender file when reading?

B

And uh so you said you need to uh because we read from a bunch of objects and you always keep track of the inode size, and you know, uh and then you read from those is that is that that's pretty unique to our file system right but uh to detect end of file. We need to do that.

A

um Yeah to some degree yeah I mean um it's not I mean we have some of those similar situations with nfs as well you're doing pnfs in particular. You know. Sometimes we may come back. It may have a very similar kind of thing where it is reading from some sort of you know, aggregate of multiple servers, multiple back ends or whatever multiple objects but yeah for seth. It's uh we, you know we have to basically query the mds and say what is the you know?

A

How long is this file, because the files can uh you know? We can't trust that, just because we got a short read, that's where the file actually ends right. um uh If we've got a short read, it may just mean that the end of it is sparse, and so it just didn't. You know the you know. Never that part was never written, but but it was maybe truncated out to a larger size.

A

In that case, we have to go and fill in that that part that doesn't exist with zeros for the response that answer your question.

B

A

B

So I think we okay, so when you uh keep track of the, I know size uh very carefully to make these okay. Yes,.

A

Definitely yeah it's pretty tricky to get it right. Actually, yeah, there's a there's, a lot of handling for the net size and.

B

A

Even 100 convinced that it is all race free. My guess is that there are probably some cases where we could race. You know and get those highs wrong. Okay, thanks jeff yeah! This is an area that probably could use some improvement. If you want to try to understand it and figure out how it works.

A

Okay, um yeah uh yeah, any other questions.

A

B

A

uh Thanks for coming, it was a pleasure to talk to you all. So uh let me know I'm gonna plan to do at least probably one more walk through covering some other areas as well, but I'll probably be in another week or two. So thanks for coming see you later.