Ceph CephFS Code Walkthrough, 6 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CephFS Code Walkthrough: kclient overview

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So, anyway, uh yeah this is a talk, talk about uh a little overview of the of the k client code and I'm not going to go into a lot of real depth here. I'm just going to give this sort of a lay of the lane or interested in doing some work in the k, client and you want to get started.

A

um So the first thing is that when you pull down like a kernel tree, uh you know the first thing to do is to understand what the k client is and really what it is. Is that it's a re-implementation of stuff? uh So we have a you know. Seth is a you know, big complicated uh tree that we've all done some work in, um but the uh the k client is really um uh just a copy of all of all that code.

A

uh It doesn't really share much with the user land cake, user land client, aside from maybe a few header files uh here and there, and so um we've got several pieces in the kernel to handle all the stuff client stuff and the first piece is uh libsep, which is a kernel, module and really what this is just the underlying uh transport layer.

A

For ceph for the seth code in the kernel- and so here all this code is contained in netsef, um where most of it is anyway, this aside from a few header files that are under an include directory- uh and then you know in here, you can see that there's a bunch of uh stuff that you can probably pretty much sort of you understand from the names uh there's uh parts for um uh handling off in the authentication um there's some uh uh stuff for handling uh crypto.

A

uh The messenger uh is here and there's v1 v2 variants. uh We have an osd map, uh there's the osd code, client code, the mod client code, stuff for handling osd and mod maps, et cetera, et cetera right. So any case, uh we've got all this uh all this code basically lives in here. I'm not gonna. Go over this in great detail, but I'll talk about a couple pieces of it here in a minute. um Aside from that, um we have two more pieces in the kernel. uh There is uh drivers.

A

And so this is uh the rbg rbd driver, it's not very big, but it called basically, it just calls down into libsep to do most of what it does. This is how the the routers block devices uh is implemented, um and so you can see here's you know, you know it's mostly self-contained into a single file and it's.

B

A

Huge it's like maybe a couple thousand lines. Maybe let's see uh or seven thousand lines excuse me uh and then we've got.

A

Fsf, which is where all the cefs code lives- and this is a complete dfs layer- uh kernel driver for the seth stuff, client uh rbd. um When you you know, when you look in there, what it's doing is it creates a block device driver. uh You know when you think of file systems in the kernel.

A

uh They are themselves drivers for their drivers, basically for for those as well and they uh you know this also will call down into uh libsep to do most of the traffic that needs to do to talk to the various uh demons that needs to talk to on the other end.

A

um So uh when you, um when we are dealing with um a file system, most of what we are doing is responding to uh you know: user driven events, things that come in from like say, syscalls or um you know it could be uh nfs activity or nfsd activity um or smbd activity. Probably these days uh any of that kind of stuff is, uh you know we are essentially uh we call you know. Most of it comes in through the vfs layer and the vms layer will call has a bunch of is essentially object.

A

Oriented too, uh just like uh you know, sort of in a poor man's object. Orientation like a lot of kernel is um but anyway, there's a uh main file for the.

A

That sort of you know lays out how vfs objects are put together.

A

So essentially in fs.h include linux fs.h. We have a um a bunch of different objects. uh You know the canonical one for the driver. Is this file system type thing and it has? uh You know, um operations a couple of operations vectors in it, uh one through init fs contacts, there's a mount which is for the old mount api and there's kill sb here, which will kill the super block eventually once you're ready to destroy it uh from a file system type. So what happens is when you go to mount right? You know you.

A

The kernel will create a thing called a dfs mount object and then we end up with sorry.

A

So then we end up with a struct super block object, and this thing a super block is what represents uh a file system in the kernel, and so we've got a bunch of fields in here and a lot of these look, you know, may look, you know self-explanatory, but uh in particular um we've got a uh sfs info.

A

um This structure here will end up pointing at uh what we call this ffs client. So in in this generic super block, you'll have a pointer to a generic pointer. That's you know a void star, pointer right uh and it points to uh you know some super block info that we've got for perseph um from here. We've got uh two main things that we deal with.

A

There's a uh struct.

A

A

The big one here is struct inode, so this represents an inode in the in the kernel um and these structures are usually embedded inside of the of the step. Inode info uh know. The super block is a little different. It points it has a pointer to a to. uh You know private info uh when we get down to inodes. What we do is we embed them then, from here we've got.

A

So here's a struct entry which represents a compat name component in the in the file in the uh super block per super block uh linux that actually has probably the most advanced um denture. Caching, uh you know or lookup handling, denture handling, info or method in the world. um It's probably one, that's one of the places where it really shines, and you know compared to anything else, that's out there, but any case we have a.

A

We have a bunch of these denture objects in here too, that are also somewhat managed by the by the mbs, and then it also note here has a specific pointer that points to some void objects that we'll get to later from here.

A

um uh You know, so each of these vfs layer objects generally has.

A

In addition to this.

A

So each of these uh structures, usually um they'll, be, will have and there's also a struct file, which represents a open file. I'll look at that right now. Each of these operations, or each of these uh structures has a or object, has a usually has a uh operation struct hanging off. This is sort of where all the class methods for the thing end up being right. So you've got um here a bunch of uh dispatch operations for various things like for a lookup right. You know if you wanted to uh look up in a directory.

A

uh You know that when we go to do the lookup, you know when the when you're gonna do the lookup. It's gonna call this lookup vector right. uh We've got a bunch of other stuff in here. There's create right to create a file in there link unlink sim link, you know all the stuff um whenever a sys call comes in, we end up usually doing some path.

A

Name walking uh to go, do look up lookups in there and then we end up calling a sort of a terminal operation that will actually do the the stuff we want to do. uh One of the particular ones. To note, is this atomic open uh so for network file systems in particular right? Is that usually like when we go to do uh uh to open a file on like a local file system? We're gonna go, do a path walk down to where the thing is.

A

Look it up turn it into an inode excuse me and then we will turn around and uh um issue an open call against the thing so uh uh atomic open, uh but that's a bit of a waste for a network file system right, because if you know, if you have to do a lookup and then go back and open the thing, uh that's two up two round trips to the to the server right. So we don't want to do that. What we'll do is do we?

A

We instead will just issue an open request right and if it comes back with an email or whatever right, then we just say: okay, this entry just didn't exist, and then so we have a combined lookup and open uh call here in atomic opening. But any case you can see all this stuff. uh Some of these things, too, are some somewhat um uh blocked by specifics as well.

A

Like fine map and so for those who, just generally don't- and if you don't define these operations, usually what happens is you know you get either some generic error back or um you know if they're set to null, uh then they'll get you'll, get back a generic error when you try to call them or it will have some sort of fallback code that will do do something different, uh and so each of these objects has the one has these I I got. You know these operation structure hanging off of it. There's an iodine operations.

A

There's file operations, there's another one for dentures, there's another one super box too.

A

Here's the super one for like uh if you want to mount a mount, do that kind of stuff uh create new inodes in the super block, because I know this can be different sizes for different file systems. So we have to have an allocator, uh a method in here as well.

A

Now, let's talk more about the specific stuff and seth so um so, like we were talking about right. Most of the info in uh for the sub client is in super dot, h, fs, super dot, h uh and.

C

A

Here's like uh in, for instance, is the set inode info right, which is the you know, the inode representation for a sephi node, uh and if you see in here there's um uh basically, we have a bunch of fields that are set specific layout info, say and here there's stuff for caps.

A

Here you know an rb tree to hold cap structures which we'll talk about a little later um and then finally, you have down here where the vfs I node is actually embedded note here that it's we actually embed a full vfsi node in the end of the thing so.

B

A

To allocate nine node, uh you essentially get all this junk at the top uh and then the vfsi node allocated, and you can see here too. We also got some uh if depth out pieces like this. uh If you don't compile it with fs cache, you won't get this field in here, because we don't need it.

A

A

Also, we've got um you know, uh so you know when you go to uh mount the thing right.

A

So per super block. uh We go to mount. What happens? Is we open a bunch of sockets to different servers right? You know we have to open them on first figure out where the mds is, then we go talk to it and then we go, you know, might have to go fetch some stuff off an osd later and so the at the super block level.

A

We have this stepffs client struct and it actually points to another thing called the set mds client, which is actually the part that talks to the mds so um and then we've also got some things in here like work, use gap or queues that sort of thing and again, if depth pieces, and if you don't compile with debug tests, you won't get any of this stuff get with fx cache.

A

The um interesting bit here is like in this fmds client.

A

um So anyway, we'll have a.

A

um So in this fmds client, this is the thing that represents our connection to a particular mdf to the uh to the nds's as a whole, uh and so you can see here. We've got a struct a list of sessions, um there are an array of sessions uh and then there's also in each session.

A

And each of these guys has a a connection as well, and this thing is the you know dials back into libsev, uh and then that represents the connection to the to the set daemon. On the other end and now the set connection.

A

Apologize for being all over the place here at this time, but it's a little a lot of pieces here essentially has a you know a socket here that it talked.

A

You know that represents the actual tcp socket that it's going to talk to that uses to talk to a uh to the daemon on the other end and there's some other features in here too one of the interesting bits here um you know: we've had this sort of, unfortunately named mutex in here, pretty much everything is done under this mutex, the the stuff lipsef code is highly serialized, uh probably unnecessarily so, and it seems to probably- and my suspicion is that this kills performance, one of the things I'm actually looking at doing it right now doing.

C

A

Now, or have been looking at over the last month or two is uh trying to at least get some pieces of of the call of the um the activity out from under this mutex. So we don't need to serialize quite everything in any case.

A

So, let's look at like what happens like uh let's say: we've mounted a file system and now we're going to go. We want to do uh uh open a file, maybe and write to it. Okay, that's probably you know a pretty simple sort of operation that we do so.

A

The first thing we're going to do right is we're going to call into the um walk down the path till we get to the ceph until this ffs uh and maybe walk down doing lookups as well for different path, components till we get to the file we want to open um and so in that directory uh in the directory, where we're going to open that file, we're going to issue an atomic open, and so when you get the atomic opening, you get a bunch of different parameters.

A

There's a directory. uh Here's the uh the name of the entry, uh which is the uh you know, the entry that we want to open uh here's a file that is not quite filled out yet because we haven't actually done the open and then there's some. You know open flags uh and then a mode for the like, if we're doing a create a new mass mode.

A

um So essentially, what we'll do is walk down this thing. If we've got, you know more of the name max which we should send back, you name too long, but then we've got some other stuff too, like. If this thing is a create, uh we're gonna do some. uh You know some setup and then we you know: if we're not in a. uh If we're not being looked up, then this is not negative. So there's a whole bunch of rules for doing atomic opens.

A

It's probably more complex than it needs alvaro has been raging about this for years, but he just doesn't want to change our change out how it's going to work um in any case. uh Eventually, though, we're going to come down here and we're going to prepare and open. So this is actually the request that we're going to send to the mds to say: hey we're going to open. We want to open this file all right, so we set up some stuff here again.

A

If we've got an o create, we might just try to do an async open depending on whether we can get away with it or not. We've got the right caps, but eventually, let's pretend for just a second that we're not going to do anything and then so we'll just go down to here.

A

We fill out the the request, this uh mds request, which is what this rack thing is here, and then we submit it to the to the uh to the engine to do to do its thing uh and then eventually this is a synchronous request um in most cases or definitely synchronous, request here and then we end up when the error comes back.

A

If we get enough, we say, oh that file doesn't exist and we send it back to the uh journal to say: hey, you know that that file wasn't open or did or this file doesn't exist or otherwise. If we've got a, you know, oh create. So we've got a bunch of rules here that happen, but at the end of it all, we will call this finish open, which will finish filling out the uh the file strip, the struct file, and then uh you know, let's say: we've got a full at that point.

A

We've got a fully instantiated open file, open file, it'll get hooked up to a file descriptor by the uh by the generic open.

C

A

And then pass back to the user so now, let's say that: we've done that um and now the um we're going to issue a write. So the first thing it's going to do is um it calls this function? It turns this right into. uh We have this sort of generic iterator called irb, uh and so we get a write request from userland or or maybe from a splice request, or you know who knows what right, but it gets turned into this ioder, and it says you know, write this data to this.

A

You know, because you know to these positions in a file in the file, and we've also got this iocb, which is sort of an I o context, um and so the first thing it's going to do is look in the icd and say: okay here it finds the file. This ki filt is what points to the struct file and then from that we can get the. I know that we're going to talk to.

A

uh And we do some other checking a lot of this stuff.

A

Doing I, if we're trying to do an uh an oh append right, uh it will do this. uh Do a get adder to try to get the size for the file. This is terribly racy, but it does sort of semi work.

A

We also will check to see if we've got an osd, a full osd.

A

If we do the deal is that we want to do a synchronous write in that case right because we never want to so so uh seth is um you know if we've got the well here I'll talk about it, but in case, if we, if we've uh got this, then we're going to want to do the right synchronously.

A

uh You know it turns out that the thing was set. Uid.

B

C

A

To script, the set.

B

C

A

um And then what we do is we check. uh We start checking for caps. So the next thing we do here is we call down- and we say: okay, we're going to get caps and we're going to. We have to have file. You know the fw caps file right caps, but we also want uh file buffer cap or maybe lazy, io cap.

A

Now the mds will issue these. You know fb caps if it wants to allow the client to cache or to buffer rights, um and if this is the case, then we can go down here when we get down here. um Also here where we increment the I version, but down here, we um uh we'll look and see. Okay did we get? uh These got represents what caps we got and if we got the file buff didn't get the file buffer caps. We're going to need to do all the synchronous same thing.

A

If we've got we're doing it, oh direct right: we don't, we can't use page cache there. I think we're doing a thing was opened. Osync or we've got um uh or there's been an error right back error. Recently, we've got that we also switched to doing synthesis so any case we come down here. um We get some snap context info there and then we come down here and call if we're doing a synchronous right, we're going to call down into here and do this.

A

If it's a direct write, we do this. If it's a sync write, we do that now, these cat, this code path, is not not commonly traveled. Most of the time, the client has caps and can use the page cache, which is good because that helps performance.

A

But when it's not, then we actually have to go. Do we actually go and issue a synchronous write to the server or any case from there? um We will call, um but let's say we have caps we're going to do a buffered right.

A

We call into generic perform right, which calls back into the vfs and the vfs at that point, we'll call something called write begin.

A

We separate again so, um in addition, every um inode also has some has a what we call an address space, and so here we can do when we're using the page cache. Now when we don't have caps or we don't uh or doing particular types of writes like godirect, whatever we don't use the page cache, but when we do, we have to call into operations here, um and so the first stop, for this is seth right again and what this does um we'll go and try to fault in the page.

A

So and then we we've seeded a lot of this code or moved a lot of this code into the netfest layer now, but effectively. What it does is it calls in does a and the netfest layer. uh Now uh we have this set netfest redox and it has a bunch of operations to do things like issue read operations.

A

uh We also can resize them uh that kind of thing. So.

C

A

You know moved a lot of the boilerplate code that was in here into into this, but the interesting bit is this issue off. So let's say we're going to do this non-buffered write or this buffered right. Sorry, the first thing it's going to do is um we're writing to the page uh right. We're given a page.

A

So here we're given this page and now we have this- we have structfolio now, but essentially structfolio represents a page, and um what we will do is we're given a page here.

A

And we have to go and figure out where this page is in the file and then what down here? We call it to netfest write again and and if it looks like we need to uh that. We're gonna have some parts of this page that are not written um in this write request. Then we're going to go and um I'm sorry that.

A

We're gonna go and do this uh operation here to fill it, and so what this does is that.

C

A

Say is anybody if anyone has questions, uh please speak up and ask uh you, don't have to wait for me to step down?

A

Does anybody have any at this point.

A

I'll take that as a no, but in case we have this osd request uh that we record that we do here, and so we call down into step os the new request passing a whole bunch of parameters based on what we're trying to do, which is three to five. You know, do a read uh and then we so we're building a read request here, and then we also build this iodine or x-ray, which is, I won't go into right now, but.

B

A

Case essentially, this is the thing that holds all the pages uh or represents a bunch of the represents and arrange the file.

A

And then we have this iob iterate pages alec, where we allocate array to hold the page structures that we're going to fill from there. We pass that down here to sort of set them up in the request and then we'll go ahead and start the request, and so.

C

A

Thing will then go turn around and uh uh you know this will start the request running uh and then eventually, once it finishes, we will call this finish. Netfest read function uh which will then go and handle you know whatever whatever.

A

So when we pass this thing down here to the libsep engine, what happens is we've given the uh lipstick, an array of pages and said, okay plop, all the data that you read into here, and so these pages will end up being written to populated, and then we call this finishnet if that's read, which goes and checks the result right and if we got no object right.

A

If there's no uh object where this, where we went to go, do this thing we basically zero, fill the page and then and then test it back ditto here we've got uh or you know. If you've got different errors, then we will handle them as well and then.

C

A

We call this netflix terminated, which will then go and write the uh or finish off the uh finish off the request.

A

Okay, so now we've got, we've done a write again at that point, the uh we pass it back to the netfest and the vfs layer.

A

It will go then take the pages that it got from userland and copy them into the page, cache pages that we're using uh and then we call sephwrite end or then write ends will get called, and this will go and set um the pages up to date so that the page cache knows that it can satisfy reads out of these pages um and then we go and um maybe to go and increase the size of the file, which is what this does, because if we wrote to the end of the file now the file's longer, we have to go and do that and we have to mark these pages dirty uh and so later on.

A

uh The kernel will turn around and write these pages back to the server you know at this point all we did was we read in the pa with the unread, the unwritten parts of the page did a uh um uh you know, copy the data into them, and then we uh mark the thing dirty so that we can write back later and then a point we unlock the pages and put them, uh and then we might also check and see if we have caps, we need to deal with.

A

From there um now, let's say: okay, we've got the page in the page, cache okay, the data in the page cache now we're later on someone, let's say someone call sync right, you know or fsync, or something like that. Now we have to write them back at that point. What we do is we do something called. I usually get a call from the dfs to do something called write pages and we have the set right pages start a vector that will do this and what this thing does.

A

Is it walks down the tree of pages that we've got the array of pages that we've got? It's not really an array, but it looks like one um and it will set them up uh to be written back um and then start, and so you can see in here where sorry, I'm probably going kind of fast, but um but you know, there's a bunch of special cases here. Oh, if we're beyond the size of the file we have to may have to uh invalidate some things um might have to deal with snapshots.

A

It's pretty complicated, but essentially at some point we will come down here. um Step set the page right back bit in the page, which basically tells the colonel hey. I'm writing this thing back and then we clear the dirty bit for the page. That means that someone. So if someone comes in later, you know and dirties the page again, then we will ensure that it doesn't uh uh that we didn't miss a race and miss the right uh and then so anyway, we gang up a whole list of these pages.

A

It's going to walk down while it does the build it uh until we are finished. Writing back what the vfs says. It wants to write back, uh so we gang up, apply all these pages and then we call def osdc new request. So we generate a new write request in this case right.

A

And then, if that turned out to fail, and sometimes allocations in the kernel can fail, we have to deal with that as well and and failing during right. Back or failing allocations during right-back is particularly problematic, because a lot.

C

A

Times we're writing back data because we're trying to clear memory. So we you know if we are under memory pressure and we can't allocate pages in order to write, write that data. Then we have a big problem right. We can't we can't go anywhere so any case we've got. uh We set this our callback at the end to write pages finish, because at the end of the thing we're going to need to handle some stuff, um you need to handle that in the right and then we go and start um a marshalling up pages.

A

And that's what all this this will. You know: stick the data pages into the write request, update the extent in there and then and then down here.

A

We will come down and wait a minute, there's a point where we start. We can start writing earlier up right here. I think oh no, anyway, down here. We start, we write, we start the request and then this thing will kick off the write request, interesting bug on rc.

A

um So any case we will do all this. uh That will kick off the request uh and back to the thing, and then we do a little bit of first processing at the end of that and then we'll return.

A

Eventually, we will come back.

A

All the osd client engine will submit the request to the osd and collect the reply and then eventually we'll call when it gets the reply. It will call right pages finish to finish doing up and then that point it will come down here. uh Mark the pages clean you know or in the right back of the page, as you can see here or end page right back, and so we can tell the kernel. Okay, it's not under write back anymore.

A

We might update some metrics and then we put whatever cap refs we've held uh because we take references to the buffer caps. While we're doing all this um and then eventually uh we put the request and then things finishes. Right is done.

A

um Okay, so um that's all fine and dandy. So that's a good example of how complex it can be to actually write out. Data reading is also pretty complex in the same way, but uh in addition to all that, we also have some sort of more autonomic uh uh stuff that we do um some more autonomic uh activity, that sort of happens in the background or that it's driven by the mds.

A

So we've got this so I mentioned that seth connection earlier each connection has what's called a dispatch routine or a bunch of operations. Actually, just like we have operations for the op for the inodes that are handled by syscalls. When we get certain socket activity or certain uh connection activity, uh we will call different uh uh operations vectors in this op and this uh struct, and so, for instance, like for the mbs. This is the mds operations vector.

A

uh We will call nbs dispatch.

A

And you can see here, um we've got a bunch of stuff that we do uh depending on which sort of message comes in uh so, like you know, if you get a reply from our request, it will handle reply. If you've got oh caps, you know you gotta handle caps, so on so forth.

A

Perhaps messages in particular are a big part of stuff. Let's talk about those for a minute, so here's the thing for handling a cat message, but you know how do you store them right.

C

A

Caps are uh you know we have a structure, you know an object that we track just like everything else to represent a cap, um and these are held in rb tree in the uh in the island, and so when we get uh say a request from the mbs to say uh you know flush some, you know. um Maybe we wanted to.

A

uh You know to um start writing. You know two favorite books of caps or or to grant some caps. um We get this call from the mbs.

A

And we will call it we'll call seth handle um so like on an open or something like that. We might add some new cap structures to it and then, when we, you know just in the response to the request itself to a normal mds request, but we also can get an asynchronous grant from the client say: hey, okay, you know, maybe some caps came free that you had requested earlier here.

A

They are now- and this is the part that will handle all that we get a cat message and we've got a bunch of message version stuff. You know the set protocol is a bit of a mess and that we add we grow the certain structures in it all the time.

A

um So if we get um you know, depending on what sort of cap request we get, we will there's a switch um switch statement here for certain types of cap requests um and then um yeah, here's the thing and like let's say in, for instance, if we get a cap opera vote or grant we're going to call this thing, which is steph, which is handle cap grant.

A

That does um basically we'll walk down the tree and handle all the cap grant order or revoke, and at the end of that we will. uh We might end up kicking off right back, uh particularly like let's say someone revoked our or someone tried to open the file that we had open and had buffer to write for before before we had written back, uh we might call and then also we you know buffer caps get revoked.

A

uh We have to uh then handle start or kick off a right back before we can uh return those caps back to the client to the nds.

A

um So any case, all this hand is handled here and a lot of this is handled in fsf apps.c, there's a whole bunch of uh routines for him.

A

Asynchronous cap requests from the server or, for uh um you know, you know doing things like uh returning caps to the to the nbs.

A

Let's see um yeah, that's really about all. I had to to go over sort of the 10 000 foot view of the client. um It gets pretty complicated. You know complex down in there and you know one of them, but uh that maybe give you at least a start on on what to do. Does anybody.

C

Have questions.

C

uh Chef uh you're talking about uh page right, back errors right- and I know like a few.

B

Years ago,.

C

You worked on that something generic for all uh file systems, so uh I mean. Can you like briefly say what was the solution? You came up with to handle letters uh when you're trying to you know, write back pages.

C

A

um Yeah, so essentially what happened? Oh, that was yeah. That's not really directly related to seth, but I can talk about um so you know effectively right. You know when we have um when you write data to the inode or to the page cache right. You know.

A

Eventually we have to write it back, but a lot of times that write back can um happen behind the scenes, like you might not even be aware that you know it's doing maybe flushing pages out because of memory pressure and um but at some point you know you know the ideas that you will call fsync. You know and then at that point, if there's an error, um and so if you're, let's say you're, writing back.

C

A

Of memory pressure right when you get an error, you can't just return that error right away right. We have to store that so that we can report it later, and so I there was a there's a.

A

Yeah there's some so we put some infrastructure in several years ago, uh called aeroseq, which just um sort of splits up a uh um uh an integer that we track in the inode um that just uh it's like basically uh uh can hold an error up to, like you know, negative four, nine, six or an error of four or nine six right uh and then from there we have and then the rest of the 32-bit integer um will hold a counter, and that counter tells us whether this isn't.

A

You know whether this error has is a way for us to tell whether this error has been been not seen yet or not. On a particular file script. uh You know I could go into that, it's a little more complex, but um so essentially what we did, though you know. So when you.

A

There's actually a file, a documentation file that may be better.

A

I'll define it.

A

um Yeah here it is.

A

So you know so: here's like you know if we have a 32-bit integer right, um we have something here called uh we. We declare, you know 12 bits of that to represent the air node that actually that we're going to store usually like when we store an integer right. You know we have a bunch of bits left over if it's just going to store an error.

A

So what we decided to do is just use the rest of this to store a counter and a flag too, to see whether the thing has been seen yet or not.

A

um And if you go read this it'll explain how it sort of works right, but essentially what it is is that we keep a copy of the of this.

A

So when an error is recorded, uh what you know if we will bump this counter uh every time right, so let's say you, you wrote back once and you got an error: okay and then someone called uh fsync right after they get their right and then you, maybe you write back some more data and you get another. uh You get another error right and then you know it's hard to tell whether um you know the.

A

If you call sync two times in a row, you know you don't want to report two errors in a row before there's there's not been another right back, and so what we want to do is ensure that we report this each error. Only you know report the latest error only once reach uh fsync or or you know, msync or whatever um so yeah I mean.

A

Essentially we have this counter in here uh and then we have this scene flag, and so all this thing does, is it just uh you know in each struct file we keep it. We keep another 32-bit integer uh and we keep basically take a snapshot of whatever the thing was right after we did the fsync so like.

A

If you do an fsync, we take a snapshot of what it was at that time, and we reported you know if, if it looks like we need to report an error we will based on, you know whether uh and then, if we, when we look in the uh struct file uh version of this counter, we will, um if, if it looks like this thing, has already been recorded, we don't.

A

We will all just report zero right, because there has been right back here since the last time we called fsync on this file descriptor, but if there.

C

A

You say there is a new right back error. We will bump this counter and then, when we go to write back, we will compare it to the one. That's in the struct file from the last time that we kind of did it and we say ah well, the counter got bumped, even though the error number is still the same, we're still gonna we. We know that there was a new error that was recorded since the last time. We report reported this error and then we can report an error again on that on the subsequent sync.

A

So that's the idea, basically is to allow you to record errors in one place, but we can report them um over multiple file, descriptors, that sort of answer your question. I could probably give a whole talk on this alone.

C

Okay, yeah thanks. I I look at this documentation. um I mean I had this question when you're talking about uh how hard is uh the issues that you might have when uh you know uh doing uh during a page like back so part of this.

A

Yeah, so the problem we used to have right was that um we tracked a lot of these these errors. We had actually basically had a flag that was just in the inode and the, and so what would happen is um let's say you might have this file open multiple times and lots of people are writing to it, um and then we would we like to write back error happened.

A

We would just set that flag and then, whenever someone called fsync, um then you they would see the error and clear the flag right and then so, if another file, you know, if another person calls f sync on a different file descriptor, they wouldn't see an error right and they would think that, oh my my code must or my rights must have gone through.

A

They refine right, but they might not they're, and so this was the mechanism that we that we came up with to um to basically ensure that whenever you, you know, whoever is writing to. When you have people write into multiple file descriptors on the same file that they all will see errors, uh you know if, if there was one app you know so that way they ensure that that it's it's reliable to to ensure that you will that we report errors to everywhere. So but yeah take a look.

A

If you have questions, let me know: I'm not sure how applicable this would be in any, but it was an interesting uh project. Any other questions.

B

Hey jeff, uh quick question: do we use reader plus in pk clan.

A

Yeah we do uh well, we yeah we do. um We've got a. uh There is no real read of reader plus um operation in the kernel.

A

So you know you just do a reader, but what we do is we will uh do a read dur and then we will pre-populate inodes and denturies in the in the denture cache and in the inode cache, when we wouldn't do that.

A

So here I'll show you a quick look at seth reader. I didn't really go over how the cups and stuff were.

A

So we have a reader notice, it's called iterate now, but uh reader infrastructure is actually pretty complex at the vfs layer. But um yes, essentially when we get info back, we also get back when we.

C

A

Of denturies we also get inaudible for them as well, usually uh and so yeah we will pre-populate the inode cache. You know we'll allocate inodes and pre-populate all the info in there and so that later, when someone goes to do like, say a stat or uh or something like that against that particular entry, we don't need to go to the server for it, because we've already got it in the last reader request.

A

Yes, we do handle, we do sort of a redirects.

B

um Okay, so I remember some time there was this reader p call added that in one call you can fetch the um the directory entries and the stat information too. I mean currently doing a stat, always obviously fetches the stats for a niner from the d cache or something I mean from the tunnel cache.

B

uh But this was one call that fetches, like you know the entries uh and then all the stat information for those entries I mean it was added infuse if I was not wrong, but uh maybe it was hired to vfs and every other file system. I read not sure, but it looks.

A

B

A

Sorry, are you talking more at the reader, or are you talking more at the at the vfs layer? So like was there? There is no reader plus system call ah okay, and so uh we don't have anything like that. Now, the nfs server can do uh may get a reader plus request. You know, in fact, or you know, an nfs v4 reader request is also like reader plus right. No, it also gets fetch designer okay.

B

um But uh yeah.

A

You probably saw.

B

Some patches going on that tried to add it to the vfs, but probably they don't make they didn't make it. Okay,.

A

Yeah yeah, they went didn't go in I mean we did some. We did add stat x, a while back, which you know it's like an extended stat thing right and we there's been some discussion about doing a reader plus operation, but it's not real clear. You know what applications will use this right. um I mean I could see it being used by uh like samba, maybe or uh the you know the nfs uh you usually land nfs servers, maybe but it's not as useful as you might think, and so you know and effectively.

A

You know I mean the only thing that you're that you will, because you know like we are populating the cash when we do a reader in the background right um so like it said like say we get a reader request right. We will go ahead and and get the list of dentures that are in the directory, but we'll also you know this.

A

The mds will also forward along all the all the uh um a lot of the um I node info as well for for each of those inodes and we so again we populate the cache for that. um So you know, if you do a reader plus system call the only part you're really saving you is that uh the context switch from the from you know.

A

The multiple system calls right, you know it and then, if with something like a uh seth right where we, you know the time that you are spending in that is pretty negligible.

A

uh You know most of your time is dominated by uh the round trips to the to the nbs, and so it turns out not to be as big a win as you might think. It would be to just cut out that sort of system call activity.

A

It's a bigger deal if you've got something like um uh like on a local file system. Maybe there was some talk about that. That's that's really where I think you would see it like if you want to do like xfs might want to would want to do that because then you could do you know something, that's a pretty, because it's pretty fast right! You know dealing with local disk, and so you don't need to do you know cutting out that system call overhead, because anytime, you do a system call.

A

You have to do a contact switch uh and so the cutting out that system call overhead can be a big deal there, but less so in the network files.

A

Paradoxically, in a way right.

B

Any other questions, yeah yeah, it was probably related to xfs. There was this, someone added a bulk start call or something, but I'm not sure if that those made it, but it it was on these similar lines of you know having getting a list of entries and all the standard information. So you don't have to do the the the start call again so yeah, I think so. You're coming from the same lines.

A

Yeah yeah, I mean we did talk about it a while back and, and you know, some of the samba guys were kind.

A

It didn't go anywhere because I don't think it's as useful as they think it would.

A

I think nowadays you might be able to do something like that more with, like and and samba sort of moving into using more iou ring, which is another way you can do. I o now.

A

That's all that stuff and so they're doing some of that sort of thing in an iou ring, and so I think there may be a reader. There may be a way to do a reader plus an iou ring, but right off hand. I don't remember exactly how you do it.

A

C

Other questions.

A

Or comments feedback.

A

Okay, if not I'll, take that as a no and uh thanks for your time, everybody have a good day.