Ceph Ceph Code Walkthrough, 29 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Code Walkthrough: Patrick Donnelly - Metadata Servers 2020-09-29

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everybody and uh welcome to another ceph code, walkthrough, I'm here with patrick and we're going to be going over typically over the cef uh file system base, uh in particular with the metadata servers, uh which is actually keeping uh various uh well metadata on uh various objects and inside of stuff. So uh patrick, will you please uh take it away?

A

B

uh To start, I put in the chat in either pad with a broad breakdown of what we're going to cover um so uh for those watching. uh Please go ahead and open it up and you can follow along.

B

um So to start, I'd like to go through uh just a brief uh overview of what the different code bases that step effects is on. um So, of course, all most of the ffs code lies in the the ceph source tree, um there's three uh primary components of cfs code and that's uh in the source mds directory.

B

That's the one we're going to be looking at through most of this talk, which um is all the source code for the mds itself, which manages the metadata in ceffs, there's also the source client directory which we're not going to be getting into, but that's where the uh libs ffs and the fuse client lives uh and all the code for for uh handling talking to the mds and also the osds and then finally, there's another um module here called the osdc or the osd object, cacher, which is used by both the mds and the client.

B

The client uses it to read and write file, data to a range of objects and the mds also uses the object cacher to delete, object, delete files or delete the range of objects, backing a file's data and also to handle its journal um so and then finally, there's uh another bit of code uh for those who want to get even more adventurous in the the kernel the linux kernel itself in the fsf directory.

B

That's where the uh ffs kernel driver lies um again, we're not getting into that today, so uh starting with the uh or getting into the mds, um a good place to start as usual would be the um main function, and that is in this step, mds directory and unlike the rest of the mds code, that's actually in the uh just in the top level source directory of the sep tree, um and this is just a standard theft demon that uh that configures signal handling uh reads arguments.

B

So here's the main function, it configures its thread, name parses arguments. Does some global initialization, that's common to all theft demons and uh nothing too special about uh or unique to the mds in this code, fairly common to all the sf demons. The uh interesting bits here are when the the mds actually creates a messenger sets some policies for the messenger sets the buying address for the address and port that the clients are going to connect to configures and monitor clients.

B

So it can get access to the monitor map and the mds map, and then here is the bit where it actually creates an mds daemon.

A

B

A class that lies in the source and the s directory tree um and that's going to be actually the entry point for all the work that the mds is going to do. And finally, this main function um just as a weight in the messenger loop and that's uh uh the place where it sits, letting the messenger pick up messages and then the mds team will actually handle them um when this method actually ends. That's when the the.

A

B

State is torn down, so all the this main function generally sits in this messenger weight loop.

B

So uh getting into the mds. We'll now talk about the mds daemon.

B

um So this is uh the startup code for the mds and also the uh handling, so the standby state for the mdss, all mds's start in a standby state waiting to be given a place in an mds cluster for a file system by the monitors, uh so the mds daemon will sit and listen for a new mds map and that's handled in this handle mds map function.

B

Open the c file on the bottom and you'll see this method is actually called in. A uh method called handle core message, and this is driven by this ms dispatch 2 method. This is a method common for all messenger dispatchers uh in in ceph, where the this ms dispatch method is called when any new message arrives and the mds daemon is going to try to actually process that message. So it's going to try to handle this as a core message.

B

And it goes to a bunch of different message types. So is it a minor map? Is it an mds map and so on? It is an mds map is processed here. Coming back to this handle mds map method, uh and this involves decoding the actual mds map from the message.

B

um And then finally processing it and it's going to do a number of things for checking the state. So it's going to look at it's going to do a number of state transitions according to what the monitors tell it should be doing. There's a bunch of uh sanity checking in here. um One one of particular note is: if it gets removed from the nds map, then the mds is actually just going to respawn itself.

B

So if you've ever run, the mds fail command. This is the code path it's going to hit where the monitors remove it from the mds map. Mds gets the new mds map from the monitors, seeing it's no longer there, and then it triggers this respawn method which will just re-execute the executable.

B

um So again, there's a bunch of state transition code here for for handling the mds map, notably uh when it's, if it's actually not a standby mds and it's and it's an active mds. It's going to hand this new mds map off to the nds rank which we'll be talking about next, where further processing may occur.

B

um Let's move on to uh the mds rank. The mds rank is a fairly recent bit of code in sepafes.

B

uh It was added about four years ago as an abstraction for the state the mds holds when it actually is a member of a cfs file system and not a standby, and this is where the mds will handle moving between major states like uh during recovery when the mds uh during failover, when an mds is starting up. It goes to a number of states, including uh replay rejoin, um resolve, etc, and this is all handled within the mds rank class.

B

um So here it is uh it's going to uh let's see where so here's this is where we're going to manage whether or not the mds is in various states and is creating, is starting standby et cetera. um The mds frank is going to hold this or keep track of this state for the mds daemon.

B

um It's going to handle some uh queuing for continuations in the nds. According to changes in state.

B

um We also have these methods for when the mds hit certain bugs, for example, the the mds is, it finds damaged metadata. It will call these methods for actually uh notifying the monitors that the rank is damaged and then committee and then committing suicide, so it doesn't cause further damage to the metadata.

B

In the the metadata pool. You see, there's a number of methods for sending messages to clients.

B

B

Another um a code here is that the mds wholesale keeps track of the nbs's mds maps from other ranks and use that to make sure that they're that the mds cluster is consistent in terms of what the state of all the mdss are.

B

And, let's see, um mds rank is also responsible for handling its admin socket command. So if you ever tell an mds to drop its cache or list the sessions this, this is handled in this method.

B

Let's see now, let's get to the core part of.

B

Of or where the interesting code.

B

Is so, as we saw earlier in the mds demon, it actually will um dispatch certain. um I don't believe we saw that. Let me.

A

B

B

All right, so here's the ms dispatch method in the mds daemon, if, uh if it doesn't handle this core message, um for example, an mds map update or a monitor map update, then it checks if it actually has a rank and if it does it'll call them. Yes ranks ms dispatch method.

B

So coming back to the mds.

A

B

B

Is various checks here, but the uh making sure that the sources of the the message are valid um then we're calling into this dispatch method and again more checks are being done. For example, if uh the mds determines that it's laggy, for example, is this is uh checked by by the mds sending a a beacon or a heartbeat to the to the monitor clusters periodically, and if it doesn't receive an acknowledgement response back from the monitors, then it begins to believe that it's laggy or there's a network partition in that situation.

B

The mds rank will actually just uh put the message on a queue waiting to be processed later until the mds is no longer laggy, and this is one of those cues we talked about earlier, and this would be a continuation for that.

B

um Finally, if it's able to handle the message, it'll call a single message, method um and there's various types of messages that the mds rank will receive. um It'll receive messages from other mds's according to updates to the cache.

B

uh This is the the migrator code path, so any uh subtrees that are being migrated between mds's will go through this uh migrator dispatch and then uh for setting up client sessions.

B

This is all done through the server dispatch, in addition to the client request, which is the the message that the mds is going to be processing the most perhaps with the inclusion of capability updates, um various pure requests from other mds's uh heartbeats from other mds. This drives the metadata balancing we'll take a look at that later.

B

um Here's where the uh mds locker will receive updates to the um to the capabilities or from other mds's locked messages um that update the locks that the mds has maintained for all the metadata. We'll look at that again, some more later and also uh client capability updates uh from from the clients themselves, also go to the locker, so the mds rank is sort of the gateway to all the dispatch loops for the other parts of the mds.

B

Right, um let's move on to.

B

B

So uh again, the monitors keep track of all of the mds's in the file system, and this is done through an fs map. It's similar to the osd map. uh It's going to keep track of or- and I should start out by saying um the old way. The old uh way we kept track of the mds's in the in the cluster was to what was called an mds map, and this is still still present in ceffs, but it was the because we only had one file system.

B

There was only one mds map uh and back when john spray was working on adding multiple file system support to cefs. We added this new class called the fs map, which maintains multiple nds maps for each file system we might have, um and that will so here's the fs map that the monitor is keeping track of. You want to look at the data.

B

This is hard to navigate on a small screen, so um here's the main data structures that the fs map is keeping track of uh the main one here. That's interesting is this file systems list.

B

So each file system has an identifier associated with it, and then this then a pointer to a file system class, a number of mds roles which is not actually persisted.

B

This is just an in memory cache of what mds's are serving, what file system a list of standby, daemons, waiting to be assigned to a file system and the epochs associated with those standbys.

B

So looking next at the file system class, um this is mostly a wrapper around mds map which, as I said now, we're having a number of mds maps according to the number of file systems.

B

So this class just holds uh um three field. Three three members uh used to be two as of uh just a week or two ago, this new one. This mirror info is brand new, uh but this is just the cluster, the file system id and then also the mds map for it.

B

So I think it's the the main highlights of the fs map. So, let's move on to the mds map.

B

um All the data members are at the bottom of the class, so uh the mds map keeps track of a number of things, um including this, uh the epoch, which allows us to keep track of changes that have been made to the to the mds map. This is mostly useful for for the monitors, keeping track of changes, um but also clients uh that need to know if there's been an actual update to the map or not um here we're going to keep track of the the file system name.

A

B

B

The max mds, so when you're setting the max mds on file system to increase the number of file system ranks, that's where this change would be set um a number of data pools. So if you use the fs, add data pool command that data pool id would be added to this vector. You can't actually set a file layout in cefs without first adding the with a with a separate data pool without first adding it to this list of data pools in the in the mds map.

B

That's, of course, a form of protection, because you don't want clients um to be writing to arbitrary pools, uh we're also keeping track of a number of mds ranks that are are available, so the in set is the number or the various mds ranks that are in the file system. This will just be, uh could actually just be um an integer now, rather than a set, it's going to be a consecutive mds rank 0 to n n being max mds for the cluster.

B

Here we also keep track of which ranks are failed or stopped or damaged. So again, if, as we saw earlier, an mds might go damaged by calling that damage method and that notifies the modders that the rank is damaged.

B

um Then the monitors would note that in this damage field for the mds map, so that it doesn't try to start a new.

A

B

On that rank, until it's been repaired, also we keep track of which ranks are failed and which are stopped. So if and if we decrease max the max mds and shrink the cluster, some of the ranks will be moved to the to stopped.

B

Then. Finally, we also have this up map, so this is keeping track of which uh mds uh ident ids are actually associated with a given rank.

B

We can take a look at and that that looks up into this mds info map, which we can look at next.

B

So this is another class or struct in the mds map which supports, or is really just uh keeping track of, the number of fields associated with an mds daemon, including its global identifier, the name of the demon so like mdsa, uh which rank is following, if any, what stated it is in um and the file system that it wants to join. This is a new feature we added in ffs to have uh and to allow you to have an mds join a particular file system.

B

um And also what addresses the the fps is on. So this is all kept track of in the mds map, and perhaps I should have mentioned.

B

Right so- and here is the mds state, so it starts out as state standby, but there's a number of of states that.

B

That an mds can be in, and that's here in this demon, state, enum and you've- probably seen most of these. If you've ever admired a step. F cluster before creating is done when we're creating a new mds rank or or restarting an mds rank after shrinking a cluster and then growing it again, and then a number of states associated with uh failover when, when an mds starts up taking over for a rank.

B

It goes to a number of these states, and uh I showed earlier that the mds keep track of the of what mds map other mds's are aware of, and that's partially to or that's mostly to keep track of or to have consistent operation when they're doing recovery as a group and that's important during uh especially cash, uh the recovering the global cash, the state of the global cash.

B

uh And then finally, state active, which is the the default running state for the for the mds, all right, uh feel free to stop me with questions at any time.

B

So the next thing we're going to look at is this mds server me drink of water, um and this is the module. That's handling, uh client request dispatch predominantly so, if we place to start here, would be this dispatch method.

B

B

um Again, remember that the mds rank calls this dispatch method. If it's a, if it's one of the types of messages that the server would handle uh here, we have this switch based off of the message type um looking and the server will fork off um didn't, say fork. It will call off message for handling each different type of met message. So here's the the client reconnect so during the nds failover, one of the states is reconnect.

B

If a client will try to reconnect during that state and then re-establish the capabilities and locks that it had with the mds that'll happen in this message: handling.

B

And then here we have a number of checks associated with looking up the session for the.

B

B

um And then certain types of messages require waiting for for active. So if the message requires that the server be active, for example, we're just doing a standard make dir or open, uh then then that message will be retried once the mds turn becomes active.

B

um Then. Finally, another switch on the message type here, we're looking at the client session. So whenever a client opens a new session with mds. This is where all the bookkeeping for that would um would be, uh would be done in this handle client session message. Then we have into client requests. This is um where the the main entry point for for all the the request handling in in the mds client reclaim is a new feature.

B

We added this ffs recently to allow clients to reclaim old sessions that they may have lost, for example, with the primary use case of nfs ganesha, to allow a new nfs connection to reacquire the state that an older, older instance of ganesha had and then also peer requests from other servers.

B

So, let's look at handle client requests, so any uh client message that uh will access metadata on in cefs will go through this code path uh and there's a number of different types of client requests.

B

We'll see we'll see that in a moment, but this is going to do a number of checks associated with just any request handling, for example, waiting for the root. I know to be opened or getting the session and checking whether the session is valid.

B

B

B

Here we're doing uh some bookkeeping for the the request. So, if you've ever seen, the mds uh make note of uh slow operation um for a client for like a a get adder, that's not that's being served too slowly. um This is part of that bookkeeping. The mds will will track every um request in this uh md request class, which uh is in this mutation.h, which is gonna, keep track of.

B

I mean I'm going fast. It's going to keep track of all the metadata requests that um the clients have, and this is going to keep track of locks. The the distributed metadata locks that the the client acquires as part of a request and any um other metadata, but also how long it takes to actually process the request.

B

um So that's created there and then finally, we're going to dispatch the client request.

B

um So here we're looking to see if the request was killed if it was and we're going to uh cancel it. um This, this method may be called multiple times, depending on the request needing to be restarted, for example, wasn't able to acquire all the the locks that needed when it first started.

B

And then, here we pull the message off the client request, because we're going to take a look at it again and here we're doing a number of checks. For example, um this is a check to see if the the file system is full, so we're looking at the type of metadata opera doing if it's uh doing a mutation.

B

If the cluster is full, then it's going to respond with if there's no space, otherwise um we're going to get into more specific code paths and that's going to be as a switch based off of the type of operation we're doing.

B

uh There's a number of the this is um basically all the different rpcs that the mds may may process. uh For example, lookup, which looks up a particular inode.

B

We have getadder, which is looking up by path set adder, which would be flushing metadata mutations on the client side to the mds, for example, a change mod there's, not a separate change, mod rpc and then yes, it would all be processed through setadder we have our x header operations directory reader, setting up posix files, locks, creating a new um file uh link, armor, rename maker, mlink, etc.

B

Again notice. There's no change mod. That's all done through set editor all right, so uh we don't have time, of course, to go through all these. So I thought we would just take a look at uh handle client get adder, which is akin to a stat in posix.

B

So here again we're getting the client request message and we're going to be look doing some basic checks on on the message itself to make sure that it's a valid request. um For example, the file name path can't be empty if it is and when we return it's invalid.

B

um This is some newish code in mds to to batch up a number of get adders into a single operation and that's just to be more efficient with the locks and the mds which we're going to be talking about in just a moment um here we're going to uh get the um the inode associated with with the get adder request.

B

This goes to this read: lock, path, pin.

B

B

And it's just going to be looking at traversing the uh file path. The main code path here is in the md cache path, traverse which is going to be actually going through the cache and looking up the the inode by its file name.

B

That's all we really need to say about that.

B

B

Okay, so uh as part of a get out of request, we also issue, what's called a capability in step fest. So that's um a type of lease, um uh although leases and cepha suffocals there's also leases and suffice, and there and they're slightly different, but uh there's an uh old file system concept from the 90s called elise, which is a type of lock that clients get on metadata to ensure that they have that the metadata is not changed so therefore, they can cache the metadata and improve the latency of operations locally.

B

uh So the mds has fine grain capabilities tracking the different rights that clients have on on a given inode, including reading writing file. The file laid out layout, the number of hard links on the file, uh what authenticate uh authenticated or the um the state of the uh uh authorizations on the the file like the um the uid or or the permission, bits, etc?

B

And that's all tracked through this mdf, locker, we'll be looking at.

B

Just a moment- and it's all managed to this acquire locks so up above here we got this lock op vector.

B

Apologies- I don't know what this fix me is for, but uh here we're looking at the different um things that the the get adder is asking for. For example, we want the the link the number of hard links for the file. We would be setting this theft cap link shared bit and we would add a read lock for getting the link lock on the inode.

B

This ref here is a kind of poorly named variable, it's the actual inode associated with the what we're doing a get adder on and we'll see this later, but there's a number of metadata locks on on the uh the metadata itself to corresponding to who has a permission to change it or if uh we have a collection of read locks on the on that lock. We have multiple clients that are that um want to have uh the.

A

B

Link capabilities, so they know that the link count of the file and it won't be changed uh without being notified by the mds. So that's one of these locks that it might acquire then the office, the authorization lock, the ui, djid mode bits, etc.

B

The x adder lock so the I know the client has a the collection of of extended attributes set on a file. So it would have this lock as well um and that's all controlled by what the client asks for and the client can be can limit itself on what it what it wants and that can help make sure the request is processed more quickly. If we don't have all the clients asking for right locks, which would require us to revoke a lot of permissions on other clients and slow everything down.

B

So the clients can be minimalistic in what they asked for and after adding all the different kinds of locks that the client needs in order to actually get this design out state. It's all driven to this nds locker acquire lock and the locker is going to be responsible for um looking at each of the locks that are being asked for and trying to change. The state of the distributed locks across all the mdss and also the clients, and try to change and poke.

A

B

State of the locks in a direction that gets us to a point where we can actually finish this request. You can see if, if we don't succeed, we're going to return because it's going to be the operation will be retried later, um but as part of uh trying to acquire locks. If there's uh conflicts, for example, another client has exclusive, um extended, attribute permissions and we're trying to get all the files uh I know state.

B

Then the locker actually needs to go out to send a cap revoke message to the client that has that extended, attribute block and that's all driven by this, this locker module.

B

um Likewise, the locker is also going to handle issuing capabilities so as part of acquiring the locks on this file, the the locker is also going to take care of the details of issuing the capability on the inode in question, so that the client knows what it can cache according to the get adder and in a general stat operation. That would be pretty much everything and it would have a read lot read: capabilities on all the associated metadata.

B

As far as what a read capability would look like, for example, with the x adder shared, that would be the um that would be the x adder shared because it's going to be sharing the date with other clients. It wouldn't have permission to write, whereas theft, capex at or exclusive would would grant permission to make modifications to the extended attributes.

B

And finally, we have this check to make sure that the client has permission to read um that inode and that would be. uh For example, we have in the step x, capabilities for clients.

B

We can limit access based off of the network address or what path the client should be able to to read, and that's all done through this check access method, and if the client lacks permission to read, then we return, um and here we're going to set a number of uh fields in the reply for this get adder request, which we're also calling that this debug message and then we finally respond to the request.

B

So there's various bookkeeping there.

B

See ah here here we're actually gonna reply to the client request and make a m-client reply message.

B

And there's various bookkeeping done on the metadata mutation so that mutation class I talked about earlier. They were marking an event that we're replying to the message. um So if you did see a slow um operation, you could see what events have been processed and when.

B

And I don't need to talk about all this, but here it's looking up the session and actually sending off the message to the client here.

B

Okay, um so let's go back to the.

B

You know client get at her. We um spoke a bit at length about what acquire locks is doing. So now we can go, have a look at that.

B

B

So um here we're getting our mutation reference uh for the actual operation, the vector of walks that the mutation requires and some other metadata concerning the off pins. So here we're just.

B

The locker is an incredibly complex bit of code because we're handling a distributed locks across mdf's and clients, but for those who are interesting and we're not going to go through this really in depth. But um I wanted to point out uh here's where the uh caps that we would eventually um be issuing to the client. So there might be a set of. I knows that we're going to issue caps for.

B

And we call this cancel locking method which is actually going to set the different inodes that we need to get locks on and then, finally, at the end of this really long acquire locks, we're going to actually issue the caps.

B

So this actually differs a bit from some of the other code pads in the mds home, where the the caps are actually distributed, and that's done um in this in the locker itself.

B

uh I think the reader path is handled as slightly differently, so here's the issue capset and that just iterates to the the set and calls issue caps and we're going to see this is where the locker actually issues the cap to the client.

B

ah Here it is we're sending them client caps message to the client.

B

B

All right so again, the locker is a very complex bit of code. There's a very there's, a state machine in the um in this locks, dot c and h files users. This is some fairly old code, which is uh responsible for managing the different um state of the locks associated with um with each type of lock that an inode might have.

B

I mentioned there were several types, you know extended, attributes and authentication authorization locks, and we would keep track of the different states that it might be in, and this corresponds to whether or not you have replicated metadata across the number of metadata servers and what cat capabilities each metadata server may have distributed to clients and whether their capabilities at all being issued to clients.

B

That's all managed here, see all right. So, let's move on to, um I don't think we'll get through. All of this.

B

The md cache another window.

B

Mdcash.H, so this is the um structure that manages the global cat or the cash for the for a given mds rank, um and uh this is where uh all the management for changes to the cache are gonna take place. um Access to uh like traversing a file path is all gonna go through that amd cache.

B

So, uh there's a number of structures that keep track of of the the metadata in the md cache. So you know the bottom again, there's the various trucks that we're keeping track of uh here's the the inode map. um So it's going to associate the inode number that we're all familiar with and the actual ci node, which is a structure for the cache inode.

B

So you can look up the dinodes that are in cache for the mds through this structure. Similarly, for snapshots, we have a snapshot at inode. The root inode, all mds's have that in cache and it's replicated.

B

The filer, which is again used for managing the uh or being able to delete, file, tabs uh or file file data and then also um setting back traces on files in the default data pool, which allows us to keep track of the the file path. The file data is associated with. If we need to do some kind of recovery um and then various state, we need to re, we need to keep track of the the the cache uh and, for example, uncommitted, peer rename.

B

Alder is uh going to be used when we're doing um rename requests that affect multiple nds's.

B

The state associated with doing the the rejoin state for the mds.

B

And what else do I want to talk about subtrees so also.

A

B

Cache is a set of sub trees um if you're familiar with using multiple active metadata servers and cfs, we have this concept of a subtree, so we'll actually split the file system tree into multiple pieces called subtrees and they can be hierarchical and nested, and these are distributed across mdss.

B

Each mds rank will keep track of a list of its own subtrees um that it can see, and if you have multiple layers of subtrees, some mdss may not actually be aware of the full uh subtree map across all the mds ranks, and it does it keeps track of the ones they can see. However, and it also keeps track uh the sub trees that under their boundaries to the to that subtree.

B

um That's all done through this structure, so uh one of the especially if you're looking at the server code and you wanna, for example, add another rpc.

B

One of the primary methods, that of the metadata cache classes is path, traverse method, and we saw that called earlier in the in the get adder code path and that's going to again take this metadata request method, um a factory for setting up continuations. If we have to retry the command later so we can keep track of when it should finally be retrieved.

B

What the file path of the uh path we're trying to traverses various flags, and uh that's done um all in this method. So we're gonna be looking at uh various flags in in this um like whether or not the we want the directory entry associated with the path or if we need to get certain read locks as well, that can all be batched together as we traverse the path.

B

So this is just going to get inodes and then recursively do paths reversal as we peel off.

B

File names from the path, so this is all managed here and there's a number of lock operations that we might need to do associated with traversing the path, and that's also um bundled up into this method as well, um for example, as part of path traversal, we might need to acquire certain read locks on on a directory fragment as we as we traverse the path, and if we don't have it, then we need to retry.

B

Similarly, um we also have this frozen path, so I spoke about how the sub trees would be distributed across multiple ndss.

B

We might sometimes need to freeze the freeze the directory tree, so we can ship it off to another another mds and delay mutations to that directory until it's been migrated to another mds. So that would be part of this freezing concept.

A

B

um That's all I want to say about this, but suffice to say, there's a lot going on in this patch reverse method.

B

So again, there's also this get inode method, and this just is a wrapper around the inode map allowing us to find. I know this is used in several places in the mds for obvious reasons and then also um getting a directory fragment. So this is uh this durfrag t is a structure for keeping track of the inode and the associated fragment of a directory. So talking briefly about fragments um in in stuff effects, the directory may be sharded across multiple objects, and this is for uh efficiency and performance reasons.

B

We don't want in an object that represents a directory in rados who have too many omap key value pairs, and so that may be fragmented into multiple objects and those directory fragments can be distributed across multiple mdss or um or just uh kept separate for for space uh for performance reasons, um but we can also again distribute them across mds's so that we can improve throughput on a large directory by allowing operations on a given fragment to go to separate mds's.

B

B

I think we're nearly running out of time. Do we have a q a session for this.

B

A

um Yes, we do usually.

B

Okay, well, I kind of went through the uh the code path for for processing, request um kind of a whirlwind tour of the nds, and I think, where we just stopped, was about as good a place as any.

B

So do we have any questions so far, otherwise I'll just briefly touch on a few other parts of them. Yes,.

A

Also, everyone you can use the chat and blue jeans as well. Don't want to lay out your.

A

A

No questions so far, you're making everyone experts.

B

Thank you um all right. So if there's no questions I'll, just um briefly touch on a few other files, so we have this um in the mda. Cache has an uh you know: it's gonna keep track of various metadata objects um and uh like, for example, starting with the root directory. So you can use path traversal, but also the number of I knows that are in the cache. um I talked about how there's uh the mdss keep track of of locks for each type of metadata object.

B

One of the the base class for all these metadata objects is done through this mds cache object.

B

And this is just an abstract class find.

A

B

uh That keeps track of a number of different um basic operations on on cache objects, including inodes directories, directory fragments and also uh directory entries, and so there's a number of um pins that we might have set on a on a cache object. For example, um here's a pin for a request. So when we pin something to the mdf cache, we don't want it to be uh um trimmed from the mds cache if the nds cache is too full uh because an oper there's a outstanding client requested is trying to utilize this.

B

This uh this object, and so this is the type of pin we might put on on the the cache object, there's also locked, pins or uh or replicated pins. So if metadata is replicated across the number of mds's uh for for performance reasons, um we don't want it to be removed. While it still has outstanding replicas, and then we have the number of state opera uh states, for example, state auth. This is the most important one.

B

uh This is just indicating whether the mds holding this this object is the authority for that object and therefore has the capability to actually do modifications to the object, assuming that it has the required locks um and so there's a number of basic operations we have on here like setting the state um checking whether we're authority.

B

uh We have the uh the authority state on the on the um the object, and then we also do some uh reference counting here as well, and some of this is just debug code, but we're keeping track of the number of references that.

A

B

B

um So again, that's a this is just a um abstract class. So let's look at the see inode classes, the the cache. I knows themselves so uh c c stands for cache, of course, um so each uh inode in memory is going to inherit from this I know store base class, and this is all of the metadata- that's actually persisted to to the rados metadata pool and there's three primary structures associate with that dig it up here.

B

We are um well there's a number of things, actually there's uh the symbolic link string if it is a symbolic link, the the tree of directory fragments. So if this is a directory inode, we would keep track of what the directory uh fragmentation is on that on that uh directory.

B

um Another one is uh the the inode data itself, so there's the same relatively new bit of code here I know constant pointer, which is uh just this: a shared pointer of this memphis inode, which itself is uh just this inode t structure and we'll take a very quick look at that. So a lot of the metadata types and step effects are in this mds types, header file- and you can see that here.

B

This is, I know, t, and this keeps track of a number of all the metadata that is part of it, an inode um various methods on it. So here's some things that most people should find familiar. This is, I know, which is the I know, number the r dev the c time m time, uid and link, etc.

B

The file layout, which is, is a cfs specific feature, allowing you to specify how the the file data is sharded across the number of objects, the size of the inode etc. Some basic metadata.

B

Coming back, I think this would be. The last thing we talked about is just the cached inode.

B

um Coming back to, I know store base, so this is just all of the uh data is stored in the uh stored in the metadata pool and, for example, here here's also the etc x center map associated with an inode all right and then that's inherited by inode store- and I know store, is some c plus plus organizational class. um I know sort bear more of that and finally, we have ci node.

B

So you see that inherits from my node store base, and so the ci node is going to have a lot of of the methods we need to keep track of of.

A

B

And locks um what state the inode is in, for example, does it have? Is the rstat information dirty or is it frozen?

B

um Are we exporting the inode and that's all managed here another the one more interesting bits from our mds hacking standpoint? Is this projected inode? So this is where we would actually make mutations to the inode. um There's a there's, a vector of of mutations on the ino, which we call projected inodes.

B

uh So anytime we make a change to the x-matter map or the inode data like the uid. We would store that in this projected inode struct. You can see the inode pointer and the x saturn map pointer here and when those changes are finally persisted to the mds journal, then we would pop that projection and actually set it in the the main part of the inode class, but as part of journaling.

B

We have this projected inode concept, so I think, I'm out of time um again, there's other cached uh cache objects like the cache directory and the cache directory entry and cache directory fragment. um They have similar operations on them and then, finally, also the capabilities themselves, which are just a uh in mds memory concept for for the locks that clients have through capabilities.

B

um Yeah. There's lots to talk about here, so I think we're out of time for this session.

B

That was a great walkthrough.

A

Patrick I'm looking in chat right now. I don't see any questions. um Anybody still on the call uh have any questions all right.

A

Well then, uh I think that uh wraps it up for another step code, walkthrough um some thanks to patrick uh in the chat. um So thank you, patrick for taking the time for giving us, uh even though his broad this was a very well in-depth uh walkthrough. I feel like and uh appreciate everybody joining us uh live as well, and uh we do these uh every month, uh we'll have another one. Next month and um yeah everybody for joining us.

B

Have a good day.

A

Thanks everyone bye.