Ceph CephFS Code Walkthrough, 7 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CephFS Code Walkthrough: MDS Journal Machinery

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, so that's it so the md log is uh or or the metadata journal for um the mds is the um primary holder of all metadata mutations that the mds makes as it goes, so any any change that needs to be durable, that the mds makes will end up in that in the mds journal, with a little asterisk. Next to that we'll get into it later.

A

um Some of the things the uh journal contains are some of the recent metadata mutations. So that's the obvious stuff. If I do a change mod or if I create a new directory or move a directory, any of those events will always end up in the mds journal.

A

It also contains client session information. So anytime we have a new client connect, the mds. It will write out to the journal the session information for that client. That's um particularly important for failure. Recovery.

A

The mds needs to know um all the different clients that were connected so that it can make sure that uh those clients, you know reacquire their their caps and uh and reestablish state with the the mps the uh the mds, will also move on from the state of waiting for the clients to reconnect once it knows that all the clients have connected.

A

So um it's also important that uh to to improve the time the mds failure recovery that the mds knows when all the clients have actually reconnected, um which is something you can compare to say like nfs, where it has to wait for the full grace period whenever it does a failover because it doesn't know which clients were were connected.

A

uh We also have the export import state so as part of moving sub trees around in the uh between mdss, the mbs has to journals state several times so that it knows who's authoritative for a given subtree.

A

um We also have the purge state synchronization bits, so the purge queue operates somewhat independently uh from the mds journal, but there's periodically certain events that need to be journaled so that they're they're in sync, then. Finally, this is the the big one, especially recently is the subtree map.

A

Every time the mds writes out a new journal segment and the journal segments in a bit. It needs to write out the subtree map. It also writes out the subtree map whenever it changes.

A

A

And then uh formally, it also uh logged which files are opened by clients. um That was actually uh something that was recently changed by zhang um two or three years ago. uh He added the new open file table, and that was to improve um mds uh failover times and uh what was mostly nbs failover times uh and to reduce the size of the mds journal.

A

um If you had a num, a large number of clients- and they had a large number of files opened, mbs would spend a lot of a lot of time, um logging to its journal. What files were being opened by clients, and so that was all moved out into a into a separate set of objects to reduce the load on the journal.

A

Okay, so the um the the mds journal is uh as a data structure is, is important for mds, because it's it's it's a way to do serial rights to a conceptually large object um on top of the distributed data store. That is ceph, so obviously ceph has you know you can have billions of objects and they can be scattered all over the osds.

A

But um actually you know knowing which objects to read and write from uh is, uh you know is just something you have to code up and the journal represents that the journal um is in a lot of ways a circular buffer.

A

um It does occupy a finite set of potential objects and the reads and writes uh go to a certain subset of those objects, and then the journal is constantly growing towards the end, and then uh the the trimming is is constantly catching up with it, um and all of that is managed in a separate uh bit of code in the objects, storage device, cacher library, but we'll get into in a moment which handles all of that uh handles all the details of that data structure.

A

uh So the mds journal, also, if especially, if you read some of the uh academic literature associated with stuff, the mds journal, um it is- is behaves very similar to or I should say that the genesis of the idea of the journal was, you know, began with the you know the log structured file system. So if you want to learn about journaling file systems, you would look back to that old paper.

A

uh Log structured file systems- and you know, as things move forward, especially with like I believe it was the xc3.

A

You know that they that evolved into an idea of we'll have all the file system state stored in in regular data structures like b trees or whatever, and then have a separate journal for actually making um for for making sure data is uh durable and also to allow for recovery. The mds follows that philosophy. The the directories are, um you know, stored in in individual objects on on rados, and we store all this.

A

The directory hierarchy state in those objects, but we also have the mds journal, which provides that uh the fault, tolerance and and the the ability to do failover, but also it it provides a like a hot cache for for the uh mds is taking over, um so that it can load all the most recently used um metadata by the clients and get its cache hot when it finally turns active.

A

So that's its other function.

A

All right any questions about the mds journal so far.

A

Okay, so let's move on to how it's structured so again, I said that the mds journal goes out to a series of uh objects, um as you can kind of, think of it as like a circular buffer, but only part of the buffer is is ever allocated on on disk.

A

um We have the an individual live log object, which is um at least for the mds journals, we're just using the blob store and it stores a number of of uh logical um uh data structures.

A

We call them journal segments and the mds only ever writes out an entire segment, um and each segment is uh a a set of log events that have been grouped together um and whenever you do any event, sorry any uh time the mds must flush the journal, maybe the client requested a sync or or someone executed, the flesh journal command on the mds or some locking logic.

A

Locking logic between multiple nds is required that a sync occur that will cause the journal segment to be written out and complete with all the the outstanding log events.

A

Log events is a parent data structure or a parent class to a number of different types of events that the mds will actually write out we'll get into those later.

A

There's no uh minimum number or of log events that need to be written out by the mds, but it always tries to um delay writing out the journal segment um as long as possible. There's a periodic five second sync in mds, so eventually things be uh metadata, does become durable, even if the load is extremely light on them. Yes, um and once the log object becomes full of journal segments.

A

uh Typically, I believe it's four megabytes is the target size for a an object for the log it'll move on to the next lock object and.

A

That yeah, that's all there is to say about that all right and then, as far as um where the journal is laid out on those, uh we have this concept of what's called a journal pointer and it's actually a fairly new um concept in cefs. It was added in about 2014 by john spray.

A

For those of you don't know john spray, he was the previous team lead first ffs from back in 2016 2015 to 2017, um and he's been working on steph and cfs for for a while at the time. um So the journal pointer was added uh for well I'll get into that later, but um the journal pointer is just uh uh a pointer to where the the current the mbs journal is, um and the reason it exists is to facilitate doing things like uh disaster recovery on on the journal.

A

If you need to write out a new journal, uh you want to set up a new series of of objects that the the the mds should write journal out to well. The journal pointer exists to atomically update where the mds journal is um and kind of think of it as a double pointer.

A

um It always exists at this object, location, 400, um and that is uh which looks a lot like a regular inode number or directory inode number uh for those of you who have looked at the the rados layout uh on disk layout for ffs directories.

A

um But it's really just a static object.

A

And uh in the uh in the x adder information for that object, you can find the the uh current journal so it'll point to the current um journal head uh for most chef installs, especially in new installs. um The journal head would be at 200.

A

um and if you just run a vstar cluster and you look at all the objects in the metadata pool, you would see the mds journal at this location.

A

uh However, it can also be put in a different location um and we'll see for the long-running cluster, it's actually at the 300 location. So this would be 300., um and so, when the mds is, if you're doing some kind of recovery event or for reformatting the journal, which has happened in the past, um the journal pointer would be updated and then you might just have us.

A

That's all you ever see. As far as the on the actual on disk um information is stored with the journal pointer, it's extremely simple uh it. You can use this. uh This rados command line um shell command to actually read the journal pointer and it's just these two fields, the front and the back and the front is uh correspond to ox 300, and this is actually coming from the long-running cluster. So this is receiv0001, and you see we got the.

A

um I said the x adder, it's actually stored in the blob of the of the 400 object, and um here we're importing into the journal pointer type and decoding it as json, and so again we can see that it's at os 300. So the long running cluster is actually, um as its name would suggest, been around for a very long time.

A

We've been constantly upgrading it and it's served as a nice way to catch um bugs that may occur with the legacy data structures that may exist anywhere on um in ceph, either in the osd maps or the or in the monitor stores, or on the osds themselves, or legacy mds structures which may exist somewhere on some directory instead of s. So um this cluster has been around for a very long time. It's uh survived like journal, reformattings and and whatnot.

A

You can see, unlike a newer cluster with like vstart, you can see that it, its journal head, is actually at os 300, which is uh you know if you were? If you went looking there for the journal on the long-running cluster, um I think it was in the 200s you'd be surprised. It's not. There.

A

All right and then the journal header, so I said that uh the journal pointer acts like a double pointer. So if you look back here, we have the 200s and the potentially 300. I put question marks there. It could be somewhere else.

A

um This log head structure um tells the object, catcher journaler um information about the the journal.

A

In particular, where the the journal objects are, as it is a large uh circular buffer logically, but only a subset of the objects of that buffer are actually physically present and so there's millions of different uh objects that could be uh of that in that set of objects, and only some of them are used. So you need to know where they are uh so in this head object which we can get and then pipe that into the set decoder and get the journal header.

A

You can see all the different fields here about the location of the journal objects and the three important ones are here: the right position, the expired position and the trim position and we'll get into those in the next slide. There's also a layout.

A

This is used by this is using the same logic that the setfuse library uses when it uses the um uh the osdc library again uh to handle the filer logic concerning where to actually store the file data objects for the uh for a file, namely, what's the stripe size. What's the object size which pool id what pool name space all of that is stored in the layout and the mds journal uses the same uh information uh to lay out his journal.

A

But it's not particularly interesting because we're just using the defaults, which would be a four megabyte, object size and a four megabyte stripe unit and only a stripe unit of one. So we're not doing anything fancy here.

A

All right, so what are the right position, expire position and all that, so the right position is where the mds journal is. Writing the latest changes to um the journal. So it's going to point to the the current uh log object. That's that's um being written to.

A

Similarly, the uh expire position is noting, where the mds has uh expire literally expired journal segments, meaning the what's an expired journal segment, that's a segment that uh the mds has made durable um by writing all of the metadata mutations for that journal segment out to the physical directory objects.

A

um So, for example, if I, if I create a file in a directory and it and that event is noted in one of these log objects that I want to expire, I have to go out to the directory object for that for containing that file and actually create the file I node in it in that directory, object's, omap and then once that's complete, then that log segment is eligible to be expired and once all the log segments in a log object uh have been expired, then I can adjust the expired position of the mds journal.

A

Now, uh just because a lot a log object becomes expired, doesn't mean you necessarily want to get rid of it. Because remember. I said that the mds journal uses the. I mean the mds uses the journal to facilitate recovery. It also uses the journal for uh maintaining a hot cache of recent metadata, um so you don't necessarily want to expire everything up expire and delete everything that up to the like the current write position.

A

If you're, if you're expiring things as fast as your writing, um you want to keep a certain buffer of recent events so that a new mds can kind of get a hot cache.

A

So the uh journal keeps track of where the oldest expired uh expired object is, and then it also keeps this uh trimmed position. So that's what we're going to use to keep track of what's actually been deleted, um so there's going to be a number of objects which are trimmable um and expired and a number of objects that are expired, but we don't want to get trim and then the current object that we're writing to uh and there will actually be a group of objects which have not been expired.

A

I didn't show here but uh are, are complete and fully written and then so the trim positions constantly being updated. Whenever the mds journal uh data structure, the osdc journaler is trimming old objects and which is really just deleting them and then there's the expired position which is constantly being updated, as the mds is fully writing out all the changes in the journal to the backing directory objects.

A

So that's. These are constantly chasing the right position.

A

You'll also notice that the uh if you, if you just look at an mds um running for quite a time- and you know, you've made a few directories or files- and you just wait for those directories to show up in rados you'll- be waiting a very long time, because the uh the expiring of the log object and the log segments in that object occurs uh only as needed as the mds journal grows.

A

And so the those directory objects may not be created until a lot of load is placed on the mds. And if you want to actually see all of those changes, flush to disk, you have to issue a journal flush command to the mds, to the admin, socket interface or really through the septel interface. Now, with pacific and onwards, and that'll actually um write out force.

A

All of the current log segments to be expired, meaning that the all the the changes uh correspond to those log segments will actually be written out to the backing directory objects and then um the uh through some commands to the journal interface. The all those segments will also be trimmed.

A

All right um so another way to look at the journal. Header is through the without actually running, rattles commands and then manually running. The decoder is to also just run this ffs journal tool. I said this was new. um It's new, relatively new. It was done in 2014 again by john spray. He added this tool to allow us to do some kind of disaster recovery for the journal, but also just to inspect its contents, and so uh this is a good tool to become familiar with. As this ffs developer.

A

To just learn how the mds journal works and also be able to uh learn more about certain types of recovery situations fairly common, especially in community clusters, that something really goes wrong, and if it's really bad, they may end up doing a full uh journal reset, which is just wiping out all the latest, uh all of the journal of the mds and creating a new one. So whatever they have that's been expired from the journal uh onto the the metadata pool is what they get.

A

That's all they have, and so again this shows you all the the.

A

A

That and yeah so again, this is um this is from a v star cluster, and so we can see the right positions just a little bit ahead of the expired trim position. These two actually have not been um uh have not updated yet because nothing's been expired or flush from the v-star clusters journal.

A

You can also use the cfs journal tool to look through the different types of events that are in the journal, so here we can do an event get list. I'm just looking at the latest, 15 uh journal entries and you can see we've written out a subtree map. Subtree map test is something you only see in v. Star clusters, usually uh new session information makers.

A

Cap updates those types of things, and you can actually um add that in the next slide you can see. uh There's you can do a lot of fancy filtering. If you are only wanting to look at the events associated with like a particular inode, you can do that with this selector, you can look at particular paths um or fragments the directory fragments or if you just want to see everything that a client's been doing and have a selector for that as well.

A

Here is the reset command I was talking about earlier, so this will allow you to just do a full wipe of the mds journal, and this is only done during the certain disastery recovery if the journal has gotten corrupt. For some reason, um an upgrade went horrible, uh be all sorts of uh reasons, though generally it's it. It really should not ever be done, but you'll also notice that community members, when they become desperate, they often run this uh this reset command.

A

um So it unfortunately is fairly common to see uh in mailing list posts um and then there's these uh import and export. So you can actually import and export the full journal out to a file.

A

So that's I was going to move on to the code, but before I do that is there any questions so far.

A

No questions all right: I'm gonna switch to a different, no.

A

All right, so uh I I have in the slides you're following along on the uh google's drive, I think I sent um you know which, which code we're going to look at, but I can't display both at the same time, so we'll just look at the code on or in the screen share, okay, so starting with uh osdc journaler. So uh this is the all the code that handles the the data structure. Logic of the journal.

A

There's nothing nds specific in this, this uh library and again it's used um by both the cefuse client library and also the mds, which should make sense because both actually need to understand how to manipulate the data objects for for the uh files.

A

Now, if you're, just in the top level source tree, you might think uh you might be looking around for a journal and- and you find this and you think that's where the uh mds journal is, but um that's actually not where it is. This is uh code used by rbd um and nothing instead, ffs uses it so just be aware of that.

A

The mds journal is actually here. This journal or class.

A

A

So um this is actually a really important comment if you ever have to dig into this journal or code um strongly suggest. You fully understand this comment before doing anything uh and the the main point. uh Just to summarize what it's gonna, what it talks about is that the uh current trim position, the current expired position. The current right position are actually not updated, synchronously with where whatever the current right object is or expire object, actually written periodically out by the journaler.

A

The journaler uh just assumes that these are moderately out of date and it'll actually continue looking uh forward a few objects until it finds whatever the latest written object is and whatever the latest trimmed object is so on, um and so whatever the uh head object says, is the current right position. If you were doing manual debugging, it may not actually be that object. It could be a few objects later, but that's just something to be aware of.

A

um So I'm not going to actually go through this code much um it's just a data structure. Again, it's a logical circular buffer um with only a subset of the objects actually allocated and uh there's not a lot to say about it. um This code, uh the code for the journal, error, reuses a lot of code elsewhere, um data structure code, notably the filer.

A

uh So um that's why it's in this this particular library, not something completely separate.

A

uh But if there's any questions about this I'll pause here.

A

A

Oh, um is the font large enough and near any complaining, so I just assume it is.

A

Yeah, it looks good, okay, all right, so here's the journal pointer object um again, it's a basically a double pointer, there's, not really a lot to say about it. There's the front and the back links for which journal we're accessing and during a recovery situation there may be two but generally on a stable cluster. The the back link will be will be null and only the front one is is uh the current journal.

A

um So there's not really a lot to say about this.

A

All right, let's move on to the meat and potatoes.

A

So here is the um the md log, which is the uh the set of code that manages writing out all of the journal. Events for the uh for the mds and um uh interfaces with the ostc journaler.

A

uh So the important bits here are just.

A

Great so here we're just creating an uh an empty log. This is done by when the mds is starting up for the first time and uh opening the mds log and doing performing any kind of replay.

A

Let's see what else is.

A

um This is the uh pending event, so this is a structure we're gonna, see later on. um Whenever the mts needs to write out a log event, um it's going to create a pending event structure and for that iq during uh the design of the md log. Is that there's a there's, a thread for um actually doing the rights? How to to the journaler and um there's the other threads that come in and try to create events?

A

There will be several different, mds threats that could do that, and so they create this pending event, object which gets thrown in the queue and the submit thread just regularly reads these off the queue and writes out the log events.

A

Let's say about the header move onto the code, so here's creating an empty log. um This is the mds I know log offset, so the inode number for the journal layer is, uh you know, for new plus will be uh like ox200.

A

um So that's what this lot offset is, and then you actually add the uh node id, which is really the rank, the mds so for rank zero. It's just adding zero. So that's why the rank zero mds journal is at os200 and if you were rank, one it'd be lx, 201 and so on and there's actually a maximum number of mds ranks, which is not something we often see in the wild of 256, um which is kind of there's. No real reason.

A

We have to have that limit other than certain on disk um uh metadata structures, assume that there's only 256 ranks um because, once you add 256 to 200, you get always 300 and that could be it like a recovery journal or something.

A

So that's one of the reasons why that that limit exists, there's a bit of a diversion so here we're actually uh creating the journaler metadata pool for, for example, with the lx200 file object and we're setting um just another item at creating a journal pointer that points to that uh that inode saving it out. And then we just wait for all of that to complete.

A

All right and then and here's that writing the head- object all right uh smith, red entry, so.

A

This is the the main method that most of uh the mds other mds code will be calling.

A

So this is when we are going to submit a log event to the current journal segment and actually the the journal segment segment's already been associated with the log event by this point um for certain uh some structural reasons of of associating certain state with the log events, and here the uh will be creating with one of these um pending events that I talked about earlier, and this is really just uh the log event itself and then the station here.

A

So this is a uh context, theft context, which is how we do our continuations within ceffes or in the in mds. uh So when this actually becomes durable this a lot. This context completion will get completed um and you know that might require that might result. That would result in, for example, in telling the the client that the um that the uh that the request is is now fully durable and known, and we would get a safe reply from the mds indicating that that that request is now durable.

A

A

So here you can see some of the logic for like when we're going to start a new segment.

A

um So here we have like, for example, the uh there's a number a maximum number of log events per segment um or if the uh log segment sum uh period is or sizes is starting to uh exceed um the size of the backing objects and we're going to start a new segment and we're going to end up writing out the new. The current one- and here is that logs our subtree map test uh that I mentioned earlier.

A

This is a this is only done when we have this mds debug, subtrees uh config turned on and then we'll actually write out the subtree map test- and this is just a as it is. The comma indicates just a catch replay box, but that's not an event. You normally see.

A

So um that gets put in a pending queue and you can see there's a so. This is a helper method, but actually, if we looked at.

A

uh You can see there's a a submit entry law, uh the mutex that is is guarding this uh this uh view and here we're adding the um so. This is just the important bit adding that to the pending event list. So move on to.

A

A

So this is the thread: that's uh just sitting in the mds constantly reading, pending events from the queue and writing out the log segments as they become complete.

A

So, let's see so while mds is not stopping um we're going to be continually running, this thread uh grabbing an event off the pending events. Queue it's empty, we're going to wait for it to become to get an event, and here we read the any event data and then we check, if there's a log event associated with it, there's not always going to be a long event, um sometimes you're, just waiting for a flush.

A

A

And here we're going to uh get the current right position of the journaler uh incorp associated that with the log segment, um which I believe is mostly for debugging and then finally, here is the where we're actually appending that that uh um the buffer list containing the log segment out the journaler.

A

A

A

ah And then you wait for the journaler to flush once that completed, you'll you'll run this finisher and.

A

A

Now that I'm looking for it, I can't find it.

A

Anyways, um as this actually is flushed there's.

A

A

This is a continuation you'll, often see in the mds vlogs. If you look at a debug log, you'll see lots of these and um it actually is just a wrapper for for completing another context which would come from another part of the mds that wants to know that something has become journal, a a durable but also records. The current writing.

A

uh And there'll be a number of these for every log event, so you can see this was created for with the data.fin and again, the data object is just the pending event um that we read off the queue and the the fin is the the finalizer associated with it. The continuation that the the mds is going to run when, when the vlog event becomes durable,.

A

A

Yeah, so that's just a submit thread. It's just looping on those and reading pending events and writing them out uh any questions. So far.

A

So um starting a new segment is something we do when we want to um force uh certain events prior to that segment to be flushed or expired. um This is something you'd uh you might see, for example, when we're we're telling the mds to flush his journal. In order to do that, we need to to fully flush his journal. We need to tell it to start a new log segment so that um all the previous events in the pre in the prior segment can actually be um eligible for expiration.

A

So that's one of the circumstances where you might have this uh this starting a new log segment, and generally it's not something you need to think about when you're interacting with the md journal. um But important thing here is preparing a new segment.

A

So here is where we actually allocate a brand new blog segment within uh the current event: sequence, which is just uh keeping track of the number of events that have gone through the mds and.

A

Now that I'm looking for it, I can't remember where I put this.

A

A

A

Oh there, it is so that's what I was looking for.

A

So here's where we actually.

A

A

Oh here here it is so we actually create the new uh segment and then the first thing the md log does, whenever it creates a new segment, is it uh creates a new subtree map, object, uh event, sorry subtree map event, and every time the mds starts a new journal segment. It actually writes out the subtree map- that's done here, so it's actually going to ask the metadata cache to create the subtree map, which is just a special event in the mds journal that records the entire subtree map.

A

As of that current point in time um and writes that as the first entry associated with that journal segment- and this is done to simplify data recover recovery during mds failover and the reason I bring this up is that uh especially if you're looking into performance issues right now, one quirk of the mds is that if you have hundreds of subtrees that substring map can get extremely large, and that means that every time the mds is preparing the journal segment, it actually will um have this gigantic subtree map taking up space in the mds.

A

The mds journal is just spending a ton of time. Writing out the subtree map uh and also acquiring the metadata server, lock big lock in order to do that, um so it's very expensive. This is actually motivation for one of jung's recent changes he had done before. He left, um which he actually just renewed. The pull request a week or two ago to pull the subtree map out of the mds journal, and that is it's all centered around fixing this bit of code writing out the tree map segment.

A

uh And and again, the reason for doing that is to to handle those scalability issues all right.

A

Move on to a rethread, so this is um called when mds is doing replay. um This might have to be the last thing we talked about because already 10 minutes till top of the hour. I wanted to leave some time for some way, uh so here we have the recovery thread or the mds or the metadata log, and here this is really just uh going through the different events for the uh mds journal and uh reading them off the the disc um reading them off. Often the metadata uh pool.

A

So here we're loading. The journal pointer and the current journal generally is uh we're. Just gonna, be uh so here. Here's the the back journal. So if we're doing some kind of um reformatting of the journal- or there was a journal reset, this might be not null, but generally it will be null and if it is no, then we're going to just load the journaler with the uh front object to the journal pointer and here we're going to recover it, and so the recovering is. This is called.

A

This is the osdc journaler class method to recover the journal, and that's just again uh remember that the journal head may be out of date. So we're asking the journaler class to load the current um right position and fire position figure out where the actual uh latest right position is uh that's what the journal recovery there is doing, and once that's done, we figured out what the actual right position trim position expired position are.

A

Right handlers, there's even a case to reformat and then once that's done, we can actually do a replay.

A

So here uh the error handling.

A

And then here here in the replay thread is where we actually are going to, uh and this is called by if we're in standby replay or if the mds is in the the replay state. It's actually going to read the events off the journal decode the event here and.

A

And then here actually we do the replay of the event um and that's just um taking the uh updates out of the log event, creating a directory or whatever and applying and applying it to the metadata cache, um and so that allows the mds to rebuild the um the the metadata cache whenever it's replaying um events off the journal.

A

Okay, so I think I ran out of time. I don't leave a little bit of time for q, a um talk about the other parts of the metadata journal sometime in the future, again uh I'll open the floor, any questions or comments about what I've gone through so far.

A

uh You don't ever really need to um as part of journal flushing it's it might trim the the journal.

A

You might do it if you need to do some kind of recovery and you want to make sure that everything is currently in the journal is written out, but generally you would probably be doing that recovery manually with the recovered entries option to the cfs journal tool um prior to perhaps deleting the journal through a reset command.

A

um I do a lot of journal flushes just because I I want to perhaps manipulate the directory objects or I want to ensure that uh um everything is is synced out, but um yeah there's not as an admin. There's not really a lot of reasons to do it.

A

Oh, um if you want to drop the nvs cache, probably want to flush the journal first, in fact, um actually the fleshing that the mds cache or sorry dropping mds cache that code mds actually flushes the journal too and that's necessary because you need to expire everything in the mds journal, so that a lot of the metadata in the mds cache becomes unpinned and eligible to be dropped, don't flush it then um the nds may not drop anything. You have 100, you know.

A

Hundreds of megabytes of of cash that just uh is is not actually dropped and the mds um a little bit diversion, but the mds dropping the cash was an interesting idea um as part of doing performance tests, but due to how slowly you have to drop the cash in order to prevent destabilizing them.

A

Yes, it's not very useful in practice I find, um but that you know at least for a performance perspective if you're doing tests you're better off just recreating the file system, because it just takes too long to do it in a safe way.

A

But uh I don't think dropping cash has become very common in commun, you know, but for community admins. So you don't see it there. Very often I mean I don't hear talked about on the mailing list and that's really the only way. I would know.

A

Did I answer your question? Ramona? Yes, yeah! That makes sense. Okay, it seems like when you want to talk to cash, that that's that's when it when you want to do it.

A

um So culture asks what's the average size of stepfs journal in comparison to the data stored. So uh I was just read the numbers for this, but the mds journal, I think, can get to a few hundred megabytes in size, but it is. It tries to cap itself around that size um by forcing trimming.

A

uh The metadata pool, of course, can get as large as needed, but it usually just ends up being several gigabytes, because the metadata just simply doesn't take up that much space.

A

And uh patrick, can you talk a bit more about the optimization regarding the subtree map able to follow that? What's the issue and what's uh what is done when trying to solve regards to the the sub tree map that is being stored in the journal, so the the.

A

Again, this is something you only see if there's a lot of subtrees in cfs, it was never really a problem that was noted at least in in the early designs. This fest, like you, won't see it talked about in the early sevenfest papers.

A

um The dynamic balancer does not generally create very many subtrees. um You might get like a dozen on each mds.

A

uh So that's not that's, not a scaling problem by itself, but um all right that at that level, but um once we added uh ephemeral, pinning um where you can have a number of subtrees created according to whatever the size of the directory, is at least especially with distributed subtree, pinning, um which uh at least the first iteration of the design and distributive thermal pinning we had a sub-tree for every sub-directory under a directory.

A

So think, though, again the canonical case, I always mention- is the home directory that we want every user's home directory to be pinned to a particular mds through ephemeral penny. So you would have a subtree for every home directory, um which you know. Obviously you could have hundreds thousands and that uh did not. We had scaling issues. The the rank zero would be, uh which was authoritative for the home directory would be uh overloaded with with journaling it's low to respond to lock requests. It just was we've.

A

You know we suddenly found out that this was actually a scalability issue.

A

uh The initial idea was to somehow figure out how to top writing out the subtree map. Every time we do a journal segment. um Is there some clever way we can uh reconstruct the journals, the subtree map, while reading the journal um jung ultimately didn't elect to do that uh he pulled the substring map out of the mds journal completely.

A

I have not actually gotten dug into his that part of his pr. Yet, as far as where he's storing the subtree map now might just be a separate journal, um which reminds me I mentioned, I was going to talk about the purge queue. um Purging uh was originally part of these the mds journal. This is, I have a little bit of time off. Just mention that um you know like the mds would uh periodically when an unlink was event was going to be expired. It would actually delete the corresponding entry from whatever directory object held.

A

The link that was low for hanging fruit to improve the performance of the mds, because you know to reading deleting sub uh directory trees is fairly common um and it doesn't necessarily need to be done immediately during event expiration. So the purge queue within the mds exists to handle the logic of actually deleting files and truncating files and certain other truncating and deleting files, and also stray reintegration outside of the journal machinery of the mds.

A

So the perch queue actually maintains its own separate journal, same journaler class for all the files that need to be deleted, and it does that out of band of the mds journal. So this is kind of continuing a trend of splitting things out of the mds journal, because it's become a whole large data structure that takes too long to, or it doesn't scale very well all the different things that the mds actually needs to do.

A

So the subtree map is now um the new victim to remove out of the substrate or out of the mds journal.

A

As far as whether or not we will um merge jung's pr, I have to go through it still and I'm not exactly sure um whether it'll be the right approach, but I generally trust jung to make good good choices in that regard, so it'll probably be merged, uh especially since apparently he's still working on sevenfest. um I wasn't sure if he'd be around to clean up any bugs that may surface with the result of merging his pr. So we'll see about that.

A

Anything else, any other questions.

A

So, just uh I'm going to close this topic out, um you know, I think what we will find going forward is that the mds journal will probably be other things we want to pull out of it um as far as making the mds faster.

A

um Anything we can pull out of the journal um into a separate thread, and data structure will probably end up being a large performance improvement for the mds. So that's something to consider um in future work force ffs.

A

uh One example of that is the recursive unlink rpc that we want to implement. um Where you know we we no longer journal it or have a number of rpcs with the client and number of journal events associated with deleting the directory tree. That could be something we potentially pull out um into the purge queue.

A

A

And then, as another example again, I mentioned that the every time we have a new client. We write that out to the journal. We ever get to a situation where we have thousands or tens of thousands of clients and then yes, we have no idea what scaling issues exist. At that point it may nee. We may need to rethink how we journal client sessions out to in mds in that particular situation, but I don't think we're anywhere close to that. Yet.

A

All right, I think, we're out of time thanks folks for showing up and see you tomorrow.