Ceph CephFS Code Walkthrough, 3 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CephFS Code Walkthrough: MDSMonitoring

Description

Schedule: https://tracker.ceph.com/projects/ceph/wiki/CephFS_Code_Walkthroughs

A

uh So today we're going to talk about the mds monitor. I believe I've done this at least once before, in a water cooler, but.

B

um For one of our first uh walkthroughs, I thought I'd go through this. This often a little mystifying.

A

uh So what is the mds monitor? The mds monitor is part of the step monitors it's one of the paxos services that we have.

A

um There are other packs of services within the mons, notably the osd, monitor and manager monitor what these different paxos services do are to create an authoritative service for driving state change within the subcluster for the mdss.

A

They make changes to the mds cluster by uh making modifications to the fs map or the mds maps, and then distribute those mds maps to all the clients and mds according to what those entities have requested.

A

Some may some entities like the mds map or like mds, only care about the mbs mds map for the file system that they're serving some clients may want to subscribe to the entire fs map to be able to see all file systems, um and some clients will also like the mds to only care about an mds map for a particular file system, which is generally the case.

A

The other uh function of the mds monitor is to monitor the health of the mdss.

A

um It notates when mds demons are laggy and it does that by adding that to the mds map, um and if an mds continues to be laggy, then it may actually uh do a replacement by promoting a standby and kicking the old one out and block listing it.

A

um Finally, the mds monitor is also there to uh provide a command interface for manipulating the md fs map and mds map um via the via uh commands, so like uh from the cef command line.

A

um For example, failing an mds is really just this really just removing it from the fs map um creating a new file system, it's creating a new uh file system and mds map in the fs map. Things like that um these are that yep. So that's that's everything the mds monitors is is for as far as purpose.

A

Do we have any questions about that before we move on.

B

A

So, um within the monitor there are these components, um primarily. The mds monitor, refers to two given mfs maps um and each fs map refers to one or more mds maps. Well, actually could be zero or more.

A

If there's no file systems um where uh the mds monitor looks uh remembers the uh the current official fs map, which I've marked e minus one fp minus one um and that is set in stone, it's not, it can't be changed once uh created uh or once um uh the the meyers have agreed on what his current state is. So it's immutable and then uh fs map e, um which is the the next fs map that the mds monitor may be.

A

Making changes to here I removed one of the mds maps um just to indicate maybe a file system got deleted um and that would be the delta between the two um and then there's this uh fs commands uh module within the the uh within the monitor. That um is just an abstraction for performing a number of file system commands on the fs maps, for example, uh creating a new file system or failing all the mdss on a file system um or changing fs settings on a file system.

A

Those are all routed through fs commands, which also modify the current fs map. That's pending um you'll often refer uh see within the the mds monitor the the next epic is uh referred to as um pending the pending one is never shown to clients until it's finally,.

A

Distributed among all the monitors and they agree on the changes as part of their paxos algorithm, so any changes to the current pending fs map are never revealed to clients or mdss. uh It's only when they're fine, when when it's. Finally, uh when consensus is reached by the monitors, do we finally uh effect those changes which may be um sending an mds that it's been removed from the mds map or um uh or updating clients about the new mds map.

A

So uh here's a look at the mdss and the monitors the clients um and what kind of messages we see between the these entities. uh The big one is going to be from the mds's to the monitors they periodically send. What's an mds beacon message, which is really just a heartbeat message for the most part being sent to the monitors, telling the monitors that the mds is still alive, and that is since I believe, every mds beacon interval seconds and usually about two seconds.

A

I believe uh from the mdss to the monitors and uh when the monitors receive that, let's say the mds sends it to one of the peon monitors. It could be any of the monitors that it sends a message to, but generally once it picks a monitor. It stays that way for the duration of the mds's lifetime, uh unless it loses connection to that.

A

Monitor um that the monitor that's the peon, that's assuming that the mds sent it to the peon would uh forward um that message via an m forward wrapper uh to the monitor leader and the leader. Actually is the one that's going to do any kind of necessary notations about the about the beacon, for example, is the the mds no longer laggy, or is there a state change that needs to be performed, and then it sends the response back to the piano monitor which forwards?

A

Finally, the mmdsp can act, the mds, uh the beacon messages we're going to get into more later, but one of the reasons why they're so uh why they're very important when looking at the mds monitor is that they drive state changes for the mdss?

A

If, for example, an mds is in uh the uh replay state, um so it just took over for a failed rank only when it's done replaying is it going to tell the monitors? Okay, I'm ready to move on to the next.

C

A

And it'll mark it, as uh I should get them all jumbled in my head, but like up resolve um and it will request that the monitors change its state to resolve.

A

Then once the monitors will or the leader will drive the change in the fs, the mds map to updated state and then only once all the monitors agree that on the new uh fs map epic, will it finally respond to the mds that okay, you can go to up resolve and that would be included in the mds beacon, ack.

A

Also, the monitors distribute mds maps periodically via an mmds map message, so those get sent to the mdss whenever there's a mds map change same thing for the clients of the file system, I have in the top right and then on the bottom left. We have this m command message, which uh would have something like a set: fs, new or cfs set uh or mds fail any of those kinds of commands which gets sent to them on and if it corresponds to an mds command, that's handled by the mds monitor it would get processed there.

A

So that's another kind of command we see dealt with in the mds monitor any questions about this diagram.

C

I have one patrick um sure: is it always the leader, that's that distributes the uh mds map or can a peon also do it once all the monitors have agreed upon the paxo state.

A

They yeah, they can all distribute the mds map once they've agreed. So um you usually the mds will pick a monitor at random and every everything it gets. It gets everything from that monitor, including the mds map. um The leader doesn't need to get involved in distributing the mds map to the mdss, and that's one of the nice things about having several monitors is you can actually distribute that loadout um and so yeah. The the leader is not involved in that.

C

A

Any other questions.

A

Alright, so we're going to look at some code and I will need to uh change what I'm sharing so.

C

B

A

A

D

uh I can see that the console just appeared, but it will be uh better if you could possibly increase the font size.

A

All right, let me also get a link for.

A

um Everyone to view the slides.

A

And hopefully that link works. Let me know if it.

C

A

Anyone able to try the link.

D

uh Need uh you need to open it up? I think permissions-wise.

D

Okay, that that font size you just set is legible. For me, I don't know about others.

A

That looks good to me all right. I updated the permissions hopefully should work now.

D

Yes, now it works.

B

A

All right, so the first bit of code, we're going to look at is um called the paxos fs map, and this is fairly new code in.

A

In the monitor, I think I added this like two years ago and the reason we're starting with this is that um it's kind of unique to the mds monitor- and I I think in retrospect, if we were starting the monitor from scratch. We'd probably use something like this throughout all the monitor services, because this is one of the most uh the par the thing that gets screwed up the most in the monitor and causes all sorts of problems. um That is, I mentioned in the diagram.

A

The current epic, that is, you know that that has been being distributed to mds's and clients and is immutable, um is is stored in in in the paxo service, like the mds monitor and well sometimes, people write code and they accidentally modify the immutable um uh fs map um and that causes all sorts of problems. It doesn't do what they expect um and sometimes changes get lost, etc.

A

So um what I did was I added this class, which just protects the f, the current fs map and then what we call the pending fs map, so the next epic, and uh only the you can't uh the paxo 7s map is inherited by the mds- monitor. Take a quick look at the mds monitor.

A

um You can see it's actually inherited by this mds monitor class.

A

um Let me back to that code, so this is private. It can't even be accessed with these. These members fs map and penny fs map can't even be accessed by the mds monitor uh class.

A

The only way to get at these um these uh members is through the the methods and we have getters for the fs map and then also the pending fs map. These are public, so they can be called by, for example, the osd monitor, which I don't, which does happen in like at least one place within the ost monitor. It actually needs to look at the fs map, um but when it does use these methods, it only gets a constant version of it. So we can be fairly certain that it's not uh changing. It.

A

We don't we're not aware of it. uh The protected methods are designed to be used by the the mds monitor itself. So here is how it actually gets a writable version of the mds map and it can only get a writable version of the pending fs map um and we have a check here to make sure that this is the leader monitor.

A

That's doing the changes so that you know we don't have a code path that somehow, as a peon monitor, is somehow changing the pending fs map, because that's never what you want to have happen, and then we have two methods here which actually create the uh the next pending fs map. um So the leader anytime, it's it's uh finished.

A

um Distributing the the new fs map to the other monitors it'll create the next one through this, uh the next fs map through the create pending, and that just makes a copy of the fs map and then increases the epic.

A

um And then we have this decode method, and this will primarily be called by the peons. So whenever it gets a new fs map from the leader, it's going to decode the buffer list, uh update its current fs map and then also null out the pending fs map to make sure that there's no invalid accesses to it. Because again, the pending emphasis map should only be read and written to by the uh leader.

A

So we have these uh protection methods in and there were a number of changes that had to be made to the mds model to use this. But uh overall this actually turned out to be a very positive change. I feel because it it actually caught a lot of bugs and is preventing through code, a number uh any number of new bugs.

A

But again this is fairly. um This is a unique protection to the mds, monitor you're, not going to see this in other services.

A

All right um and then just to look at the mds monitor side of this. So again, an mds monitor is a paxo service. So one of there's two methods we should look at for the mds monitor.

A

As far as this uh paxos fs map is concerned, the first one is update from paxos.

A

So this is this would be called, for example, by the peons it once it gets an update to the fs map. It goes through this code and.

A

Here is just getting the next version of the fs map, so it's getting it in a buffer list and then finally, we're going to call this paxos best map, decode method, that'll update the peons version of the fs map, and here we print it out and if you look at the monitors, you'll periodically see the fs map get dumped out to the debug log, and that happens here and in fact it happens at a very low debug level.

A

So even on clusters, with all the debugging set to the lowest defaults for the wires you'll, still end up seeing these new fs maps get printed out in the monitor lock. So you can rely on that being there and then here's a called a check sub. So uh anytime the uh monitors um update the fs map. They also need to notify all the clients who have subscriptions to mds map and they will go through all those and and update that we'll take uh another look at this later.

A

And then the fine, the final method, we're going to look at relating to the paxos fs map is create pending.

A

So they get, this would be called by the. um uh It is just one of the the abstract methods in the in the paxo service class. So whenever it's time to create the next um uh map that the the paxo service is distributing, uh it calls this crate pending mat method, and so that would happen. After uh all, the monitors have consensus on what the current version of the fs map is, and then the leader is going to create the next pending one.

A

um So that happens here and again. This is wrapped up in this paxos fs map, precisely to prevent uh the mds monitor code from creating a new pending map um anywhere, except here.

A

All right, that's the paxo cfs map, any questions about that.

D

A

So the next code, to look at I told, as I mentioned earlier, the beacons are what drive most of the state change, um that's automatic in the in in the mds monitors, and these come from the mdss and so the first method that's going to be called whenever there's a beacon. Is this pre-processed beacon method that happens uh both in the peons and in the leader?

A

So, uh as the name of the method suggests we're just doing some basic pre-processing on the beacon um prior to actually doing any changes that would be a result of that beacon. So, for example, here's a simple permissions check that even the peons can do. If it's not an mds with this x-cap, then uh the beacon's got insufficient um privileges. We're just gonna, ignore it, um and here we're gonna check that the the the fs id for the sep cluster matches. What the message has.

A

So is this an mds? It's actually part supposed to be part of a different step, cluster things like that, um and then some compact checks uh for the most part. These result in just ignoring the message.

A

um We don't the peons, don't reject messages, and this is actually kind of a sometimes a problem within the mars is that if something goes wrong, the monitors just ignore it, and so what you'll end up seeing is a client that hangs and because it's expecting a response from the monitors that it's never going to receive.

A

So um this is either it can be a good thing or a bad thing, depending who you ask, um I I having confusing hangs in my opinion, are kind of a bad thing, but that that's something that's kind of prevalent in the monitors is that there's a there's, a a lot of uh the default behavior for handling a problem is to just ignore it.

A

uh So you'll see that um and then here's the key bit in this pre-process message. If it's, uh if it's not the leader, if this monitor is a peon, then we're just going to return false and that just indicates to the uh the caller period across. Let me see if I still have this code.

A

A

Yeah so um preprocess query is uh one of the nds monitor messages, uh methods and you can see it just calls pre-process beacon below if it's at a message beacon um when the paxo service calls this preprocess query. If the return value is true, that means that it was processed and the uh there is no further um work necessary as far as pre-processing for the message.

A

But if it's returns false, which for the peons will generally always be the case, then the beacon needs to be forwarded to the leader, and that is handled here in the paxos service code of the monitor. If it's, if it's not a leader, then it's just going to forward that request to the leader. So further processing will happen on the leader.

A

So if you're trying to debug issues with um changes to the mds monitor, um you generally always need to be looking at the leader uh and not any of the peons, because all the beacon processing, all the file system commands, are eventually going to be processed on the leader.

A

So finally, uh you know when the leader does get the message. It's also going to go through this pre-process business um and then finally check do some basic checks on on the uh beacon so like, for example, if the if the gid of the mds does not even exist in the current fs map, then, um and and it's not asking for the boot state- that is it's not a new mds.

A

uh Then we're just going to say that the beacon is not an fs map and we're going to send it an empty mds map which will cause the mds to respawn.

A

um That could happen periodically, for example, if you, if the mdf gets, kicked out for being laggy, because there was a network partition and then it comes back 20 minutes later thinking, it's still in the mds map and the monitor says no you're not, and it sends it in mds empty mds map to cause it to reboot.

A

So, for the most part in pre-process we're not doing any kind of work on the fs map, you'll see, there's no there's. No. This is all access to this uh go back up to the top of this method. To show you here we do this get fs map call and, as you'll recall, from the paxos service access fs map.

A

That is this method, and so we're just getting a constant version, a constant reference to the uh current fs map, and so it can't be modified and we're not calling get pending uh fs map.

A

I don't even think we call this. This method is used in um okay. You can see that in mds monitor.cc we don't even call getpending at this mount.

A

We only call get writable and you can actually see exactly where, in the in the mds monitor that we uh we we get this reference, um so it's not in pre-process beacon we're only using the existing fs map, so we're not making any changes to the fs map in this method. We're just doing basic pre-processing on the beacon, for example, if the current version of the mds is laggy, we're going to definitely note the beacon.

A

And you'll see this called in a few places in mds monitor, which is basically just saying uh doing some internal bookkeeping, and this uh monitors saying that the the the mds has been seen recently and should not be considered laggy. uh As of this.

A

Time and you'll see the the interesting bit about that being in the uh pre-processing. Is that we want to uh pre-processing is done, I believe, with the fast dispatch. So it's done, but um basically as soon as the message is right off the wire and that's important to prevent the uh an mds from being marked, laggy and being removed from the cluster just because the uh mds are just because the monitor is under load. Maybe it's getting uh is is doing a lot of work of some kind.

A

It's getting a lot of messages, something slowing down the monitor and you don't want your mdss to get kicked out just because the uh uh the the beacon messages are not being processed quickly enough.

A

So that's uh one of the functions of pre-processing and you'll notice that if you especially if you look back in the history of the mds monitor code, there's been numerous changes to to try to avoid that particular situation of the the mds monitor, falsely believing that the mdss are laggy or disconnected and then removing them.

A

So that's been a recurring issue for in the mds monitor code and then there's uh various code here to check. If the. uh If there's going to be a state change and eventually, if there is we're going to um uh actually um send a risk, an act message back and and then finally do some further processing on it and the further processing is going to be done in prepare beacon.

A

uh So this is where we're actually uh going to make state changes to the uh the fs map, potentially um and because of that, we're going to get a writable reference to the to the next fs map. And so that's what this pending reference will be.

C

A

Right and so there's a few things that preparer is doing, um it's going to record the the health checks from from the uh from the beacon so appear every every time the mds sends a new beacon. It will that all of its health checks in it, for example, if it's uh and I've got an oversized cache or a client, is not releasing its caps fast enough. You will see those messages in these.

A

uh What we call these mds mesh metrics and um the monitor is going to look for differences and actually note those um like, for example, in the cluster lock. For example, this mds health message was cleared and all that work happens here early on.

A

And then, finally, the uh mds monitor is going to look at for state changes. So is this going to be a brand new mds map, so the one of mds first boots, uh it's going to send the state boot to the monitor and here we're checking if uh there's one condition, mds enforce unique name which is uh by default. True.

A

So if there's another mds with the same name as as this one, that's claiming the booth, then we're just going to go and find the old version old mds instance with that name and we're going to kill it so as part of killing it. You have to block listed so here we're waiting to see if the mds monitor or the osd monitors. Osd map is writable.

A

If it is we're going to um uh block listed, that happens in fail, nds gid, but if it's not writable um and that's that's just corresponds to machinery and the monitors about whether or not you can do a right at the moment to the pending osd map, then we're going to wait for it to be writable and then retry. This beacon message, um but uh here we actually uh fail the the mds with the current or the the instance with that name.

A

um So there's various bookkeeping here for the the beacon um here's the else for that, if on whether it's a boot mds- and that means that there's some kind of state update eventually so here again we're checking if the gid exists in the pending fs map, maybe for example, we have a uh mds fail command was issued and the mds got removed from the next fs map and at the same time as that's occurring, the monitor is also looking at this uh a beacon from that mds.

A

So in in the pending fs map, if that mds has also been uh removed, then it's going to hit this code path and it's going to need to wait for the current fs map to be uh distributed amongst all the monitors, it's it's uh to reach consensus, and then it's gonna send uh execute this lambda context to send an empty mds map to the uh mds that sent that beacon.

A

That would be the situation where we would hit this code path and then, finally, the interesting bit here some of the interesting bits. So here we're going to clear the laggy state if it was laggy and moving on moving on here's where we're handling some of the various states. So if the nds is stopped, um that means that the the rank is down. So the mds was in the stopping state and then it says it's stopped.

A

It's finished so here we're going to add this to um we're going to call this fsmap stop method on this gid and that just does. The fs map will do some bookkeeping to record the rank is stopped and then we just remove the uh the stop gid from from various uh bookkeeping structures in the mds monitor, and here is an indication that the rank is damaged.

A

So again, we're also going to block list mds. So we're going to check to make sure that the osd monitors osd map is writable and then finally, we're going to uh block list uh our mark that the the rank is damaged here in the depending fs map and I'm looking for where we block list them. Yes, and maybe we don't because it's presumably going to shut down on its own. So I'm not sure why this check was here to begin with, that might just be unnecessary.

A

ah Here we go we're calling the osp monitors block list method directly on the addresses of the nds that sent us the damage notification and that's just to handle the case of the nds is uh um sends the damage message, but then somehow continues operating. So the the monitors want to make sure that it's dead uh do not exist so that we would get this state if, for example, the mds was uh uh failed um or are rather terminated.

A

So it got some signal like big term, and uh the mds is just going to tell the monitors, hey, I'm going away and I'm not coming back and that allows the monitors to immediately do a replacement rather than the mds just going away. And then the monitors have to wait for the full uh mds heartbeat, uh grace time period, which I believe is by default.

A

15 seconds. Excuse me um and then at that time it would uh do the replacement. So instead, the mds immediately just show sends off a begin to the monitor, saying I no longer exist and the monitors are going to.

C

A

Block list it and then um let the mds know that it got the message, so it sends a beacon back thing uh that it got it and then the mds would finish shutting down. For instance, uh the next big one is going to be just rolling down a bit uh this one.

A

So we're going to check that the um the mds is not currently in standby state and the state that the mds is requesting is not equal to its current state and then we're going to make sure that the state transition is valid. If it's not valid, then we're going to uh make a note of it.

A

And then not do anything and then here we're going to modify its state according to whatever it requested and then finally, after um making any changes to the pending fs map, we have to wait for the finished proposal. You can't tell the mds is about it's changed to the state of the fs map until it's actually complete.

A

So you have to wait for the finished proposal on the pending map and then finally, you can send out the reply.

A

All right any questions, beacons.

D

I wasn't able to understand why we need to notify the ost monitors about changes in the mts map.

A

ah So the reason for that is uh all of the block listing and stuff is driven is done by adding to the block list of the osd map. So if I want to block list and mds so they can't access the osd's anymore, I need to update the osd map by adding it to the its addresses to the ost map block list section, and so whenever I need to block this in mds, I have to wait for the osd monitor to be writable.

A

Does that make sense.

D

Okay, okay, so so because uh mds is uh yeah are also ost clients, so you want to block list them. Okay, right.

C

D

A

Any other questions about beacons.

A

All right, the next one is prepare command. This corresponds to prepare beacon, so here we've got a uh mon command, so this is basically a message. Is wrapping up some api requests to the monitors?

A

uh This would be the message that carries, for example, uh mds, fail command or fs new or fs uh set any of those commands get wrapped in in this mod command message uh here: we're getting the command map from the json, getting the prefix of the command, which would correspond to say, mds fail getting the session, ensure it has sufficient access and then here again we're getting the pending fs map, because, as part of doing these commands we're going to have to make changes to the to the fs map, and here we're going to a number of handlers to see if we can process those messages there.

A

These are going to correspond to what's in the fs commands uh uh class and we'll get to that in a moment, um and so this is just some con common handling for uh as an abstraction for handling those commands, and if we can't, then we go to this poorly named method, command file system command, which has really just become an mds command method.

A

uh If we go to that, you'll see that most of these commands that we're handling if the prefix is, for example, mds set state. This is uh some of these are just uh development.

A

uh Commands that aren't worth talking about for let's talk about mds fail because that's something people actually run so here we're getting the roller grd argument and then um checking are getting the g. The uh gid from that argument, by applying this gid from arg helper method in the mds monitor, and if the gid doesn't exist and we complain it it, it does not exist um various checks and then finally fail the mds, which is something we call throughout the mds smart order.

A

um And fail mds, I believe, will uh return a failure e again. If the osd monitor is not writable and if that occurs, then we wait for it to be writable and retry this message, so the check for the writable esteem on is: it happens in fail mds in this case, so all these just make changes to the pending fs map all right.

A

um Next, let's talk about it.

A

So this is a method in the mds monitor. That's just periodically called like approximately every five seconds, I believe by the paxo service. um It just lets you do basic uh uh upkeep on the uh fs map, so here we're again getting the pending fs map writable handle and we're going to just do a number of checks on it. So, for example, here we're going to check health. um This is just going to make sure that the mds's are are healthy.

A

I believe it's going to just check if it if any of them are laggy, so here we're actually checking whether the you know an mds actually timed out.

A

um Make sure it's fully populated, so you'll see again like, as I said, before, the beacon uh noting when to remove an mds, because it's timed out is actually fairly complicated and there's been a number changes over the last decade to the monitor to to handle various corner cases of making of mds is going laggy.

A

So that's all handled here. If one of them is laggy, then we're just going to remove them. So you have this vector of gids to remove and you'll see the various notes. For example, this one's being marked laggy. This one is going to be removed and then here we actually go through the two remove vector and find replacements for those mdxs and drop them.

A

So that's handled in check.

A

Health and then here's another one to resize the cluster according to max mds, so that happens in this method. Maybe resize cluster.

A

And here you'll see, for example, we check if the current mds map is resizable.

A

Current mds map is from here we get the fs map, get the file system and get the mds map, so we want to actually check if the current epic of the fs map is resizable and then mds map corresponds to the pending fs map. So you see this fs map is the the writable handle here and then this method also wants to look at the current epic as well, so it gets a handle to the current mds map. So this is the pending fs and then this is the pending mds map.

A

um The names could perhaps be improved, so we're just making sure that the the if either the current fs map is resizable or the the pending fs map is resizable. If neither is true, then we're going to say that the mds map is not currently resizable, we're not going to make any adjustments to the number of mds in the cluster based off of max mds.

A

Otherwise we look at, um for example, if there's. If the number of in mds's is less than uh max mdf, then we're going to try to grow the cluster, uh and here we would find a replacement um by asking it to fill in this rank. um And if that fest map's able to do that, then it's we're gonna promote it, find a replacement for the rank and then promote it to that given rank, and then we're done. uh The mds monitor at one point, actually allowed you to uh promote several mds's.

A

So if I sent max mds to 10, it would try to promote, like nine mds's, all at once to all those ranks if it had sufficient standbys available that turn behavior like that turned out to be a source of bugs. So at one point we changed it so that things happen in a sequential fashion. So now the mdx monitor will only promote one rank and then it's going to wait for all those ranks to be active, and only then is it going to add one more mbs to the cluster and that's also true of stopping.

A

We had a lot of issues with multiple mds trying to stop at the same time that found all sorts of weird bugs in the in the mds. It turned out to be simpler to have only one mds stop at a time so for shrinking max mbs.

A

The mds monitor will only stop the largest rank and wait for it to fully stop, and then it moves on to the next um highest rank in room and stops that range. So that happens in this method. Maybe.

C

A

A

And then, finally, we go through again and look to maybe promote any standbys for fail rings. So if a file system has a failed rank and it's waiting for a standby to promote it, that would happen in the in this method and if, as part of doing this, we also have to propose a new osd map that would happen. For example, if we failed in mbs, then we need to also do a block list and update the block list. So we also have to request the proposal to the ost monitor.

C

All right he's very angry.

B

A

Next, let's look real quick: it checks up.

A

So here um this code is looking to see if we can uh or it. I think this was done in like 2011. It's going to update the uh clients of the monitors with the new fs map, so anytime, there's a change to the mds map. The peons and the leader of the mons will will send out new uh nds maps to the clients. The one the codepath that's going to get hit. The most often here is: is this one?

A

um It's going to? uh We have a sub, that's requesting the mds map, and so is it a client? Is it requested a particular name space? So, if you recall in the chrono client, we referred to file systems as name spaces, so that that terminology leaked into the mds monitor at the time all that code was written, and um here it's going to look up the the file system and then uh uh find the fs id cid. That's what this code path is.

A

These codepaths are really looking for and then once it has it, the fscid is going to send it off um the mds map corresponding to the fscid. So here is the lookup where it gets the mds map and then finally, that gets sent in this make message. Mmds, map and shipped off to whoever is asking for this subscription.

A

All right any questions about what we've gone through so far.

A

All right, so we talked about beacons a lot, so I thought we'd. Finally, look at what a beacon is and there's not a lot to say it's a fairly simple message. The probably the most complicated part about this message is the number of health warnings that we've we've uh we've got in here these. What we call these mds metrics these get shipped off to the monitors.

A

um The little message here, um the main bits that are interesting here are the name of the demon, so every mds has a unique name: the demon state. So this would be what the mds is saying. It wants its next state to be in the steady state general case, it's going to be asking for the active state repeatedly and nothing will change and here's the mds health metrics that it includes.

A

And then also the the file system for the mds or the file system, name and then also one last thing: sequence number, which is going to get bumped every time. The mds sends a new uh message, a new beacon to the monitors. It sets the sequence message: sequence, uh epic, so that it can keep track of what the monitor has seen so far, and uh that's one of the ways you can tell if it's being like if it's uh got a laggy connection with the monitors.

A

And then also the beacon class, so this is a uh class that operates outside of the mds lock for the most part- and this is just going to be um doing things like handle mds beacon, so whenever it gets an acknowledgement from the monitors for its beacon message, it gets an mmds beacon back and here it's just going to note, like the sequence type timestamp, for the the monitor sent back and detect whether it's no longer laggy um for the most part beacons are sent through.

A

This send method, message, send uh method and predominantly these are set by by this sender thread: um that's initialized in beacon in it, and so every interval seconds it's going to call this send method and then um just wait to send the next one. So again, this operates outside of the mds locks, so you'll see regularly in the mds logs. The mds's beacon class is just sending off a new beacon to the to the monitors, and here it notes the sequences sending updating the last sequence number um setting health metrics on the beacon.

A

Etc and again because I was telling you we're trying to avoid um if, if the mds or the monitors are under load, we don't want it to delay, beacon, processing, so you'll notice. um I can't find it all right, so here's fast dispatch, so whenever the um we get a beacon response from the wires, it's going to fast dispatch this this handling so and that's one of the reasons this operates outside of the mds lock, because you can't you should not acquire mutexes as part of fast dispatch.

A

All right any questions about beacons. We have seven minutes left.

A

All right, let's talk about.

A

File system commands.

A

So again I when I was going through the mds failed handling. um You know, I noticed that I noticed there were a bunch of handles that or handlers that could handle a given m command message.

A

um A lot of the file system um handlers were moved to this fs commands uh class, and that was just to um simplify the the a lot of the uh repetitive code that happens um like batch, proposing and and uh actually requesting a new proposal. All that code was abstracted out and we have these classes now.

A

um Here we have like a few methods to check whether we can handle.

C

A

Given message or is an op operation even allowed? That's perhaps the interesting one here is: if is op around allowed, uh this was based off of summer rishov's work recently to add.

A

An authorization to the sevex caps that say which file systems a client has access to can even see the mds map for- um and that's here uh so here we're actually we're getting a copy of the current fs map and we're filtering it and we're filtering it based off of what the session is allowed to see. So they can only see a given file system name.

A

This filter method filters out all the other file systems and then, after that, we check whether it has access to that file system by trying to get it. If the file system doesn't exist, perhaps because it was filtered out, then we returned an error message. The file system is not found with some exceptions if it's a fsrm command.

A

um So that's one of the examples of abstract handling there so and just to look at it at a given file system handler. Usually when writing a new one of these you it's a lot. It's mostly just an exercise and copy and paste um they're templated pretty well. So here we're checking if the osd monitor is writable, because this fs fail command what it's going to do if you're not familiar with it.

A

Already, it's going to take every rank in the file system, both uh the the rank itself and any standby replay demons for that rank, and it's going to remove them all from the mds map, effectively failing them.

A

um So as part of doing that, you have to block lists. So it's going to check that the osd monitor is writable. If it's not, then we have to wait before we execute this command.

A

If it is um writable, then we're going to get the file system associated with this fs name and then we're going to mark it, not joinable. So no new nps's can join the file system um and that would prevent, for example, the mds monitor tick method. That's going through the file systems. Looking for failed ranks, it's not going to promote any ranks because the uh any standbys to the to a given rank, because it's uh the file system is marked not joinable.

A

In order to modify the file system with this lambda function that we created and then actually fail each rank. So we got to get a vector of the gids and then for each one of those. We call under the mds, monitor to fail the mdsg id and then propose pending on the ost, monitor and that's it.

A

The interesting thing here this is really a c plus thing, plus uh accepting um here we're getting a vector of the gids we're going to fail and then pushing them back into the vector. The reason for doing that is you don't want to modify this map. That's returned by get mds info. You don't want to modify it by removing mds's during the traversal.

A

So first you get all the gids and then you iterate on that copy of the gid vector um and fail each.

C

A

Okay, any questions about what I've gone so far. We have one slide left so I'll switch back to.

C

A

Right one you all on, can you all see the last slide? Yes, all right!

A

So just some of the interesting history that happened in the in the last 10-ish years uh in 2020 we added mbs affinity. That's the mds join fs config that mds get set with that's being used, both in sep, adm and rook, to indicate which file, system and mds is been created for, um and that's just to simplify the current. The the previous situation we had so sev adm was creating mdss with very specific games, and the names also included the file system that the mds was created for, and it was really weird to have.

A

Those mds's uh join other file systems, but it looked like just a a mix and it was really hard to track. You know whether or not you had stray mds or not, and all that so, and you also may want to have mdss with different hardware for a given file system.

A

So this config setting was added to to support that uh 2019. We added a lot of standby replay.

A

That was uh to simplify uh some config settings. We had like uh uh where you had to specify that a given mds was allowed to do standby replay, and that was really confusing because generally you wanted to set that on all of them and really uh it made more sense of the file system flags. So this this file system, flag, allows standby replay was added um in 2018 we added the paxos fs map, so the get writable fs map methods and all on all that, that's when those protections were built in.

A

In 2018 we added doug actually did a lot of this work. Was the incremental hmong control, the activist deactivation? So um we don't stop a bunch of ranks at once. If you reduce max mds to one from like five, you don't try to stop ranks one through four, you just uh you you do it in an incremental fashion.

A

That proved out to be an extremely big stability win for for multi mds uh 2017 to 2018 you'll, see a lot of changes in the mds monitor trying to fix some issues we kept seeing in upstream qa, with the beacons being lost or or not being recorded properly.

A

So we had tons of messages and ways about uh uh mds is being replaced because they're laggy, so you there was a lot of co-churn there and some, I think, there's still maybe a few latent bugs, um especially when the mds is talking to a peon monitor some of the beacons are not being processed quickly enough for some workloads and is causing the mdss to be removed, falsely so I think there are some latent bugs that need fixed um 2017 last rank deactivation.

A

Oh, I may have these mixed up uh 2018. The incremental hmong controlled activation- I should say so. You increase max mds, it only increases mds is incrementally last rank, I think, was uh the stopping change anyway, it it's a mix of of code there and then 2017, the fs commands was added and that that corresponded roughly with the 2016 changes. That john did to add. Multiple file system support, so all that was done fairly recently.

A

um 2013 allow new snaps. uh This was in response to a number of bugs that were suspected with snapshotting and file systems, um so greg greg, farnum added the allow new snap setting on file systems to prevent news or to indicate that new files, new snapshots were allowed to be created on a file system so, and we were also used that setting to detect if uh snapshots were ever allowed on a file system, and we needed to do certain upgrade checks.

A

There were certain upgrade sequences in order to upgrade a file system that had snapshots at one point in 2011 standby replay was added. It was fairly long ago that this feature has been in in ffs and only recently, I think, became more usable uh just with this. Allow allow standby replay, setting it became much easier to set up and then in 2009 subscriptions uh were seth has always had some kind of subscription to the mds map by necessity, but a lot of the code to create subscriptions was back then that's about.

A

As far back as I went, and if you look at the code of the mds monitor you see like at least half of the commits are trying to fix weird uh consensus bugs or you know, beacons being lost in certain code paths and and or not being recorded properly, and so the monitors falsely believe.

A

Okay, all right, um that's it for the slides. Any questions about.

D

Persistently, like fsmap or do they need to be stored, persistently.

A

Yes, absolutely um all the monitors have their own store um where they, where they keep the mds maps um and with some look back eventually, things get garbage collected after a while. So it doesn't remember every mds map it created.

A

The actual machinery for storing the mds map is is handled at a higher level than the mds monitor, though it doesn't need to actually worry about the details of saving the mds map to a persistent storage handled by other um and yeah. That's that's all I say about that. Any other uh does that make sense or any other follow-up questions.

D

Yeah yeah, that makes sense yeah so yeah, so monitor, has its own persistent storage system. Okay,.

D

And uh the other question I had is when you make any changes to the fs map and you get a new version of fsmap the pending one and so automatically the mds is subscribed for it and that's how they know the change, because they keep subscribing or the the the monitor sends. Those messages to the nds is saying that the map has been updated.

A

So there's different kinds of subscriptions. One is that you might just want the next epic and you don't care about any follow-ups uh and then there's other subscriptions where you just want to be continually updated for all versions, and the mds is what I will do. The latter.

D

A

I haven't looked at that code very hard, so I'm maybe wrong on the details, but I'm pretty sure that's how it works. It may be that you just need every time you get an epic uh or a new mds map. You have to just ask for the next one immediately, but I think in order to reduce the traffic on the mons, that probably was changed to just be automatic.

B

Thanks, patrick.

A

Sorry about craig is not very happy right now, any uh any last questions.

A

All right thanks, everybody for attending the walkthrough um I'll, see you all tomorrow. Bye thanks, patrick nice talk.