Ceph Ceph Tech Talks, 30 Jul 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2015-JUL-30 -- Ceph Tech Talks: CephFS

Description

A detailed look at CephFS. What it is, how it works, and what the development roadmap looks like.

A

Finally, things like permissions and directories are actually intrinsically useful things: they're, not just hangovers. If you build a new system based on objects, you might find yourself implementing some kind of hierarchical concept on your own. So sometimes, if it's, what your application needs, having a file system has. Those things baked in is very useful. So I don't use file system to everything.

A

Well, they have some difficulties they're much harder to scale than an object store, because, while the object of an object store are somewhat independent from one another, the items in a file system, the inodes, the directory entries and the directories are all connected to one another, and that makes it much harder to split things up and share them out between multiple servers.

A

Some applications that expect to file system interface use it badly in a way that expects latency is associated with local file systems. The classic example is people running a lot of LS l calls from their applications, which we'll go through and stat every file in a directory which worked fine, our local file system that presents challenges for us when implementing a distributed file system in providing that kind of functionality without unacceptable, latency overhead and, finally, the statefulness of file systems is challenging so where's with an object store.

A

Your client will put or get in a file system. The client opens something and holds it open, so you have a whole lot of data is associated with your file system, clients that needs to be managed.

A

The way that seven paths addresses both challenges is firstly, to fully implement the POSIX interface, some distributed file systems settle for and NFS like set of semantics.

A

That limits your ability to talk to all existing applications that have been developed in debugging other file systems, so implementing a POSIX interface makes us very compatible with a whole lot of software. We get great scalability in our data storage because we store our files directly and rate us and inherit all of its useful properties.

A

We get scalability in our metadata, but allowing users to have multiple metadata servers that act in a cluster. We add some functionality on top of the basics that you would expect from a POSIX file system, so we have snapshots which are great kind of per directory level, and we also have recursive statistics which allow users to identify statistics at a pair of directory level without having to recurse down as the final system.

A

So, finally, this is less of a feature of sefar fest, but more of a feature of set as a whole that when you're using 7sp or file storage, you get that unification with object and block. But there is no need to deploy different discs for your file system.

A

So the way that all that's built is outlined in a fair bit of detail in the paper that I reference at the bottom on this slide, which is sages from way back in 2006, some of the stuff in there has changed sense, but a lot of it's irrelevant and that year is kind of kind of amazing. When you consider how rapid the development is on staff and SEPA fest. Now the project's actually over 10 years old in the final system was one of the earliest parts to exist.

A

So this diagram will probably be a little familiar. People who've been to private previous talks about radars as well. The little turquoise squares are OSDs and the new part of this is the little squares with a tree M, which our metadata servers or MD Isis. So at the top of this diagram, you have your client host, which is running some set the first client code. When we say client, we essentially mean amount when you have /mnt slashes a professor or something like that.

A

That's a client and the client is sending two types of information to the cluster data and metadata. Where metadata is things like opening a file or getting the attributes of a file and data is the actual reads and writes within a file as I mentioned, the data goes directly to raid us, so it doesn't have to go through the metadata servers. There's no extra bottleneck there and there are multiple metadata servers within the cluster.

A

The way that we store the final data within Raynaud's is worth describing so in a similar way to our median rgw. We support striping and chunking of the data in set the first. We already have a unique identifier for every file and that's its inode number, so objects and named after the inode, followed by period and then the offset within the file where the offset is a account in the number and in the size of chunk.

A

That's selected for the file, which is 4 megabytes by default users, can change those settings on profile, / directory basis, and they do that using virtual extended attributes which are acceptable using any existing system tools. So you don't need special, safe, specific tools on the client to do that, and in addition to specifying the striking of a file the layout lets, you say which radar school you want to store data in, so you can have multiple radar schools and use force ffs, and when you do that it lets.

A

You do interesting things like having a pool that uses SSDs, that you might use for a small scratch directory within your file system and then another pool that uses spinning disks that perhaps you might use for archival.

A

Maybe you might have another pool again that uses cache tearing and a range of coding, and you don't need separate clusters or separate file systems to do that, and you don't need special tools for configuring that you do it all through these magic virtual, extended attributes metadata is stored in radars to the MDS is don't store any local data on the disks and the house where they run.

A

They put the metadata into directory, fragment objects within radars, and these are these are things that take advantage of radar, says Oh map interface, though Redis lets you create an object and then use it as a key value store or a no map. Instead of us, we want to support really large directories, so we don't want to create arbitrarily large Oh map objects. So we break directories up into fragments, based on the hash of the directory entry names within a directory and within that own map.

A

The keys are the file names or their denturri names, and the values are the directory entries, which also in step of s, includes the inode. So we embed VI nodes directly with the dentary so that when somebody retrieves a directory, they get all the data they need right there and that reduces latency when somebody is traversing the filesystem, so everyone, the the takeaway from that, is that there is locality in the way that we store files within a directory.

A

So then we're taking advantage of what the user is essentially telling us that these files are in a directory. Their probably gonna get access to the same time, and we reflect that in the way that story on disk.

A

This is a simple example of what kind of objects you end up with after creating a directory and a file within your service file system, so we create a directory called mitre. We write 12 megabytes of data to my final one within it. On the left hand, side of this slide I see two objects in the metadata call. These are directory fragment objects. The top one is for the root directory, the slash directory, which has a magic inode number, which is already known to everybody. Working with file system.

A

That is just one so directory fragment 1.000 contains a single entry for my dirt. These are these are OMAP key values which contains the inode for my death, which is 1, 0, 0, 0, 0 1. Forgive me for pronouncing the right number of zeros, that's just how we print them and then for that mitre I note. There is a directory fragment object again, which contains an O map key for my file, 1, which contains an I, know that there's 1, 0, 0, 0 2 and in the data pool you've got the objects for the file.

A

There are three because the default chunk size is 4 megabytes and we wrote 12 megabytes and they just contain the data. The first object in a file has this extra extended attribute called parent that contains the full path to the file by the time it was created and the use for that will become clear very shortly.

A

So this structure on disk is optimized for the lookup by path case. Where something says: I want to open flash mightor one slash my file, and so we can go and read the root directory fragment, find my der read the miner directory fragment fright, find my file and then go and see the objects on disk. For that that's fairly straightforward.

A

We additionally have to support look up by I know so for anybody implementing a file system that supports for buzick semantics. You have to have this essentially double index or I notes. Sometimes you have to look them up by path, and sometimes you have to work them out by inode number.

A

So in order to look up by inode number, we have this extra attribute on the first date object of files. We call it a back trace and we can get to that directly by inode number, because we name our data objects after the inode number and that allows us to have more support for hard links and for NFS file handles.

A

So the NFS protocol is works in such a way that clients are allowed to maintain references to files after they have forgotten the name of the file in the past of the file so they're just doing it I know. So. In order to have a proper NFS support, we have to have this lookup.

A

The downside to this is: we are slightly funding our model in that we're storing this piece of metadata within the data pool, we're doing, in some cases, some extra iOS to set these back traces, but there was also kind of a nice bonus from doing this, which is that these back traces allow us to do interesting things during disaster recovery.

A

So, for example, if you lose your metadata pool and you want to try and rebuild your data as best you can, you do have some record in the data pool of what the path was originally for. The files and that's what this looks like when you do a look-up by I know so you initially go read the back trace from the data object and then the rest of the lookup is just the same by path process going through the directory fragments, except instead of having been given the path by the client.

A

We got the path by looking up the back-trace.

A

So that's how the data and metadata is stored within radars. All of that work of storing map metadata is done by the metadata servers. These are demons, much like the Mons or the OSDs written in C++ and initially, when you start up an MVS demon on the host, it does nothing at all. It communicates with them on and it goes into a standby mode.

A

Then it doesn't do anything until the mom assigns it a rank, and so rank is like a logical virtual MDS in that, if you want to have save three active MDS s at any moment on a system, you would have three ranks, but you might have any number of MDS daemons physically beyond that, it's just only three everyone will be holding a rank at any moment and that the real meaning of a rank is that it is the authority forced on set of data on disk, so the directory fragment for a particular directory isn't getting written by all the different NDS is at once.

A

Only one MDS at any given moment is going to take responsibility for that piece of metadata.

A

The India friends also have some of their own data of the per rank data, such as a journal and things like an allocation table for assigning new inodes. That's one of those per rank as well, and by storing those in radars we can failover and the s ranks really quickly, because there is just no data at all left behind on the NBS servers themselves. You start a new one, it gets assigned the rank and it picks up where the old one left off by reading all that meditation from Raiders.

A

The actual assignment of which piece of metadata goes to which rank is done dynamically and it's done in terms of sub trees. So in this diagram the colors represent different and yes ranks and the gray and the s0 part starts at the root and that's where all the metadata would have been initially and then over time. As the systems used, hot directories get reassigned to different and BSS, but then, if there is a particularly hot directory within a parent directory that was also quite hot, this can be recursive as well.

A

So you can see in this diagram that there are colors within colors. That might seem like a lot to keep track of, but in reality all the MDS is actually need to know about. This situation is not where each individual I notice, but just where the boundaries are, and that's the MSO Trina I'm, just gonna close for a second and see if there are any questions and checked well. Okay, I should have said at the start, feel free to just draw any questions that pop into your head into the chat as we go.

A

I, don't have to save Matt til the end.

A

So when the MDS is making updates to this metadata, if you had a whole bunch of files, I saw a whole bunch of clients making updates to files. You would find that for things like incrementing the size of the file to reflect data appended to it, that will generate a lot of little pieces of iron to update the metadata objects on disk. So we don't do that. Metadata. Ops are written to a journal, there's a journal for each MDS rank and when we have received some updated metadata, we've written it to the journal.

A

We have that in an in-memory cache a cache of AI notes and entries and directories. It will remain in that in-memory cache until it falls off the end of the journal and we actually use pretty big journals and the default size of the journal is in the hundreds of megabytes and you can make it a lot bigger than that.

A

If you want- and that's because it's nice not to have to worry about hurrying to evict things, not Vivek so expired things from the journal, but it also lets us do failover even more efficiently, because after we replay the journal, we've got a great big journal with a large collection of recent metadata operations. In it, our cache will be warmed up with everything that was recently operated upon by clients.

A

The cash can get pretty big within the MDS. We don't remove anything from the cash unless we have to so.

A

Typically, you want to size this for the amount of RAM in your metadata server, and you want to provision servers that have plenty of RAM for uses and ESS the mb/s cache size parameter. Lets you control that that's a limit by the number of I nodes and controlling cache size has actually been kind of a tricky area recently, and that's because the clients have to be involved in the process too. If a client has a file open or a file in its cache, the N DSS can't necessarily remove it from their cache.

A

They have to pin it as long as a client is using it, and so, in order for the MDS is to shrink that cache. They have to ask the clients to shrink the client caches and, as you can imagine that as they it's a diff tribute systems problem. Therefore, it's hard, and if you follow the mailing list, you will have seen various people asking about some of the warnings that we've added recently for her clients, failing failing to respond to capability releases and that kind of thing and those messages are about this.

A

They they're about the clients, participation in the process of controlling the cache size, in addition to a standby mode, where the mb/s just sits there doing nothing and waiting for a job will set the local standby replay and that's where an MDS can follow along for log of an active and the s, and so that when a failure occurs and the standby has to take over, it has already read in all of the metadata that it needs the downside to doing this, and the reason it's not a default is that it means you have to specifically allocate a standby for each of your active MD assets, whereas by default as then by MDS can take on any role if you're in standby replay, you can really only take over from the guy.

A

You were replay, so I mentioned that the client maintains a cache. The client MDS protocol is kind of interesting. It's implemented twice once in our youth space client, which is which has a fuse interface and once in our kernel, client, which is part of the upstream Linux kernel, the clients start up and they learn. The addresses of the MDS is from the Mons, so when you Manus a file system, you don't type in the address of an MDS, because MDS is a completely dynamic.

A

You type in the address of a mod and the client will go ahead and open a session with each of the MDS is that it finds it needs to talk to, and those MBS is in response to requests from the client to, for example, open file will hew capabilities, and these are quite fine grain things they'll tell you things like you are allowed to buffer.

A

Read summarize to this object. You are allowed to write to it. You're allowed to read from it you're allowed to update the metadata for it, you're allowed to update extended attributes for and that kind of thing, and that means that when you've got multiple clients which are taking an interest in the same file, although we have to do this locking to maintain posing semantics, we don't it's not an all-or-nothing thing.

A

The fact that, for example, one client is writing to a file when necessarily stop other clients from being able to see the metadata for the file as it gets updated.

A

The state associated with these capabilities in the MDS isn't persisted, because if we were to persist, it I'll generate a huge amount of I/o and the way that we deal with that is that there is a routine that the clients in the end list and the SS have to go through after a failure.

A

So when an MDS comes back online or another demon takes over that rank, the NDA, the clients have to send a reconnect message to the MDS, which tells it here are the things I have in cache here in the locks I hold or the capabilities that I hold, and here are any metadata operations.

A

I was doing that weren't quite finished, yet that might sound like kind of a low-level implementation detail that this stuff's worth knowing even as an administrator, because you will see the MDS going through these stages as it starts up after a failure or when you first start it'll go through replay of its own journal. It'll go through a reconnect phase where it's waiting for the clients to come back and it'll then go for a client replay phase where the incomplete client operations are getting replay.

A

So if you see a system stuck or seemingly stuck in any of those states, they can take some time. Then it's useful to know what that means and what's actually going on within the system. That's a common failure mode in distributed file system. It's not just a set of s that this this dance, that you have to go through after a failure. If the clients are unresponsive or one of them's died, we do cope with that, but it's not necessarily immediately obvious.

A

What's going on, I've talked about things failing and things failing over and clients recovering from that and the way we actually detect the failures is that all the MDS is send beacons with other monitors and there's code on the monitors which will decide when an MDS has been too late, too laggy, and if there's a standby available, we will then black it to the MDS at the radars level, so prevent it from doing any more writes.

A

In case it is still alive in some systems that we call fencing what we do that and let another NDS start and take on that role, so that all happens completely autonomously. There's no IDE mean intervention required. Mmm-Hmm and clients do a similar thing, except instead of pinning the models, because they find a very large number of clients. We don't want to over, like the ones with that.

A

The clients ping the nd assets and then the MDS is individually decide if a client has been too late and if it has will drop any resources its holding, so that other clients can get access to you dropping out each other chat again chat again: okay, there's a question: does the dynamic subtree partitioning in place already, as it said, the unstable? So yes, it's in place, and if you want to know how stable it is, you have to test it. So we're not.

A

Actively testing that intensively at the moment, our QA focus at the moment, is on single MDS systems and to get that as robust as we can. But I'll come into that a little bit later in the presentation.

A

So that's what set the press is and how it works. Now, how do you use it though first hand ease get it well, it's packaged and released as part of staff on some systems. It might be a separate package called SAP MBS that you need in your MDS demons, but it's within the whole really cyclist app. You can use set deployed to create MVS demons and there's a manual process you can do to which is documented and the various orchestration frameworks that have modules for Saif. Many of them. I. Imagine know how to do.

A

Mds is to your main point of contact for dealing with your file systems, and MPS's is the ones just like with the rest of rails. So the same safe, command-line tool that you're familiar with gives you access to the command and control parts of F of s as well.

A

Some low-level things are exposed in the form of admin, sockets, moans and ESS, and Ice T's all have these things called an admin socket which allows you to log into the node and put on so locally to them, and there are quite a few admin socket commands in the NDS s, some of which we will eventually expose up five months as well. We tend to add new functionality in the form of an admin, socket come on first and then that's opposed to elsewhere later.

A

So what that looks like in a terminal is actually pretty brief. You've got one command on with SEP deploy to deploy, and, yes, you need to create a day pool and a metadata pool for your file system and then use the FS new command to configure the file system once you've done. That. Bmds that you just deployed will be informed by the one that the file system is now available. It'll come up and take rank zero, stop operating as an active MDS, and at that point you can go ahead and mount it.

A

So the example of bottom is how you mounts ffs using the kernel client and where I have the X's. That's where you would put the IP address of one of your mods at the point that you've mounted it that /mnt, slash safe location will just look like any other file system on your system.

A

Here's another practical example: I mentioned we have a recursive statistics that give you real numbers for what happens on a directory. Sorry. This is all very difficult to read because of the line wrapping so on the top. Half of this is what we're all familiar with with a local file system like ext4, you go and do an LS on a directory and ext people claims the directories for colitis.

A

It's a little bit strange, but we're all very familiar with with that. If you're familiar with Linux, you will have seen that works with South. If s, if I go and look at one of my directories in a set of s file system, it's telling me 16 megabytes in this example, and that would be the size of the files within the directory or within any children of the directory.

A

Now, shots are also exposed directly within the file system, as no special tools needed to create managed snapshots. Every directory in the southwestt file system has a magic dot. Snap directory this doesn't correspond to a real piece of on disk metadata. This is something virtual that the NDS is D. You accessing the dots net directory and translate. Add international operations in this example I'll just step through it and create a file called history.

A

My backups directory go into the dots net folder and use the make directory command to create a snapshot called snap, one that.

B

Might be a little bit.

A

Counterintuitive that we're sort of doing purposing make directory via file systems, don't give you a way of adding new commands. So, instead of a make snapshot command, you have a make directory command and the point that we've created that's natural. We can go back up into the backups directory, delete the history file.

A

Do an LS see that it's really gone, but then, if we do an LS in dot snap slash snap, one will see that it's still there, and so once you've created a snapshot they show up as if they were directories within the dot snap folder. And similarly, if you want to get rid of a snapshot, you can get rid of it with the remove directory command.

A

The statistics that you can get out of any SAP service are particularly useful for MTS's, so you can run this command on any type of service, not just MD SS, but to get insight into what's going on in your system. If you want to know why is my client stuck? Why is my file system not pushing radar since I'm at detecting it too? It's very useful to look at these stats, especially the rates of client requests, the sixth column from the left-hand side, the HTC are or handle client request.

A

Column is kind of interesting, especially when you compare it to the next column along which is objector, writes, which is an internal name for our radars rights from the NBS. So you can see you can actually see the journaling going on here. We're getting a fairly steady stream of client requests coming in, but we're going through several seconds of not doing very much in terms of radius, writes and then a little flurry of updates and that corresponds to expiry of log segments within the the NDI slaughter.

A

If you want to know what the rest of these stats are, you can use the perf dump schema command, which is available on any surf Deena, and they have little strings that. Tell you what they are also asking IRC and that kind of thing and they're sort of internal things, but they are pretty.

A

So if you have clients that don't speaks a purpose, you will expose yourself masa file system to other systems. It's typical that you'll want to put an NFS gateway or assemble gateway, or something like that in front of it, and so there's a bunch of support for that.

A

There's: a module for the user space and FS kamesha server, but talking directly to set the best you can mount a set of has filesystem using the kernel, client and then re export it using kernel NFS and you can put a send the beam on top and set the bus as well.

A

There are varying degrees of testing that has been done so far in this. We recently had some very useful feedback about how the NFS server especially was handling or not handling cash pressure properly. So as with any of these upstream components, please do try these things out, but if you find bugs, then be ready for that and be ready to report them to us.

A

Okay, there's a question again: does the Indiana crush so crush? Is the algorithm that's used within radars for deciding how to locate data across a population of disks in order to place something within that population of well so crush? Tells you where to put placement groups and then to decide what placement group an object goes into.

A

You take the hash of the object name, though, in radars we scatter the objects across all of those placement groups and the placement groups get placed on OSDs using crush the enemy, has a dozen news crush and that's because the NBS isn't necessarily aiming for a uniform distribution of data. What the NBS is aiming to do is continuously monitor what the hot spots are in the metadata hierarchy and then decide based on that dynamically, where to move things around.

A

So in the meta data servers when we're assigning metadata from one MDS rank to another, we're doing they're explicitly and the placement into in terms of which MDS a piece of metadata belongs to is determined dynamically, whereas within Rios the place that an object lives is implicit in its ID and the place that the placement group lives is implicit in the the crash calculation. So I hope that answers that and there's question about commission and access control, which I will come to a little bit later in the presentation.

A

Do I want to talk about the most recent developments in southwest? What's happened over the last year, or so our focus at the moment is very much getting to the point where more people will be comfortable using self s in production. So, at the moment, Sabbath s is its function. It works. You can install it and put your data on it and go and read it back and it'll still be there.

A

What's missing is our ability to really recover when things go wrong and our ability to really make firm statements about how well a given release of Ceph FS is functioning based on testing that, so we need more testing, we need more QA and we need new tools to be able to recover a file system when something goes wrong.

A

One of the downsides to POSIX file systems generally is because of the tightly coupled nature of all the AI nodes and directories, and that entries, if you poke a hole anywhere in there and just knock out one piece of metadata you're, going to potentially remove access to all the metadata. That's a penny thousand three, and that makes it more fragile.

A

Inherently than an object store, for example, would be so that's why, with file systems, you have things like fsck programs, you have things like online scrub capabilities and both of those things are in development for CFS right now and more broadly, our focus at the moment is on getting a modest nds configuration without too much without multiple MDS is necessarily without using snapshots and get that running as robustly as we possibly can before.

A

Moving on to getting those MYSA features more thoroughly tested, and so you know, I need to be clear that those features are in there and they work.

A

But if you wanna know exactly how well they work, we haven't yet got the level of QA to make a statement on that I like to throw in some statistics just so that people can see quantitatively how much has been going on.

A

So these are a little out of date, because ham has been out for a while now, but during the Firefly hammer period, you can see many hundreds of commits many thousands of lines of code, many thousands of lines of code added to the tool surface directory and to set QA suite as well for testing, specifically the file system and a pretty steady turnover of bug tickets. So, as bugs come up in the file system, either reported by users or found by automated tests, we're continuously fixing them we're we're also back courting some file system fixes.

A

But although there isn't quite the level of support for long term releases of the file system, yet because it's not in these in production as widely as the other components except, we are still making an effort to certainly not break things, not break backwards-compatibility. We don't do that, but also to make sure that when there are bugs which are affecting the people, who are early adopters of SEPA fest that they get taken care of so I was going to really quickly run through the various sort of grab bag of things that have been added.

A

Recently, so we have new health checks on the MDS.

A

When things go wrong, it can be pretty difficult to know why your LS command is just blocking that's often a symptom and a distributed file system, not just in southwest or very many things that can go wrong, so we need a way of reporting exactly what's gone wrong, especially if clients are misbehaving, because if somebody has older versions of clients- and that happens a lot if somebody's got an old colonel, we need to report that so they don't think they're MDS is broken, so they don't think their other clients are broken and they can identify which client, which version of the clients causing the problem, there's a little complexity in there and that there are potentially quite a large number of clients in the search system.

A

So we'd only give you a thousand health alerts when something goes wrong across the system, so there's aggregation of those alerts too. So if you have a bunch of clients, you'll you'll see those health checks shrink down into five clients have such-and-such energy.

A

Similarly, if something seems it seems to be stuck or not progressing, you can now use the OP tracker component to get a very detailed into the hall view of what's going on within the MDS. This is a little little less user friendly because it is very internal information, but it allows you if you've got a system that stuck and you need to send some information to developers or to whoever is supporting your system. You can give them that level of information.

A

This is an existing class that exists in the OSD before and it's now exposed to the same way in the MDS. So the big focus at the moment is on the file system, check and repair.

A

So if we've lost data objects, we would like to know what files does that affect what files are damaged if we've lost metadata objects, we'd like to know how that effects our system hierarchy and what sub trees or parts of sub trees are now unavailable, and we need to do that continuously and online, because if you have a petascale file system, taking it offline to do a check is prohibitively expensive.

A

You would have to be offline for so long to scan such a large amount of data that it's really mandatory to do all this stuff online, and that means scrubbing. So you have an online scrub which, or shortly we will have an online scrub that will check that than because its stats are consistent.

A

That if something says, if a directory says it has 16 megabytes of data in files within it, that those files really exist, and we do have that total size does the metadata that we have in memory in our cache really match what's on disk and does our metadata for files match reality. So if we've got metadata, the size of file should be 200. Megabytes is in really 200 megabytes, and this is partly about detecting damage from loss of loss of objects on disk, although Rados is, is very resilient.

A

If you do have, for example, a three disk failure and you lose some subset of your placement groups and we want to make it so that that won't kill your entire file system. You'll only lose the data. You've really lost, rather than having your whole file system. Go down but as well as that data loss case. It's also about making the system resilient to bugs because bugs happen, and we need to make sure that when something hits an issue, we can take them through a process of recovering and fixing their system.

A

Rather than saying that the system is now entirely inaccessible because one aspect of it experienced a bug.

A

So some parts of this recovery and repair capability are starting to come online in the master branch at least there's a brand new surfaced data scan tool which enables you to scan through the data pool and essentially scrape out the files by exhaustingly exhaustively examining the object swimming pool, but a little bit more selectively. We can identify which files in the data pool appear not to be referenced by the metadata, so things which are orphans and take actions such as removing them or recovering them so creating metadata that will reference them.

A

So you can get back to your file or scraping the file out to a local file system. So if you've got something, that's badly damaged and you just need to get those files somehow we can now scrape them right out of the data pool onto disk.

A

There is a performance challenge here and scalability challenge in that, in the same way, you can't have an offline fsck for continuous use, because it would be prohibitive to take system offline for too long when you're using this tool, which is optionally, an offline online tool depending on how you're using it, you don't want to just be running one instance of this and having it to step through your tens or hundreds of millions rayless objects.

A

So there will be a new capability in the Raiders level to have many workers to share out the namespace in a you know, raid or school, and work on it in parallel. So this tool will take advantage of that and what it will look like is you'll have tens or well. However, many you want instances of this program running in parallel to scrape the data out of your system in case of the disaster.

A

There is a sort of sibling tool that has existed for a little bit longer. This has been in the last couple of releases called Southwest journal tool, and what this gives you is the ability to recover from damage journals, so we've seen at various points bugs or incidents that have led to people having damaged gels and historically, and that would break your system for really badly. It would not necessarily be completely unrecoverable, but it was pretty hard to recover from it, because without these journals, the NBS is couldn't even start up.

A

So this is an offline tool which, if your MDS is won't, start up because something's gone wrong with your journals. It lets you interrogate. What's there identify specifically, which parts of general appear to have become unreadable or unusable and then take action to fix that, such as by blanking out parts of the journal that you no longer want to run touch because you know they're broken all by trying to scrape out as much metadata as we can from the journal before purging it from the desk.

A

All of these disaster recovery tools operate on a best effort basis. If you've lost data, then we can't.

A

We can't recreate that if it's gone it's gone, the emphasis here is on limiting the damage and making sure that if one file was damaged, then only that one file should be inaccessible, not your whole file system, aside from power recovery, more general resilience features so this this made it into hammer.

A

If I recall the way we handle full space used to be a little bit difficult in the file system, so by default, Redis clients will always stall if they, if it's close to goes fall, they'll just wait for the cluster to not be full anymore.

A

The problem with that and second fess is that, in order to delete things, you have to be able to write to your metadata journal, and we also have a problem that you couldn't delete things if a client was still holding on to it and a client would still be holding on to it if it had dirty data, but it couldn't flush because the cost was full and therefore its iOS report.

A

So that mechanism has been reworked and as of hammer, you will get nice friendly, no space errors, rather than having things block when the system goes full.

A

There's a number of new features which enable you to have some visibility of your clients on your cluster. So there is a session LS admin socket command that allows you to list what clients are created. We've also added client metadata, which is transmitted from the clients to the MDS s, and it tells you things like the kernel version, the hostname and the path that something's meted out, which means that rather than saying client 2, 3 7 X had an issue.

A

We can say client my hostname water and had an issue, it's a simple thing, but it makes a real difference if someone's trying to work out which client is clobbering, their system, client eviction so killing the session of a client which is known to be dead and believe to be dead. On misbehaving used to operate. You know in a slightly best-effort way and that has been tightened up now, so it's now possible to properly blacklist the client and fence a client.

A

So, even if you have a misbehaving client, you can ensure that you can safely remove it from the system and that's an example of what it looks like when you run session LS, you get a bunch of useful information about your clients in the future. It would be useful to extend this to environment. Specific pieces of information like what HPC job is a client part of what VM does a client belong to you? What container does it belong to.

A

There have been various client improvements, especially to cash trimming that didn't work too well a year ago and then works a lot better. Now in the fuse client, there is new flock support, F locks at what. So, if your application relies on that, you can I use it with a fuse clan and there is a new quota feature in the fuse client. This is implemented client-side. So at the moment, it's specific to the fuse clients that available in the coal plant- and there are some caveats around the quota- support.

A

It's not completely stripped. The clients are allowed to overshoot slightly, and if you have malicious client code, they would be able to simply ignore the quota, but for many use cases it's it's a useful feature.

A

So, in addition to adding all of those useful capabilities, there's a lot of lot of work on into testing and QA as well, and that's really. Ultimately, the answer to is a given release of Ceph. That's ready is going to be. Does it pass the tests, so seth has a to follow G test framework which a lot of people in school will be familiar with and that's used across set, and so a bunch of new functional tests remain.

A

Python have been added to that recently for Sabbath us, and these are white box tests that exercise and specific features and go through very careful processes to create undesirable days for the system and check how it responds to them, as opposed to the kind of testing that we've done historically, which was more of a long-running probabilistic thrashing type thing we do on the system, so our tests are becoming more extensive and in addition to the the threshing tests, which remain extremely valuable, we have more tightly defined functional tests, but we also have dog food service for some of the storage used within our lab environment and I always leave that bold line at the bottom.

A

That third-party testing is super valuable. You will do things that we have not thought of or that we have not had time to do, though.

A

Please try south of us if you're interested in using it adventure believe, but maybe you're waiting for it with a little bit more stable. You can help us get it more stable by testing it right now.

A

There's some ongoing work to improve the access control features around surface. So, though, historically you could kind of do a trick where you would set layouts on particular folders that use particular pools and then make sure that certain clients could only access certain rules at radars level, but it didn't prevent clients from doing naughty metadata operations. So it was. It was kind of a little bit fuzzy and there's new features going in here now to have robustly enforced client access controls using the the author caps mechanism that exists throughout Seth and that's gonna.

A

Let us do things like limiting access by a path prefix to a particular client, so say this client, even if it was behaves, even if it's gone malicious code on it, can only access files beneath slash food, slash bar and doing things like root squash, which will be familiar to when you're using NFS, and you sort of combine a few of those access control semantics and you will have something that is a very big fit for doing things like container volumes in a secure way.

A

So from here going forward, we need to get the FSC camera petals done, get them to a stage where we're confident that, if something goes wrong in the field, we will be able to fix it, because that those tools are what will enable vendors such as Red, Hat and others, to look to provide proper support for CFS in the future, rather than the current situation, where most people use this ffs are self-supporting and then from there we can move on and touch on.

A

All these other areas like hardening the multi IDs support the rebalancing all that good stuff, getting the snapshots working even better than they are now and getting the testing around that. That gives us the confidence that they're working and integration with cloud and container environments so, for example, the manila project that OpenStack is of a lot of interest to us, because that provides an Avenue by which people can use us with their and just really quickly.

A

At the end, if you are crying out set a set of s right now as an early adopter, these are links. You definitely should know if you don't already so the mailing list, the IRC channel, the issue tracker documentation, including troubleshooting documentation, and when you encounter an issue there's the most recent release fix it because stuff is getting fixed all the time, and that includes in the kernel, if you're using the kernel client.

A

If you're reporting an issue tell us as much about your configuration as you can, especially what versions you're using whether you're using the kernel plan or fuse claim what are you doing with the file system? What kind of workload are you doing and ideally, if you can reproduce an issue with the Evo clogging enable that makes us really happy, and that makes for a really good travel ticket? If you can do that so with that I will wrap up and go see if there any more questions in the chat if anyone's talking into IR C.

A

Please come type your question into the blue jeans chat because, with a screen sharing I don't have my IC client right here. So here's a question: is there any limit number of pools in the South cluster I?

A

Don't know if the off the top of my head, if there's a hard limit but there's a practical limit, because when you create a pool you're, creating PGs MPGs, create resources or consume resources on OSDs. So you don't want to create an indefinitely large number of calls. The this is a real thing by the way not a set for fasting.

A

The solution to that is something called rate of namespaces, and what that allows you to do is create subdivisions of all same spaces within a pool without creating any more pges and without creating any more without consuming any more physical resources. So one of the things we'd like to do in the near future is a Laos ffs layouts to specify not just a pool but and a namespace, though, that people can divide things up using namespaces, where all the pools avoid spurious.

A

We can see data resources, but, as you, as you remember, a thousand to ten thousand pools, you would not want to create a set cluster with with that people's a question from Eric.

A

Who I should also thank for the NFS connection, both report? That's very useful, with Colonel for one and I know I'm, seeing a difference in file locking between Colonel infuse using the versus beeping contest? Should I report it as bug?

A

Yes, please. It may also be affected by what version of fuse itself you're using but I. Imagine if you're using coal 4.1 you're, probably using a recent version of fuse, so there is possibility that could be our bug, but I'm not aware of a bug in that area. So we'll see what's going on question, am I going to post flies and this whole talk is? Is videotaped? It's going to be on youtube then asks how is MDS funding work?

A

How do you guarantee, but fenced off MDS does not subsequently modify metadata, so fencing or blacklisting is something that's implemented at the radars layer. So it's one of the very, very, very useful primitives that Rados gives us when you fence a radar client in this case and MDS. What we do is write an entry to the OSD map.

A

The OSD map is shared out across all of the OS DS and also across all of the clients, and if a client tries to do a write with a lower version OSD map and they send the version with each operation, they do then that write will be rejected by the OSD and the OSD will say. No.

A

There is a more recent OSD map here it is, and so at that point you can guarantee that the clients won't be allowed to work with an older version of the OSD mode, so they will have seen the blacklist and importantly, more importantly, the OSD is forcing blacklist on top of that ramos mechanism within CFS. We also distribute within our. We have a similar structure called the MDS map, and that has an OSD map version in it. That reflects the version at which we last did something. So what happens?

A

Is we blacklist MDS a and we create a new version of the XD map? Let's say version 99 that includes that whitelist entry we write 99 to our MDS map, and then we write to our MDS map in the same transaction. That MDS a has failed so anybody's, seeing the MDS map that says NDS a has failed. We'll also see that you need OSD map 99 before taking any actions which assume that any si is fed.

A

So when NDS B gets handed the rank that MDS a used to have, it will wait for OSD map rank 99 before doing any I/o operations and as a result that whole process, which I hope I explained some more cogently, guarantees that it's not possible for any rights from MDS a to land on any OSDs.

A

After the point that MDS e MDS B has taken over the rank, there's a question payload or service work with a cache tier of assist, deep pools back to my rich particles. Any specific implications, though Seth FS can use a cash tear, because ray dose exposes cashiers pretty transparently when you're using an overlay mode. You essentially just create a cash tier on top of the pool set it as the overlay for that pool and then pointless ffs at the underlying pool, and it will pick up the overlay just the same as any other radars client.

A

Would you the caveat? Is you can't use our educated pools directly, so you would have to use a cash pool if you want to use rate and rate of coded pools. You should also be aware that that's not something, we've tested a lot and you might have find you have really quite interesting performance characteristics if you do something like that. So, for example, if your, if your data pool was on a cash tier and you had a lot of part of links, we were having to go and experience.

A

Cache misses a whole bunch in order to resolve hard links and that kind of thing, then your your mileage may vary, but yes, I mean it does work because Randolph's does such a good job of abstracting all that stuff away for us, but it's not particularly heavily tested.

A

Okay last question: those old new store and its effect on Serapis. The only knock on effect on set the festive new store is that it's coupled to some of the sharded object listing feature that we need for making set up a status gun scalable. So that's the only coupling there is there other than that.

A

Redis is a is a completely it's a very good abstraction and it means that folks, working with Southwest for the most part, really don't have to worry too much about changes further down in the stack and if you want an actual update on new store, you should ask Sam sage not me.

A

You're welcome.

B

All right is that just about wrap up the questions then I.

A

Should say I have more time if anyone wants to talk about anything else, but some that's all I've got okay.

B

I, don't see any more questions coming in so I think that's just about the end stay tuned for our next chef tech doc, which will be on the 27th of August, so that for Thursday again keep an eye on the Ceph Tech Talk page, if that may change, but other than that thanks John. This is great and we'll see you guys next month.

B