Ceph Ceph Tech Talks, 25 Feb 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-FEB-25 -- Ceph Tech Talks: CephFS

Description

A detailed update on the current state of CephFS as we prepare for the Jewel release.

http://ceph.com/ceph-tech-talks

A

Welcome everyone back to the second installment of SEF tech talks of 2016. If you missed last month, we had a great one with the with the folks doing, postgres sequel on SEF under my sauce with Aurora and dr.. So that was pretty pretty awesome. Conversation this this a month we're having another great one, we're having a cephus update. This will be the last major kind of SEF, a fesse update.

A

We see here before the the next major release of jewel, which will include sefa fests and all the new hotness moving from the nearly awesome camp into the fully awesome camp. So we're all very excited about that. So without further ado, I will let John give you a rundown of what's new in sefa fests and where we're headed, um no pressure.

B

huh Right, okay, so hello, everyone thanks becoming um I'm a developer at red house and I work primarily on the South file system. I will just share my screen, so you can see my slides.

B

While I've got my slides up, I won't be able to see the blue jeans chat. So if anyone else has any questions go ahead and ask them there and maybe Patrick, if you just interrupt me verbally if that comes up sure so today, I'm going to give a very brief recap of what's ffs is and what its architecture is, but primarily I want to talk about what's new in the jewel releases FS, which is what we've been working on over the past six months to a year.

B

So if you're not familiar with Seth, then you might not fully understand everything here. There are other resources available if you need a from-scratch introduction, I'm hoping that most of the people who are watching this have an existing familiarity.

B

So the surface, the file system, interface to Seth, is 13. Application is that you get with a South cluster go from left to right. We have the radars gateway, which provides an s3, calm, basketball, object, interface to the cluster rbd, the Raiders plot device that provides a disk image type access to the cluster and then surface.

B

Although three sit on top of radius, which is the underlying resilient object, store that surfers built on so surface, is a POSIX file system. What that means is it has a high level of consistency than a to the pool, NFS or file system? Would you get all the same semantics when it comes to locking and concurrency that you would on a local file system like EXT, for the data is stored directly in the radars cluster?

B

So data goes directly from where you've mended your file system to raid us doesn't go through any intermediate service for the file system. We have a separate metadata server called the ceph MDS that can act in a cluster to spread the load of filesystem metadata across multiple servers in a water bottle neck, and we do a little bit more than the average POSIX file system.

B

So in additional to normal file system operations, we have a couple of special editions which are the ability to take per directory snapshots of the file system and also maintaining recursive statistics. So, for example, you can look at the statistics on a directory and your file system and see the total size of funds within it without having to run, for example, DF to iterate through the file system, which is a comparatively expensive thing to do in a network file system, so in visual form at the top of the diagram.

B

Here we have the set of s client, you have a choice between a user space client which uses fuse or the client that's built into the Linux kernel. So the upstream Linux kernel includes a set with has client and that client, where you mounted your file system talks directly to the radios cluster.

B

It sends the metadata to special metadata servers, which are represented by little three icons in this diagram, and it sends the data to OS DS directly the motivation for certain pests or distributed file systems in is, firstly, you have a lot of existing work plates that expect to file system and those workloads aren't going away. You might have some applications that will use an object store, but you will always well for at least for a long time.

B

You'll still have file system workloads, fun systems are useful because they're very familiar to everybody, as well as having applications that already depend on them. You also have system administrators who know how to deal with them, and you have other storage systems that know how to interoperate with the file system, such as backup systems, not just legacy systems, but also new systems use file systems, for example, in emerging container environments like docker what they call volumes. Their units of persistent storage are of themselves file systems. So what we've gone through?

B

A region where block devices were the standard way of storing your data and virtualized environments in containerized environments, we're seeing a return to file systems and, aside from all those compatibility reasons, the features you get with a fully featured file system are actually useful, so being able to set permissions access, control lists and being able to have a hierarchical way of referring to your data are intrinsically useful concepts for some applications.

B

So the reason we don't use a phone system for everything is that they're actually harder in some ways and other ways of accessing storage. Unlike an object store. We're not dealing with a flat representation of the data and the pieces of data, we're storing the files and the directories are not independent. They have these relationships to one another before hierarchy, and that means that spreading them out across the cluster of servers, it's a more challenging problem than doing a similar thing with object store.

B

It's also challenging to deal with some applications. If you've got an application which is written against a local file system, it has expectations about latency and performance that don't really make sense, necessarily in a distributed high scale, environment. The classic example is when people like to run LS dash L on the directory, it seems like very innocuous operation, but on a distributed network file system, you are potentially issuing. A very large number metadata is in order to retrieve all that metadata for the to know what color the files should be in your terminal.

B

When you run LS dash up the concurrency in a distributed. File system is also more complex than in other forms of storage. When you've got multiple file system mounts coming from different hosts, if one client is opening a file and writing to it and another client would like to open it and maybe read from it at the same time, in order to enforce opposing semantics, there's a fair amount of complexity that has to exist within the metadata server room within the client.

B

In order to make that happen, so those are the the downsides to a file system. Why is it hard? Why is it taking such a long time for this and other distributed file systems to reach level of maturity, where you would use, alongside your object and block interfaces, to give them all concrete illustration of what seven visits emphasis and how you use it here are paraphrased.

B

The commands that you use, if you have an existing Raiders cluster and you want to start running, set up s, so the set deploy tool knows how to set up a metadata server. So you'll need one of those for a second that file system. You need a data pool and a metadata fill, so you create to raid our schools. And finally, there is a command called set, FS new, which configures the file system and tells the set cluster which pools you would like to use for once.

B

You've created your file system and your metadata server is up and running. You can start using your file system by mounting it from a client, the example at the bottom. There is how you mount a set file system using the kernel client, which is part of the upstream curl, the command line that you would use for using off user space, pues client, it's a little different, but it's the same workflow.

B

So that was your lightning tour of service and what it is and I'm now going to step through all of the new stuff. So the biggest thing and the thing that I think a lot of people have been waiting for, is the ability to scrub the file system for errors and repair it. When something goes wrong. This kind of functionality is critical to moving set of s into a production reigning state because it means that not only have we stabilized the software somewhat and removed a lot of bugs we're.

B

Now, in a position where, even if we do encounter those or unexpected issues, we have the tools that we need to detect issues and correct them. On a customer system in general, the resilience and self-repair or of the data and metadata stored in surface is actually the underlying radars clusters job.

B

So when surface rights, some I nodes or some file data into radars, it's getting replicated and when one of those copies one one of those disks dies, it's Renesys job to deal with that, so the scrubbing repair stuff and set the fest isn't about the everyday data resilience because that's already taken care of it's about disasters. It's about the unforeseen, serious software bugs which clearly we go to great lengths to avoid, but are possible or scenarios where r anals can't do its job anymore.

B

For example, if a user only has three copies of their data and they lose three desks duty of some severe bad luck or some configuration error, we would like it to be the case that, rather than entirely losing access to their seff file system, as it has been the case in the past, they're able to selectively identify and repair the damage done by that data loss and lower layers so individual release.

B

We now have that capability for many forms of metadata damage, principally the lots of objects, which is the kind of scenario that you would see if you had a triple failure of desks and radius and you've lost a certain number of your placement groups, but also for corruptions, which we don't generally expect to see in a Reynolds cluster. But that's really more of a proxy for what would happen if we encounter some unexpected software bug. If we encountered some structure on disk, which just didn't make sense, so we couldn't decode it.

B

That would be a corruption. These tools are for using disasters and they require expertise. They're primarily intended for vendors, providing support for CFS to be able to intervene in extreme situations and repair systems. They are not something that ordinary users would be using every single day so to go into more detail about what the new scrubber past up looks like historically, if SEF, if set the first encountered, something it didn't like on the radio stir a couple of versions ago.

B

The metadata server would generally assert out, which clearly looks like a crash to the user and essentially is a slightly safer form of crash. In the last release of SEF, we added the ability to mark MDS ranks which are the roles in a cluster, that's occupied by an MDS demon as damaged so that when they encountered something they couldn't handle on disk rather than crashing, they would report at and go into an official damaged state and wait for intervention to fix them in jewel, we're more fine-grained than that again.

B

So, when something bad is found on it within the radiance cluster, the metadata server will be able to identify where, in the metadata tree hit an issue. For example, if a particular directory has damaged metadata on disk and mark that directory as damaged meanwhile, the MDS demon will stay up, users will continue to be able to access the rest of their data and if they try to access that particular broken part, they'll get an appropriate eio code from their clients.

B

There are some tools, server side for dealing with this situation, so there's a new damage, LS command, which ninety-nine point nine percent of the time will give you zero entries. But if something has gone seriously wrong, it will allow you to see exactly where, in the metadata tree, you've got issues those corresponding damage, RM command, which is for removing entries from that list. If you know that you've fixed some fixed something using the repetitive and the third come on, there is for the situations where we have a non localized form of damage.

B

If something is wrong, with an entire MDS rank. So, for example, if some critical data structure has been critical, global data structure has been damaged. That would be reported in the health status and there is this repaired command for telling staff that you have done some intervention in the back in the background and that it should not start using that rank again. So the types of repairs that we can do. If we see inconsistent statistics in the system, we can repair that online.

B

So that's the process by which the MDS demon will traverse a tree at metadata and work out. Things like what should be the final count for a directory or how many children should a directory have recursively. Currently, all the other repairs such as. If we have orphaned files, they have an offline, which means you need to stop the MDS and go and fix them in the background. So the online capability that we have here I'll come to the offline in a second.

B

Firstly, we can scrub a particular path so that path that the user passes in will be a file or a directory, and the MDS will go and check what's on disk versus what it has in cash, and if there is nothing in cash, it will just check that it can load it from desk and it'll. Tell you if everything is ok if what's old desk seems to be healthy. Secondly, there is a recursive flag for that which will do the same thing. We will go all the way down into a directory.

B

Thirdly, you can pass them repair flag to it, which will go through this procedure, but in the process, if it finds any statistics that it doesn't like it'll rewrite them. That's the online repair capability that I mentioned a second ago and fourthly, there is c'mon called tagged path and what that does is similar to a scrub, but instead of going through and just checking all the metadata. It also goes an update to the data pool where the file content is stored and tags.

B

Each file the first object in each file and the data pool to indicate that it was found while traversing the metadata that enables you later using the tools that I'll come on to next to iterate over the data objects and identify which of those are not hooked into the metadata tree and might need reinserting if you have an orphaned file.

B

So the scrub commands will give you a message when they complete they'll, actually give you a little JSON structure that will tell you about the what, if any issue they found and they'll also admit, cluster love messages. So, if you're not around to see the result of the command, if you have a long run, command, you'll also be able to see in the class to log if they found many issues so that process of iterating over the metadata is what we call forward scrub.

B

The backward scrub is where we iterate over all the data objects, so that number of objects is of the order. The amount of data in your system's, so your data files are stored, chunked into four megabytes objects and, depending on how many files and how big they are, that will influence how many data objects you have so you're iterating over all of them, which is hopefully you can see that that means this is an unusual thing to do. You don't want to do this continuously.

B

It's not absolutely fast thing to do, and you really would only do it in a disaster to find any orphaned files or recover any invalid file. Size information by searchingly objects is an exhaustive search of the data objects in order to mitigate the fact that it's an exhaustive search and we've added the ability to run the workers in parallel and I'll show you what that looks like on the next slide, so the backwards grow up process is done using the seven test, data scan tool and it's a two-step thing. First, the scan extends command.

B

Lets us iterate over all the objects in all the files. So once we've seen all the objects in a file, we can work out how big it is. So then, that's the first step: we've worked out. How big will files up then?

B

The second step is where we iterate over just the first object of every file, and we have a new little bit of code on the OSD side, which letters efficiently iterate over just the first objects in the files and for each of them go and find it in the metadata and if it wasn't or any linked in, we can go and modify the contents of the metadata tool to create a link to this file.

B

In general, we can recover file names because seth has stores something called back trace on data objects, that's not guaranteed to be there, and if it isn't, then we will inject lost files in a lost and found directory. So just the same as it. If you round fsck on an ext4 filesystem, you would potentially see things getting linked into the lost-and-found folder. We know how that concept in surface as well, to make this more efficient and avoid iterating over absolutely every object. You can run the tag path command that I showed on the previous slide.

B

So these are great in concert. Tag puff lets you identify which files are already linked into the metadata tree, and these exhaustive scan commands will let you iterate over just the objects that weren't tagged in order to somewhat more efficiently recover any lost files.

B

In order to enable doing this. In parallel, we have some new functionality and radars which allows us to iterate over just a subset of the objects in a pool.

B

We don't currently have a very user-friendly way of running this if you want to run a large number of workers in parallel, it's up to you to orchestrate running them across a collection of clients. You probably would want to write a shell script of some kind to do this. Each individual worker, though, is fairly simple to invoke there's just a dash dash worker n and dash dash work at em argument.

B

So you tell it I want to be work of three of ten, and you start ten processes like that, with a different value of n for each one and you'll be able to do it. Ten way. Parallelism, for example.

B

Just to reiterate this is disaster. Recovery functionality don't go running some all these repair commands just in case they fix something because it is possible for them to make things worse as well as better. They are invasive things designed to operate on an offline cluster. There's also some future work to be done here.

B

The forward scrub functionality needs to be more multi, MDS aware so currently we're sending these commands directly to a single MDS and operating on whatever happens to be in the caption pattern book were in the metadata they've allocated to that mvs at one time it doesn't handle distributing this across most or demons in the multi-layer situation. I'm also not currently running this in the background, opportunistically the way that we do with radar scrubs. So at some point it would be nice to extend this to share your filesystem scrubs to happen.

B

In the background, the same way we do frade us, so that's it prescribed and repair and I'll move on to the next topic, but before I do that I just wanted to drop out of the presentation to check that the video conferences will still connect on that kind of thing seems to be so that's great.

B

So the next thing I want to talk about is improvements to authorization in service the clients in a guinness ffs file system. The servers that are mounting it need to talk to the South monitors. The South OST is to store their data and the ceph nds demons to do their metadata operations. We already have. The ability to limit hell is DS how clients could talk to is DS, so we can tell them.

B

They could only talk to a particular data pool and then we could create layouts in set of s four files and directories that puts the data in particular directories and particular data tools, so that you can have some level of separation to what clients can see in which clients could touch each other's data. Historically, you didn't have any finer grained control over which parts of the metadata the clients could see. So that's been fixed.

B

Now there is now an extension to the wolfcats, the authorization capabilities that we use for authorization within staff, so that when we give clients the capability tool to an MDS, we can be more specific and say they have the right to talk to em DSS, but only within a certain path, or only using a certain uid.

B

What that looks like is shown here in this example. So the typical use case for this is that we have a client and we want that client to only be able to see data within a particular pool and only be able to see metadata within a particular directory. So if we have a directory food and we have a pool of boot cool and we've linked those two up by setting a SF lay out the set file system layout on food, / points8 approval.

B

We can then craft one of these authorization caps for a client that tells it NBS allow audibly path, equals Buddha to limit it to that, and the existing ability is still add to restrict, which is DZ control to as well. And then the new part is what's involved that once you have a client that has capabilities like this, it needs to be started in a certain way in order to work so because it can no longer see the root of the file system. It can only see this directory.

B

You need to pass the Dutch our flag, which tells the client which directory to treat us this route. Many needs to have a root that it is actually permitted to read. Once you've gone through this setup, you effectively have a client which can only access one part of your filesystem and as far as that client is concerned, that is the root of the file system.

B

Next, up on a similar theme, I want to talk about improvements to finally outs so that, in addition to having a file or directory layout that points files to a particular pool, we're going to be able to point files into a particular greatest namespace. So you may not be familiar with rados namespaces. They are a cheaper way than pools to subdivide your cluster, so pools involve creating pgs, pgs consume, CPU and RAM, and what we want is a logical separation between one set of objects and another set of objects.

B

We would like a lighter weight to way of doing that and that's something that's existed in raid us for quite some time. It's called the namespace and they're implemented effectively as just a prefix to object names, so they create a different logical namespace, as opposed to a pool which is we're going to physically store the data and handle it separately.

B

So there is an existing ability to write, OSD or caps that limit you to a namespace. So if we could write our files to different namespaces for different set of clients, then this would be a good way of providing security that preventive to clients from seeing each other's files without the overhead of creating whole different pools.

B

So that's what has been done? The existing layout fields, you have the ability to center pool and then the ability to configure how the object was going to be striped with striking a strike. Count object size. So, there's now just an additional feel: they're called pool namespace, there's a caveat here, which is that on the client side or the information for a file that it needs to access, the data itself is going to go into that namespace.

B

But we do also store these back traces, which are an implementation detail set of us from the MDS for each file and, as was the case with customizing the pool, that's used for a file, we will write the back traces from the MDS still to the default pool. I'm default. Namespace I. Don't really have time to go into why that is. But it's something to be aware of that.

B

What you'll get is your file data in the namespace that you've asked for and that's what the client needs to see, but the MDS that's dealing with this metadata will still need to be able to have access to the default namespace in order that it can write. The back-trace is that.

B

Those two features are actually somewhat motivated by what I'm going to talk about now, which is OpenStack vanilla. So OpenStack is the open source cloud framework for building private or public clouds. It has a number of different services that make it up so Nova the compute servers cinder. The block storage service needs from the networking service in one of those services that was added a little bit more recently than the better known ones is Manila, which is a service for provisioning and accessing shared file systems.

B

As part of your cloud, Manila provides Model, T users, where it allows them to request a piece of file system storage that it calls a share and Manila has a plug-in framework that enables you to write drivers for it. So, for example, there is a HDFS driver. There is a cluster, a fast driver and there are drivers for integrating this with various proprietary storage appliances as well, so we've gone ahead and written a second that's driving for this and in sep of us.

B

The way we implement the minimal concept of a share is actually just as a directory. We're able to do that because the type of operation that Manila what lights to do on shares, things like snapshots, are actually supported in surf as a taper directory granularity.

B

So vanilla expects us to be able to limit the size of shares and because we're just using directories there actually is no such limitation intrinsic to the way we're doing it. We have that on by setting quotas, so that's a little bit artificial and but it gives the user behavior that they expect.

B

Similarly to what I was talking about with MDS or caps that limited access by path, the dash I'll flag is needed by the client. So when we've created a share for user, we give them that path as something like / Manila, /a 1, alpha numeric ID, and they need to pass that into the dash our flag to their fuse client. In order to be able to mount that share.

B

We take advantage of the recursive statistics feature of service in order to get the capacity statistics. So when we're reporting to Manila that a given share is using a certain number of lights because it share is just a directory when we're getting that data from is the our stats within set of s and using these new or caps, we were able to make sure that clients only have permission to access the particular directory, which corresponds to a share that they have been explicitly granted access to.

B

That's very important, because an openstack cloud is a multi-talented environment. So vanilla really represents a use case. That brings together a number of these different features that we been working on recently, as well as a number of useful features that have existed in south of us for some time and it wraps them up in a way that makes the whole thing a lot more accessible to users. So they don't have to type those long step or get or create commands anymore.

B

That's now being done inside menara on their behalf, just to go into a little bit about where the code lifts for this. So in order to enable the Manila driver, there is a new Python interface in the secretary.

B

So, alongside where we have the python bindings for things like radar salutes ffs, we now have something called cephalosporin client, which is a descriptive, if not elegant, name, and that is what our setup as driver in vanilla talks to so within the vanilla, Terry we've got Manila itself and then the seven pass driver and then women, the ceph tree we've got seven first volume client and that wraps up all of the operations that are done through liberators and lips ffs.

B

In order to behind the scenes create these directories separate layouts on them, set the right, both caps for accessing them and then hand up to Manila this abstract volume concept, which happens to be implemented as a directory with a bunch of magic north captain.

B

So we're working to get this driver included in the forthcoming matako release of OpenStack and we'll be talking about this. Another future Manila related sacrifice work at the summit in Austin the spring.

B

The last thing I want to talk about is a new experimental feature which we're hoping to get into fuel. So historically, one set cluster meant one SEF file system.

B

There is no actual fundamental reason for that: it's sort of a convenience of implementation thing, but there are good reasons you might want to have multiple file systems, which means having multiple MDS clusters that are all sharing what cluster. So, if you want to separate file systems rather than having to have to set clusters, you can now back at all on the same Raiders cluster you might want to do. This have multiple file systems if you want to physically isolate some work clothes.

B

So if you want to make sure for security or quality of service reasons that two different workloads, maybe one is very mission critical and what is experimental are just going to go through physically separate MDS service. You'll be able to do that. There's also a disaster recovery use case, which is quite remote avator for us here that currently, if you were going through some of the repair procedures, that I was talking about earlier you're kind of trying to do a lot of stuff in place and that's kind of a scary, uncomfortable thing to do so.

B

Once we have the ability to have multiple file systems in a system in a single set Buster, I should say we'll be able to have a workflow for disaster recovery, where someone creates a new file system within the same cluster and then does disaster recovery operations that are scraping what they can recover from their old, potentially broken file system into then new file system without having to throw away or what, if I, the twelve system that you're trying to recover from at the same time as you're writing the new stuff and, finally, there's a general resilience argument for this.

B

So if you're worried about hitting issues, whether they're bugs or performance issues or stability issues in set of s, then it's a nice way of protecting yourself. If you can say well, I've got a stable workload. That's working really! Well I'm going to keep that running on this MDS in this file system and then what I want to try Ewing something different I'm, going to run that on a different MDS with a different file system, so that one thing doesn't interfere with the other, so the FS new command.

B

That I mentioned at the beginning, that's something that was added a little while ago, the FS new FS, LS and SRM commands, which very much suggest you should be able to have more than one but his. We would give you an error. If you try to create a second one. Well now you can run it all the ones, so you run FS new second time you give it a different couple of coolness to use and you'll get a second file system.

B

Hopefully before doing that, you have created enough MDS demons to actually operate the new file system. So if you only had one and the sd1 and you had one file system, then you create second file system. Well, the second file systems not gonna, be able to come up. Yet in this initial implementation, the MDS demons are all treated equally.

B

So if you have to file systems and one demon, there's no promise about which file system is going to come up, it's just going to allocate the demons on a first-come, first-serve basis to file systems that would like a demon, that's something that we can improve on in the future.

B

But for the moment, as long as you've got at least one more MDS demon value have file systems, then what you'll have is a standby MDS demon that will step in and help with, whichever file system loses a demon, so the stem line will be available to participate in any of the file systems, and you just don't have the ability to say which at the moment, this is switched off by default, because it's brand new and experimental so much like snapshots and in line data.

B

We have a flag on the cluster that you have to explicitly set to indicate that you are aware of the caveats and that you're not going to get angry with us if something goes wrong for legacy clients which don't support, explicitly selecting which file system you'd like to connect to, we have an ability to configure which the default file system file system should be. So if you create three file systems- and you want the second of those two either one that old clients will get when they try and connect to the system you can.

B

You can do that, whereas, if you're using the new client- or I should say the latest version of the fuse client, you can pass an option on the command line to say which file system you want to use. If you omit that option and then like a legacy client, you will get the default file system. So, if you're not actively trying to use this, if you just have one file system, everything will just still work the same on the client side and on the server side.

B

You won't know the difference unless you've set this flag and created a second file system. So.

A

B

Is a bunch of fallen work that to improve this? Much like with data storage, it would be nice to be able to use Redis namespaces for my status for rich as well, so that, if I'm creating a second file system, I don't have to needlessly create a second metadata pool. I could just use different name space within my existing metadata pool. There's a bunch of authorization work needed here to make it suitable for a lot of youth use cases, especially multi-talent use cases.

B

So currently, there is nothing to limit our client to connecting to particular file systems, so any client can connect to any file system and any MDS can act as a server fatty filesystem. The file system functions stuff doesn't exist in the kernel, client yeah. So currently the ability to pick on the client by client basis, which file system you get is limited user space, client and as I was mentioning. The M de esas are all considered equal and you can't currently set up any clever policies that say this MDS is for that file.

B

System of that NGS is for the other fastest third term. I am personally quite excited by this feature because it's been a sort of annoying limitation for quite a while and we're getting rid of it.

B

So um I'm going to wrap up with some tips for anyone who is thinking of or it already is an early adopter of cephus.

B

This is a stock slide that I put in a number of presentations so come to the mailing list, come and look at the issue. Tracker look at the online reference for how to configure lobbying and debugging, and so you can get more information about any issues you're having and then, when you all, are having an issue.

B

Please consider installing the most recent development release or if you use in the kernel, play a more recent Colonel, because ffs is very actively developed and there's a lot of difference between the code from six months ago in the code for today, if you're getting in touch with us, please let us not as much detail as you can. How are your MDS is configured? How many of them have you got? Which client are you using? Is it the colonel? Is it users base and so on?

B

What are you doing with the file system so there's a big difference for a distributed file system between a workload week or just a single client versus where you've got multiple whites, accessing the same files and and so on. So really just as much information as you can gather using the tools and that as much information as you can give us when you're reporting issues is really helpful.

B

We we really appreciate the feedback when we can get it okay, so that's all I've got and I'll drop out of the slides and we'll see if there any questions at all, questions don't have to be about the new stuff in here unless any general set the post questions. That's fine too,.

A

Awesome, thank you John. That was great. Oh, it looks like we have. A question already from Daryl's is: what is the timeline for when Stefan Fest is generally considered, production ready? And what about the production? Readiness of features like that's.

B

Okay, yes, I should have mentioned that so um the dual release, Oh Seth, which is the release happening. This spring, is going to be the first one where, where we're going to start calling the upstream open source release stable and we are going to be encouraging people to start evaluating. It start testing it as for official supported releases of products based on this code, whether it's by red hat or any other vendors, that's TVA, but yeah jewel is the first day we'll release of CFS, you've also lost about snapshots and snapshots are not I.

B

Think currently we're not including that in all sort of statement, instability for jewel, but there has been recent work on stabilizing slap shots so what they they're coming along as well so I'm being vague about everything apart from jewel and this the jewel is going to be jewel, is going to be the upstream release of CFS. The folks should start testing and evaluating and complaining to us about once with.

A

Now there are to expand on that a little bit. um Typically, what you see when an open source when an open source version of stuff is released and is stable and whatnot. Typically, it's about a six-month window to when the Red Hat supported packages come out that include those things you know modulo any no problems concerns things that we want to expand on before putting it in the you know: Red Hat's, F storage product, that kind of thing, so the commercially supportable is typically about six months behind.

A

Alright, the next question is from Logan knows: what's the relationship between file and object? Is you want to expound a little bit about how the object, mapping and placement rupes and stuff work with respect specifically to SEF of s? Okay,.

B

So there's two questions. You might be asking that um when you say, what's the relation to involve object, you might be asking what's the relationship between files and set of s and objects in our GW and the answers that is, there is no relation there they're separate, so the voting the same Raiders positive of a day trainer file and the data illogically object. The two things cannot see each other.

B

As for the, the other interpretation of the question, which is how are the files and seth has mapped to objects in rados, they are striped in the same way that they are in IBD or a GW. So you can configure that with the settings on file layout, but the default which most people I think go ahead and use is to simply stripe chunk objects into four megabytes hunks. So if you write a for Megan by objects, that's going to be a sorry.

B

If you write a four megabyte file, that's going to be more object and if you were an a leg by far, let's go be two objects. There is a little bit of our head. That goes with that. In the cases where you're, storing files in non-default pools or namespaces, you will get the data objects, the file which will follow those rules about for matin by chunking, and then you will get an additional, very small object for each file as well, which tracks some some other metadata about it.

B

You also ask if active, active MTS's will be supported, so the short answer is no in the jewel release where all of the work we've done on stabilization and repair, and so on has focused on the single active MVS case. So the case we're encouraging people to evaluate is, and one active, MDS and then a standby or a stamp I replay, and yes, so standby replay is, in general, pretty good idea to get a fast failover. That's the mode where the standpipe continuously replays, the Journal of the active India's.

B

um However, having multiple active I'm ds's, the code is all in there. You know you can install it and enable it and you will in general, find that it works. But that is not something we've focused on stabilizing. So if you're, if you're putting things in a more production, ready and less production-ready bucket, then active active goes and be less prepped and ready bucket. um We have a question: what about quota in the kernel client? Is it ready?

B

Oh good question, we've I, don't think we've I, don't think we've done that work yet so I think that's a no! But last.

A

I heard the jewel release was still looking at quotas, but that was on the will probably slip to next version list. So my guesses quotas won't make the jewel release.

B

So Patrick you mean specifically the the colonel fine. Yes,.

A

Colonel CAI encoders uh yeah.

B

How is active passive, a che among the MDS knows handle is the standby and be as hot orders. Failure Vinny to be Hannah looks darling some flight pacemaker. It is hot, um so there is no pacemaker in this. The way this works um is you stop as many mbs demons as you like, they all communicate with the Ceph Molitor cluster and they essentially register themselves in a list of electrical standbys and then the set monitor cluster looks at your configuration.

B

How many active demons you want for any game file system and promotes one or more of the standby nodes to take on what we call a rank, which is a sort of active operating role within that MDS cluster.

B

So none of that requires any external input unless you want to preempt it to get past a timeout, so by default in MDS and just falls off, the network will wait 30 seconds for it before um promoting another guy in its place, so you can preempt that using the SF MDS failed come on, but aside from that, it's an entirely autonomous thing.

B

If you wanted to even mention the moment to go about standby replay mode, so if you want to have a little bit more of a crafted configuration rather than just creating a bunch of LDS demons, you can, for example, create a one MDS demon with an entry in its second file that instructs the South monitors to use this demon as the replacement for a specific other demon.

B

And you can also set a flag which causes that and standby demon to continuously replay the metadata log. The metadata journal from the guy who it is the potential replacement for which means that, at the point that it is asked to take on that role as the replacement, it already has the metadata in cash and it doesn't have to reload or the metadata Reyes. That gets you up faster failover at the cost of some extra read iOS four, following the journal and at the cost of being a slightly more handcrafted configuration.

B

Okay, would it be a bad idea to trend with the MDS demons on the same hosts as the modules, not necessarily, but in general they have different requirements, so MDS demons benefit from having lots and lots of RAM, whereas the MBS demons don't do any local I/o at all, because all of their persistence goes to the rails cluster, whereas monitor demons benefit from having a fast local, fast, local storage device for storing all of the map, data and so on.

B

So in general, what makes a good monster that doesn't necessarily hardware wise, doesn't necessarily make a good MDS. um Aside from that, you just have all the general issues associated with putting mixing different demons on the same server.

B

So if we often get asked, can we run ones and oh SDS on same server and so there's a similar kind of set of issues that you need to work through if you're thinking about doing that, but in future we're looking at things like cgroups configurations that will enable people to run montuno SDS in the same place and that same kind of work will at some point also apply to willebrand MDS demons. In the same places month, so the answer isn't know you can run them on the same servers.

B

It's not necessarily a bad idea, but you have to really think through that configuration um Brian asks. How can you scale setup has to meet demands of a growing open, star cluster if you're using it for Manila um so essentially in the same way as we would for any other large workload. So step of us ten years ago, long before I worked on it.

B

The the surface architecture was originally designed to deal with HPC clusters, which would have very large numbers of clients, and so in Manila, where we potentially also have very large numbers of clients talking to lots of different directories directing as manila shares. It's the it's a similar use case to when it comes to scaling, and so it's the same answers it scales by increasing number of MDS is in a cluster, although currently that multiple active MBS functionality isn't stabilized the other aspect to scaling with Manila is.

B

We might also look at integrating the multi file system functionality with Manila so that in cases where people knew that they had, for example, one tenant was going to use this set of file systems and another tenant was going to use another set of file systems. You might consider crafting a configuration where you had an MDS for this tenants file systems and MDS for this other tenants file systems rather than having a completely elastic cluster mdss.

B

So it's a mixture of the elastic scaling with the size of the mps cluster and then maybe a little bit of manually configured magic for deploying more individual demons that mine operators, paulo separate clusters, but the middle of stuff is really very new at the moment. So you'll be interesting to see how that pans out.

A

There are any little questions now is the time.

A

Alright, it looks like we don't have any more questions. That was a great work. Thank you, John very much just a reminder to everybody here. This was recorded and will be posted up on the YouTube channel, which is also going to be linked from the SEF Tech Talks site chef calm. So if you want to replay or revisit this, it will be there otherwise stay tuned for our next one, which is on 24th of March same time same bat channel. So we'll, hopefully see you all back then.