Ceph Ceph Month 2021, 23 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: CephFS update

Description

Presented by: Patrick Donnelly
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

um I'm patrick donnelly the cfs tech lead, and I will be talking today about what we accomplished in pacific first ffs, and what our plans for the future are.

A

A

So, to begin again we'll be going through uh the various features we had in pacific you'll notice, that many of these slides have uh um are based off of what stage already presented uh with some modifications, try to go into more depth in it during this talk and feel free to stop me at any time to ask any questions you might have.

A

I've also included in the slides, which will be distributed later. The blog post I wrote up, which also includes a lot of these discussion topics somewhat in more detail. Perhaps so, the the project has been following five themes: you uh revolving around the features we've been developing uh for the last few releases uh and that and we'll begin with usability.

A

So the highlight feature for pacific is now that multi-file system support is stable uh and it's fairly simple now to create file systems. um No longer do you need to do a a bunch of pool creation and then cfs new and then spinning up mdss in pacific. This is now largely automated for you.

A

We have this fs volume interface that allows you to create a file system quite easily, including all of its pools. Following the best recommendations of this of uh first half and automatically just deploy mdss uh using the distributed or the deployment backend for ceph, either cephadm or rook, and bring up as many file systems as you need and then also remove them as you need another usability feature. We've developed in that same vein is the mds, auto scaler, which will start and stop mdss based off of the needs of the file systems you have.

A

This is a module, that's not enabled by default. You have to turn it on, but once on it should create mdss using the deployment tool, either adm or rook in response to changes of the configurations of your of your ffs file system. For example, if you increase max mds, the uh module will automatically deploy another mds in response to that change.

A

We've also developed a new tool which is currently developer preview quality called cfs top that now lets administrators monitor the usage of the step file system with details that were previously not trivial, to gather in particular, clients now communicate to the mds certain information which we call metric gathering uh concerning the different um performance statistics that they are, that they have that the mds would not otherwise be aware of, for example, how how effective the client caches are and how much read and write io they're putting on the ceph cluster.

A

This is communicated to the mds and the mds is able to uh communicate with the sef the cef manager, to provide a summary of statistics for all the clients in the cluster and display a simple and curses ui for the administrator, showing the various client sessions in the cef file system and what they're doing and who the top consumers are again. This is a tech preview, but we are um spending time to improve it always and we are eager for any feedback that uh the community has.

A

In that same vein, we've also been working on uh cfs shell. uh To my knowledge, we haven't had um too much feedback uh from the community. This has been around for about two releases. Now, two or three releases, it's a simple python utility that uh mounts ffs and allows you to execute some commands on the file system without having to mount it using fuse or the kernel.

A

We've been continuing improvements to that and we encourage anyone who would find that makes things easier for administrating sf file system to please give it a try.

A

We also have a new snap schedule manager module that also must be enabled that allows you to schedule snapshots on a step file system on a given period and also retain snapshots according to a retention schedule. That's set up that can allow you to, for example, set up snapshots to be taken every day for a week and then delete any snapshots that are that are older than that um time frame.

A

uh This is a module. That's designed to be uh used in tandem with the new cfs mirror project that we'll talk about later.

A

We also have new, first class nfs gateway, support and steph.

A

There is a nfs manage manager, module that allows you to create uh nfs clusters to export cfs and set up a group of exports to to be exported by that nfs cluster.

A

um The uh the you can set up the nfs clusters in active, active configurations or uh set of uh and and in uh in the near future, you'll, be able to even set up aha using uh ceph8m with aj proxy. That's a work that sage, while is doing right now, uh or you can also use rook for for the aha component, although the rook crd is under active development right now, so not quite yet the nfs clusters that are deployed um you can have them automatically deployed using the uh step orchestrator.

A

So you don't actually need to stand up nfs clusters on your own. You don't have to worry about what types of configurations you should use for ceph. This is all automated by the manager module and the ceph deployment back end so that you don't need to worry about that.

A

We've also been working on adding encryption support to uh the kernel client and there have been a a few changes that have also been necessary for the mds.

A

uh Some of these changes are being adjusted constantly and the mds side is is not quite ready because the the kernel client is still um in under active development, but we have several changes that are already in the mds side uh for pacific and we plan to backport the remaining changes necessary to get uh complete encryption support ready for in the mds, uh so that the kernel client may mount uh cfs sub volumes and have them be encrypted.

A

uh Moving on to quality, um we had a api set up for uh setting a minimum or client release that you want for clients connecting to your cluster.

A

All right. I think we might have a question ah you have sharing that with yes, we're doing an fs crypt instead of talk tomorrow at 9 30 a.m. Eastern time, uh if you want to learn more about that, uh the coming back to the robustness, we have uh feature bit support for turning on and off required file system features.

A

This uh replaces the previous behavior of setting a minimum client release, which is uh uh historically not been a great um design, uh especially when the kernel client can only selectively back uh back for certain features to bring it to parity with a ceph release.

A

So uh a more robust solution is to selectively enable and disable the features that you want, and that is all um the api for that is all set up and documented this ftox. If you want to learn more, the we've also stabilized multiple mds file system, scrub that is uh in in the past before pacific. You would have to bring your self ffs file system down to max mds equals one so reduce the number of ranks to one before you could run a scrub.

A

This was due to bugs that we knew existed with uh false scrub, airs or incorrect scrubbing when executing a scrub with multiple ranks.

A

Let's see if we we had it, we did a significant amount of of development to improve that, and now uh we we uh now support the having multiple actives. You don't know you no longer need to change max mds in order to execute a scrub uh in the kernel client. We now support. We have support for messenger v2.

A

uh Thanks to efforts by ilia, you can just specify a kernel mount option to turn that on now uh also in the kernel client, we have support for recovering amounts from block listings, so you don't no longer need to remount your kernel client to have it brought back after if it becomes block listed by the by the mds and that can be turned on by recover session clean.

A

This uh doesn't actually um allow uh workloads that we're doing rights to continue um because it's it's generally a non-recoverable error.

A

If uh of a applications, writing to the mount tries to come back, but the um applications that had read handles on files will continue to function and new applications running on the mount can do reads or writes, uh set fuse has a similar um feature or behavior with the client reconnect, scalable one uh configuration and then I'll see. If you want to do this, you need you should disable the page cache.

A

Finally, um just as far as testing cfs is concerned, we've dedicated a lot of effort into cleaning up our our test infrastructure and making sure we're we're testing uh clients and uh and mds configurations.

A

um You know in uh uh across different types of tests in more consistent ways and we've doubled the number of tests for cfs as a result from about 2500 jobs to 5000. So we're now testing a lot more things in the upstream and we're gaining a lot more confidence in the stability of the system.

A

Moving on to performance, another big feature that we've had that was partially available in octopus. As development preview was ephemeral pinning this is now stable and what this is is a policy-based sub-tree pinning pinning subtrees has been around since I believe luminous allowing you to assign a given subdirectory treat to a particular mds rank ephemeral pinning allows you to set a policy saying that you want a uh how you want a directory or a group of directories to be pinned but you're not actually assigning the pin, pin directory to a particular rank.

A

uh The femoral femorally pinned um uh directories, either ephemerally, pinned or not and where it is pinned, is uh determined by the seventh s cluster and the the main reason for that is.

A

It allows the the cluster to intelligently, distribute the the fm really 10 directories and also rebalance them if the number of mds's in the cluster changes um there's two different kinds of ephemeral, pins, um distributed, pins, automatically, shards, subdirectories, so think of home directory, and then we have random pins, which is just a probabilistic chance that a directory that's been loaded into the cache from the metadata pool or a newly created directory is ephemerally pinned.

A

um This is a pretty exciting change for us and we believe it could have some very attractive performance um aspects for a lot of workloads, we're hopeful. They get some good feedback on this.

A

Next, um we've also spent a lot of time improving the capability in cash management in the mds for larger clusters. There's this has been an issue. We've been uh dealing with for quite a several releases now, and we believe that the the current status ffs is much better um now and a lot more stable than it used to be. uh The um most recent changes we've made are uh improving. The cap recall uh defaults for larger production clusters, and that is in response to some um feedback.

A

We got from dan van der surfs of cern and we also improved the capability acquisition throttling for some client workloads, and what that means is certain workloads like find the find command executed on cfs could acquire huge numbers of caps capabilities uh via reader calls, and that would uh cause the amount that it was executing on to get way more caps than it should, and the mds due to its own throttles would not recall them fast enough.

A

So now we have throttles for those clients which are acquiring caps too fast and not releasing them quickly. Enough.

A

All right and then finally, we also added support for asynchronous unlink and create we've had partial mds support for that since octopus.

A

Many of these changes in in the kernel client, uh which is the only client that uh does this currently um have already been backported to some of the more stable kernels in particular rel uh 8.4, has this, um and this feature needs to be turned on with the um no w sync flag and that allows the kernel client to asynchronously.

A

Execute the create and unlink system calls rpcs on the mds and potentially get big wins in in latency.

A

Moving on to multi-site, uh as I spoke of earlier, with the uh snap schedule module, we also have this uh snapshot snapshot based mirroring through the cfs mirror tool. um This allows you to figure replication targets, remote stuff clusters um to be mirrored on and configured for any directory.

A

uh The ffs mirror daemon is uh analog to the rbd mirror. Daemon uh is used to push data from the uh locally snapshotted um cfs file system to another cfs file system located on the remote cluster.

A

This is a daemon, that's also managed by rooker cephadm.

A

um The feature is snapshot-based, so it you. You need to have a snapshot of this file system before you actually before it'll, actually, sync it to the to the remote cluster.

A

We have an initial implementation in pacific um that we're doing aggressive feature development on right now. It supports um a single daemon configuration but that'll soon change to allow for uh multiple active uh cfs mirror demons that automatically balance the workload and the aha support is already present.

A

We also recently updated to improve the incremental updating, so it looks at the directory tree similar to how rsync does to choose which files to actually sync, if you have multiple across multiple snapshots, it'll only incrementally send what is necessary in order to do the sync and not resync the entire tree and again we'll be uh developing this uh aggressively, and there will be more back ports to pacific to improve how it functions.

A

Moving on to ecosystem, uh we also have spent a lot of time improving uh the use of cfs with uh kubernetes csi environments, uh where syphifest is used for rwx or rox uh volumes. Pvc uh pvs- and this is all orchestrated through the volumes plugin in the cef manager, which provides an api for creating and deleting um tvs through the uh what we call this: the sub volume interface uh for pacific.

A

We added a support. We stabilized the snapshots uh interface for our sub volumes and stabilized the interface within csi, and we have been adding a new authorization. Api support for openstack manila, which is also in the process supporting to use this new api and also have added ephemeral, pinning support for volumes.

A

Another major feature in pacific is the uh cfs token windows client. This is analogous to step fuse and you can use this just like set fuse to to mount the cfs and then provide access to that on the on the windows system.

A

The uh current um coded is what is a developer preview, but we've heard a lot of positive feedback on on the community mailing list. Already it's under active development and um the one of the next major steps uh will be to uh to um integrate um steph s dokken into our upstream qa. So it's tested more regularly.

A

Apparently, that's been a big barrier for us.

A

All right, so, let's move on to quincy.

A

For performance, uh what we would like to do uh next is add: support for asynchronous uh rm, dur maker and potentially also link and rename. uh This will get us to a point where our scene can run extremely quickly. Those are the last remaining um our pcs uh that the that we would need to make a asynchronous to improve that kind of workload uh tremendously.

A

uh We also have asynchronous metadata operations, support and lips ffs. It's currently under development. We expect that should be merged for quincy, there's, also some opportunities to improve the performance of set fuse. There have been some recent changes to uh lib views that we believe we could uh integrate into our staff use library to just take advantage of the latest kernel modifications that help us achieve better performance uh within lips ffs.

A

There is efforts to break up the big client lock that will improve a lot of the performance for the asynchronous systems, even fuse itself or nfs ganesha allow currently a lot of operations. There are being blocked by the big client lock and close everything down.

A

We also uh are developing the fs cache um support within the kernel, client, uh jeff jeff layton's been working on this. The uh fs cache interface is currently being um refactored and reworked by david howells and the theft kernel client is uh um being constantly updated to to his latest changes and we're hopeful that we can get something like that um complete um by the by the quincy release.

A

But uh of course the kernel has its own release schedule, so they may not match up or the kernel uh support may be finished before quincy is even released.

A

Another performance feature that we're looking into is recursive unlink, rpc um recursive unlink on a distributed file system is usually pretty slow and a lot of workloads just want to throw it away and forget it.

A

uh That's especially true within the new volumes plug-in, which needs to regularly delete sub-volumes, and it would be much faster to just tell the mds. An entire directory tree can go away uh and that'll. That's um a workload that we'd like to uh support through a new recursive, unlink, rpc and probably that'll, be exposed to the user, is some kind of dot trash directory uh to allow uh applications to to be able to do that without linking to lips ffs.

A

And we expect there should be some big wins there to let the to offload that work off onto the mds to be done. Incrementally.

A

On a usability standpoint, uh we're looking to support mds rolling upgrades- uh this has been something: that's we realize is very difficult for dev clusters to upgrade the the mds due to the awkward procedure that must be followed, and this is something we want to try to address with the quincy release.

A

There's already um some changes in flight to improve how we're handling the compatibility sets for the mdss and the file systems to redu to eliminate the need to bring out bring to actually stop um mds's and standbys prior to doing any um package upgrades.

A

um This should be. This should simplify the the procedure a lot and then we can move on to rolling upgrades of uh multiple active uh file systems, we're also working on a libs ffs sqlite, which will be another client to ffs. This will be the uh companion vfs to the new libsep sqlite.

A

That's in uh pacific uh there's, a link to the blog post uh for libsev sql lite in the slides, libstep ffs sqlite will function similarly and that it allows you to put a sqlite database on cephfs without actually mounting cfs using the chronoclient or cephus. So can speak this fs protocol and and avoid doing any mounts.

A

Then we've also got the mds memory target and this is another configuration um companion to the the osd memory target variable and this is to improve how the mds utilizes the memory by monitoring its own memory, use and then adjusting the uh its own cache as appropriate in order to to meet the desired memory target. We have some uh code.

A

That's already been worked on that just needs um picked up and polish for frequency to get this in and then first ffs multi-site replication, uh currently we're working on more testing for the h, a components of the cepheus mirror. That's already present and adding uh active active support. Both of those features were planning to uh back port to pacific.

A

You won't need to wait for quincy for those to be available and um make it more automatic to set up snapshots and sync it to the to the remote cluster to make that um interface a little cleaner and um some of the other things we're looking at is some other sophisticated models for for actually doing multi-site replication um without any firm plans of doing this. For for quincy, uh look at potentially bi-directional or loosely eventually consistent type, synchronization mechanisms and some kind of simple conflict uh resolution behavior in tandem with bi-directional.

A

Sync actual uh uh details of that have yet to be sorted out and and also how much that feature is in demand by community. So if you have any thoughts or a strong desire for this type of system, please do share on the on the mailing list.

A

So uh got a few minutes for qa. If anyone would like to ask questions.

B

I have a question dan from cern, hi, patrick hello, um I'm wondering if there's any work on multi-threading.

A

Any work on multi-threading um the answer is, there has been work but due to the size of the problem, not a lot of progress has been made, um mostly just a lot of design discussion, so I would say: no not not really any progress made.

B

Okay, okay, um yeah, because we really noticed the our mds has really hit the cpu limits and we tried to have. We were running 10 before, but this was operationally like kind of weird. So now we try to run maybe three or four and cpu seems to be a bottleneck.

B

um Okay, the other. The other thing I was wondering is: um whenever there's like a it happens very rarely, but whenever users hit like a a crash of an mds and then the journal can't be replayed and then they have to enter into the disaster recovery uh procedure. This is like extremely scary. I'm wondering if there's any ideas to make that somehow less scary and maybe even more automatic, something like.

A

That yeah, I think.

B

Yeah, okay, yeah. I think you understand you.

A

Yeah yeah um disaster recovery on on sevens can be um frankly terrifying for users. um The which steps you should follow is not always clear, and I think the mds rightfully decides to turn itself off if it detects any kind of metadata damage and expects the administrator come in and just take a look.

A

There are tools to do that, of course, but which tools to use is not always clear. I think, as far as improving the usability of of cefs, we could go a long way to um be more clear about what the kind of metadata damage we've discovered and in places where we do said that there should be some uh suggestions on on where to look.

A

What tools to use maybe try to classify the errors numerically so that um they can be looked up more easily to to uh so administrators have somewhere to go immediately to see like what? What kind of metadata damage has been discovered and is it important? Can it just be deleted?

A

Is it like transient metadata, like the open file table things like that? That's something that I think we could do a lot better on yeah and it's certainly a solvable problem. It's just something that you have to spend time doing.

B

B

Thanks for the.

B

A

A

B

Does multi multi-fs does that work with rado's name spaces in addition to separate pools per fs.

A

No, we we don't use different, we don't use the rattle's name space in the metadata pool, that's always been the case and in the data pool you can use write-offs name spaces, but that's configured using file. Layouts.

B

Oh, I get okay yeah. I guess it would be really tricky to to have raiders name spaces used for both multi-fs and for sub volumes. Okay, there.

A

Was yeah there were plans to do that, so you're not crazy, but uh it because of the recent changes in rattles to reduce the number of pg's that a pool needs to have like you, you, you can have a p, a pool with one pg and you don't see any errors or warnings anymore in in ceph.

A

uh It made less sense to undergo the you know the amount of development that would be required to um share data pools or even metadata pools between file systems, because tools are now so cheap.

A

So that's why we we didn't go ahead and do that.

B

Okay, but so I mean reasonable, reasonable, reasonably speaking, you would probably limit the number of file systems per cluster to tens of file systems. I guess right, would you do you think hundreds of file systems would work.

A

We have not identified a limit, but at some point the file system map would grow to a point where you might see slow downs in unexpected places. Probably the monitors uh we don't. We don't currently have an idea where that slow down might exist in the tens of thousands. Okay.

B

Okay, because yeah our I mean, I think at the moment we have over a thousand manila shares in our cluster.

B

I suppose we could look at because I think, with the volume support you can do once fs a separate fs per volume right per. Let's say, manila share yeah at least actually it works. So I wonder if a thousands would work if we had the orchestration to do a thousand mds's.

A

Obviously um yeah it was, it would be possible, but the the benefits there would be yeah. I'm uncertain.

B

Yeah, that's true. Okay! Next one.

A

All right any other questions.

A

Okay, thank you all for attending. Please check the uh either pad for the upcoming events on the calendar I'll go ahead and link that in the chat.

A

All right, thanks for attending, have a good day. Everybody bye.