Ceph Cephalocon APAC 2018, 22 Mar 2018

Previous Meeting

⏯

youtube image

►

From YouTube: Status and Future of the Ceph File System - Patrick Donnelly

Description

Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Patrick Donnelly, Red Hat CephFS Lead

A

So here no ymx a whole mother until contentious Patrick, darling centre-half. It's an acute issue from Hawaii Hawaii.

B

Today, I'm gonna talk about the status and future of the state file system. My name is Patrick Donnelly work at Red, Hat and I'm. The set fest team lead so to begin I'm going to talk about and give an introduction to set FS case you're not familiar with it, then I'm going to give an overview of the features that we released in luminous last year and then finally wrap up with the changes we've made in time for make which will be released in a few months.

B

So self s is one of the components on top of self. As sage talked about earlier in the keynotes f is a unified storage system. It offers many different ways to access the storage. That's based off your UC steffeff s is just one of those use cases. In fact, it was the original use case of seven-step. Is a POSIX compatible distributed file system?

B

What this means is that you can use it as a replacement for the local file system and have it shared across multiple systems.

B

There are three moving parts instead of has the client which is operating on the file system, the metadata server, which is handling the metadata for the file system and then right. Oh, swear. Everything eventually gets stored, the files directories and other metadata everything is.

C

Stored in rathaus, the metadata server does is maintaining.

B

The local state on its local file system there's nothing. The metadata server can be completely transient. Clients that can access the set fester to primary means. The fuse, client and.

C

Also, the kernel client, the.

B

Fuse client is sometimes preferred if you, you, are not able to use the kernel client because, for example, you can't use the latest version for whatever reason or you don't have control of the kernel in use by your clients. The kernel clients is generally going to give you much better performance than the fuse client one of the homeworks or necessary requirements of set.

B

The fests as a POSIX distributed file system is as coherent, caching across all of the clients, so the MDS issues, capabilities to the clients to give the clients permission to read and write to files, and so you don't need to have to worry about the the clients. Reading, older data or any type of eventual consistent file system model. Ii may be familiar with from other vendors.

B

Finally, another important aspect of set fest is that the clients read and write beta directly tirado's. They don't have to send the recent Rises through the metadata.

C

B

This is important for performance so that your system can scale the clients can or you can scale the performance of your reads and writes with a number of clients you have available.

B

This is a more detailed look at the architecture.

B

Here we have three metadata servers shown above and read. There are two active metadata servers and one standby. The two active metadata servers are there to cooperate. Cooperatively, distribute amenadiel load from from the clients in terms of the clients, looking up files or mutating the metadata, and all of these metadata changes and reading of the minute it is done in the metadata pool which is stored in rathaus. The clients interact with the active metadata servers through the separate file system protocol to do opens makers, Lister's.

A

B

And finally, when they have obtained permission to read and write over ticular 500 server, when they're issued a capability, they may proceed to do reads and writes directly to the data pool. So there's two pools in use by sétif s.

B

The MDS is also protect against fail of failing and yes, Azure, and this is becoming unavailable due to network partitions or some Bob through the use of standby and yeses, so that one of the actives fails and a standby can immediately take over and that's controlled by the monitors.

B

C

That introduction.

B

Of cephus I want to move on to one of the features we've debuted in luminous last year, so in jewel we release emphasis a stable system, except with the caveat that only one active metadata server was a considered. A stable configuration in luminous. We've corrected that and now you can have multiple active metadata servers which allows you to scale the meditative, the load linearly. With the number of active medea servers, you have setting the number of actives that you learn as it's as simple as doing Annette's ffs.

C

B

On modifying the max NTS setting on the file system to control the number of actives the wand for your file system after a period of time of changing it, the the monitors will bring one of the standardized to active, and so you can see here. We have two active metadata servers available for the file system.

B

So the benefit of having multiple actives is that again we can scale our metadata load. The way we do this is we distribute. We partition the file system tree into multiple pieces and then assign those partitions to various.

C

Metadata servers, which.

B

Become authoritative for all the changes to those part of those sub trees and thereby allowing the the metadata loader on the cluster to scale.

B

So what does this does? Give us each.

C

Mds is authoritative for a slice of the.

B

Metadata so that reduces load more metadata server doesn't have to worry about the mutations that happen on another part of the file system tree the cluster wide cache. Then the esses act is it cash for the metadata stored in the metadata.

C

B

Cluster wide MDS cache grows linearly with the number of MTS's and yes is only half the cache. What it's authoritative for, although of main cache other things as well, to improve performance via metadata replication.

B

The cash utilization also increases because we're taking a manager's spatial locality and then in the file system tree clients talking to a particular metadata server for one file, are likely to accesses another file in the same directory. So we're taking advantage of that locality.

B

So, moving on to another feature, we added in Luminess is subtree pinning. The idea behind this is that the cluster operator or user could manually partition the trees, as we saw in the previous slide, by painting a subtree to a particular metadata server, and that.

A

Forces at subtree.

B

To be exported to that and yes, you assign it to and yes is authoritative for that subtree. It prevents.

A

B

Metadata server bouncer from from splitting that subtree into smaller pieces or merging the subtree into a larger subtree and per basically prevents the balancer from learning at all. On that subtree you're overriding the balancer.

C

B

Why are we adding this so to begin? We found issues.

C

B

C

Balancers in the bouncer and then in the metadata server.

B

In the Luminess release the the bouncer.

C

And the metadata server works.

B

By tracking the load on the I knows the files and the directories of the file system and based off of the load, the client requests on the you know. It decides whether or not to split sub trees or merge them together or export a sub tree to another metadata.

C

B

We found is that there were instances of imbalance resulting from the balancer, in particular, more metadata server would become overloaded and another metadata server would be doing almost nothing. We also observed volatility, that is a subtree, would be passed back and forth between metadata servers without ever settling down on any kind of distribution.

B

On an on the right, we have a graph of an experiment where we had two active metadata servers and ain't clients billing the kernel independently on the y-axis. We have the number of I nodes that are being cached by the to metadata servers and on the x axis.

B

We have the the time to run the experiment and around the ten minute mark we see, we have about 200,000 I know is cached between the two and guesses, which is the maximum that there are permitted to cache based off of the default limit of 100,000 I know it's in the cache at any given time, which would be considered an optimal thing.

B

However, we noticed that, for example, around the 15-minute mark rainy zero, which is purple at the top, dropped its entire cash because it exported it to the other activists right, one, the green and rank one assumed all of the metadata later on approximately five minutes later rang one sent back metadata during zero, and we see this kind of ping pong effect happen fairly frequently.

B

So this is the observed issue that we saw with the default bouncer pinning a subtrees allows you to override.

A

B

The balancer, if it's misty, caving in this type of way through manual intervention. The way you can do this is by sending.

C

B

Attribute on the directory that you want to, pin and that pins that directory and all the children subdirectories to to the the rank that you specify and that's done with this Hefner top pin extended attribute name and then you just assign the rank. You want. You.

C

Can observe where.

B

The sub trees are actually being placed by running this, get sub trees, administrative command on each of the MTS's and then gathering up all that data to see where the directories are being a for tater. For that can also be used to monitor the the state of the system of the cluster.

B

C

So beyond setting.

B

The next attribute you don't have to do anything else. The meditative service will export it prevent the bouncer from functioning on that on that particular subtree.

B

So some of the benefits of of using this you're again overriding the the balance here, preventing it from from doing something that you don't want. You can.

A

B

An administrator policy, for example, if you have a temporary directory in your set the fest file system, and you want to prevent it from affecting the load across your entire MVS cluster, you are restricted to a particular metadata.

C

Server you could pin into that.

B

Metadata server and that would prevent the bouncer from trying to distribute the load.

B

Some of the drawbacks, your MDS is, can become unbalanced to this pinning. So if you have a subtree you've created manually and it's and it's overloading that particular yes, the balancer and doing help you with that you'll have to resolve it yourself by there splitting the subtree more or undoing that, like the balancer, do it dynamically and then, of course, you're introducing the possibility of human operator error into your set policies.

B

So, moving on to another feature, we stabilized in set of s is directory fragmentation. The idea of this is that you can take a directory and split it in it's dynamically split into smaller pieces in the MVS.

B

This is useful for performance reasons when we're storing the directory and a DOS, but it's also important because it allows us to split up a directory of across and es cluster, and so, if I have a particularly large or hot directory, I can divide it into pieces and allow multiple metadata servers to handle those without overloading one mvs with a large hot directory.

B

C

Was largely the result of the.

B

Test coverage becoming complete and building our confidence that the directory fragmentation is stable. The user facing.

C

Benefit of this is that.

B

In you can now have directories was more than 100,000 files can Jule. We back ported a change which prevented the file, the directory from exceeding 100,000 files, and this was to prevent performance anomalies. We saw more performance degradation because the entire a large factory would be stored in a single jurado's which would exceed maximum object sizes and cause performance problems in your cluster, and so we created that limit. And now you don't have to worry about it anymore, with luminous.

B

Finally, another feature we added in Luminos was 12 to 1, so right after the first minor release of luminous, we had in India's catch limits by the motor memory consumed when you were when you initially.

C

B

Mds before this change, you would provide a configuration variable on DEA's cache size which took account of eye notes that yes was women itself in cache and unfortunately it's a poor proxy for memory usage, because you have to empirically determine how much memory that given number of inodes uses and it's not applicable to all workloads, because some eye notes, for example, directories- might require more more memory than say a regular file. So it might work for somewhere close, but then, if I'm not using a lot more memory for others.

B

So what we did in this change was we allowed you to specify the amount of memory you wanted to limit the NBS cache size to internally? This uses, C++ memory pools to track them might suddenly as captions using it's still a soft limit. So your MDS can go above you, the number of bytes that you limited.

B

The cache to this is normally do two instances of bugs on the client giving up capabilities at that MDS is revoking and preventing them, yes from trimming its cash, but generally, if, if you're running newer clients who don't observe these types of issues anymore.

B

However, we also have this enemy as cash threshold, which specifies, when enemies should start complaining to the monitors and issuing cluster health warnings that the cache size has been that is using more than as than as limit in the cache size. The default is 50% more and you'll start seeing notices in the cluster log saying that then, yes is having trouble turning its cash.

B

In practice, we recommend using allocating approximately twice the RAM for and yes both to allow for MDS team over its cash limit by some amount, but also acknowledging that then he has uses RAM for other things, of course, and so two times the limit as well. They recommend, if you want to read more about this, there's a blog post in the slide deck that'll be online that you can read it.

B

Okay, so now I'm moving on to what's upcoming and mimicked in the next two or three months that will be released.

B

Snapshots, receivable yeah: it's we've been getting asked about that a lot. This was largely due to work by an junggeun who's in the audience he snapshots and SEM. Fests are done per directory, so you can create a snapshot by doing a mate d'oeuvre. It's the third line in that in that code, in a hidden box, nap directory, that's present in all directories and SEPA fests will provide the name for the the snapshot and that's all you need to do. Except of us handles to the details of the background.

B

You can then change the files and then look at the history by accessing the snapshot in that hidden directory, as shown in the second to last line a note for the kernel client, you need to use the latest kernel because there have been several fixes and the chrome client or you need to make sure that the the fixes that have been merged upstream are also present in your distributions Carl.

B

When you upgrade to mimic you actually will need to set this. Allow new snaps file system setting to true currently its defaults to false, because it's experimental but a mimic new file systems it'll automatically be set to true, was the default.

B

Another popular feature of us that we've committed is kernel quota support. This is part of cooperative effort by Luis Enriquez of Seuss and jumped in similar to two subdirectory pending. So you can specify the photo by using the next attribute interface set. Fests provides two different limits. You can set the maximum number bytes for a given subtree or the maximum number of files.

B

The previous caveat still applies that the limits are forced eventually, and this is due to the clients having the ability to read and write files autonomously from the MDS, and only occasionally updating MDS on the rights that they've performed to a file, and so the the quota changes are eventually or the quota limits are eventually enforced, but in practices negligible going over your quota.

B

Finally, the chrome changes have not even emerged upstream. So look for that in the future. Don't don't try it right now, with the current version from intervals.

B

Next, we also improve the cache limits by memory. There were some structural problem issues that we know about when we were creating the change, namely that the cache the items in the MDS cache weren't tracking the containers at each end, ESS catch item was using through the directories and I notes were using containers us seek like a C++ standard map that we're not we're allocating space.

C

B

Of the memory pools or our tracking was off by a constant factor, and so we faced that you can see the two issues there in the side deck and this factor a back port for this fix, is coming in the Luminess for a 12 to 5. So you can expect it there as well. Just to give you a quick example that I run through due to time.

B

The NES cache size before was approximately 65% of the total random use by and yes, that is that the cache in use has understood by the MDS based off of itself tracking was 65% of its actual use of ram. Afterwards, it's it's approximately 80%, so we're closer to the true ram usage it'll, never actually converge on the complete ram usage due to and yes using the RAM for other things, of course, not just cash and here's another look at em guesses for much larger cache sites. This is actually where we noticed.

B

This problem was having the largest impact. If you have an MDS with a 64 gigabyte cache, you would be using 50% more than that in your at a.

C

B

State- and this would result in problems if you didn't allocate enough RAM for MDS, knowing that would happen afterwards. Now it's approximately 25% more, which is more reason. Why.

B

C

Also have absolutely.

B

Is several 70s food commands like stuff that 70s dhama 70s said if you're using those in your skirts on the longer work, they were deprecated and luminous I'm. So, if you're using these, please update your scripts.

B

We've also moved the client session timeouts to the FS map, and this is so that we get a consistent view of house. Clients are are evicted if they don't communicate with the NDS after a certain period of time. This was mostly necessary so that we had a consistent behavior across multiple yeses, because it was possible to configure only one yes and the others would be differently, and in particular this was important for NFS Ganesha, which would is able to export SEPA, fest and issue delegations and the fest allegations to its NFS clients.

B

And if those, if the MDS revokes capabilities that are held by NFS Ganesha Ganesha needs to revoke delegations held by his clients, it could easily run into these timeouts. So these you're able to now set these timeouts based off of your means for, for example, if you're doing NFS exports you might want to set them higher and also NFS finishes to observe. These tenants by is by accessing a copy of the Festa.

B

We've also made some changes to simplify the key management. This FX key management force ffs for most use cases. We expect it will just be using this set FS authorized command, which allows you to specify what directories you want the SFX key to have access to.

B

So this should simplify. We use kids for most people and they won't have to specify our possibly screw up the magic incantation or they're doing this half off commands.

B

Another small change we made was some introspection facilities for the MDS request from the clients. You can now see the number of creates.

A

B

Performed against MDS or the number of inode lookups small thing, but it allows you to build a graphs and monitoring to see how your NPS's are being used in larger clients are doing, and also to get an idea of what kind of leads is being for each of these operations.

B

It finally another plan feature mimic. The promises were trying to integrate the NFS gateway and have an integrated NFS gateway in step, 4 exploring sefa fest. This serves as a third alternative client for accessing step, FS through NFS, and so, if in the figure in the top right, we have a, for example, some virtual machine, that's mounting an NFS server, Ganesha in the middle and then Ganesha is going to forward all those NFS requests. Turn those into equipments have a set of s requests which get passed to them. Yes, is it all ceased?

B

So originally the use case for this was a OpenStack. We would have tenants, ten of VMs to need access to set the fests. What we want to have them on a separate network separate from the storage network, so that the tenant beams don't have direct access to your Ceph storage cluster.

B

This is Ganesha acts as a gateway for this. We also want to have a solution that can be applied in other situations. For example, if you don't want to you, if you refuse for use this is too slow or you can't use it to do the type of privileges it might require Ori and you can update the kernel, client and then another option for you.

B

And the important important aspect of this is that it allows you this also lets us set up the high availability and scale out of manifest gateways and do that and consistently across different types of deployments. The way we're planning to do this for the high availability is to have the condition containers managed by kubernetes the life cycle of the mesh of containers. When.

C

Ganesha becomes unavailable if.

B

It becomes an unavailable, Cooper name automatically responded in container to replace it and then.

C

As far as skills concerned.

B

The set manager will be what is actually managing creating these containers in kubernetes, and it has the option of creating multiple containers for a given share to to handle scale and then we'll take advantage of the kubernetes little bouncer through the the proxy service mechanism, so that multiple condition containers can serve multiple clients to a single IP address that will allow you to do dynamic scale. So this is a big figure that I'm going to gloss over due to time. The vanilla, for example, with Manila.

A

B

Third or the third party agent, that's not necessary for this whole system to work well, do get input, share creation, requests against the SEF manager. The seventh manager will then spawn any necessary NFS finishes. Containers by talking to poo.

B

This will be represented by the NFS gateway in in condition of the box in the bottom, and eventually manifest Ganesha after starting up will advertise itself to the to the managers as one of the services that will show up in your step status and your clients will be able to locate the the shares and by climbing Manila.

B

That I'll leave it to questions. Thank you.

B

D

A

For the update one thing, I was wondering about snapshots.

D

Is how costly they are and how many you can create, or if you can share, with some background on the details of how snapshots work in practice so snapshots.

B

Are relatively cheap, there's there's some internal tracking. We have to do, but it's rather rather minor should snapshots in general.

A

B

Lessons having two to three leave us.

B

D

Would you introduce the I'm Deathlok and and how it works in Marky, active, I'm, Vanessa, so you're asking about the.

B

D

A

B

As president, or in how they interact with multiple active military service,.

D

B

Yes, not policy.

D

B

Envious lock only control, odors serializes access to the yes for the multiple threats of the MVS. Currently, but it doesn't, it doesn't impact having multiple actives so.

C

That's actually one.

A

Way that we've encouraged.

B

Improving through parts of your system as a multiple active meditative service, now, because one active metadata server is eliminated by the MPS log, that's something we want to look into fixing by breaking that lock into finer pieces, but that's the future. That answer your question.