Ceph Conferences, 28 Apr 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cephfs in Jewel Stable at Last

Description

The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute

A

Hello, it's my audio work. Yes, it is working okay, excellent, so I'm, Greg, Farnham I am the best tech lead I've, been working on the project for wow, like seven years now and I'm here to talk about stable staff s in the top stream Joel release. So I realize there's been a lot of SEP talks, but I haven't seen any that talked a lot about how Seth actually works. I'm gonna blaze through that pretty quickly we're gonna talk about Seth fests.

A

What actually works in the upstream stable release and all these things that you might have heard about over the past many years that aren't done yet so that you know what to expect and some pain points that I expect people to see or maybe not and would like to know. If you don't see so Seth Seth is built on top of the reliable autonomic distributed, object store. It was a long-term research project at UC, Santa Cruz, where sage got his PhD thesis and now is supported by Red, Hat and a whole bunch of other people.

A

I apologize if I missed anyone, I don't actually know anymore. So it's you know a bunch of people writing commercial support for this open source upstream project, with whatever their downstream spins on and are in South effete are in the Ceph project. We have raid us at the bottom, that's sort of our base storage layer that provides all the primitives all the other projects use to to build up their services.

A

There's the liber8 O's API library that allows clients in the system to talk to the actual storage cluster, and then we have the radio Skateway, which is a restful s3 and swift. Compatibles object, storage service, the Rados block device, which is a virtual block, store and step of s, and for a long time we've been calling those first two awesome and stuff. That's almost awesome and now set. The Feist has many awesome things so I'm very excited about that.

A

So within the rightest cluster, you have a whole bunch of a whole bunch of servers. Some of them will be monitors so to the MS and then you'll have oh s, DS or object storage, daemons and the application just talks to whichever systems it needs to an object. Storage daemon is a regular Linux process running at the moment. On top of a Linux file system. On top of a disk, there are experimental and developmental backends going on the top.

A

Let's skip up strip out the the Linux file system entirely and just run on the disk directly within the cluster you'll have tens to tens of thousands of OS DS. These provide the actual data storage, unlike in many clustered storage systems. Each of these, each of the OS DS, is intelligent and, with a very small amount of data, works together to maintain the replication and consistency of objects. So it's not like you have a central month. We have monitors but they're not like going around and saying hey OS d5.

A

You need to push data to OS d3. The monitors maintain a very small amount of state in that state is the cluster membership saying you know we have OSD 0 to 10,000, and all of them are up, except for OSD is five eight and nine thousand 873, and the purpose of the monitors is just to say who's alive, who's dead. What actually exists and sort of what the? What the rules are for how data goes into the cluster?

A

So when you want to read or write an object, you need to find out where it is. There are a couple of different strategies for doing that in a lot of storage systems that you just have some kind of central service that says: hey object. Foo is over on these storage nodes, but that's sad because it means that you wouldn't occur the lookup plate and see every time you want to access an object and because it means that you need a storage server that can hold the locations of all the objects.

A

So within ratos we use a calculated play screen algorithm. It's called crush, consistent replication under scalable, hashing I. Think the important part about it is that it's a mathematical algorithm, which takes a very small amount of input it takes in the map that the monitors maintain and the name of the object that you want to look up, and it says okay that object right now. According to this map, lives on these two or three OSDs. If OSDs get marked as failed in the cluster, then crush automatically says: okay, I know they don't live there.

A

Now they live over here in these different places. It's very fast calculation, it's stable. So when you run it from different clients, you get the same results and it allows you to do so much cleverer things than sort of normal, consistent hashing. You can do replication across racks or across machines or across data centers. You can. You can set up your own rules so that maybe you want to have power supplies or or power circuits, as as failure domains. Maybe you don't so it's very configurable.

A

Within the storage system, you have a bunch of different namespaces called pools, that's relevant to set the best, because we will have a you'll see you later on. We have a data pool and a metadata pool, but you might also have a pool for your rate of block devices in a pool for your rgw objects, and then each of those pools is sliced up into shards, which we call placement groups.

A

Those placement groups are what actually get sort of moved around on the OSDs they're called that because they move as a unit. So when an OSD fails, you don't move every object to different different nodes in the cluster. You move the placement groups as units and sort of the way that works is it's through the peering process. The monitors maintain these maintain OSD maps saying the state of the cluster. Each of those OSD maps is numbered with with an epoch they just increment increment for forever.

A

You start at a zero and move on up and whenever an OSD or whenever.

B

A

Gets a new we pop and a new version of the map, then it looks through all the all the placement groups. It's storing and it says: oh hey, you know placement group 42 lives on a new OSD now so I'm going to tell us in this new OSD that he's a member of the placement group.

A

So in this example, we've just pushed out a new map, twenty thousand two hundred twenty- and it has members eleven five and thirty- and let's say that 11 was not previously part of the part of the part of the set of OSD serving this placement group. So but OSD five was so I see five gets the new epoch of the map and he says: okay. Eleven eleven is now a member of this placement group and he wasn't before so.

A

I'm gonna tell eleven that he needs to go up that he is a member of the placement group and eleven then gets that gets that notification. He says okay well now. I need to talk to everybody who, like I, need to get all the data for this. For this PG forty-two and so I was the OSD says all right. Let me go back. I have a history of the maps and I want to see who all stored, who all this is responsible for storing this data.

A

That might still have the newest version of an object, and he goes back through his history and he sees that there was a change like this, which probably means that OSD eleven went down and then came back up either because you know someone was upgrading a software or there was a power hiccup or something.

A

But that's not you know the specific reason doesn't matter. It goes through the same process for basically any kind of cluster change and he'll go and you'll say all right. So in this case maybe OST 11. Actually just the placement groups have logs. So the OST 11 might just say: hey five I need everything. That's changed in seatback 1984, because hey I was in charge of it back then, but maybe instead, it's more complicated, and so you know what we don't actually want to talk about this right now.

A

Sorry, okay, so the liberators API provides access to sort of the functionality of the Raiders cluster. It's it's an. It is an object-oriented API. You say: I want to do this. Op I want to do this operation or the set of operations on object foo, but it's a very rich API. It's not just put getting delete. You can say I want to write to offset 57 these 100 bytes, you can say hey. If the object has this version number I want to set this X adder to something different.

A

You can inject your own, your own ratos classes or sort of stored procedures into the end of the cluster and you as an administrator can inject ratos classes and we ship a bunch of them and say: hey I, want you to run the the thumbnail creating function on this on this object, because it's a picture and I want a smaller version of it and I want you to do the work for me.

A

So, within the Ceph project we have a couple of things that are exist. We have the radius gateway which serves up s3 and swift, compatible api's to the outside world and stores it within a radius cluster nice. Basically just gonna get a shout out. It exists, you might have seen it.

A

Similarly, we have the array dose block device. It runs as a user space library inside of inside of Kim UK VM or as part of the Linux kernel, and it translates it translates block device commands at those layers into operations on the readers cluster. It's got all kinds of great features. It's the number one OpenStack cinder store solution. You should use it hooray, all right, so SEPA, fest and feel free by the way. I usually start out.

A

My talks this way, if you have any questions, just raise your hand or get my attention during the talk because I'm not quite sure how much we'll have left at the end. Maybe we'll have gobs of time every jiggered this a little bit, maybe we'll run out. Okay, so sev s. It is in fact a file system. Hooray, everyone loves a scalable file system. You mount it from multiple clients. You can write from client a and read the data, the client a wrote from client B.

A

It is a Linux, basically POSIX file system, in the same way that all Linux file systems are basically POSIX. It's not closed to open semantics like NFS. It's like you know. You write to your exe for volume and you read from your ext for volume. It works that way, and so that's sort of the catch-all is. It's got coherent, caching between between all the clients and the server's, your Linux host, either via the Linux upstream kernel module or via our user space.

A

It's a fuse application or via the Samba, organizer plugins, says: hey I want to write, not this filesystem. It goes off and says to the monitor, hey I want to mount the filesystem. That's all right!

A

Here's your here's, your metadata server map and here's your OSD map and then the client talks to the metadata server for all metadata updates for saying, hey what what's the root directory look like and what are the contents of my home directory and hey I want to change the time, the end time on this file and it talks directly to the OSDs for all data updates. So for all rights, the filesystem is very consistent, and that also means that under many circumstances, it's much much faster than you expect from a POSIX filesystem.

A

If you have a bunch of different clients mounted, but they all have their own sort of hierarchy like they're all in their home directory, and that's the only thing that user care about. Then they can just count cache that entire tree locally. On the client side, all the stats will be satisfied locally from the client side without going over the network or anything. But if they are sharing things, then the SIRT, then sort of the server will say: hey you I'm, making a change.

A

Your cache is invalid, throw away this information, and so that means that clients can be very fast when they're the only one working on stuff. But if there are people sharing data, then they never see anything stale. There's no opportunity for any kind of split that you might have seen in other storage systems. It just works scaling the data sort of the data path. The file I hope ass with in within Southwest, is pretty trivial. All the data stored in Rados the file system, clients write directly to rate us.

A

You see all it the same way you do in your ordinary ratos cluster. If you want more throughput, you can put in faster SSDs if you might be able to say, hey, I'm, writing files and they're in for megabyte chunks, but like these are 10 gigabyte files. I want to use 64 megabyte chunks when I'm splitting them up across the ratos cluster sort of all the other tricks.

A

You want at least until you're limited by latency I've said of throughput, at which point you know we need to make the OST s faster and that's being worked on. Scaling metadata is a little harder, but we do have some good tricks. First of all, unlike in some storage systems, you don't store the entire file system, hierarchy and namespace.

A

When you want to access the directory, then the metadata server goes and looked it up off of disk, and then it caches it in its in memory cache, but and and throws it away when it runs out of room. But that means that your metadata servers cache can see size for how much active data you have. If you have, you know, 100 million 1, 1 gigabyte files, but you only ever look at 50,000 of them in a day.

A

You need to size your metadata server, so it can keep track of 50 thousand files at once, rather than 100 million of them, and because of sort of the internals of this FS file system, we get some cool features.

A

One of my favorites is what we call is our stats recursive stats in this case, unlike in a local file system, where it reports the size of a directory as the block size on the disk, we actually count up everything underneath that directory and say, and and tell you how much data is inside of the directory as a whole short of the only hole there is that it doesn't count the allocation map from the sparse file.

A

So if you have a 1 terabyte, sparse file, that only is just 4 kilobytes, it still counts as one terabyte, but otherwise this is a really useful feature. That means you don't need to use DF in a lot of cases.

A

Second, awesome thing within stuff: this. We actually have a security model. Now, for a long time we didn't- and you know there are still a ways to go, but we do have a way to deny clients certain levels of capabilities. So clients start out with nothing at all. It's a capability model you grant accesses, and then you can say that I want this client to be able to. You know, read the entire file system, but not file system, or you can say, hey I, want the client feel to read and write to.

A

You know: slash home, slash, client, a or whatever you can say that they are allowed to act only as file system, ID, number 98 or 1017 or whatever, and for real security. So the MDS capability is controlled only what happens on the metadata server controls, what metadata they're allowed to look at and what metadata they're allowed to change, but it does not impact what they what actual file data they can read and write from the OS T's.

A

So if you had a malicious client that someone had hacked together, that could wasn't allowed to see anything the file system, but they were allowed to access all the data they could.

A

Just go out and go and like it out, so you want to coordinate that and say that you know this client gets access only their home directory and I have a what we call a raid us, a name space within Rados that is named based off of the client or whatever, or perhaps they have their own tool or something, and then you would specify that their home directory layout rights to their particular Rados, namespace or pool and prevent them from reading or writing any any data that doesn't belong to other clients.

A

This these capabilities are reasonably secure, they're encrypted by the by the monitors and are unreadable by the client they just get passed along when they open up sessions to the metadata servers of the OSDs. They say what the clients are allowed to do: yep, ok, another awesome thing: we have features called. We have features for doing scrub and repair on the file system.

A

Now, a couple years ago, people would test the file system and they would say usually it works great, but I had this crash and now my MDS won't start up, and it says that there's a journal error and we would be like well. Ok, can you zip up the journal for us and send it to us and we look at and we're like all right?

A

Well, there's this error here, but the rest of it looks: ok, so I'm gonna open up, hex, edit and and hex edit the file and and then send it back to them and let them overwrite it, and we don't have to do that anymore, which is great.

A

So the first thing we have is what we call forward scrubbing and forward scrubbing the MDS starts. You can give this the metadata server path and say I want to scrub from here, and it will go off in the background and you'll start at whatever path you get. It and it'll say all right. What do all the files in here look like? Oh look, I have some directories. Let me go down to the next directory and sort of it and sort of when it reaches an end directory. Then it looks at all the files.

A

It goes how to make sure that all the sort of Rado stay that we maintain a little bit of data and Ray DOS would also get to a little later make sure make sure that that radio state is consistent with what it has in the tree and make sure the directory is self-consistent with respect to what contents exists and make sure that the files all agree that they are in the directory, that it points to them and make sure that there are stats are consistent, etc, and so you can use this to make sure it's sort of from the top down view that everything in the system is kosher.

A

If it's not or if you have some other sort of disaster like you lost half of your cluster and most of your files are gone, but you want to get back what you can. We have repair tools which we call which referred to as backward scrubs.

A

So if there's a disaster, you're going to shut down the cluster, sorry shut down, the MDS you'll want to run our set fest journal tool which allows you to flush out the metadata server journal, which we haven't talked about yet so the metadata server maintains a journal or a log of all the things that of all the operations it does and then flushes them out, lazily too, to the backing to the backing file system objects, and this allows you to if the journal gets corrupted, you can repair the journal, but if it, but if you're just missing all kinds of stuff, then you can say all right.

A

I want you to take all that in the journal. I want you to flush down, and then you go to the data scan tool. The data scan tool makes use of the fact that radius is an object store. Unlike your normal hard drive, when we're doing a file system repair, we don't need to crawl over each block and say: does this block look like? Maybe it's an inode I think? Maybe it is so I'm gonna try and reclaim this file and put in lost-and-found.

A

Instead we can iterate through all the objects in the radio in the radio, spool and say hey. We know what objects are what the object names look like this is, you know a file object, and so I know that I now have this file whose inode number is 1776 and we use, and we do that iteration you some of our some using some of those various classes. We examine the object, name and presuming it's a file. Then we send the information about that object back to these shortly file routes.

A

So the first object in every file has a special piece of data on it called a back trace and a back trace is just the path of the file, but it's version so that it can be stale. So we say: hey you know like once upon a time in then we were in the home directory. It was some version 2 of the home directory, and then it was in the Gregg directory in version 9, and then it's the flour foo, and so, if we find this object, 1000 dot 1, we would say: hey 1000 0.

A

We had this object, 1001 and then you know well I've changed my numbers anyway, then that object object 0. We would do a second pass. That goes and looks at just the route objects and it would go and say: hey, I believe that I am in the Greg directory, which has this inode number so I'm going to send off the information to the to that directory.

A

Saying I exist and we can with that reassemble it might be slightly out of date, but we can reassemble a tree with everything in the cluster that we know is coherent and because we are running directly on the radius through the radio, say, P, I and running on running part of the code on the OSD. So we can do this in parallel across the cluster. It's not one serial work or we can spin up a whole bunch of them on different machines.

A

Alright, so awesome things we have a hot standby MDS, so nothing ties metadata to a particular server as I've sort of implied. Then we keep a log of the source of the of the meditations and Rados and we keep the actual metadata file or the metadata directory objects in rigo's. So if we want to, we can just move the metadata server over and the way you know you would do that by. If you were being polite, you would say: hey turn off this metadata server turn on the sub one.

A

You are running as that guy now, but in particular you can spin up as many backup ones as you want. We call these standbys and standby replay servers and the standby replay once in particular nice, because they will actually sit around and read the MDS log and replay all the operations in memory. They don't make any writes, but they'll just run it over and over in memory and say: hey did you do more operations? Let me do that operation in my memory too, and the reason you might want to do.

A

That is because it warms up the minute. The metadata servers cache so if you're active metadata server dies, your passive one can go. Oh hey, I just happened to have all of the things in memory that people are interested in and I don't even go around and grab those hundred thousand or million or however many I know it's I. Don't need to go and grab them off of disk. You know in that number of I/os I just have them ready to go.

A

So if you do have crash the replay is reasonably fast. You need to replay whatever amount of the metadata server log you haven't already replayed. You need to load all the necessary eye nodes out of the cluster. If you don't already have them in memory and then we have a and then we have a very short replay I think it defaults to 30 seconds where clients can say: hey I had some operations that I had that you haven't acknowledged.

A

Yet so let me replay those operations because I don't want to lose the fact that I changed that I changed this file permission to not be world readable, and then we synchronize the caching states between all the different clients in the system in the MDS and we go active.

A

So that's the end of the happy things for the moment there are some parts of stuff best that you might have heard about in the past that are not ready. Yet one of those that's almost awesome, is having more than one active, MDS server. If you've been in a talk about CFS or maybe even just staff. In the last six years, you've probably seen a slide that looks sort of like this.

A

Where we say hey, no metadata is stored on the MDS servers, so we can just like split up the metadata between more than one active MDS server, and you know it's great, it's cooperative partitioning. We each server keeps track of how hot the metadata it's working on is. If one of them gets too much hotter than the others, then it will migrate subtrees in order to keep the heat distribution across the cluster similar. This is pretty cheap.

A

All the metadata is in Rio, so we just like pack up the differences we have in memory and ship it over and it maintains locality. Unfortunately, it's not quite ready for people to use mostly it's just hard and we've been making sure that sort of the most basic products possible is ready to go.

A

We've been building repair tools so that, if there's a disaster, we can get you your data back, that you run away as quickly as possible or come back for another bruising, because it wasn't our fault whatever so sort of in general mb/s failure. Recovery is a lot more complicated if you have more than one active end. Yes, the the picture I painted when you have a single one is pretty simple, but when you have more than one MDS than operations like reading files, that might cross directories gets a lot more complicated.

A

You need to deal with the fact that you might have been in the middle of that when the recovery happens, and so you need to have this whole new phase for resolving any in progress operations, and so like. We have a lot of code.

A

It works most of the time, but there's just a few things that a few few edge cases that got missed when when it was being developed, and so we need a lot more testing and a lot more and sort of a comprehensive review of what we have and where we want to go.

A

Also almost awesome directory fragmentation directories are generally loaded from disk as a unit which means that if you have a hundred thousand file directory, which you can have I mean, depending on your model, that's not unreasonable. Then, whenever you access one file in that directory, the mb/s goes off and gets back all hundred thousand of them, and it's in its cache and says: hey now, I have the file I want.

A

Oh, but also you know, my cache size is only a hundred thousand inodes and so I had to throw everything away, which means that if you're doing repeated accesses on one very large directory, it can be very sad or if you have, or once we have multiple active MBS servers running. Maybe you have one really hot directory and you want to split it up across the different servers for faster throughput. So we have a feature where you actually can split up directories into multiple objects. That's the fragmentation part.

A

It probably works honestly, it's just not tested well enough, so we have it turned off by default in the storage system, we need to write like we. We turn it on in our nightly system so that we don't have a lot of large directory workloads and we don't have anything specifically going in and like making a large directory making sure that the split works. The way we expect it to making some change, making sure that things keep working.

A

It's it's basically, just a QA workload and honestly, what's the thing we could put off so we put it off almost awesome snapshots. Everyone likes snapshots and our snapshots are almost really really awesome. Instead of being, instead of being divided, subvolumes and taking snapshots of subvolume, you can just say: hey I want to snapshot at that guy's home directory I want to snapshot of this person's home directory. You know what I want to snapshot inside of that guy's home directory of the just the log directory.

A

It doesn't need to be the whole thing, and it's really cool that the this file data is stored in the inradius object snapshots, that's a primitive! They have it's very efficient at our level, but it means it makes the directory structures and the inode structures a lot more complicated, because you can do those sorts of snapshots inside of existing snapshots, you can rename files from inside of a snapshot outside of a snapshot.

A

We need to keep tracking all the metadata to keep things consistent and it's just it's just complicated, so we need like every so often one of our developers will go off and be like hey, I wrote a bunch know where new snapshot tests I found a bunch of more new bugs I fixed them all. It's passing now, but you know, then he writes more tests and it's like. Oh, we found more bugs.

A

So we need lots of testing, lots and lots of testing and then just sort of, especially when you add this in with multiple active metadata servers, you could have snapshots and part of the snapshot. Data is on metadata server, a and part of its on metadata server, B, and that makes things even more exciting from a coding perspective and from sort of a recovery perspective when you have snapshot operations that are happening, but one of the server's fails and you need to recover stuff.

A

So it's just more workload, but it's not something that you should be deploying in production or anything that, where you take the data too seriously last all most awesome features one that we're very excited about and that's having multiple active file systems within a single ratos cluster.

A

Historically, we've allowed one fluff of s within one stuff install, but nowadays we we have the code, it's it's locked off by default, but we have code so that you can say: hey I, want to create this file system and put it on pool a and the spouse's month will be, and this file system in pool a but just in a different radius namespace, and when you set up a multiple file systems in each file system, gets its own metadata server and has to be connected to independently.

A

Really the only thing missing here or the biggest thing missing here is just testing in commerce in February. We didn't want to turn it on for a long term, support release and we didn't to make such a brand-new feature, part of our stable announcement. um We do have a few very a few, a few very small, known issues under edge cases, which I think all the ones we have. We actually have pull requests pending, for they just aren't done yet, and the security model here is just a little bit iffy.

A

You don't really want clients who can access the ratos cluster. You see that there's a file system called apples, secret car project and right now, that's possible.

A

But this will probably be turned on for cracking unless we, which is our next release in about six months unless we come up with something very surprising and I, think that's the first time I've said that out loud, so there we go all right, so some pain points that you might see. If you deploy Steffes in testing or testing or or do something with it, they weren't expecting um only mr. file deletion file.

A

Deletion works, don't get scared like you delete a file, it does go away, but um you know a file can be very large. It can consist of you know thousands of Rados objects which- and you know, depending on how fast your cluster is, it might take a lot of times. You actually send out that number of operations and and do the actual deletes on the disk. So when you, unlike a file from the client side, then it sends an operation to the MDS and the MDS says alright.

A

This file is is in the leading state I. Unlike this file and hey, no one else has a no one else in the tree has a link to it. So I'm, deleting it I, moved. In fact, what we do is we move it into what we call the stray directory saying that it's not part of the filesystem anymore and then we'll say: oh hey, we can in fact leave this file.

A

We should delete this file, so we put it on a queue of things to delete, but putting it on that queue means that at the moment it's stuck in memory, and so if, for instance, you have a hundred thousand inode cash and you delete one hundred and ten thousand I notes and you pin them in memory, then your entire cache is now filled up.

A

In fact, more than your set cache size is now filled up with things that need to be deleted and any other client operations are going to have a very, very bad time in life, so they'll happen, but you probably don't want each metadata operation to require hitting disk three times. So that's sad.

A

We do have fixes in progress. um We have pull one pull request pending. That reduces the memory pressure a great deal, and so that help and one of the more urgent things in our task queue is that we need to build a system and so that we can say hey. I am writing to this, for this I know just being deleted and and by the way I just finished, leaving this one so give me the next amount for the queue.

A

It's not terribly. Complicated. Work it'll probably be done in the not-too-distant future, but that needs to happen. So you know if you're, deleting lots and lots of things that wants to be aware of that. Second major pain. Point is client trust sort of inherent to this ffs protocol right now is that clients are on some level trusted. We have coherent caching, which means that if a client, if a client has has, has information cached about something, we can't change it until we've until the client has told us that it's dropped the cache.

A

Now, if a client goes unresponsive, then we will time them out after 30 seconds or whatever, but if they, if they keep on saying yes I'm alive, but I can't give you this cat back or I, can't give you this into locks on this information back because I'm still writing data out and I can't release the in released information. Until I'm done writing this data, then we can't kill them so clients. If can deny rights to anything. They can read clients, because the data certainly OS DS and they have their own security capabilities than anything.

A

The clients can read or anything. The clients can write in the OST cluster they can trash over. You can fix that by giving clients separate namespaces, you can fix the the write deny by not shanks of the Cross tenants and sort of the biggest one is the clients can send a do s attack against damn? Yes, they attach to. They can just keep on saying, like hey, create this directory at a deeper and deeper level, create these hundred thousand files, whatever.

C

A

Once we have multiple file system in a cluster, then that will that'll work, but at the moment there's sort of a minimum level of trust you need to have in all of the clients which you are connecting directly to your radios, close or directly view yourself filesystem. If you don't trust your clients, you should put them through sambar Ganesha as a gateway instead and finally, final pain. Point is debugging live systems. We have some pretty cool tools. You can go to the metadata server and dump every operation. That's currently in flight.

A

We have similar commands on on the on the client side and see everything that's going on.

A

We can see what clients are connected to a data server and some of the information about what they've got going on and we can dump the contents of the meditative servers cache and get a lot of useful information about the state of the system out of there. But we can't say what's happening on this specific inode. We don't have a real great way if you have an operation on a client in this like. Why is this rename taking forever?

A

We don't have a really good way of saying: oh, it's blocked because there's a client which is touching something and it's stuck because oh hey you've got so many OS DS down that part of your filesystem names face is not available right now and the recent one that we just ran across is that we don't have a good way of tracking accesses to a particular file.

A

But you know these ones that we have on the list like we have some stuff in work in progress for, but it's sort of, we need to get it out and see what people actually want because I don't know: I'm a developer, my file systems last for like 12 minutes before I hear you start them it's great. We need to see what the diagnosis that people need are before. We can start building the appropriate tools to track them.

A

If you're interested in more information, you can get it from these websites or go to our IRC or mailing lists and I guess that's the end. So questions please I.

B

Have a question: why you're not scrubbing metadata automatically? Why do we have to schedule it just curious? That's.

A

C

By the message um so I've heard you know a long time, don't use in production and now you're saying it's awesome, but is it awesome or is it almost awesome I'm still a little shaky on on production usage.

A

So we've had many people actually running in production for many years, and sometimes they come to us and say: hey we've been running seven of us for four years and it works great and we're like I've, never heard of you before they're like yeah. It works and we're like awesome. So what we've?

A

The upstream community is leery abusing the words production ready you can talk to downstream or to like downstream people who provide actual support for that because the upstream community, it's all you know, hey I, have this problem and someone on the on the mailing list is interested in it. What we're saying is that it's stable? That means that we are very really very confident that if you run the system the way we tell you to run it and don't turn like like these features, that I've said are almost awesome.

A

You have to go set flags via the monitor we like they're locked out, so you have to acknowledge that we don't think you should turn these on. Yet, if you lose, if you turn them on, you might lose data. They irrevocably mark this that your cluster States, so that, if we're debugging things, we know that these have been turned on right.

A

But if you run it in the configuration that we that we that we that we tell you is stable, then we're very confident that you aren't going to lose data. It might or might not have the performance characteristics you're looking at, which is sort of where the concern is lower right. It's concern is right, like if you want, depending on what you like in a file system.

A

Performance, is sort of part of the of the basic requirements, but very different people have very different needs about what exactly is performing in which ways so we're saying it's stable, we're not gonna, lose your data, we're very confident or not. Gonna lose your data, and if some disaster does befall you, then we built the recovery tool so that we can get to your data back and let you go to something else now. That's where we are.

B

A

Eraser coding, I'm, supported under sacrifice, the Ceph file system expects a replicated pool so no erase your coding right now. There's Rados features which will be out if we're very lucky and Kraken may be an L which will overwrite Sonny, C and then I think it should be fine and we don't like to recommend cache tearing, but that does make it look like a replicated pool from South Augustus perspective. It doesn't care for.

A

Wind stuff of us natively support, CC yeah, that's probably not gonna happen. The EC overwrites will will give us the the Raiders API we need and that's when we'll do it.

A

Question is, if you have standbys or standby replays in the main, and the active metastable server dies, do they take over automatically, and the answer is yes: the metadata server, the active metadata server, maintains a heartbeat connection with the monitors and I think the default is 30 seconds, but you can tune it. The monitor is declared dead and they say: oh hey, we have someone else, you can take over right now and they push out a new map. That says this person is in charge.

B

Do you have any estimate on the scaling limits of with a single metadata server.

A

So we don't have real great performance numbers. The information we've had most recently says that, depending on what you're doing you can expect on the order of five to ten thousand metadata server operations per second, keep in mind, that's not like an HDFS server where stat counts as an operation. That's like something that changes the state of the system, because otherwise the clients have a cached and an inode + addi entry takes depending on what you're doing between about 2 & 4 kilobytes of memory.

A

So, however much you can stuff into that, and or however much remember, you can stuff into your hardware and, however, many inode Cindy entries. That is.

A

Alright, I don't see anyone else standing up and we're almost out of time, so I will stay around for a while and he wants to talk one-on-one and release the rest of you. Thanks very much.