Ceph Conferences, 13 Dec 2012

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sage Weil speaking at the Storage Developer Conference - Sep 2012.

Description

Ceph creator Sage Weil speaking at the Storage Developer Conference in Santa Clara in Sep 2012.

A

My name is sage, while from ink tank and today I'm going to talk about the SEF distributed storage system, just a brief outline of what I'm going to talk about a little bit about why you should care about another storage system, sort of what step is from a high level, how it works, what it does I'll talk a bit about that distributed, object, storage, layer and specifically about some of the interesting features of our object. Api that talks to that distributed, object store.

A

Then I'll talk a little bit about the distributed file system that we built on top of that and some of the interesting things it does and then finish up with a little bit about hooing. Take is why we work on set and why we think it's important.

A

So to begin with, why should you care about yet another storage system? You've heard any of these talks during the course of the conference. I think there are a few reasons. One is simply a matter of requirements. People have very diverse storage needs. Some people need objec storage because they're building the next web, two point: O application, they're going to dump a bazillion images into some big distributed server farm.

A

Other people need block devices for running virtual machines and so forth for their public or private cloud infrastructure, or maybe they want to replace their legacy, stand that's too expensive. Other people need a shared POSIX file system because they're running legacy, applications or because that's what theories are demands or users demand so forth and other people are doing big data types of things where they have structured data and they don't frankly, really know what they want.

A

Maybe it's file, maybe its object, and it's maybe not particularly clear but sort of common across all of these things is that people really need systems at scale. So when you're building out a large infrastructure for your the enterprise, you need to be able to incrementally add nodes to go from terabytes petabytes to exabytes.

A

Typically, you want to be able to run on heterogeneous hardware, so you have to buy all the same box from the same vendor, but you can sort of grow as time goes on and as a result, you need to deal with issues of reliability and fault tolerance in sort of the key design.

A

The other thing that's important in these types of systems is the amount of time that you spend administering them in particular the ease of administering you know dozens or hundreds of individual appliances can be prohibitive, and so you'd like to have a single unified system that you can sort of work with as it as a single unit, and that means you want to avoid tedious things like manual data migration, as you new nose, your system, you don't have to move data around and so forth.

A

You want the system to sort of figure all this stuff out for you and painless scaling, of course, is a key part of that, and that includes both expansion as you deploy new servers and also contraction, as you take old racks of machines that are old and failing and wired incorrectly or whatever, and rip them out of the system and have all the data sort of spread to the new nodes and have a sort of seamless migration, as that happens, and finally, of course, cost is very important.

A

Ideally, you want cost to be a linear size, a linear function of the size of your cluster or performances or as close to linear. As you want. You don't want the sort of exponential curve that you get when you're buying more expensive options, you would like incremental expansion, so you don't want to have to deal with for cliff upgrades and ideally no vendor lock-in, and so you can have a choice and what kind of hardware you run in?

A

What kind of software you run on, and so what a lot of organizations are demanding is really an open source solution that they can run on whatever Hardware they choose. That's sort of the ideal situation, of course, from our perspective, and that obviously demands a question of what is what exactly is Seth? Well, it's tries to address all many. Many of these concerns. First and foremost, staff is a unified storage system.

A

The idea here is that we can deal with multiple interfaces to storage, both object, storage using our native API s or restful api, so compatible assess through your Swift. You can also store virtual disks or block devices with features like thin provisioning, snapshots, cloning, that sort of thing and finally, there's a POSIX distributed file system, so you can actually store files and directories and so forth in the cluster, and you do this all with the same unified storage infrastructure. So it's sort of it. The API stack at the bottom.

A

You have this component called rattus, that's a reliable distributed, object, store and that's the thing that scales to thousands or tens of thousands of storage nodes and make sure that all your objects are replicated across multiple nodes. That sort of dynamically move data around as cluster state changes and sort of handles the reliability and scalability key pieces, and then, on top of that, you can sort of talk to that distributed, object store in a number of ways.

A

You can use the native liberators API to just if you need sort of raw object, storage for your for your custom application or something there's this ratos gateway component that sits on top of that API. That gives us three and Swift compatible object, storage, using restful api eyes, there's an rbd component that gives you a virtual disk.

A

That's essentially a logical disk that stripe over objects that are then stored in this distributed; object, store and shared, reliable, all that good stuff and finally, there's a distributed file system that also leverages the reliable storage abstraction to build a sort of higher-level service. Where you have. You know POSIX semantics with files and directories and so forth staff is open source. It's licensed under lgpl too.

A

It's a copyleft license, but you're free to link to proprietary code, so it's very easy to sort of integrate into other projects and, unlike some other projects, there's no copyright assignment. So it's very essentially the project. Copyright is an help hostage by a single company that you've been relicense it and extort money and so forth so tends to be very friendly. In that sense, there's an active community of users and developers and there's also a commercial sort.

A

Support available now step is a distributed source system and what we mean by distributed is that it's designed to go to data center scale, so you can imagine, tends to tens of thousands of machines and a huge data center, all participating in a single aggregate storage pool that you can leverage using your applications without having to worry about the details, aware where your data goes and so forth.

A

So it can skip scale, a single cluster, all the way from terabytes up to exabytes it's distributed in the sense that it's fault, tolerant, there's, no single point of failure, any particular host or demon or node in the system can can fail due to a software bug or a you know, power outage and so forth, and the system will keep your data available throughout that I'm just designed to run on commodity hardware.

A

So we try to take advantage of whatever you have available, whether it's you know hard drives SSDs in DRAM and whatever whatever it is, we can generally consume it and we try to make the system self managing and self-healing so that when you have a system that builds an entire data center, you're sort of optics class aren't proportional to the size of the system.

A

So if you've managed it as a single unit and most for the most part that squirrels for the behind the scenes to deal with all the details and moving data around I'm using software, so the stuff distributed object, store, is based on an object, storage model, and the basic idea is that you have some number of pools of stores that are durable, ah gical collections of objects and their infinite namespaces and collects menu for each pool.

A

You set some replication policy, they, for example, you want three copies of every object and then inside each pool you have bazillions of objects, I mean each object is consists of a blob of data, so some number of bytes similar to a file, some attributes similar to extended attributes, and you can also store in an object, a key value bundle so think of it as a sort of a small in a sequel table or a berkeley DB file, or you can sort of have sorted key key value access to an object as a sort of container for multiple keys.

A

So the question is: why do we start with objects? Why we will distribute object store? Instead of a distributed file system first and the most, the first reason is that objects are much more useful than starting with blocks. So, in contrast to you, know, just drive where you have blocks that are sequentially laid out and you can't really name them and you have to deal with all the allocation details and so forth. Objects are named.

A

They live in a single flat namespace that variable size and they give you sort of a simple API with very rich semantic, so they're very easy to program to to build some higher-level service on top of this basic storage abstraction but more importantly, objects are much more scalable than files, so, in contrast to a directory hierarchy, that's very difficult to sort of chop them to pieces and distribute across thousands of nodes objects have no sort of relation to each other, except the fact that they have unique names, and so it's very easy to hash them across the zillions of nodes and have the system sort of intelligently manage that without having weird interdependencies and locking and strange problems like that, so you can imagine.

A

The general goal of a system is to have some human interacting with some set of storage systems. Then you want it to scale, so you have lots of disks and so they're all sort of going through a computer.

A

The problem, of course, with this, is that you actually have lots of humans or probably something more like this, and it's obviously the sort of server typical and this sort of typical client server, Sarah becomes a bottleneck, and so you really want to do is have the users, consumers of storage, be able to talk to lots of different storage devices directly without having to do it through sort of a single bottleneck.

A

So that's what the step architecture do so serve a slightly different picture. Here you start with a number of disks. On top of each disk, you slap a local file system, typically butter, FS, or something like that. You can also run on explore or XO fast, although butter press is sort of where we're going in the future and then on top of that file system, you have object.

A

Storage team in SF object, storage statement that manages that particular local set of data and then communicates with the other demons in order to provide a higher level abstraction, and then you typically have a whole bunch of these inside a node, and then you have a bazillion of these to sort of form. Your your larger storage cluster. You additionally have some number of these monitor nodes that are responsible for essentially herding the cats they deal with them.

A

Cluster membership and state these Paxos to sort of make sure that we know who is participating in the cluster and what your role is at a particular point in time, but these guys aren't actually involved in any of the data path. They're only involved in sort of cluster management, cluster state and contrast these object, storage demons need at least three of them.

A

Typically, you have hundreds thousands, each one is serving up some local piece of storage and contributing it to the aggregate pool of storage for the entire system, and the idea is that we want to make these uh these OSTs as intelligent as possible.

A

So they can do most of the work of sort of dealing with this distributed system and so that the end result is that you have your your user who's, essentially interacting with the cluster as unit, doesn't care that it's actually made up of thousand different nodes, but the cluster sort of handles the details of where data goes and houses tributed and so forth, and that that is hidden behind a nice, clean, logical storage interface.

A

One of the key problems in designing system like this is deciding how your data should be distributed. So our requirements are pretty simple: we want all objects to be replicated. Some number of times is totally tunable, but usually two or three is what people choose, and we also want those objects to be automatically placed in balance in a dynamic cluster, because these systems are going to change over time as disks fail and new storage is deployed and so forth, and we also want to consider the physical infrastructure.

A

So we want to make sure that if we're replicating objects, we place replicas in different racks of the data center, so that a single power circuit failure won't affect the availability of my data, for example. So there are sort of three basic approaches that you can take for deciding where to store data. One is to pick a location and remember where you put it, and so when you come back a week later, you try to go back to the same place, and hopefully your data is still there.

A

The problem with that is that if the something happens say that host failed or Iraq failed or so forth, that won't actually be the case. That's sort of a not a very good strategy. More typical approach is that you pick a location with the data, and then you write down where you put it some sort of metadata server index.

A

There was something like that and then, when you have to go back and read the data, you look up where the data should be stored, it tells you which server to go to and then you go to that server and you find find your object that gives the system control, so it can actually move things around, but it has this whole index that it in turn has to be able to scale to huge and excises and becomes a problem.

A

So a very different approach is to use a hash function to determine where your data should be stored. So essentially you calculate a location based on the current state of the cluster, and that tells you where to store it and then, when you read it, you perform the same calculation and it'll. Tell you where to go to find it and to do this, Seth uses a function called crush. It's a pseudo random placement algorithm. That is a fast calculation for where the store data there's actually no look up involved.

A

So we don't have to bother with maintaining this index and we just calculate the result whenever we need to find it. It's a repeatable, deterministic calculation, of course, but the key property is that it maintains a stable mapping so that, if you have 100 servers- and you add one typically 1% of the data is going to move to that new server.

A

So you don't have this sort of random reshuffling that you would have with a naive, hashing algorithm and the other key thing with crushes that it gives you it's very flexible and that you can specify rules that determine how your replica is our place in the cluster. So, for example, you can specify that I want three replicas as I want them all to be in the same row of the data center, so that my replication, you know, doesn't traverse my spine network core routers or something like that.

A

But at the same time, I want the replicas to be separated across different racks, and you can specify these constraints and this calculation that it does will obey those sort of basic properties so that you get the layout that that you want without having to have an actual table to store the results.

A

So in a bit more detail, but this looks like then, you have essentially a pool of a bunch of objects and you hash the names of the objects, essentially by the number of what we call policeman groups, essentially we're starting this logical pool into a bunch of different pieces, and this gives us some number of placement groups that are rainbow. Color animation works correctly and then for each of these placement groups. We feed it into the crush algorithm, and then that calculates where and the cluster those guys are going to be stored.

A

And so you get the sort of D clustered approach where you're, where you're placing groups are randomly scattered across the cluster. But in this particular example, our rule specifies that we will always choose one node from the top row and one note from the bottom row say: they're different, hosts or different racks, or something like that, and the way that this works is that the distributed objects or Rados periodically publishes, what's called an OSD map. That's essentially a snapshot of the current state of the cluster which OS DS are participating. What the crush algorithm!

A

That's specifying! How data is mapped onto those nodes and all the current IP addresses and so forth, and using that particular map, we can calculate where any popular, where any particular object or piece of data should be stored in that storage cluster. The object, storage demons then are responsible for safely replicating the data in the stores cluster and as if there's a new map published over time, because there's a note came up or came down.

A

Those nodes are responsible for using sort of peer-to-peer protocols to migrate data to the new location as specified by that map, and they use gossip protocols to efficiently share these map updates and so that they can sort of stay and think about what what the current distribution of data should be in the system. This is a very decentralized distributed approach that allows massive scales, because you don't have any sort of central coordination. Aside from the fact that you're publishing maps that says which nodes are up and down, nobody is having to say you.

A

You take this piece of data and moved over there, but instead sort of the leaf nodes. The osts can do that on their own because they all have a shared view of reality based on these OST maps. So then, a client then also gets a copy. This map say it needs to store a particular object. It can do that crush calculation and we'll know that you know it's in the green placement group sort on these two knows it has sort of complete knowledge of where all data is stored.

A

By virtue of this mapping algorithm what happens, then, if you have a node that fails say we lose a node with the yellow in the orange. In that case, the the other Oh STIs that are replicating those placing groups to realize that they're there Pierre went down essentially because they got a map update. They can see that the the replicas are no longer there and the new they'll identify who the new home is for those of those data objects and they'll actually migrate them, and this is a sort of fully period of your process.

A

So there's no Sun, but nobody is actually having to do this for them. So again it scales very well and then a client. If it needs to read that object, it'll get a new copy of the map and it can go to the new location and find the data where it should be so liberate. Us is sort of the low-level API that talks directly to this distribute store.

A

You know it's got: C C++ bindings, Python, whatever it's very easy to consume, and so, essentially you can take an application link to liberate us, and you can gives have use a native protocols to talk directly to you know these Brazilians of storage nodes in parallel and sort of have full access to the raw capabilities to the cluster.

A

Liberate us also provides another number of other interesting features that make this make. This particular object API very interesting to consume one of those things as atomic transactions, so the client operations can actually contain multiple operations and a single request and those will be sent to the object. Storage, node and it'll apply all of those sort of atomically. There are all succeed and it commits atomically or they none of them will commit and it'll fail, which is kind of nice. So this gives you Adam icity, which is very helpful.

A

You can also do conditional requests, so you could say you could send a request. For example, that says make sure this X adder is equal to 1 and if so, then apply this operation. But if it's not don't do anything, so you can do sort of atomic compare-and-swap type operations and that's all mediated by by the object. Storage cluster.

A

The OS DS also support a key value API for individual objects. So, instead of storing just a sequence of bytes like a file in an object, you can store multiple keys and values.

A

It's based on Google's leveldb implementation, which is pretty nice. It's sort of the big table SS table design gives you efficient, no range query insertion that type of thing, so you can insert update, remove T's, and you can also the key thing that this allows you to do is sort of efficient read-modify-write type workloads where you can say, for example, that I want to just remove certain keys and that operation will happen efficiently on the OSD without having to read the entire object over the wire, make some small change and write it out again.

A

So, for example, the set distributed file system leverages this capability to store directories and objects. So, instead of having to read the whole directory and make a small change and write it out again, it can just tell the OC to remove this file and they can they can do that efficiently.

A

One of the other interesting things we can do is what we call watch notify. So essentially, you can establish sort of an interest or watch on a particular object in the object store. So multiple clients can sort of observe a particular object and then they can send notify messages to each other. So you can use sort of an object as a meeting point and a message, communications gateway for for clients to coordinate and allows you to do similar things that you might do with them.

A

Sort of Apache zookeeper, but using this distributed object, store, is sort of the basis for doing that type of coordination. So no an example of what this actually looks like here, a number of different clients. Each sending watching question object. They get a commit that says they've registered that watch interest in the object and then later, if somebody send to notify request that notify gets distributed to all different clients who have watched it and when they all acknowledge, then you finally get to notify acknowledgement.

A

That says: yes, I've notified everybody who's watched, and so you can use this for for a number of different things. I'm. What we's it for one example is the radius gateway component. That's doing the s3 gateway uses it to manage its own cache consistency. So it sends invalidate messages essentially to invalidate entries out of its other rate of skate wing scences caches, because you typically deploy like a hundred different radio skate ways and behind a load balancer or something in there.

A

They're all caching things, and so they use this to keep their caches come here in, which is which is pretty useful. One of the more exciting things that you can do, though, is you can actually implement what we call ratos classes, so you can dynamically load a shared object into the OSD that implements new functionality for objects, that's based on existing functionality.

A

So, for example, you can implement new breed methods on these objects. If you will, that will run arbitrary code essentially and then for a read request. They'll do some transformation, give you a response or for right they can do some sort of higher level mutation on the object and then and then atomically commit that to disk.

A

So some simple examples of this would be like a grep class could scan the contents of an object and tell you whether there are certain matches or return only certain records from at or you could calculate sha-1 of an object and return it.

A

If you were building a Flickr application, you could, you know, send a request to an OSD that would filter out the red-eye and write the object back out to disk or generate a thumbnail that sort of thing without having to read all the state over the network and make a small change in and write it out again.

A

So there's also a moving on I guess: there's a there's, a ratos de lis component, that's built on top of liberators that gives us three in Swift compatible API access for people who sort of want a drop-in replacement and their own infrastructure for applications that are targeted towards s3 or switch interfaces. That's built on top of liber8 s, it's relatively straightforward, there's also a rate of block device that gives you a virtual disk. Essentially, we just take a single disk and stripe it across lots of objects and then distribute those across the cluster.

A

That's pretty well integrated and works pretty well, so you can imagine essentially taking all these OS DS and distributing little blocks across them and aggregating those into a virtual disk and then attaching that to a computer or multiple VMs. More typically, so this is number of nice things you can do. You can snap shots on these virtual disks. It's linked directly into qmu kvm virtual machine framework. So you can, you know, have a virtual machine, that's backed by the stuff cluster without any colonel support.

A

There's also supporting the Linux kernel, though, so you can map like a dev rbd device, that's backed by this by the sub cluster, and you can also one of the newer features you can do here. Is you can take an existing rbd image and you can clone it to create a copy on write clone of that.

A

So, for example, you might have a new blue 2 1210 image that all your games are based off and you can clone a bazillion of these and then boot them up immediately without waiting for all that data to be copied into a new image before your vm can start up, which is an attractive feature for cloud infrastructures.

A

So again, you're taking lots of objects, you're sort of linking them through the live, our body library, integrating with some virtualization container and giving a virtual diff that's consumed by a virtual machine and, of course, because we're dealing with shared storage here, that's reliable and so forth. You can also do nice things like a virtual machine by migration, live migration.

A

You take a running vm and move it from one host to another because there's talking to the shared source back end and that all thought it works pretty well with kvm and of course you can use the the native colonel block device to map a dev rbd device on a regular host and you know put X 4 or whatever, whatever else it is.

A

On top of that, but probably the more interesting piece here as far as complexity is a safe distributed file system, it's probably about as many lines of code as everything else combined, because file systems are complicated, even when you start on an object, storage grant foundation essentially so, and the key idea here is that clients were mounting. The file system will talk to metadata servers to deal with issues related with the file system, namespace, so for resolving paths and traversing the hierarchy and so forth.

A

But then, when they actually need to read and write file data, they can talk directly to the object stores. The O's denotes to read and write from the object that stores that files data, and so this allows the data path to be highly parallel and distributed and scalable and so forth, and then we just have to deal with having a file system hierarchy. That's spread across multiple metadata servers. So that's really where the complexity here comes in. So we have these new metadata server components.

A

These are set in DSS they're responsible for managing the POSIX file system hierarchy. You know dealing with all the file metadata owner mode, uid, gid, all that good stuff. They store all of their metadata in raid us again, so we're leveraging the fact that we already have this magical reliable, distributed. Scalable storage abstraction. So these are religious demons that are cashing lots of stuff in memory and then writing everything out to raid us on the back end and, of course, they're only necessary if you're, using the set distributed file system.

A

If you're talking to just the object stores and the metadata service, don't even get involved, because those objects are trivially parallel and all that good stuff, but building the distributed, met a server with an interesting design problem. Part of it was due to the fact that sort of the legacy metadata storage approaches are sort of a disaster, so you typically have a file name, it master and I node, which is stored in some other table, and then that has a block list which is all the blocks on disk.

A

So you have to go look in and you find the data. So this is multiple levels of indirection before you actually read a file. Just comes sort of annoying annoying and the inodes are stored in a different table. Often the locality isn't very good and they could I know table gets fragmented. So, even though you're looking at sequential file names, I notes are scattered and random places on disk and things get fragmented and it's lots of Sikhs and it's difficult or petition right. You have.

A

You know you have your directory structures over here and then you have your I know tables over here and if you want to spread things across multiple servers, how do you make sure they're on the same node and it's it's an anger, so steph source metadata very differently?

A

The first observation is that block lists aren't necessary, we're storing our file data in objects, objects are variable sized and we can name them, and so we can just name the objects that are storing the file data, as the inode number, maybe with a block number, and so we don't have any block metadata at all. Steph I notes are very small and relatively pack because, basically fixed size.

A

The other observation is that I know tables are usually useless, because you only have a single file named linking to a flat a single inode most of the time, and so in those cases we can embed the I know directly in the directory that refers to that file. That means that in SEF we just store an object that has the contents of the directory. It has all the file names and most of the time it does all the I know is that those file names refer to.

A

So we can do a single I owe to the OSD, read a single object and we get all the filings for our directory and all the I note. So we can do things like LSD shale very quickly and we leverage the key value objects on the back end to make this all efficient and easy easy to manage the real challenge, those that we have. You know this one big tree, one big file hierarchy and we have multiple metadata server so like how do you? How do you make this work?

A

And what stuff does is we sort of dynamically carve up the tree hierarchy? Labeled tickly big big, take big chunks, sub-trees of the overall directory hierarchy and will sign them different to different metadata servers based on the current workload based on how busy those sub trees appear to be, and we can sort of arbitrarily do this, so they have all the logical complexity to migrate. Subtree management between metadata server nodes and sort of arbitrarily partition it across across metadata servers and what we call an algorithm.

A

We call dynamic, subtree partitioning, and the nice thing about this approach is that it's scalable we can take a hierarchy, sort of arbitrarily carve it up into little pieces. So that's that's nice, because we want to be able to have hundreds of metadata servers. The other nice thing is that it's adaptive, so the monitor the metadata servers are sort of monitoring how busy the file hierarchy it is that they have cached is at any point time and if they decide that they're overloaded, they can sort of take about.

A

Twenty percent is about this big sub tree over here and they can shunt it off to another metadata server, and this is based on the current workload. So if your work load shifts later and suddenly or have a total different file set that you're working with the metadata server will adapt by splitting that file set into smaller pieces and distributing across the cluster, so that you're sort of always utilizing all available metadata server resources. It's efficient, we as a sub tree based partition, so that we preserve locality within the workload.

A

So most applications are working within a single directory or nearby region of directories, and so those tend to be focus on a single metadata server. We got all the caching caching effects and locality and renames tend to be local and all that good stuff. And finally, it's a dynamic cluster.

A

So we might have three metadata service right now, but when load goes up, we might add an additional metadata server to sort of expand demand when I metadata server fails, another one can take over by replaying its journal and rejoining the cluster, and so it's a very dynamic environment.

A

One of the challenges, though, is dealing with metadata I. Oh so metadata tends to be very small. It's updated very frequently and you sort of want to avoid a situation where you have lots of small rights to the object store, because that's you know no matter how well you optimize a it tends to be a nightmare, and so the way we approach this is to view the set metadata server as sort of a big cash I'm.

A

Essentially, you have client requests coming in on one side, and you want to make sure that you have fewer reads: sort of and writes coming out on the back end to the ratos object store, that's sitting behind it, and so we want to reduce reads by doing prefetching and keeping a nice cash very big cash and give them lots of memory, and we want to reduce rights by consolidating lots of updates into a single sort of transaction that we then send up the back side and the way we do this is by maintaining a very large journal or log for each metadata server.

A

So the idea is that it's essentially a large sequential file or blog whatever you call it. We stripe over. We stripe over objects in the object store and then we essentially, whenever there's an update to the system. We write it out to the journal and then at that point it's durable and committed and we can move on, but we an end up with sort of two tiers.

A

We have all the recent updates get consumed by writing things out to the journal and then later, when we, the journal, gets big and we start trimming things off the end of the journal. We then take all those updates and we push them out to the long term storage, which is the per curve. Directory objects that store the file system hierarchy on the backend, and so the nice thing, of course, with the journals that you have very fast failure recovery.

A

So if I metadata server crashes, you can just read it in, but the more important thing is that, as a journal grows over time, you notice the missing effect, where the things that you right at the beginning of the journal are always sort of dirty every day. Those updates exist only in the journal, but metadata tends to be updated multiple times repeatedly.

A

So you might change the entire mind directory times as you do a compilation or so forth, and so, as the entries in the journal get older and older, they tend to become stale, so there I don't actually contain any useful information, and so, by the time we get to the very end of the journal. That's maybe an hour old. Most of the metadata we wrote out, there isn't actually even dirty anymore. It's been since updated more recently in the journal, and so when we do have a directory that needs to be updated.

A

We can take all the updates that have happened to that directory of the last hour and build up one single large transaction and generate a single ayah that goes out to the object, store and updates that directory, and so our overall right pattern. That is random because we're updating these directories tends to take to consolidate rights over a long period of time to each directory and ship them out efficiently. So the overall aggregate, I/o pattern generated by these metadata servers tends to be very good.

A

One of the big questions is what actually gets put in a journal minutes and it's a trade-off. So there's lots of state in any complicated system like this right there's and where what you actually do with that state is sort of a trade-off. So, on the one hand, if you journal that state it's expensive up front, because you have to actually write to write to the journal but, on the other hand, is very cheap to recover. When you restart the metadata server recover from crash, you read it in sequentially and you get it all back.

A

So so that's expensive up front, but it's cheap to recover. On the other hand, if you don't journal state, you tend to have complicated protocols during recovery. They have to recover at that state. So some examples of things that you would journal would be defective client sessions open the particular clients who are accessing the file system or maybe, of course, actual modifications to the metadata in the file system of those things sort of have to go in the journal, because they're important you know, do you want to recover them later.

A

On the other hand, things like cash provenance, you know, I have a particular piece of metadata in one minute or servers memory, but it's replicated another in other metadata servers, caches or in client caches. That type of information is very expensive to journal because there's a lot of it and it's happening all the time we want.

A

We don't want to generate all this ayah, so it means that has a trade-off when we do recovery when the metadata server restarts clients have to reconnect on the app to sort of resynchronize and re-establish the shared state in order to move on less. But one of the key things that we do do is what we call the ez flush. So whenever their client modifications that the client is sending to the metadata server, the metadata server is queuing them up and is getting ready to send out to the journal.

A

But it doesn't necessarily have to do that immediately. It can send them out when the client is nesting core, after that the right gets big enough and so forth, and then that makes that the journal, iOS, larger and more efficient.

A

The client protocol were the sub. Clients are talking to the metadata servers. Generally speaking, it's highly stateful, so we aim for strict POSIX consistency. So we would like to processes interacting with the file system, to behave the same if they're, on the same host or if they're on different hosts, don't just be little bit slower, they're on different house, but that's sort of the level of consistency that we inform.

A

In contrast to protocols like NFS, which are notoriously weak in this area, we tend that the clients have a seamless handoff between metadata server demons because they're actually using our own protocol and that sort of a legacy protocol I canna fest. They understand the fact that they're talking to lots of different metadata servers and they can behave it tell instantly as a result for that.

A

So when the client is traversing, a hierarchy, they'll sort of seamlessly move over to different meditative servers that are managing that part of the file tree, I'm and so forth, and when the metadata servers are doing their load, balancing and they're moving things around. They tell the clients about, and so the clients can shift their their cash state and so forth to actually have that. Have that work? Well, of course, when they're actually reading and writing file I/o, they talk directly to the LSTs.

A

So it's sort of an illustrative example of what the section looks like you have a client here and he's happy he's going to mount the file system until you do mount tcf. The IP address is one of the monitors, that's sort of the how you identify particular cluster. So initially there's going to be a few round trips to the monitor.

A

Has he authenticates and gets a ticket that says I'm allowed to talk to these meditator routine and demons he's also going to learn who the metadata servers are, what their IP addresses are and what to do, sts are and what their IPs are. And then there will also be a couple of round trips to the metadata server, as he opens up the root directory. He opens up a session and then gets a handle essentially on the root directory. So we can mount the file system amenity.

A

The server is going to journal something to the OSTs, because he wants to record the fact that he now has a persistent session open with this particular client and then say the client traverses into a directory. So they're going to be a couple different round trips, a pair of round trips of metadata server, as he looks up, / foo and then / bar inside that directory as the metadata server had a cold cash.

A

Then he'll up the low-dose directories off of disk, so there will be some corresponding I/o request to the objects or to populate the metadata server cache. But that's generally pretty quick I mean if the client does analysis al. You want to see. What's in the side, this directory there'll be an open operation that actually has no interaction, because he already has a handle and a leaf on that directory inode. So there's no MDS interaction necessary there and then, when he does the reader to fetch all the directory entries.

A

There's going to be a single round trip to the metadata server to fetch all these directory names and again, if there's a cold cash, they'll load all that stuff off of a disk in a single io to load that director Han. But the reply is actually going to contain. Not only the directory, names and leases that say: they're valid until otherwise invalidated, but also all the I notes that those names refer to that.

A

The that it is have we got for free because they're embedded in the directory, so that when the client has a stout on every single file, there's no additional metadata server traffic necessary. He already has it all in his cash and it's all right there in the VFS and just plows right through it. And finally, when he closes again, that's essentially a knob here and then.

A

Finally, if the client is going to copy all the data in that directory to somewhere else, he now has all of the I notes for all those files and leases on those I, not saying that they're not going to be changed, and so he can go directly to the osts. That store that file data to copy those objects to a local file without having any further interaction with the metadata server. So again, this means that the metadata server workload is very low and efficient.

A

That client has these sort of highly stateful leases and all the pre fetching and caching and so forth, and when he actually does start to do file I/o, we can spread that across the entire cluster and do it all in parallel and appreciate it on his way, and it's going to be fast and wonderful. One of the other interesting things that the metadata server does is what we call recursive accounting so because we're essentially implementing a file tree from the ground up, we could do all sorts of interesting place.

A

This is one of them essentially for every directory in the hierarchy. We keep track of recursive statistics. One of those is file sizes.

A

So for each directory we have a summation of all the file sizes nested beneath that point, in a hierarchy stored in that directory, I note, and so, for example, when you do an LS al, the file size you see for a directory. Instead of being the sort of meaningless number, that's like a multiple of 4 K or something on x3 is actually the sum sum of all the file sizes nested beneath that point. So essentially what you would get from a d you, but it's free it's sort of accumulated over time efficiently by the MDS.

A

We also maintain file and directory counts and the most recent modification time. So, for example, if you dump the extended attributes on any of these directories, you can see all these different statistics, which is which is interesting, and the key thing is that it's efficient. So whenever there are changes, this information is sort of being lazily, propagated up the hierarchy tree by the metadata servers and stored. So it's not.

A

You know one hundred percent time accurate, but it's way cheaper than doing Adi you to try to figure out why your disk is filling up and what user is writing data and so forth, so pretty great for system administrators? One of the other interesting things we do is snap shots on a per directory granularity.

A

One of the problems is that when you have a petabyte scale file system, you don't necessarily have a single data retention, snap policy that makes sense for all different types of data that you're going to store in it, and so instead we empower direct users to create snapshots on any sub directory, and that applies sort of recursively. The things that are nested being set point- and we do this using a very simple interface without any special tools.

A

So this hidden there's a hidden dot snap directory, and if you want to create a snapshot, you just do a make der inside this hidden directory with some random name and that poof essentially creates the snapshot. I mean it has sort of the usual semantics where I'm, if you you'll notice, if you look inside a subdirectory in the dot snap directory, you'll see that it's, if also part of that snapshot, although the name is mangle to avoid collisions, and it has usual semantics. So you know you delete a file, it disappears.

A

If you look inside the hidden snap directory, it's still there, so it's sort of what you expect and then, when you're done with a snapshot, you want to delete it. You can just do an arm door on that magic directory and in it goes away and it sufficiently cleaned up. On the back end, the file system, client is implemented and a number of different ways. There's a there's, a Native Client in the linux kernel. That's been upstream for two to three years. Now that you've mount with mount Dashti steph.

A

You can react sport that as NF a source, if sort of unusual way, there's also a fused version of the client where that's implemented user space is the generic fuse API to to mount there's also a shared library that you can link directly into an application. If you want to build something on top of the Ceph file system, but don't want actually melt it as a native file system on your kernel, then you can do it that way and their number of things that we've done at that.

A

So one example is: there are patches to glue lips at best into the samba BFS, so you can directly re-exports f as sifs without actually mounting it as a kernel file system. Another example is the ganesha user space NFS server there. Those patches actually support peanut best on top of stuff, which is sort of an interesting thing and another is Hadoop, so you can use staff in place of HDFS and run all your MapReduce stuff on top of Stephanie still go all the you know.

A

The localizing features where you run the computation on the same note that the data is stored on and so forth. So it's very easy to consume. Sort of a picture of the current status of the project. Ray dos liberate us rato, scary RVD are very stable. People are using in production generally pretty awesome. The file system is a bit more complicated is nearly awesome.

A

There's a bit more sort of deliberate QA effort and the story there is that I sort of bit off a lot initially so I was working on this project for a long time, sort of on my own and implemented all kinds of features in there with the snapshots and the recursive accounting and the scalability and so forth. And it's been being used for a long time. But it hasn't had sort of a deliberate q effort that you need with a real QA team, with all sorts of automatic regression, testing and failure, testing and so forth.

A

And that's really what's coming over the next few months as we sort of get the file system in shape. So.

A

Sort of just a I don't know why my animations are all screwed up a little bit about why we why we work on staff, their limited options for open source, scalable storage, there's no thirst buster in the HPC space, there's cluster. There are things like EFS and so forth, but there aren't that many options that really scale big and that's an emerging requirement for things like public cloud and private cloud and structures in particular, and also for big data and so forth. In the proprietary solutions that people use instead tend to be very expensive.

A

They tend to not actually scale all that well they'll scale, to like tens of notes, but typically not beyond that, and also that scalability.

A

If you start thinking about the money that you spend to actually make them scale and they tend to be a nonlinear function of your capacity as far as how much I should have to pay to get the size that you need so for somebody who's trying to deploy a public or private cloud, there's really not a viable, affordable option that you can use to compete with, say, Amazon, I'm, just frustrating, and the other sort of key thing is that they tend to marry hardware to software you're sort of stuck buying a particular box running a particular piece of software, and so our belief is that there's sort of a paradigm shift that needs to happen in the storage industry, where people need to have the same freedom that they have now with Linux and choosing whatever Hardware.

A

They want to run it on in the store space. So they can choose. You know the cheapest SATA drives they want or really expensive, fusion-io o drive and then run sort of a distributed, scale-out system. On top of that and then go out from there. So stuff was originally created. It UC Santa Cruz. He grew out of some department of energy grants for petascale storage after I finished my dissertation work. It was developed at dreamhost for several years sort of as a skunk works pet project of my own.

A

More recently, we spun out a company called ink tank, that's dedicated to supporting staff properly as an open source project, so that companies wanting to deploy this as a storage system can actually buy level 2 and level. 3 support consulting performance tuning that sort of things that they can actually run it in environment and there's a growing community, so the Linux distros are picking it up. There are lots of users. It's integrated with OpenStack CloudStack system integrators are looking at it. Oem are looking at it as a basis for their future scale-up source products.

A

So that's exciting times.

A

So that's it! What kind of questions are there? Yes right, so the I had yeah, that's always the question. So the the observation is that hard links are rare, so we optimized for the case where there aren't hard links.

A

But when there are, we have what's called a remote link, it's sort of a it's like sort of like a symlink internal, the file system, but it the inode is stored with one of the links and the other one has to be able to find that other instance of the inode by inode number, and so there's an auxiliary table that essentially keeps track of parent of directories and so that it can sort of recursively traverse the hierarchy to find that I node in in the hierarchy.

A

It's it's a little complicated, but it works it in and ends up being, meaning that the sort of the cost to resolve a foreign link in that sense is a little bit more expensive. It's like logging, expensive to find the inode versus a typical I, know table which is sort of order, one it's a fixed cost, and so it's not that bad. But it's not not quite as good so for workloads where you have bazillions of hard links.

A

This isn't necessarily the best choice but but it'll work right right, right, yeah, you might have to do a couple different lookups to find out where that I know it is in the hierarchy in Traverse to that point and find I know, but that's something that the metadata server is sort of doing behind the scenes for you, yeah.

A

How do you deal with running out of space? Is the question the easy answer? Is you don't you generally when you, as the cluster start to fill up you just deploy more storage nodes and things were balanced out of the way part of the problem? Is that we're using a hash based distribution? So when you're writing a piece of data, you don't get to choose where the data is stored, the hash function.

A

Does that for you, so the key is to make sure that the variance in the utilizations the different nodes is relatively tight, so their number of features to actually make that happen, but essentially, once you start having devices that are approaching full, then an OSD map is published. That basically says everybody slow down, switch to synchronous rights and eventually, whenever reaches a certain point, it says everybody stop writing, because we're full. So, yes,.

A

A

Yes, so the question is: how do you deal with the semantics of cross directory renames, so they're sort two cases there. One is when the target directory is on the same metadata server, that's easy! You just earn elytte and update the trees and so forth. The harder case is when you're renaming across metadata servers, which sounds hard in reality. The fact that we're we have this ability to dynamically move sub trees between metadata servers and we're already sort of describing this distribution in terms of these trees that are mapped.

A

We can sort of leverage, some of that, so that when we rename a directory somewhere else, we're actually only moving the inode, but then in the new location it appears as if that sub tree was been remapped back to the server where it already was, and so we're updating the subtree map and removing that 1i node and updating the hierarchy, but other than that. There's not inexpensive like migration. That has to happen to make it work. So it's complicated there's.

A

You know several messages that go back and forth and there's two phase commit going on in the journal and so forth, but but it works such as slower. That's one of the reasons why we try to maintain a coarse subtree partition so that most renames aren't across meda to service. They tend to be localized in the same sort of part of the hierarchy. Yeah.

A

Yeah, do you so? Do you deal with having heterogeneous storage same yes, so there there are a couple of different ways to deal with that. So one is that the crush hashing algorithm essentially lets you wait each device that determines proportionately how much data they'll get. So if you have drives that are twice as big as other drives, you just set the weight twice as high, and so they get twice as much data and that's twice as much I. Oh, so that's the first answer, but that doesn't really deal with a different performance characteristics.

A

So that's a bit more tricky. So the answer there is that the rate of object model lets you create different pools of storage, so you might create one pool of storage. That's based on you know slow SATA disks and you put one type of data there and you might create another object, pool that's based back by flash or something and you put other data there and then the file system. You say this particular directory is mapped to this pool of storage.

A

So you know, / temp has one replica and it's over here on this crap and then slash home has four replicas and it's over here on the superfast f50 or whatever it is, and the final thing that you can do is in the crush rules that are specifying how your data is distributed. You can do something like say that I want three replicas of all. My data I want the first replica to be on this sort of fast year of storage. That's servicing both the reeds in the reitz.

A

Maybe it's you know staffs or flash or whatever, and then I want the additional replicas to be on this sort of slow storage. That's all SATA and it's only getting rights unless one of the you know front end notes fails, in which case we fall back, but that's sort of the exceptional case. So there are several different games that you can play to do with it.

A

Yes just mentioned one way: mm-hmm many opportunities.

A

There are several different ways: you can have an administrator that sort of decrees. This is the crush hierarchy that I want and I meticulously figured out how that should be constructed and mapped it all out and said this is the map to use the other way is you can tell each node in the South com file which rack which row which host it is and then the startup script when it starts up, will say: okay, I'm, starting a postie.

A

You know 712 update the location for this OSD in the crush map to be in this part of the hierarchy in this row, rack hosts whatever and so it'll be placed in the right point and then, when it boots up and it starts getting allocated, data it'll be serve in the right location. So that's sort of the direction that we're moving with with them.

A

There's a lot of work going on right now, with improving integration, with tools like chef and juju and puppet and all those sort of dev ops ii deployment tools to make it extremely painless to sort of deploy this on thousands of servers. And so that's one of the things as long as you can tell each host sort of what their row and location is and so forth, then, when you deploy OS DS, they'll dynamically allocate OSD, ids and they'll put themselves in the hierarchy appropriately and they'll start up automatically and so forth.

A

A

The question is about geographical replication. Yes, they're, they're sort of two different projects there that are both on the road map, but they're a bit ways out. One of them is sort of disaster. Recovery type replication where you just want to have an asynchronous, mirror, that's sort of streaming off to another location so that, if you, the primary cluster fails, you can have you know less than five minute old copy somewhere else. That's in a consistent, a consistent state and that's sort of the easier of the two.

A

The other one is dealing with situations where you have an active data set, that's distributed across multiple data centers, and you want to be able to tolerate an entire data center failure without having sort of a latency spike, as you sort of time out and wait for that to happen, and that requires a different replication model where the client is actually writing to multiple data centers in parallel and as soon as two of them reply, then the right is safe and when you read you read from two and that sort of thing- and that's that's a larger challenge, but is also under roadmap.

A

So my time is app. So thank you all for coming. If you have further questions.