Ceph Ceph Tech Talks, 27 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-JUN-27 :: Ceph Tech Talk - Intro to Ceph

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, hello, my name is safe, while today, I'm gonna give an introduction to stuff the open source distributed, sword system, I'm, going to start by talking a bit about what Steph is and why we care about it. It's important I'll do a bit of a deep dive into the set architecture.

A

Why it's built the way it is how it works, but are the core concepts? What makes it different from other systems, focusing mostly on raitis the underlying storage later, but also talking about the object, block and file components that are built on top of it, then we'll shift gears a little bit and talk about how stuff is managed and some of the user-facing features that make it easy to consume.

A

And finally, we'll talk a bit about the open source community and for ecosystem. So what is SEF? It's been described as software-defined storage. As a unified storage system, scalable distributed storage, we've branded stuff I was the future of storage. It's on a lot of our t-shirts and people have also described stuff as the Linux of storage, but all these frames braces mean slightly different things: different people so try to get to the crux of it.

A

I think the first thing to recognize is that stuff is open source software emphasis on a software that can run on any commodity hardware, so commodity servers from any vendor use this typical standard, IP based networks, and we can use all the usual standard types of storage devices, hard disks, SSDs in muse, Envy, gems and so on. And finally, it's it's important to recognize that def is a unified system in that you can serve object, block and file workloads from the same cluster from the same hardware, using the same software stack.

A

So Steph is free and open source software. That means you have the freedom to download and use f3. In that sense, you also have access to the source code. It's open source, so you can introspect look at how the system works. You can modify it. You can share changes as long as you conform to the open source software license. This gives you freedom from vendor lock-in.

A

You can choose from many different companies and organizations that are building products and services based on stuff, and, if you don't like them, you can switch somebody else without having to you know. Throughout your software stack and by virtue of the community, you also have the freedom to innovate in the space by integrating stuff with other software systems and adapting it to your particular use cases and workloads.

A

Sep is designed and built to be reliable. Our goal is to create a reliable storage service out of inherently unreliable components, though the architecture is designed with no single points of failure. It provides data durability via either replication or erasure coding of your data, and it's designed to be continuously available so that you have no interruption of service from rolling upgrades expansion of your plaster of your cluster contraction failures and so on, and it's also reliable in the sense that we as a rule, favor consistency of the system over correct and correctness over performance.

A

So we're more concerned with making sure that when you read data, has the exact same data that you wrote and will avoid taking shortcuts in order to make things go quickly.

A

Finally, stuff is scalable. We describe it as an elastic storage infrastructure. That means that your storage cluster may grow or shrink over time as the size of your data sets or your workloads or your overall requirements. The organization change over time and you can add and remove Hardware from the system while it's online and then load both to deal with failure and hardware refresh in order to also to expand capacity or deploy a new different performance classes or whatever it is, and we can scale a number of different ways.

A

So you can scale up by simply using faster, bigger servers and storage devices. You can scale out by adding additional nodes or racks of nodes and more storage devices to get more capacity and performance in the system, and you can also federate multiple clusters across multiple sites, using a set of asynchronous replication features for disaster recovery type, these cases and to provide availability in the event of a tie, an entire data center site image.

A

And finally, we describe Seth as a unified storage system.

A

So that means we object, block and file service from this a buffer from the same cluster, and all of that is built on this underlying component that we call ray dose ray dose is the sort of reliable elastic, distributed storage layer that makes sure that data is replicated and handles all the hardware details handles data placements migration, rebalancing when you add, or remove Hardware, all of that to provide sort of this underlying reliable and highly available storage layer, and then, on top of that, we have different components to provide different types of API.

A

So our GW, the rate of Skateway, provides an s3 compatible with compatible object, storage, API buckets using a restful, get input, type interface, RBD. The rate of block device provides a virtual block device interface. This is used very widely in public and private cloud deployments and platforms, so virtual disks usually backing virtual machines and stuff if s is a distributed network POSIX file system that allows lots of clients to have a shared access, shared access to a single file system in space, with your usual POSIX like semantics.

A

So in this talk, I'm going to do sort of a deep dive into how this architecture is put together and how it works, starting with raitis this underlying layer and then moving on to Rios, Gateway, brightest block device and stuff file system. But let's start with Freitas raitis stands for reliable autonomic distributed, object store, and this is the common storage later that underpins all the other services and stuff, and it provides this low-level data object, storage service, that's reliable and highly available.

A

That's scalable both when you're in cluster is initially deployed on day, one these sort of arbitrarily large, and also on one two three years down the line when you are refreshing hardware and expanding and deploying more storage and so on. It's also scalable sort of after the fact I mean ray. Does this job is to manage all the replication and erasure coding of the data in the system figure out where that data should be stored on what nodes and what storage devices rebalancing scrubbing for integrity checks?

A

Repair all that stuff is handled by this underlying rate of storage layer. It's designed to provide a strong level of consistency so for those familiar with the cap, theorem Rados is a CP system. Not an AP system and its purpose within a larger stuff architecture is to simplify the design and implementation of the higher layers, so that the file block and object components can focus on the complexities and intricacies around, providing that particular type of API we're all Rados can handle the safety and availability of the data, so Seth Andrade us as a software system.

A

So it's comprised of a number of different storage demons. The first one is the monitor step. Monitor these monitors are a central authority for authentication using for authentication data placement and policy in the system, they're sort of the central coordination point that manages all the other demons that participate in the system.

A

They protect critical cluster state with an algorithm called Paxos and they're, typically somewhere between three and seven of these per cluster, usually spread across different hosts or different racks, so that you have reliability and availability, there's also a set manager demon that has two roles. The first is to aggregate real-time metrics about all the demons that are participating the system, things like what's the current level of throughput.

A

What's the current disk utilization, what are the various internal metrics that all these other suffer components are reporting aggregate all that so yup a real-time view of what's happening and stuff, and the second job is to provide a host for pluggable management functions. Things like the the dashboard for user management or automated background tasks that are doing optimization and other automated functions. These can all be implemented as Python modules and are hosted inside the set manager.

A

Daemon, there's typically only one or always actually only one active manager, demon or cluster, but you usually will have a number of standby so that if the first one fails or the host that it's running on fails, another one can take over. And finally, we have the stuff OSD, the object, storage, demons. These are the workhorses of a SEF cluster and their job is to store data on a directly attached, hard disk or SSD, and a service IO requests to that data. But these OS.

A

These are also smart they're, not just a passive storage daemon, but they cooperatively, cooperatively, communicate with their pure OS DS, and they work together to make sure that data is safely replicated and consistent to make sure that things are scrubbed and rebalanced and that everything is a-ok.

A

There can be a small number of these. It's a very small cluster. You know one or three of them as sort of a proof-of-concept type deployment, or you know many many thousands of these in a you know a larger cluster with tens of petabytes of data.

A

Now one of the goals of the set architecture is to avoid sort of the problems with legacy: client-server type, architectures and so over.

A

In past decades, client-server architecture served as well, but as the services scale out, it becomes very difficult for systems to maintain the illusion of being a single server when in fact they are hundreds or even thousands of individual servers, and so there a number of tricks that systems I've used to sort of maintain this illusion things like virtual IPs and failover pairs and gateway nodes that sort of hide the layout of data on the back end.

A

But all of these design methods are limiting and overall sort of limit the design of the system and limit the overall performance and consistency and behavior. So instead, I'm step is designed around what we call a client cluster architecture, which basically means that there is an intelligent, client library, that's sitting at the application side. That understands that. It's not talking to a single server, but is in fact talking to a cluster of cooperating servers and could do intelligent things like smart requests for adding making sure that IO requests are routed to the correct node.

A

That has the actual data in question, and this allows us to do things like flexibly address all the all the nodes, all the demons in the system.

A

You know manage the fact that data might be moving around in the background and provide sort of a seamless experiment, experience for the application and, at the end of the day, we're providing the same application API as far as the application is concerned. It's writing a data object into some logical construct and it's this library, that's handling sort of the internal details of where exactly that request should be routed.

A

So one of the first questions when building a system like this is where, should you store your data? And how do you know where you put it? So if you imagine an application that wants to read or write a data object, it needs to know where. To put it, though, that's sort of a naive approach would be to have a metadata server that has a big table of all the data objects and what servers? What notes are stored on em?

A

The problem with this is that it involves a separate lookup step if you're trying to read an object, you have to find out where the object is first and then go and contact that particular node. That's slow and it's also hard to scale that metadata service to trillions of objects when you're, storing many many petabytes of data, so at many other distributed systems and stuff as well to do is something called calculated placement.

A

So the idea here is that when the library starts up, you get this initial map I'm a concise description of the structure of the cluster. What servers exist and how data is supposed to be laid out across them and then, whenever you want to read or write a particular data object, you do some calculation, that's a function of the state of the cluster and the name of the object and that spits out the location in the system where that data should be stored.

A

And then you can contact the appropriate node or demon in the system and then, if some time goes by, you know, maybe the cluster gets expanded and it fails and data gets moved around. The application can get an updated version of the topology of the cluster and so later on, when it needs to go, read back that data that a previously wrote it can repeat that calculation, it might get a different answer, this time and it'll go and contact the appropriate node where that data should be stored.

A

This avoids the complexity of having that overall lookup table and tends to scale very well when your clusters are very very large. But this brings up the question of what these data objects actually are. So this is the fundamental underlying unit of storage in Rados is an object, so each object has a name: / unique set of characters, usually tense characters as some semantic meaning. Presumably each object can have some attributes associated with them. These are sort of analogous to extended attributes in a file system.

A

I mean a or may not need to use them, but you can have some sort of lightweight metadata associated with that object, but the bulk of the data that the object is really in the byte data or key value data. So the first type of object looks kind of like a file. You can store a bunch of bytes in it. Typically, objects can be.

A

You know a few bytes up to maybe tens of megabytes on average and by default most objects and stuff or around formats, but it can be sort of anything within that general range of sizes.

A

Rados also supports a type of object, called OMAP objects that can store key value data instead of by data, and so the idea there is that the content of the object object is actually a map of some number of keys to some number of values, and it's similar to similar to sort of a flat database, no sequel type, lookup table and and any object in gredos can contain either byte data or key value data, or actually both, although in practice it's usually one or the other and then sort of stepping back a bit.

A

All these objects exist within sort of a logical grouping called a pool so pools usually map to some sort of use case or deployment. So you might have a pool that contains all the virtual machine images for your cloud hosting of the structure. You might have another pool that contains all the data for a file system that sort of thing, so it's sort of a high level large grouping of objects in the system.

A

So the question is: how do we map? How do we decide where these objects should be stored across the you know, hundreds thousands or tens of thousands of OSDs in the cluster? So if you imagine you're, storing all kinds of different data and stuff right, it might be disk images files, video files, pictures, let's assume for for as an example that we're storing a big impact video. So the first thing we would do is break that large video.

A

Maybe it's several terabytes into lots and lots of greatest objects, so a long sequence for mega objects say and they're all going to add names that are, you know, probably end with a number or something sort of this sequence of objects. And then, of course, all of these objects exist within a pool, so we're dumping all this video data into a single rate of spool, and so when you do this with lots and lots of videos, you end up with a pool that has bazillions of objects, millions, billions trillions.

A

That sort depends on how many terabytes or petabytes of data you're storing the system. But it's a effectively unbounded number of objects exist in this pool, and so we have to decide where all these objects should be stored in the system.

A

So the next step step that stuff takes, is to break this pool and down into a number of different fragments or pieces or shards that we call placement groups, and it does this by taking a hash of the name of the object and then using something like a modulo operator to map them to one of the say.

A

4096 in this particular example, different placement groups that are comprising this pool so that each placement group has you know some fraction of the total objects of the pool and and those objects map into placing groups in a deterministic way and finally, for each of those placement groups used to these sort of fragments of your overall data set. We have to store it on multiple devices for redundancy. So in a 3, X replication type scenario, then each of these plates and groups would be but a Steudle randomly assigned to three different toasties in the system.

A

But these are small. We have a lot more placement groups than we have OSD. So if you sort of look at this from the other perspective, if you consider one single storage device, it's actually going to store, you know tens or maybe around a hundred different place in groups that might be all from the same pool or might be different pools or whatever. So each OS D growing lots of different chucks chunks of the overall data set over all pools, placing groups that are stored within the sub cluster.

A

So you might be asking why do we have this sort of intermediate stage? Why don't we just sort of assign objects to storage devices and there a couple of reasons for this, but it's helpful to look at what the alternative design options might be in this case, so in sort of the simplest approach you could simply choose to replicate disks in your system right.

A

You could take all of your disks there doing three rare replications, so you have a bunch of sets of three disks and these disks are simply all replicating the same: identical identical content, so sort of raid zero or raid one type configuration the first limitation. You notice is that, in order to do this, all these disks have to be exactly the same size, or at least you can only use sort of the smallest size of three disks.

A

That's a bit of a limitation, but maybe you can get past that if you instead replicate placing groups, for example, though, things are a bit better because you have each individual placement group, you can sort of randomly choose which devices it's assigned to and they're sort of spread around. You can have different size devices, because the smaller devices might just have fewer placement groups and the larger devices, but it's a bit more flexible in that sense.

A

And finally, you can imagine taking this to an extreme where you take every single object in the system and you sort of randomly map that to different devices, and you end up with a situation where sort of every set of posties or disks in the system is sort of sharing replicas data with every other disk. So you have a sort of a tightly fully connected mesh of storage devices.

A

So let's look at what happens when a disk fails. So in the disk replication scenario, if the disk fails. The first thing you notice is that you have to have a spare device, that's empty and totally unused in order to do a repair and then also that spare has to be sort of the appropriate size so that you can make a new copy of the failed data onto new disks to compensate for the fact that you lost one. So this is a couple of problems. First, you have to have these spares around.

A

They have to give you the right sizes and before the failure, the other idle disk- that's not being used, you're essentially wasting that resource, and the second thing is that the recovery process is bottlenecked by the throughput of a single disk. So um you can only recover as quickly as this thing as this replacement disk can write.

A

Its data or the source can read its data and, as we know, hard disks are getting bigger faster than they're getting faster, which means that the recovery time for a single disk is getting longer and longer, which means that you have a wider Wender window of vulnerability during which sort of the durability and replication count. That data is somewhat compromised. So that can be problematic in the case of placement groups.

A

It's a little bit better because you notice that when we lose a disk, we have copies of the loss, placement groups on lots of different devices, and we can choose new location for those placement groups that are independent and also pseudo-random, so that the you know this cream-colored one can replicate to one node and the blue. One can replicate your different node, and so you suddenly have a parallel recovery process, so both of these pieces are recovering in parallel. It will happen twice as fast in the extreme you can imagine.

A

If there were a hundred placement groups on that failed disks, they could go a hundred different disks recovering in parallel, going taking one one-hundredth at the time and also you'll notice that we didn't need a spare, as we can simply move these placement groups to do the remaining sort of empty space on the surviving nodes in the cluster. So that means that all of our hardware is being utilized at all times.

A

We don't have this sort of waste of having sort of idle disks sitting around and finally, if you look at sort of the extreme case where we're replicating every object, if you fail a disk or if you lose a disk, then the data on that disk is replicated a little bit bits of it on every other disk in the system.

A

So you have sort of this all all probably recovery process, where they're all copying data around all over the place, which in theory is possible, although there probably some practical considerations around, how do you actually keep track of where all these objects are and should be stored and how individually, if you consider them, it's probably more overhead to figure out exactly where they should go I'm going to ignore all that you know in theory.

A

It should work the larger problem with that scenario or what that strategy comes when you think about what happens when you have a concurrent failure. So if you imagine that you're very unlucky and not just one device filled the three device devices failed at the same time, what happens so in the original scenario?

A

Where you have these sort of replica sets of three it's most likely that if three devices failed they're, not all gonna, be from the same replica set, they're gonna be spread across different replica sets, and so you're never gonna lose or very rarely are you going to lose all three replicas of the same data and actually I'm data loss? Usually it's going to be spread across different wrote, the cassettes and you'll be able to recover so very few. Triple failures caused a loss.

A

On the other hand, though, if you think about the scenario where we were replicating individual objects, because we have a gazillion different objects and they're all randomly placed pretty much. Every set of three devices within this cluster has some data that is replicated on those just on those three nodes, which means that there's pretty much always going to be some data loss.

A

It might not be very much, but there you're always going to lose some data and that can be particularly problematic when sort of the integrity of an overall data set depends on having all the data I'm not sort of having some random subset of it disappearing and hoping that the rest of it will still hang together. So that's very concerning and then, if you look at placement groups and then it's sort of somewhere in between right, so some triple failures.

A

Most triple failures probably won't take out all three replicas of a placement group, but some of them will, more so than insert of the replica set where you essentially effect essentially just had one place in group per device.

A

But it turns out that the sort of placement group strategy is a balance of these competing extremes. So in the academic literature, this was described as D clustered replica placement and it's this basic trade-off. So if you have more clusters more placement groups, then you have faster recovery and a more even data distribution. And if you have fewer clusters, if you have fewer placement groups, then you have a lower risk of concurrent failures, leading to data loss event and having using the strategy with placement.

A

Groups is a happy medium because you can avoid the spare devices and you can adjust by adjusting the number of placement groups. You can sort of choose what, where you want to be on that on that spectrum in either extreme as a the perfect world. But you can sort of have this balance and durability in the case of conquering failures and the recovery time that you want to tolerate, and when you do that, then sort of having a complete strategy to keep your data safe is really then around.

A

Avoiding those concurrent failures in the first place or ensuring that when concurrent failures do happen, they don't lead to data loss and the way to do this is to separate your replicas of your data across its failure domains. And so, for example, you might have a rack cluster that comprised of hosts organized into racks, racks and rows rows, no data, centers and so forth.

A

You might want to separate all of your replicas across racks so that when you lose an entire host which might have you know, 24 discs, none of those discs are going to contain multiple replicas of the same data you're right, because they're always separated across X and so the way stuff does.

A

This is it creates a hierarchy of the source devices of the system that the cluster understands how all these devices are organized that should be aligned to the physical infrastructure as much as possible and then allows you to express the placement policy in the system in terms of that hierarchy.

A

So again you might say that I want three replicas I'm all in separate X, and that means that you know if I lose a top or X which I'm I only lose one replica of anything because we're we spread across racks or even if somebody's walking down the datacenter, with a cart and I sort of veer off to the side and crash into a rack or the cooling fails or something, and you take out a whole bunch of servers due to something like correlated event.

A

By having that infrastructure aligns to sort of the physical placement of those devices in space, then you try to correlate those failures with the failure domain and minimize the risk that you'll have just simultaneously failing that are in different racks. That might be sharing the same and so the challenge. Then the real question is how to question you this. How do we have this like magic policy that places all these bazillions and placed in groups across devices that respects this buyer, to have replicas, separated across failure, domains and so on?

A

Do all the things you'd want to do in a real store system, and then the answer is with an algorithm that we call crush so crush. Is a pseudo-random placement algorithm, it's repeatable, deterministic calculation, a function of the state of the cluster and the name of the object that spits out where the data should be stored. So the inputs are the topology of the system so that hierarchy, that I was talking about how osts are organized under hosts and racks and rows, and so on.

A

The pool parameters like the replication Factory and are my placement policy and then the identifier for the placement group that I'm measuring in a store- and you put all that into crash. That's some calculation and it's it's out a number as essentially not a number, but an order list of which OS DS that they sent group should be stored on and then that's where. That's, where you going to put your data, and so as part of this, these pool parameters crush, allows you to write, rule-based policies that describe how those replicas should be placed.

A

So you can do things like say you know: I want three relic dozen different wrecks. Maybe I only want to use SSD devices, that's one of the inputs to the function and spits out which devices to use at the end. You can have something more complicated like if you're, using erasure coded scheme, that's six plus two, so you have eight sort of shards of your data. Maybe I want to have two of those shards per rack spread across four racks, but of those two shards that are within a particular rack.

A

I want those to be separated across different hosts and I only want to use hard disks. Something like that. It's also possible one of the key properties of crush is that it generates what we call a stable mapping. So that means that if you have a particular state of the cluster with some set of devices and there's some topology change like a node is added or a device fails or something like that.

A

Then we want the amount of data that has to move in order to rebalance the distribution to be proportional to the size of the change made. So, for example, if I have a hundred nodes and one node fails, then roughly one percent of the data is going to move. So when I repeat my crush calculation with all the existing placement groups and find out where they should be stored.

A

Now, given the new state of the system about 1% of those placement groups will be mapped to different OSTs and will requires some sort of data movement, data movement, and so that's a very important property with storage in particular, because moving data around is very expensive and finally crush supports bearing device sizes. So every device in this hierarchy has a weight and that weight determines the sort of proportional amount of data that will be stored there.

A

So the number of placement groups, so you can have hard disks, meses, T's of completely different sizes in the system and you weight them accordingly, then they'll get an appropriate amount of data, and so they don't overfill.

A

So that's Crush, that's sort of the magic that that figures out where all the data in systems could go, and everybody can repeat this calculation and figure out out where to read her to write data. The challenge, then, is what should raid us then do once it knows where it's where the data should go, how does it actually store it and two strategies? So all the objects that are stored in a pool have to be durable. We have to make sure they're safe.

A

Those pools are broken up in the placement groups, so each individual placement group- you know some subset- that the overall data has to be durable in some way and we have two strategies for doing that. I'm. In the case of replication, we simply stamp out copies of placing group. So if we imagine this PG as two different objects in it, we just have three different: no SDS. We store a copy of this placing group on each of those those these and each OSD.

A

With that placement group has a complete set of copy of all the objects that are in that in that PG.

A

So this is nice in the case of recovery. If we need to make a new copy to PG, we can take any existing copy and just read it and make a new copy somewhere else. So that's great and it's also convenient and that raitis lets you change replication factor at any time in the future.

A

So if I want to go from three role because to five replicas I can just you know, flip a switch in the pool and rate us will go off and start creating new copies of these keys and finding new places to store them, and that's all fine, it's really quite straightforward. Our researcher coding is a different reliability strategy and it works very differently in that, instead of having identical copies of placement groups. Instead, we have different shards different slices if you will of the same placement group.

A

So if the placement group logically contains a number of objects in this example, which is a four plus two scheme, we would have four shards and half of the data. That's striped over those four shards in a way, and then we would have two additional shards that have some sort of parity and redundancy information. So this is really what raid is doing.

A

A richer coding is sort of a generalization, a more flexible version of what raid does, and so we have these additional I'm components that provide some redundancy so that, if I lose either one or two of these different shards I can always read the read the surviving pieces. Do some calculation and rebuild the data and you'll notice that eraser coding is much more storage efficient. So these first four shards have a complete copy of the original data and then I have two additional shards.

A

So I have sort of a 50 percent storage overhead to provide a level of redundancy that allows me to lose two different devices and still have a full copy of my data or be able to rebuild my data. You'll notice that in the 3x replication case, I can also only survive. Two node failures, there's two copies and still others arriving a copy, but the overall storage overhead is 3x. So this is one copy of the data and I have sort of a 200% overhead versus a 50% overhead, so wrist recording is much more space efficient.

A

Unfortunately, it's less efficient when you're doing recovery, because, as I mentioned with replication, you can just read any surviving copy and then write it again and that's pretty straightforward in a racial coding case, if I lose one of these shards I have to read all of the surviving shards and do some calculation in order to generate regenerate that one additional shards. So it's significantly more expensive in terms of that work. Bandwidth, then storage IO, but it works well, particularly for datasets, where you have large objects that aren't changing very often and then radius.

A

Of course, support allows you to store lots of different pools in your cluster, and so you can have multiple specialized pools living within the safe cluster that have different storage policy, so you might have a replicated pool. You might have an eraser coated, pools, some of them using hard disks and be messy using SSDs and so on, based on those those quests crushed policies.

A

Now by default and in most cases all of these pools in the system X by will normally just share devices, so each pool is being broken up into placement groups and then they're. All things that are randomly spread across the u.s. DS in the system, unless you are specifically specifying a policy that specifies SSDs or her gist's or something like that. But this sort of mapping between the logical pools and the physical storage devices means that you have elastic and scalable provisioning.

A

So you can a pool, can sort of contain either a little bit of data or effectively infinite amount of data. As long as you can provision of those DS. In the background to keep up with your storage demand as you store data, then you can keep expanding the system and you won't run out of space. You want to determine specify what the size of a pool is up front or anything like that. I'm, it's totally virtualized and flexible.

A

This approach also gives you sort of the ability to have uniform management of devices, though um I just sort of deal with the deploying the Ceph software on new hardware nodes and making sure that they're consuming storage, I, throw them into the pool and then stuff and crush will handle remapping data onto them and consuming them. So I have sort of a common workflow for managing the hardware resources. Regardless of what is consuming that storage, it might be file, storage, the FS or objects, or something else.

A

Those are all users of Rados and ratos is just providing that storage via these, these logical pools. So another way to think about this is consider that Redis is really virtualizing storage right. We have these virtualized pool storage, abstractions that are sort of variably sized and have some have some of policy around like what performance you want out of it, and but the internal redundancy scheme is but from users perspective, they're, just a bucket full of objects right and then ratos.

A

Some crush do some magic to make sure that these things get replicated in a richer coated and distributed, and on the back end, you have all these different underlying storage devices and software demons that are actually making this all work. But as far as somebody who's consuming the storage, they don't really know or care, though, and that that's an turns out to be a very, very powerful thing, I'm in a tick particular because it means that radius can be used as a platform or the higher level services that are built on top of it.

A

So radius provides this highly available, highly durable storage service, and then, on top of that we can build object, service, block service and file service, so we're gonna, move on and talk a bit about, radius gateway, the component that provides object, storage services in in the stuff, so rgw stands for the rios gateway and, as you might imagine, it's a gateway that provides s3 and Swift api compatible object. Storage, though this is an API, that's based on rest, it's usually tunneled over HTTP and provides sort of a high-level optic storage service.

A

It's the same type of thing: that's often combined with a load balancer and actually exposed to the public Internet. So much like Amazon s3 service. You can have an encrypted connection to these gateways and you can store and retrieve objects so that the data model, our job, provides a little bit different than what ratos does, though I'm the s3 API is built around the idea of having users and buckets collections of objects, and then objects are usually large.

A

--Is-- blobs of data, so there's a whole model around how that what the structure the data is and how the permissions work and how users are allowed to access what objects and so on based around Ackles. All of that is implemented and enforced by raitis gateway and in fact, what it's doing on the backend is mapping that into some internal storage. That's dumping into ratos. So one important thing to recognize is that the objects that we're talking about with our GW object, storage or s3 objects. That's not the same thing as the radio subjects.

A

I was talking about a few moments ago that are stored in pools and ratos. Burritos objects are small, they're, usually you know less than 10 megabytes and they can store key value, data and byte data, and so on. They're sort of this low-level object, rgw objects, s3 objects can are usually pretty big, they can be gigabytes terabytes and they have apples associated with them and they live in buckets, which is a totally different abstraction.

A

You could have you know millions of buckets, whereas you only have usually a small number of pools and Rados and so on. So it's a very different use case in gredos gateways, both component in the system. That's sort of making that breath remapping and, in fact, mostly what's happening, is that rgw is taking these big s. Freestyle objects directing them across a lot of smaller ratos objects and then doing the authentication and enforcement.

A

So if we look at a little bit of detail about how that might work, you can imagine that we're storing a large video file, vn s3 plate or post operation into our GW and that's getting stored into the backend stuff clusters. So the first thing that's going to happen is we're gonna go look at our metadata. Our GW can look at our metadata about users.

A

What s3, users and buckets are defined in in the system and make sure that this is a valid request, that it's authenticated and that's a bucket we're putting into actually exists and what the policies are around and so on. It's going to go. Make an update to this bucket index object, so s3 API is defined around the idea of being able to do a sorted, lexicographic enumeration of all the objects in a bucket, and so we have to take their sort of s3 names and sort them and put them in an index.

A

So we can perform perform that enumeration. So it's going to make an update, they're saying we're in the process of updating the subject and then it's gonna take the data and stripe it across lots and lots of reduce objects, dump them all in reduce and then, when it's done, it'll update the index and say I'm done I'm taking.

A

You know now read the subject: the big picture, we're taking big as three objects and we're striping them across British subjects and we're doing this translation between the s3 model, where you have users and buckets and enforcing that at the gateway.

A

So you can think about this whole picture as being grouped into something called zone right. So you have you, have these radio spools that have the actual data that you're storing and the metadata about it, and then you have some number of ratos gateways. You can scale these out horizontally. You know less than ten tens of them.

A

In fact, in parallel serving up all this object storage, you can take that sort of whole unit and you can call that a logical zone which isn't useful in it in and of itself, but is when you start when you consider greatest gateways, Federation and dual replication capabilities, so starting with Federation.

A

The idea here is that you can add multiple zones that are deployed that are federated together, so each of these zones might live in a completely different stuff cluster, maybe in different sites and different geographies and different continents, but they're associated in that there's a replication relationship where all of the user and bucket info you know which, as freezers, exist in which s3 buckets exist, is replicated between these two zones. So they have a shared view of what buckets they're serving, um but they have different data.

A

So when you have a request to read a bucket from one Redis gateway, if it's stored locally, it can service that request and read it here. If you request a bucket foo, that's actually stored in a different zone bar then this gateway knows that, because the house, with the metadata about that bucket, it can send you a redirect that bounces the client over to the appropriate gateway, and so you can read the data from that location instead. So this is really very similar to what Amazon's globalists service provides right.

A

So you have a global namespace of buckets and users. When you create a bucket, you create that bucket in a particular region which is similar to a zone, and you can request, do reads and writes from that bucket anywhere in the world and as soon as sort of that access touches the Amazon Network. It sends you to the right data center, and so you can read and write that particular data.

A

Now, in addition to that, Federation capability rate of Gateway also has a Geo replication capability, which is sort of extending the zone concept to take multiple zones that are replicating the same content. So if you imagine you, we had two additional zones known snow zone, C, 1 and C 2. Again, these are totally separate zones.

A

They might be in different clusters and different continents, even but we logically grouped them into a zone, Group B and that essentially tells us where to skate wave that all of the bucket data, all the content that's stored in these zones should be replicated, and this can be a unidirectional relation replication relationship where all the writes happen in one zone and they get sent off to the other one or it can be bi-directional. Active active, so you could have a bucket that's stored in both these contents.

A

Both these zones- and you can both do reads and writes from either bucket in either location and those rights are asynchronously propagated to other zones within that same zone group and with conflicts resolved using some sort of last writer, winds, type policy, so that that can be a very powerful and then, if you sort of imagine, then that every bucket that's stored in this overall sort of global federated namespace is placed within a zone group.

A

So the bucket either exists on these zones and zone Group C or the buckets enzyme, Group B or those bucket lives in zone Group, A and regardless of which gateway you touch. You get sort of sent to the red zone and then the rate of State ways for each of these zones. They sort of scale out horizontally and there they are literally the Gateway. So all of this replication between between zones is actually tunneled via the gateways over. You know: TLS, secure channels between them and a scale well fashion.

A

So these arrows here are really showing the replication relationship at the actual data path. It's always going through the Rados gateways up here, as they are, it's really the gateways.

A

So that's Federation rgw is a sort of a robust product with lots of other lots of other features as well, though, first and foremost, we emphasize very strong s3 api compatibility. This is a huge investment for the project, unity. We have a test suite. We built called s3 tests. That's a functional test, suite that evaluates whether we've sort of correctly implemented the s3 API.

A

That's this test suite is in fact used by lots of other projects and products better s3 API compatible and we implement lots of other API is mostly focusing on things that are in s3, so there's sts, which is a security token service which allows you to sort of. Instead of using the native s3 authentication model, you can integrate with external dedication frameworks, things like Kerberos and so on. We support encryption, there's a whole set of API s and s3 around this.

A

In a few different ways, you can manage keys and so forth that we implement several different flavors of that API, there's inline compression. So if you put an object, they look and compress before it gets written back after a dose. We support COEs cores and Static website hosting features which are I'm used by many many people.

A

There's a metadata search capability that integrates with elastic search, though you can do queries over metadata about objects that are stored in a bucket and we've recently added a pub/sub event stream capability, which is useful, in particular, with your integrating with the serverless framework like a native, so you can imagine putting an object into a bucket in our GW and that triggering an event that then induces a serverless function or lambda to be called somewhere. That does some processing on that data.

A

A nut stream can also get fed into Kafka or MQ or some other as well.

A

Rgw also has a number of sort of cheering stores policy related features, so it supports multiple storage classes or those source classes map to different radio spools.

A

So you can imagine that you might have some Rado schools that are, you know, erasure coded on hard disks and other ones that are replicated on nvme and devices for very high performance, and when you put a object, you can specify which storage class that object might get stored in or you can set policies and buckets so that all the objects in a bucket are either very fast or very slow. Something like that.

A

There's also support for a lifecycle management feature which allows you to have cheering Andry tearing between those storage classes happen automatically on a time basis, so that maybe when an object is two weeks old again, this gets moved to the slower storage. Maybe when it's six months olds, it gets deleted automatically. That sort of thing there's also the ability to create an archive zone.

A

I'm sort of within that federated view, where that particular zone stores a full copy historical copy of all the data that has ever been written so even after an object has been overwritten or deleted. It preserves all copies of that object for compliance and other backup type use cases, and that's that's raitis gateway and sort of a complete, robust, s3, API implementation. That's used in many different private cloud and private infrastructure environments, on-premises.

A

That is also used by several public cloud companies offering public object, storage services on the open, Internet next up is RBD their rate of suck device, which is the component that provides block storage process. So our buddy stands for rate of block device and its purpose is to provide a virtual block device that allows you to store Disqus images in Assefa and Assefa frado's cluster and at a high level really.

A

What we're doing is basically taking a big virtual disk and we're striping and across a bunch of smaller radius objects and dumping them dumping on the Doritos, and the purpose of this is usually because you can take the storage and decouple it from the hosts from the sort of the compute side. That's actually going to consuming that storage. So if you imagine that it's a virtual machine, for example, if the storage is stored in a rails cluster attached to the network, you can just aggregate your compute resources from your storage resources and scale them independently.

A

You can take virtual machines and live migrate, then between compute hosts and fail them and move them around or whatever. With that, I mean to touch the storage sort, just sort of always available over the network really and sort of all the use cases, and they can imagine a August to the EWS is the eds service in Amazon and our beauty is accessible via both virtualization technologies, like KVM and also from a raw linux device from a Linux host.

A

So maybe our machine it would consume storage via the Verdejo block driver the live, RVD and liberate us.

A

Some implementations of the errbody client link directly into KD into key mu if they've been sort of very tightly integrated there for many years, so that that virtual machine process is talking directly to the storage cluster on the back end or, conversely, if you have sort of a regular bare metal Linux host, you can also map a virtual RBD disk device I'm using the RB driver to a dev RBD raw block device in Linux, and then you can put whatever you want on top. You know usually a filesystem thanks, so you could skim that storage.

A

Then our body is tightly integrated with lots of different projects, so libvirt handles all the stuff with Q mu. It's widely used in OpenStack. It's also used in kubernetes and lots of other sort of cloud stack, virtualization technologies.

A

So the RBD supports snapshots and clones sort of table stakes these days for any sort of virtual just device. So a snapshot is a read-only point in time copy of the state of the disk image and so for any RVD image and stuff cluster. You can create multiple snapshots on that at any point in time and there's essentially a snapshot view of state of that image. At that point time, all the snapshots are sort of logically linked to the original image.

A

So if you have an image called foo, you can so you go enumerate the snapshots for foo and you can delete them and make copies with them and so forth. A clone is a little bit different, in contrast to being a read-only, it's actually a writable overlay. That's created a sort of logically layered on top of an existing snapshot.

A

So a typical way that these are used is that you might have an RBD image. That's a base operating system image. You know like a blank empty install of a particular Linux distribution, for example, create a snapshot at that and then every time you create a new virtual machine in your environment. You simply in order one time, create a writable overlay of that operating system, snapshot and then start making changes on top of it and then all the kicks in all the sort of copy and write behavior did that.

A

So you can sort of instantly start booting up new virtual machines with a full-blown existing OS install image, and because these clones are first-class images, they can themselves be snapshotted. They can be resized renamed. All the stuff, you'd usually do with arguments you can do to the clones, unlike the snapshots, which are sort of read-only immutable copies, but in both cases for both snapshots and clones, all this is efficient, so creating snapshots and clones is an order.

A

One near-instantaneous process doesn't involve copy any data just dealing with some metadata and in both cases we leverage copy-on-write support in Ratos, though, that space is only consumed, as data has changed. If you create lots of snapshots but don't modify anything, the snapshots no consume any space same thing with the clones you don't, a clone doesn't occupy any space until you start modifying the data that was in the original snapshot, and only then does it consume storage resources.

A

So our BD lays data out by mapping it onto lots of objects and Rados. There are two different types of objects, so the first object is an RB header. This is just a small object that has metadata about the image things like: what's its name, how big is it? What are the striking parameters?

A

Does it have snapshots what own options are said that sort of thing and then the actual content of the block device, which you can imagine as a sort of logical array of blocks in this virtual disc, are sort of arbitrarily carved up striped and mapped onto reduced objects so that each ratos object is a chunk of the overall block device by default?

A

These objects are four megabytes, but you can configure that to have any sort of straightening scheme that you want, and notably the important thing to remember is that this is a it's a sparse, sparsely allocated. So when you create a new our beauty image, it creates the header image, but no data objects are created, so the it might be a four terabyte image, but it consumes almost no space, except for this header.

A

It's not until you actually start writing to a location in this block device that instantiates this object and write some data into it and in fact these objects themselves are also stored in a sparse fashion. So if you write a 4k in the middle of a four mega object, it's still only consumed as 4k of storage, and these others can get dumped in any radio school. So it can be a replicated pool or a richer coded pool. Depending on here. What your dura, bility and performance requirements are.

A

So our BT also has an alternative mode of writing data, so a write instead of just going to the particular location of the block device and mapping to the image and writing there. We can instead enable a journaling mode and the idea there is that all the rights are 1st appended to this sort of sequence of journal objects.

A

That's you know growing here to eventually get ain't. Remember there. After a write is persistent to the journal and stable, then we can go write it also over into its normal location, and then we can trim the journal. So the Strela contains recent data rights and it also contains any metadata changes. Things like resizing the image and creating captions and so on sort of a full record of all the things that are happening to the image or a recent recent history at least are contained and within the write journal, so in and of itself.

A

This is a particularly useful capability, except in that it enables us to implement this RVD mirroring feature so RBD mirroring is the ability to asynchronously mirror a RVD image, that's stored in one sub cluster into a different stuff cluster and probably a different location and different data center and so on, and so in that case you have the first step cluster, which has somebody consuming.

A

The image is now it's making rights depending to the journal and then flushing them all to the data pool like we just described, but then you also have an orbiting daemon, which is essentially just watching all the rights that are happening to this image in the journal and reading them, and then it's applying them to a copy of the image that's stored in a separate cluster, probably in a separate data center. So this gives you an asynchronous copy of the image in a different cluster.

A

So notably, this is a point in time and crash consistent copy. So if this first cluster blows up, even if all the latest haven't been modified, what we have in cluster B is a point in time crash consistent view. So it says this thing: consuming writing to the storage crashed a little bit sooner, because we have lost the last few rights, but it's fully consistent in that it's point in time, because if it nonsense, it also mirrors not just the data that's stored in the image, but also all the snapshots.

A

I know: there's sort of metadata type changes, so this this is copy and the second cluster is a sort of a complete clone that has all the properties, not just the actual data and already Mearing supports sort of the full cycle of lifecycle events. That can happen, so you might lose your first cluster and have to failover to cluster B and start consuming. That and updating this image cluster a comes back online.

A

You need to resynchronize the images by rolling back and it changes here that were lost and copying the new changes in B back over to a and switching the masters back at the other direction, and so on, so that full lifecycle supported making the sort of a robust and disaster recovery type solution.

A

It can be set up on a per image basis of some images might be replicated to this cluster others might not be replicated at all or might be replicated to a different cluster and so on and there's management capabilities for these already married, even so that they can scale out horizontally so that I'm, regardless of how many of these different RVD images are being replicated, you can sure you have the processing and bandwidth and so on, to actually manage all that all the replication traffic across to another cluster.

A

So that's our buddy marrying sort of a complete disaster recovery solution across multiple stuff clusters for block storage, barber tea has a lot of other notable and useful features. So our BD top gives you a real-time view of Iowa activity in the system.

A

What images are being written in read to you and how much bandwidth they're using there's a quota mechanism, that's enforced, sort of at the time that the devices are provisioned, though it will prevent you from creating images and then later filling up the cluster by sort of enforcing the total size of the image at provisioning time. There's the ability to restrict different clients via the authentication keys that these to connect to the cluster, to restrict them to different private namespaces.

A

So they can only see their own RBD images and not access others, even though they might be sharing the same. Underlying ratos pool, there's full set of important export capabilities for backups and getting communal bits, migrating data and all that stuff and there's a trash give ability so that when you delete an image, it'll actually get kept around for some period of time before it's automatically purged to help avoid sort of fat-finger type situations.

A

And there are lots of ways that you can access our beauty storage, though I talked a bit about the Linux kernel client that allows you to sort of map an image to a device in dev RBD and with the kernel sort of handling all that Network IO Sergio.

A

You can do the same thing using the kernel mvd feature in NBD mode, where it creates an NBD style device, but that essentially deserves and a pass through to a user space implementation. That's a live! Rbt client, that's useful for using sort of the latest user space features that aren't yet implemented in the kernel. Client, there's an iSCSI gateway, so you can create gateways that take our beti images and expose them over the ice cozy protocol.

A

It's based on the Elio stack, there's all sort of tools to manage to configure that and there's, of course, the Lib RBD library, which is a user space implementation of the client library that you could directly link with your application and that's in fact what this sort of nvd process is doing is taking the user space library and just passing the I/o through from the kernel space.

A

So that's that's our BD so already is used extensively in the OpenStack community and elsewhere, and that gives you that virtual block device, abstraction that is much more flexible, scalable and powerful than then alternatives like ice cozy. And finally, that brings us to stuff of s which is distributed file system. So stuff s is a distributed network file system. That means that you get files directories, things like rename, hard links and so on.

A

It gives you a concurrent shared access to the same file system, main space from multiple clients that are mounting the same: the same file system. The fess is designed to provide a strong level of consistency and caching, which means that if you make changes to a file in the file system from one node, if another client looks at that file, it will immediately they see those changes. So it's a it's a fully coherent view of the data and through a combination of locks and laces and so forth.

A

So it's important to note that the way that is implemented is by separating data and metadata management. So when a client host mounts ffs and you read and write to a file, the client is writing directly to read us directly to the objects that store that file data. So the I/o path is direct to the OS DS and it's scalable and high throughput and so on, because you have lots of different O's to use and you're balancing data and so on.

A

Metadata access, on the other hand, goes to a new type of demon called a metadata server. So metadata accesses things like creating files opening files listening directories, all that sort of namespace coordination is handled by communicating with these metadata services, and this bi is something very important. It means that we can scale metadata and data independently in the system. So if we need more storage capacity, if we need more I/o throughput, then we can expand the size of the rightest cluster and just deploy more and more OS DS. If you get more capacity and throughput.

A

On the other hand, if I have more files, if I have lots and lots of small files, for example or I just have a workload, that's I'm making a lots of metadata changes, then I can deploy more metadata. Server, daemons and I can scale the metadata capacity of this file system, so that happens by deploying these stuff metadata services is a new demon type. That's sort of complements the sort of the core Rado Stevens. So the set metadata service, the metadata server daemons job- is to manage the file system namespace.

A

Its primary purpose is to store all that file metadata in Rado subjects. So these demons are stateless. You don't have any direct storage attached on the notes that, where these demons are running they're actually des storing all their data back into Rados, though you can employ these demons pretty much anywhere. You just want lots of. You know a few resources in memory.

A

Their main job, then, is to coordinate file access to join clients, though they need to make sure that if multiple clients are accessing the same directory or accessing this file, they have a coherent view, and so they have to manage the consistency of those client caches and make sure that they're, cooperating and so on. So it's managing that cache consistency, LOX leases and so on. There again they're, not part of the data path.

A

When you actually read and write file data, you go directly OSDs and you can deploy any number of them in a file system. So usually you don't need very many of them, but you can have many tens of them deployed in a system for the same file system in order to scale out that metadata.

A

If you still need, if it's necessary and then you usually also have some stand by demons deployed that are just waiting, though, that if one of the primary active ones dies, then what the standbys can take over and continue service without interruption.

A

So, as I mentioned, the metadata from Seth, FS or stuff, s is stored back into raised pools, and so the client is directing its metadata operations towards the metadata server. All of those modifications, the namespace, are getting drilled into a set of objects in a metadata journal and then once it's stable there, they get written out to all these objects on a per directory basis. So each directory in the namespace has a radius object in the made of metadata pool.

A

That's essentially a map of file names, I, know in metadata and then, when you actually want to read, write data or you go directly to the variable. So, unlike RVD you'll notice that for each file system and stuff s, there are at least two different pools, one for data and one for metadata, I'm stored in the rio sequester and you can do tricks like put the metadata pool in SSDs. So you can use the performance here and so on.

A

One of the key challenges in making such a fast scale is making this metadata scale, and the problem is: how do you take this sort of complicated hierarchical file, structure and distributed across sort of a linear set of store servers and the way that stuff does? This is using a technique that it pioneered called dynamic, sub tree partitioning, and the idea, basically, is that we take the overall file hierarchy and we partition that hierarchy and it's on a sub tree basis and we map sub trees to different metadata servers.

A

But we do that dynamically based on what the current workload is. So if you have a billion files in this file tree and you're, not actually looking at them, you might not even load them in a memory and they might all sort of the Technic bicycle metadata server. It's not until you actually start accessing data and loading that metadata into the metadata servers memory that it starts consuming resources, and at that point, then the metadata cluster might dynamically.

A

Take a subtree of it and pass responsibility for that subtree off to a different metadata server ship that metadata to another meditative service cache um in a coordinated way, so that somebody else is responsible for managing all the updates there. So the clients, then, as they're traversing the file system, they sort of dynamically, explore the file space they'll, learn that this particular part of the namespace is mattre different metadata server.

A

So they learn this partition as they go as they explore the file system namespace the subtree partition is nice because you can sort of maintain directory locality, so most metadata workloads are limited to a single directory or small number of directories, which means that any particular workload is usually meant talking to a single metadata server.

A

So you get things like cache, prefetching and sort of some officious efficiencies associated with that, but you can also arbitrarily scale the system by sort of carving this hierarchy up into smaller and smaller pieces and mapping it out to a larger and larger number of metadata services, metadata servers and, in fact, if you have a single directory, that's very, very hot, for example, or very very big stuff.

A

That's can even fragment that directory into lots of little pieces and then take those fragments and map them to different metadata servers and so that you get that that's scalability in the system, so overall, a very flexible approach to managing managing your metadata. So this implements a number but they're sort of unique user, visible features. The first of those is the way the snapshots sowhat's of file systems can do snapshots in such a fast.

A

You can snapshot not just any file system but any directory within the file system and when you snapshot a particular directory, it applies not just to that directory, but all sub directories nested beneath that point in the file tree. This is in contrast to most file systems where you have two snapshot: sort of a volume or sub volume and that's sort of a predetermined unit of storage or boundary, though snapshots are point-in-time consistent and its emphasis from the perspective of a POSIX API from the actual assist calls at the client side.

A

Unlike some manifests based file servers which happens, sort about the client-server boundary when things are written back, so the consistency, an ACEF, a snapshot a little bit stronger, that it is in many other systems but, most importantly, these snapshots are very easily I'm using assumed the other file system. So you can imagine moving into any directory in the file system, and you see some files. There's no snapshots listed there vo, that's, not hidden dot snap directory if I want to create a snapshot.

A

I simply do a make dur in this hidden magic directory to create the snapshot and voila there. It is. This is a fast metadata operation and we have said the usual behavior. So if we delete a file, it's gone, you know we don't see it here, but if I look on the snapshot, it's still part of the snapshot and I can still read these files from the snapshot I'm using normal POSIX operations and when I'm done later, if I want to delete the snapshot, I can just do an armed or operation and proof the snapshot.

A

It's gone remind about it and in the same way that our buddy snapshots bones were efficient.

A

Death of that snapshots are efficient you, so the creation deletion is sort of a fixed, immediate operation and snapshots only consume space in the system when you actually overwrite or modify data again, bypassing coordinating with some features in the greatest layer to make sure that sort of copying right and all the way up and down the stack, the other. The other unique feature in stuff FS is its support for recursive accounting and the underlying capability here is that the metadata servers are maintaining recursive statistics across the entire file hierarchy.

A

A new stats cover file and directory counts, the sizes of files and summations with file sizes and the most recent modification in C time on I/o notes, and these statistics statistics are visible via virtual Exeter's. So if you mount the filesystem and you go into some arbitrary directory and you dump the extended attributes on a directory, for example, you'll see some information about that particular directory. How many files and subdirectories are contained within it, and then you also have all these statistics at our recursive stats.

A

So this is the summation of all bytes stored within this directory recursively, not just this directory, but I'll subdirectories nested. They means that- and this is the most recent modification time of any file nested beneath this point within the hierarchy and in fact, if you mount the filesystem with this, are by HTML option when you do the Mount and then, if you do an LS, this recursive bytes value is the value that's reported as the size of a directory which can be very convenient right.

A

You just do an LS AO, and you can immediately see that this directory contains four point: five Meg's of data, most of it's in this directory. You can see not nested and, as I can see that my parent directory is 12 gigs all this stuff. It's actually the same information that you would get out of. Addi you, but it's sort of immediate and free, sadly, and having these sizes reported here, confuses our sink, because sometimes they update a little bit asynchronously and our sink gets confused when the directory size has changed.

A

It thinks that there's some larger change in it. That's not a warning message. So, unfortunately, by default, this option is turned off I'm, hopefully in the future sometime we can fix our sink so that it doesn't get confused and we can leave them on and stuff s contains a number of other features, so you can create multiple file systems within the same stuff cluster and each of those file systems have sort of its own independent, separate set of metadata servers. So you can have no tenon or use case isolation.

A

If that's, what you need stuff of that some or it's extended- attributes, X letters of course, file, locking the UFO and bug the locks- and you can create quotas and quotas- can be set on any directory in the system and can be set either on a file count or on a byte count.

A

You can do subdirectory nested mounts, so you can, for example, have a client that has an access key. That's only allowed to mount its particular subdirectory, like its home directory, for instance, and it's I'm sort of locked out securely from accessing any other files in the system. You can create multiple storage, tiers and I mentioned that every file system has at least one data pool, but you can actually have multiple data pools and you can set policies on directories so that files created in different sub directories get mapped to different radio spools.

A

So you might have you know. Everything in /home is mapped, she's, sort of a generic lower performance to your storage, that's backed by hard drives, and you might have other directories that have a policy set on them so that they get the files are created, backed by a faster rate, Oh school within DME and replication, for example, and they're.

A

Also, some other features there's this lazy IO capability, for example, that allows applications to selectively relax some of the strict consistency behaviors instead of s and if they sort of know what they're doing that can be very helpful and having some high performance computing applications improve their iOS foot and as with RBD, there are lots of different ways. You can access that if s so, of course, there's a Linux kernel, client, that's where we focus most of our efforts. Any recent Linux kernel you can just do mount, do stuff and access us.

A

That's the best file system, there's also a fuse implementation that you can use. Instead, for example, if you have an older kernel or if you want to take advantage of some new or different feature, for example, and in certain cases, you can also build stuff used on all Linux hosts to mount SMS from there. You can export stuff if SP, NFS and sifts via either the Ganesha or Samba products projects. These essentially act as gateways.

A

So you have NFS to a gateway that then talks to stuff, that's the meditative service and their rate of lowest ease and so on on the backend and those actually work by linking with flips ifs, the user space client implementation dynamically. Anything that and if you want to you, can directly link flips FS into your own application. If, if you have some seven reasons to do so as well,.

A

So that's and that sort of completes our overall picture right. We have Rattus, which provides this highly reliable, highly available storage service. We have object, a3, API service provided by rios gateway. We have virtual disks provided by the rate of block device and we have POSIX distributed file access provided by SEF of s all of that within a single cluster running on the same storage hardware.

A

So, let's ship here's a little bit and talk a bit about how is the user? You would actually consume SEF how you would manage this type of system, so CF has an integrated dashboard. This is built into the set manager daemon, so it comes with every cluster. You just set this or turn it on and set your initial user password decide what ports run it on and it'll come up and the dashboard has sort of three basic rules all right, so you can do monitoring.

A

So you can monitor this the health of the cluster and have some very basic information about you know how much space is used and so on.

A

It also has a whole bunch of metrics built in, so the manager has a module that allows Prometheus to scrape a lot of very low level metrics from the cluster, so the asset manager aggregates all these internal metrics from all the different games in the system provides a single endpoint that fermitas can scrape, and then the dashboard has a number of core fauna based dash boards, graphs and so on that are embedded within the dashboard that pull that data back out of Prometheus.

A

So if you go into the dashboard and you go into the pools and the Oh Steve's and you click on the specific ghostie, you can go. Look at some low level, metrics about the I/o latency on the particular device in this large cluster and pull all that information. A lot of Prometheus and have these nice pretty great photographs too. So it's a very deep level of monitoring and metrics that it's provided by the dashboard. And finally, the dashboard provides a lot of sort of typical storage admin type day, 2 tasks.

A

So things like modifying the staff, configuration all the documentation and all that stuff to sort of manage the dashboard and you can do provisioning, creating new pools, creating our buddy block devices managing your NFS kate ways all that stuff. A lot of these day. 2 operations, provisioning tasks and I'll be done done to the dashboard set pass. Some other nice management features. So, as sort of noticed mentioned earlier, stuff monitors the internal status of the cluster and reports sort of an overall health status. It has error in running states.

A

Every specific alert has sort of a unique ID type associated with it with an Associated documentation. So you can tell what that particular warning means and what steps you might want to take or it could take to mitigate it and so on. All the configuration management for the cluster is integrated into the system. So all the configuration options are self documenting. The configuration is stored by the monitors and distributed automatically to all the different teams in the system.

A

So you don't go I'm poking around the system, modifying configuration files or anything you can all do that all through the CLI. For it's a GUI on that includes things like history and the ability to rollback configuration and so on. I'm Steph is a software-defined system that consumes block devices for the OS DS. So in principle you can sort of layer those of the season top of anything whether it's a rather vice or Eliam, or a nice cozy one. Whatever I mean you can you can create all sorts of convoluted ways.

A

You stack up different software storage layers, but Ceph in particular makes an attempt to map those devices that it's consuming back down to the stack to rob devices so it'll reach through Delvian layers and so on, to figure out what the vendor model and serial number of the physical devices that has been consumed and report that metadata the clusters he can tell which demons are consuming, which device and what host they live on and so on, and it goes further to also scrapes health vice metrics, for example, smart data on a regular basis gets scraped and stored in rightists and if you enable it, SEF can do device life expectancy prediction right.

A

So we can look at those metrics and tell you if it thinks that a particular hard disk is going to fail within a couple of weeks, and in fact you know by default. It'll just default they'll just raise a warning, but if you choose, you can also configure the system to preemptively evacuate data from devices that of thinks they're going to fail to improve the overall reliability of the system.

A

So, instead of waiting to go from three up because you're up, because you can sort of create a fourth replica before that that device actually fails and keep your data that much safer. And finally, there is a telemetry feature, though this is the ability for a stuffed cluster to phone home anonymize metrics about the stuffed cluster. You know what version is how big the cluster is, what api's are being used?

A

What features are enabled that sort of thing can get phoned home in sort of a high-level report and back to the set developers, which is very helpful for us to find out like what versions of the software are deployed and what people are using and so on? Recently, we've added crash reports of this mix, so Steph will sort of automatically generate a crash report whenever one of the daemons crashes either due to a software bug or some other an unforeseen event and those get collected by the stuff manager game on an archive.

A

So you can query any given stuff cluster until when it's crashed and exactly how it crashed, which team. At what time? All that good stuff and these crash reports are very lightweight. They just have some metadata like what version of the software is running and what functions were currently executing when it crashed. There's no identifying information in here.

A

But if you enable the telemetry feature telemetry feature, then these crash, you force walls to get phone home developers, which means that if we look at the deployed population of stuff clusters and the community, we can tell what versions of the software people are running and what specific bugs they're hitting and in which versions which can help the developers immensely in prioritizing, which bugs to fix and to tell whether things that were happening previously of stops happening and when they started happening. All that all that good stuff, so obviously this is opt-in.

A

If you install Seth, nothing is that we're gonna get flown home without your express clear consent, and so the telemetry feature of course allows you to show a dump of the data that it would report.

A

So you can look at it, examine it, make sure it's behind to report that have been information, and only if you approve that can you or should you enable it so that it'll get phoned home, but we're very interested in having users enable this feature, because the informations are very valuable to us and if they don't want to, we are also very interested hearing exactly why, because we'd like to find the right balance that sort of maximizes the value the developers of the information, also respects sort of the concerns of users and ideally gets as many people as possible to participate in the program.

A

So there are lots of ways you can install Seth. Historically, the way to do that was be a tool called Seth deploy. This is sort of a bare-bones CLI tool, written Python, and that makes it sort of pretty easy to deploy step clusters. It isn't really maintained anymore. In fact, it's mostly, but not quite deprecated. Most people these days use one of these other techniques to deploy stuff. Oh there's stuff ansible, instead of ansible playbooks that deploy Cephalon bare-metal and enable various features there's the GNU project, which is an operator for kubernetes.

A

This is a new CN CF project that runs Seth inside a kubernetes cluster and simplifies and automates a lot of the installation and management and furbishing tasks for Seth, though work is very exciting.

A

There's the deep sea project: this is a salt based deployment tool and that kind of run Safiya saltstack and there are a variety of different puppet modules projects. I think this OpenStack one is the most commonly used, but they're there several out there. So these are show the current options.

A

There's also work in progress on sort of integrating the orchestration installation experience directly more directly into Seth, and the idea here is to create an integrated orchestration API that allows seth to reach out to the tool that was used to deploy it in order to do things like restart daemons and upgrade and provision new hosts and replace discs, and so on. I'm with the goal of providing a unified, COI and GUI experience.

A

That's common across all these different tools for all the common data operations like replacing disks and expanding your cluster and so forth, and the way we do that was with pluggable integrations with those deployment tools. So a lot of folks on their initial work implementation.

A

So the stuff cluster will actually reach back out to ruck and telework to deploy us do OS, T's and that sort of thing and spool and so on, and in fact, there's also going to be some focus on sort of a bare-bones implementation that just does sort of the bare minimum using SSH. So you don't really necessarily need one of these more complicated tools that have chef sort of mostly orchestrating itself, handling the installation and upgrade via containers and so forth. So that's, that's very exciting.

A

I'm stay tuned for that more in the future, and finally, I talk just a little bit about the open source, stuff community and the larger software ecosystem that we exist in so staff is open source software. We have a open development. Community seth is licensed under the LGPL two point, one and three. For the most part, there are little bits of and pieces of it that are have slightly different licenses, but the bulk of it is LGPL. We do all of our development upstream in the open we collaborate primarily via github I'm.

A

All our code is I'm reviewed and merged via pull requests on github. We have a bug tracker based on red mine, there's an email develop unless for most of the discussion takes place, and we also have an IRC channel where most of sort of real-time chat takes places all as well, and we do a lot of meetings over video chat. So we have beta stand-ups for all the different major components: a lot of weekly meetings, a lot of discussion, planning, design meetings and so on.

A

The full schedule of all those meetings is posted online on the website and in fact anybody is welcome to join any of those meetings to have a listen in or participate in the conversation. So we invite anybody who's interested to join and the upstream community.

A

You know it does releases every 9 to 12 months and we publish ready-to-use packages for a number of just different distributions and of course we work also I'm very closely with the maintainer x' and other downstream distributions, so that staff can is very easy to consume over here we have just a quick snapshot of some of the top ten contributors for the last major stuff release.

A

So we have a you know: mix here of academic institutions, cloud operators and providers, hardware, vendors software, vendors telecom companies and so on so very diverse community of people that are contributing and developing stuff.

A

The sub community has also invested a lot of effort over the last decade to integrate with adjacent communities and to make stuff work very well as a source platform with other pieces of software. A lot of that effort has gone into the OpenStack platform, so stuff is very tightly integrated with OpenStack and in fact, the majority more than half of all OpenStack installations in the world use Seth as a source back-end, usually for virtual machines, but also frequently for object, storage and file storage as well.

A

More recently, there's a lot of interest in focus on integrating stuff with kubernetes, as humanities has become the container for patrician platform of choice. That's scale out, you need scale out. Storage to go with it. F is a natural choice. A lot of that work is the other rook project, which is a new CN CF project.

A

That is an operator for kubernetes that installs manages and provision stuff inside a kubernetes cluster and, of course, there's a lot of work with sort of the underlying communities with Linux in particular, and with kayvyun project to make sure that stuff works well I'm in those communities and with those software components. There are a number of stuff events that happen throughout the year.

A

The first is set days. These are one-day regional events that happen. You know, maybe ten times a year spread around the world. Each stuff day is usually between 50 and 200 people, depending on the location. Usually it's a single track of technical talks and is the mostly user focused, but these are very popular, they're, very cheap and if they happen in your area that very easy to attend for more information go look at the web page. You can find out all about the upcoming set days where they're going to be at the CFP is open.

A

If you want to speak with them as well, we also every year have a 2-day global Kent event. We call cephalic on and usually it's in the spring, and these are larger events you know and if we're from you know, 350 to a thousand people, and the idea here is to have multiple tracks and bring together users, developers and vendors from all around the world into one location once a year to talk about the latest and greatest in a set community.

A

So for more information about stuff look on, you can go to the web site there and find out about past events. All of the talks at all past cephalic ons are all posted on YouTube, so you can also go look at the type of content.

A

You see that's been available at those those men's I'm pleased to share that there's also a foundation that supports the set project the foundation is administered in conjunction with the Linux Foundation, who support we very much appreciate the foundation is an organization of industry, members that support the Ceph project and the set community, and so we currently have 35 mentor 35 members, that's a mix of different hardware and software vendors cloud companies, major users of staff and academic and government institutions that are involved in developing and doing research in Seth and Seth the Jason spaces and the foundation primarily concerns itself with things like planning the sort of global events, helping support and manage the upstream CI infrastructure that supports a project.

A

You know providing hard work with the community test, lab making sure our documentation is up-to-date and generally I'm using its final financial resources to do whatever it can to support the set project and the growth of the of the SEF user and developer communities.

A

Here's a here's, a quick snapshot of the current members that foundation again. These are the premier members that are part of the governing board and again you have a full mix of cloud companies, hardware, vendors, software, vendors and so on. We also have a number of general members that support the foundation financially and also nonprofit and academic institutions, government institutions that are members that are supporting the project in non-financial ways and that's and that's what we have so for more information.

A

There's a lot of available resources to find out more about more about Seth, there's, of course, the website you can follow us on Twitter. We have lots of documentation, it's all online and versioned and so forth. A lot of the discussion happens over email, so we have an announcement list.

A

There's users lists with a lot of discussions, asking questions about how to use SEF, and it's a very, very active community here and there's also a developer list with all the developers actually building and working on the future versions of staff and both communities also congregate on IRC. All that's done through github I. Also encourage you to check out the that YouTube channel.

A

That's where this talk is going to be posted and past step talks, as I mentioned all of the past talks at cephalic, ons, we're all recorded and will be recorded and are all posted on YouTube. We also have a lot of our our weekly meetings, get recorded and posted there and lots and lots of other good resources. So if you're looking to learn more, definitely encourage you to check out YouTube and that's it. Thank you very much for listening and I hope. This was helpful and have a nice day.