Ceph Conferences, 10 Jul 2012

Previous Meeting

⏯

youtube image

►

From YouTube: vBACD July 10, 2012: "Scaling Storage with Ceph", Ross Turk

Description

"Scaling Storage with Ceph", Ross Turk, VP of Community, Inktank
Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. It runs on commodity hardware, has no single point of failure, and is supported by the Linux kernel. This talk will describe the Ceph architecture, share its design principles, and discuss how it can be part of a cost-effective, reliable cloud stack.

A

Hello good good afternoon, if you're on the East, Coast er up here on the west coast like me good morning as gentleman said, I'm from ink tank and I'm about to talk about Seth.

A

So Seth is a distributed and open source distributed storage system, and we recently just did a rebranding upset because we launched a new company called ink tank, which is a support and services company around Seth. So many of you may be familiar with Seth, but aren't familiar with the sort of visual look of stuff that we have now. We started a company because Seth is getting to the point where we need to figure out how to start helping people to go in. So this is the Sestak.

A

It's kind of it's got a lot of different parts and a lot of different components, but it's all based on something called Rados, which is the reliable autonomous distributed object store on top of Gredos. We've built a series of applications that allow you to use the distributed cluster in a few different ways that I'm going to go into more detail on, but first I think it's probably best to start in the beginning.

A

At the beginning of and and I have this nasty habit when I do presentations of going all the way back to the beginning of like humanity and trying to make it mean a lot more than us and that's what I'm doing here and in the beginning of information, storage, I. Think the first example of how people stored information was probably cake.

A

Ladies, you know it's how people recorded history and recorded stories and things like that we've come a long way since then, at some point we figured out how to write and writing was kind of like cave painting, but a little bit better. You could put a lot more information inside of a book than you could on a cave painting so I mean already we've noticed in human history that we've outgrown the cave and now we're beginning to fill up books, and you know like about a thousand cave any definitive. Look.

A

I mean I, don't know it's really difficult to say, but that's about the right ratio and I'm mentioning this to you, because later on, as I kind of moved throughout the history of storage, we see this sort of ratio happening all the time where we increase the amount of information we can store by by a large factor. So people began writing a lot and capturing a lot of information and I mean so much so. But the problem with writing is that writing is kind of time-consuming it takes I would call it in modern, modern terms.

A

We call it a really low bandwidth, a low bandwidth media right, because it takes a lot of time to write. It takes a lot of time to read and Acuras he's. Not so great I mean I, can't read half the words on this page. Some people figured out how to industrialize writing so how they can use machinery and how they can use technology to make it so that the written word was more effective, more legible and more prolific. But at some point something really magic will happen.

A

We we had so much information that we have grew the book and we outgrew the library and- and so fortunately, we.

B

A

To come up with something called magnetic tape, which is not really the combination of magnets and tape, not really but kind of and again we see this ratio where a thousand books can fit onto a single magnetic tape, which is is again, you know an exponential increase in the amount of data that we store and the amount of data the humanities collected.

A

B

A

This this, this new method of storage, was mechanical in nature. Right, it no longer was something you could pick up in your hands and read just human being to device. You required technology to allow you to interface with the storage that we've created, and that was. It was kind of the first time that that we noticed that we started with a nice, simple interface between human being and rock. You know, and I'm being a little bit.

A

You know cute, but, and then human beings began to interface with ink and paper, but then we saw a huge divide when human beings began to interface with their information with an intermediary which is the computer, and that caused a lot of technology to need to be built. So it's building this technology, where computer programmers write computers need people to work. For the first time in human history, we require a specific type of person whose entire job is figuring out, how to allow humanity to interface with its information.

A

So we have this thing in the middle now between the humans and their information, and the other thing that happened at the same time is that we started storing information in zeros and ones in binary or in some other non representational form. So this picture of this I don't know this frog on the dune, buggy or whatever. It is ends up being a series of ones and zeroes, which we can't read without the help of computers about this time. That's when throughput became really important when we had computers reading and writing our information for us.

A

Suddenly it mattered how quickly that happened. Right at latency began to matter. You know the the rate of the rate of reading and writing begins to matter. That sort of thing started to matter and that's I, don't know if that led to the invention of of the hard drive but I suppose our address must have come around it's just about the right time, because we also saw again a thousand-to-one jump between magnetic tape and hard drives, and you know I think that it changed.

A

It changed the world right because hard drives are a lot better than tape and today, in modern terms, we look at if you put the same the same chart up and say that solid state disks are a lot better than you know, spinny discs, but at the time thinking about it was it was just this outrageous invention, but because we invented technology that allowed us to store another sort of level of measurement of information. We've gone to. You know: we've gone from kilobytes to megabytes to gigabytes.

A

It became really obvious that we needed to have some kind of technology that allows us to organize our information. So that's kind of when we saw the emergence of file systems. File systems are a hierarchy of information with directories and nodes, and things like that. It allows you to organize your information and also store a little bit of metadata.

A

For example, if my frog picture were a file we'd know, a few things about it, like it was owned by me, was created on August 12th, whose last viewed on 17 it's about 42 kilobytes and it's readable by everybody and writable by me, and these are the permissions that we have in this file, which then gets positioned into a tree right in an absolute way. So this allows us to start the story, even more information than we did before so you're.

A

Seeing the trend that I'm trying to take is that humanity he's inventing new ways to store information and then keeps outgrowing those ways, and that's that always happens. That's going to happen forever. Unless somebody builds technology that that is truly infinite, saleable so.

B

A

Humanity outgrows the hard drive we just had too much data. We mean we can't fit everything we need to fit on one hard drive and I'm, not talking about all the information for Humanity I mean my movies won't fit on a single hard drive anymore. You know there is no hard drive big enough to store all of the data that I personally have collected, so we need to figure out a way to have storage of something that was bigger than one hard drive.

A

So now we're figuring out how to have a computer interface with multiple disks. So we have, you know a human being interfacing with the computer, that's interfacing with multiple disks at the same time, but that didn't solve every problem, because it's not just my movies that need to be stored. But if you know what, if my friends want to watch my movies or you know, my family wants to watch my movies, we need to figure out how to allow multiple people to interface with multiple discs through a computer.

A

So computers are getting smarter, the whole time they're getting more intelligent because they're having to deal with more advanced situations, and actually it looks a little bit more like this. It doesn't look like like this. It looks a little bit more like this, because you have tons and tons and some humans interfacing with tons and tons and tons of disks and guess what we've found. Another bottleneck right I mean this computer in the middle of looking awfully tired moment.

A

So the next sort of big boost in innovation for storage was distributed, storage, which is multiple people interfacing with multiple computers, angellist computers have multiple disks and I. Don't it doesn't show in this life? There are multiple disks on each of these computers so.

B

We've gone parallel.

A

At this point in storage to solve at the same time, we've realized that not everybody needs this tree anymore, not agree, needs to organize their data hierarchically anymore, and so some people actually just want to store objects, and they can be any kind of objects.

A

It could be a picture of my frog, but storing it as an object, instead of as a file gives us the ability to have sort of some more arbitrary data associated with it, and even more so, it allows us to not have to clear it into a tree, and so an application or a human or whatever is interfacing with the solar system can just store an object in the system and the computers figure out where to put it and how to find it, and the computers also figure out how to keep track of all of the extra information that goes along with that objects.

A

So that's that's object, storage.

A

At the same time, something else kind of magic will happen and I know I've gone from cave paintings to object and block storage in like four minutes- and you know it- may it's a little bit jarring, but something else happened where, if I had a bit of information on a bunch of different computers on a bunch of different disks, spread out and distributed, I can take that information and assemble it into a disk that then, is used by a computer, and this is what we call a block device and, in the case of you, know, cloud stack or or any of the the sort of cloud platforms.

A

You're really looking at powering virtual machines inside of a change containers. Do this so throughout history, sort of concluding my my going back to the beginning of time, intro that I always do can come throughout history. We started with painting and writing in computers and we're working all the way forward towards you know this. This exponential growth in the amount of data in humanity is storing and we think it's stuff is the answer. That's that's! That's where that's what our that's? What our perspective is, you think it's one of the answers out there.

A

So, as humans began to interface more and more and more with these multiple computers accessing multiple disks, it became obvious that this was something that could be. You need to become a business, so what people did as they took these clusters of computers and discs, and they product us them into appliances, and there are lots of different options for our appliances out there, but humans interface with these appliances. Instead, it's a much simpler, simpler way to interface, with a complicated network of computer to storage and these storage appliances.

A

There they're real things they get put on trucks and they get shipped across the country and they get bolted into Iraq and put power into them, and they do a thing for you and their actual their actual devices, their actual things they're, not software they're hardware- and you know the largest manufacturer of these devices- has six point: four million square feet of factory space, and so it takes a lot of a lot of money to make these devices, and these story or vendors have a lot of bills to pay.

A

You know they're, they have a lot of costs. They have a lot of research.

A

They have a lot of capital expenditures that they need to need to justify, which means that storage appliances are kind of expensive and they're, also expensive, because you're you're not buying software you're, buying hardware you're buying a consumable, and so it's it's a somewhat expensive way to buy technology at the same time that that started happening, I guess in the last 15-20 years, technology also increasingly became a commodity just like wheat or corn, or you know anything else, soybeans as a commodity, and that means the prices fluctuate all the time on technology, hard drives, go up in price.

A

Hard drives, go down in Christ processors go up in price; well, actually they never go up in price. They go down in prices, but there's there's a certain amount of luck, shoe a shoe due to the commodity nature of computing, and sometimes you can buy computers cheaper than other times. But if you buy a hardware appliance, let's say: I buy a pedophile or the hardware appliance, and it's with a proprietary appliance vendor a really well-known, a really well-known appliance offender. It's gonna cost I, don't even know like 14 bazillion dollars.

A

It's just it's just a number where it's going to cost a certain amount of money right then, when you go to buy my such a petabyte of storage, choosing the same storage vendor, it's going to be another 14 billion dollars right protected by the same thing again and I have to buy it usually from the same vendor and I have to you know: I have a little bit less flexibility in Christ, then, if I were to build it myself, for example, the other thing that is challenging about storage appliances, especially as it relates to the cloud and and this this ever-increasing demand for storage- is that appliances are kind of by definition, old technology and I'm, not trying to be pejorative here or or or critical.

A

A

Just saying that by the time a company assembles all of this hardware and tests it and make sure that it's secure and make sure that it works with all their other products and that sort of thing you have a bit of a runway and so, for example, this is the list of the fastest processors available today from CPU benchmark net and the flagship product from the largest storage appliance vendor uses this chip here, and it may be that it doesn't need any more than that ship, and it may be that it uses that chip, because that chips, reliable or whatever.

A

But the truth is that it's it's it's not something. You can change. So there's a bit of a black box nature to these appliances, and it's it's there's an advantage to that in that it's a little bit more convenient, I, think or a little bit perhaps ended back in the day. It was a more reliable thing because you could test this this unit as opposed to testing a thousand different parts, but the kind of black boxes.

A

So if you, for example, wanted to pull out a compute node and put in a faster, bigger, compute node to deal with some sort of hot spot, you could do it, but you couldn't do it with everyday hardware. You know you, you couldn't just go to price, or you know, or whatever the electronic stories around your house and buy a better processor core.

A

Then it's not it's not that kind of flexibility, and similarly, if you're, a human of subtype developers, subclass developer and- and you want to interface with its appliance to make it do something that it's not designed to do it's it's not that's, not good, and really the only option you have at that point is to go back to the vendor, so I think the perspective of Seth. The set project is that the world needs a storage technology.

A

That scales infinitely absolutely infinitely, because we understand that our request, our requirements for data storage are going to scale intently. So we think that the technology should scale infinitely as well, and we also think that the world needs a storage technology that is, software based that doesn't require an industrial manufacturing process behind it. You know we work, we don't think this problem gets solved by putting more technology onto trucks and trains and planes and bolting interacts. Well, we do, but it's we have a different perspective.

A

So this is sage while he's the co-founder of DreamHost, which is a company that has invested in seth, he's the adventure of seth and he's the CEO of ink tank, which is the company we spun up to do. Services and supporter outset and sage had a vision and his vision was that he wanted to build a storage solution.

A

That was that was consistent with a certain number of philosophies and design principles, and the first is that he wanted to build it open-source, and the reason is because we felt that the technology that we're building has the potential to change the way people look at storage and change the way people feel about storage and that open source is the best way to spread ideas and to spread adoption of technology.

A

He also believed that we needed to be community focused and, as the community guy I'm, very thankful about that it makes my job kind of nice I, don't have to argue so much with people who don't get community, which is really good, but the real motivation is that all of us are smarter, together than we are alone. You know you you've heard that a million monkeys with a million typewriters completing the complete they're getting complete works of Shakespeare I think we're kind of proving that with the internet that that actually happens.

A

You know I think that the more people you get involved in a software project better it becomes- and this picture actually is kind of interesting. This is my LinkedIn in Matt, which is a list of all the connections I have on LinkedIn and showing how interconnected we truly are. All these people that I know would know all of these.

B

A

People that I know is really fascinating. He wanted to make sure that whatever happened with Seth was was done in a way that was community focused and the other reason that that's important to be community focused is that Seth, the technology doesn't belong to sage, it doesn't belong to any tank, it doesn't belong to anybody. It belongs to everybody, and you know just kind of like just like a forest. We all have to take care of it.

A

We have to pitch in and do our part, and so, for example, if Steph doesn't do something that you wanted to do. We encourage people to make and do that thing? It's it's open source and it belongs to everybody. So that's a really important philosophy as well. On the design side, we wanted it to be scalable I'm, going to take a bit of it actually I'm going to.

A

We wanted to make sure that it was appropriate for a world where we have more data than can fit on a cave wall, more data than can fit in a book on a disk drive on a single computer Morgan. It will fit in a room. You know we want to be able to have a technology that can expand beyond that and truly be infinite.

A

We also felt kind of strongly about having no single point of failure, and this is very similar to scalable, but it's a little different, a lot of storage solutions scale, but with the single point of failure and the way, our philosophy is it's that if you have a truly infinite storage network of millions of disk drives something's going to be failing all the time you can't afford to have something fail. That is a single point of failure, and this is this is a tangent that was talking about. It was a bit lost, but B.

A

This is kind of a I know it's kind of a startling picture. This is a banana slug and a banana slug is the mascot at UC Santa Cruz, which is where sage got his his PhD and study storage and invented Ceph. This banana slug is also in the same genus as the cephalopod, which looks like this, and this is the original logo verse F that we had before we decided to rebrand.

B

And it's still our mascot in many ways, but you'll notice that.

A

It's a big metaphor because not has multiple arms, eight arms, you know so obviously it has replication across his arms. It has two eyes, so you have high-availability on the eyes, which is not not really in line with our philosophy. You really want everything be replicated, but we have a big problem with the octopus as a metaphor, because it does have a single point of failure.

A

That's not why we went away from using octopuses our logo, but it's it's important to us that everything we do is completely scalable and has no single point of failure at all. I. Think a better metaphor for the technology might be a beehive, although there's still a queen bee, so I don't know, perhaps a coral reef or something I'm still trying to figure out what the good metaphor is, or maybe maybe.

B

A

Don't even need a metaphor, but I think the other thing that was an important design consideration for us was that it's a software based solution, meaning that if you wanted to change the hardware out or put in faster Hardware, slower hardware or solid-state disappear or spinning it's there or whatever you wanted to do. The technology was interchangeable because the software was separated from the hardware and it gives people a lot more flexibility, and it also allows people to buy the cheapest hardware available. You can.

A

You can deploy Ceph on the least expensive hardware of it like saves the end of the corner and your hardware vendor doesn't have any good deals, but another hardware vendor does so you can get your next petabyte for cheaper. You.

B

A

It because you can have a heterogeneous technology environment with with the software solution. So that's you know thumbs up. The other thing that was really important was that the system is self managing, and this is because I mean hard drives, are not going to be the technology for a whole lot longer I mean spinny drives I mean. If you look at it in the the cave, painting scope of the world.

A

It won't be that much longer, but in the meantime they are basically record players, their little tiny record players inside of the computer, and they fail all the time like they will fail. It's guaranteed fail and if you have a cluster with a million distant, that means that one business could be failing fifty five times a day. So it's really important that the system is self-healing so that fifty five times a day when something goes wrong. It takes care of itself instead of needing human intervention, to move data for one no to another or 11.

A

So that's where Seth came from that came from these ideological principles and these design philosophies and stage went and it went to college and built chef and came back from college and decided to continue building stuff because he thought he had something. So after the invention of step at UC, Santa Cruz sage came back to DreamHost, where he was a co-founder dreamhost as an ISP and it's a hosting company in Los, Angeles and DreamHost decided to continue incubating Seth to great results.

A

The monthly code commits went up very, very noticeably at that point and Seth started popping up in other technologies like qmu and OpenStack and Saif is inside the Linux kernel and if you're on the marketing team at OpenStack, I'm really sorry about what I did to your logo. It's just illustrative, but I know that that's a bad thing to do, but anyway the point is, it starts popping up in these places. Just you know all these integrations started happening even before there was any kind of commercial Leopard and around Seth, truly a community oriented.

A

So going back to this, this architectural diagram I'm going to talk about each one of these boxes in a little bit of detail, to give everybody an understanding of what Seth is and how it works and how the pieces fit together, and the first I'm going to talk about is ratos, which is what everything else is built on top of, so Raynaud's is fundamentally an object store and it works kind of like this. Let's say if I have a node with five disks.

A

I need to put a file system on each of those disks set, runs on top of a file system. That file system can be butter, FS, x, FS or DHT, for we believe the butter FS in the long term is the right file system to run stuff on top of, but it can be rough x, FS as well in the short term as butter at best the begins to increase in stability. Then, on top of each of these file systems, you run ACEF OSD, which is an object, storage daemon.

A

We suggest that it's one per disk. It could be one per host it wander. You can have multiple career discs, although I'm not sure why you'd want to do that. You really suggest one produced, but it's flexible in the way that you it should employ it, and all of this on one node becomes part of the cluster.

A

It gets added to a few other different types of notes, most notably the monitor note, which is the Big M there and then a human being wanting to put an object into Rados interfaces with the cluster as a whole, not with a particular OSD or a particular node, but interfaces with the cluster as a whole and the object gets stored in a way, that's transparent to the user, and you can scale this as large as you want it to be with commodity hardware, so just to review the monitors.

A

What they do is they maintain a map of the cluster they understand which which hosts are in and which hosts are out of the cluster which hosts are up or down which is distinct from in and out it's different, because in and out is less permanent than up or down the monitors also provide distributed decision-making. So you need to have an odd number of these because they they talk to each other, to figure out who has the correct cluster map, and if you have, let's say you have three monitors and one of them disagrees.

A

You need to have the other two to have a majority. Also, if you have a split brain situation, where you have a cluster split in half, you have two monitors on one side: one monitor on the other side. The side with two monitors is the canonical side of the cluster and will continue to operate as such, because the monitors understand that they have the majority of monitors. So it's important to have an odd number. These do not serve any objects to clients; they don't serve any data at all of the clients.

A

All they do is monitor the cluster. Also, when you mount the filesystem you're mounting it with the host name of the monitor. So that's what the monitor does. The OS ds1 produced is what's recommended at least three in a cluster, because otherwise it doesn't really make a whole lot of sense. It. These actually do the serving of stored objects to clients, so these serve the data who needs it.

A

They also intelligently peers perform replication tasks, which is part of that self, managing thing that I was talking about and they support something really meet, which is about an object class which I'm not going to go into all that much, but what it allows you to do is essentially have methods on the object that you store.

A

So, for example, when you put an image into your cluster, it can create a thumbnail automatically, and this is it's something that is a bit of an experimental feature, because we're still figuring out exactly how much computing you can do on each of these OSDs without without starting to dig into your the processor and memory and whatever require to actually store the data. So it's something that's experimental now, but really really powerful.

A

So the next part of the component are the next component, a architecture that I'm going to talk about is liber8 oohs and librettist is kind of what it sounds like if you're a unit C person and you're familiar with with libraries, it's a library that allows you to build applications that interface with Rados. So, for example, if I have an application and it's built with Laredos I can use that application to store a node into a cluster, and it's going to speak over a native protocol, which is a pretty accurate O'call.

A

It's doesn't have a lot of overhead is super fast. So, if you're building an application that needs to have a very rapid access or very efficient access to a cluster, we suggest you use, liberate us or any of its other language alternatives, there's alternatives for C and C++, Python, PHP and Java. So that's that's the greatest and it's it's really straightforward. What librettos does and the grayness is the foundation for just about everything else that we built on top of set.

A

So the next component is called Raynaud's GW, which it is the rest gateway for Raynaud's, which is compatible with s3 and Swift. So, for example, if I have an application, I want to store an object into a cluster. I can have that application talk to Rados GW, which is built on top of librettos, which then accesses the cluster and I can have multiple of these right, because everything is distributed. There's no single point of failure anywhere, so the rails gateway you can have multiple of them. You can put them behind the load balancer.

A

You can do your standard web tricks to make sure that the Rados GW is highly available, but the architecture supports multiple of them and they speak a native protocol to the cluster, but they expose a rest-based protocol to the applications which is compatible. That's three insulators, which is kind of cool. They also support buckets and accounting. So this is the easiest way to get data in and out of the set cluster. If you are, if you're looking for application type storage, the next thing I'm going to talk about is RVD.

A

So this is a block device built on top of Gredos. So, for example, if I have a cluster and I have you know bits spread throughout my cluster that end up becoming a block device, I can run a virtualization container that has been built with live, RVD and librettos, and that virtualization container will present that information as a single piste to a virtual machine.

A

So RBD, the ratos block device allows you to essentially stripe a virtual machine image across the entire cluster, and it can be a very, very large image or a very small image, but usually very large and the virtualization containers that that we support now actually have done in later slides, but a virtualization ganders can be able to the bar BD, which I allow them have access to this block device. It also allows some really nice tricks like, for example, if I had one virtualization container, that's running a vm f of a block device.

A

I can actually move that virtual machine to another virtual machine container live right so because we're decoupling, the storage from the computer infrastructure live migration becomes something that's feasible and and actually possible. Today, also, if you don't want to do this in a virtualization context, you can use KRV d, which is a linux kernel, module to mount a block device out of out of set right. So that's it's not built into a virtualization container.

A

In this example, it's built into you know the kernel of a client machine, so the Gredos block device allows you to store virtual visits inside Rados. It provides live migration because of the decoupling of virtual machines and containers at striking across the cluster, so that you can get that sort of distributed, redundancy and and performance. It has boot support for qmu KBM and opens Tonopah and actually just a couple of weeks ago, also CloudStack, and it has announced support in the linux kernel, so that's block device.

A

The final thing that we built on top of Rados is Seth FS and I'll pause here for a second just to reiterate that all of these things, it's it's a unified storage platform. So in the in the same cluster, you can store objects, you can access those objects with rest, you can store block devices and you can have a pasta supply distributed file system in a single cluster. The action that the same cluster not separate clusters, the same clusters and all this has been built on top of the burritos.

A

So if you want to build another application that uses stuff to do something for tomorrow's needs, it's it's their architecture, so set that best is, as I said, a distributed file system. So, for example, if I want to mount a file system, the first thing that happens is my client has to retrieve metadata from a metadata server. Something has to collect all that metadata that accompanies the file and also it has to manage, locking and permissions and actually manage the hierarchy itself.

A

So a client will access the metadata server first to to make sure that that information is available, and then it will get the data from the OSDs directly, so the metadata server. It manages all the metadata for the posix-compliant short file system, which means the directory hierarchy. It also does all the file metadata it stores its metadata inside Rados, so the metadata server doesn't store it locally on its own disk. It stores all its data inside the clusters, but the metadata server goes down. All of that metadata servers. Information is available to other metadata servers.

A

It doesn't serve as it knows, doesn't serve any file dated clients.

A

That's another thing to know about it, and you only need this component if you're going to run and set that best, if you're just doing block devices or object storage, you don't need to run this component at all, so it simplifies the architecture to not happen so I'd like to spend a bit of time and talk about what makes seth unique, what makes it different from what all the other stuff that's out there and the first example is sort of an exercise I'd like to call how do you find your keys in the morning so every morning I leave my house I can never find my keys like they're, always in a different spot, because I'm, not you, know, I'm, not a creature of habit.

A

I was putting somewhere different, but it's a good metaphor for how you store information in distributed storage systems. So if I'm, an application and I want to store an object inside of the cluster or I want to retrieve an object from a cluster. How do I know where to go to get that object? How do I know which part of the cluster to connect to there are hundreds, maybe thousands of machines I, could potentially connect to and having an architecture where you connect to a single host and then it directs.

A

You is a single point of failure, so that's that's not allowed in our in our in our philosophy. So that's kind of a question and one option is that you break up your cluster into multiple parts.

A

You know files that start with a through G or directory start a through G or hashing, something or whatever, but in some way you're breaking up your cluster and saying this stuff goes here and that's the most of there and that's that goes there and that way, when you want to store a file, it tells you oh, it starts with F. Let's go put it on that box.

A

Does that boxes where all the files go right and that's what I call I always put my keys on the hook right, and this is how most people live in their house. You always put your keys in the same place yeah, but.

B

A

Know where it is and when you want them, you go there and you get them it's the same thing that happens in a lot of distributed file systems that are organized this way. The client knows where to get the data, because that particular piece of data is always located in the same place right or the same kind of place. Maybe it's replicated or or whatever it's a little more complicated, but it's always the same place.

A

The other way to do it is to have a centralized metadata server somewhere that you go asked, and it tells you, oh, if you want that, go straight.

A

Hang a left, hang a right at support machine from the top right, and it tells you where, in the cluster to get the information requires multiple round trips and there's some there's something turns around to centralize metadata server being a single point of failure, but this is also a way that a lot of this happens- and this is what I call dear diary today, I put my keys in the kitchen. It's like writing down. Where you put your keys every time, you walk into the house mm-hmm.

A

The only problem with this is: how do you find your keys when your house is infinitely big and always changing? Imagine having to figure out where your keys go with your house changes every time you walk into it and it is infinitely large all right. That's the type of problem except ends of solving, and the answer to that problem we believe is called crush. Crush is an algorithm that is sort of at the core of how Rados works so with crush.

A

Let's say, I have a bunch of bits that I want to store into my cluster. The first thing the crush will do, is it'll hash them into a certain number of placement groups and that's configurable, but in this example, there are 10 of them, and so after it's made it's 10 placement groups, it runs the spacer groups through crush and what you pass crush.

A

Is the placement group that you want to place the state of the cluster and a set of rules and then crush will tell you based on that input where, in the cluster that data belongs? So it's a deterministic placement algorithm, it's pseudo random, but it's repeatable, so it would take all of these items and spread them in the cluster in a way that was pseudo random, it's very distributed, there's not a pattern around it and it indeed gives kind of an even data distribution.

A

So crush is the algorithm it's pseudo-random. It's the algorithm itself uses for placement, ensures that they just evenly distribute across the cluster. It's repeatable and deterministic, so it'll always run the same way, given the same input and it's configured by rules. So instead of telling Rados.

A

To always put you know, it's it to have ten different pools in each pool is a different storage pool, and you have to always put these pools onto this note or whatever the way you configure sep. Is you tell it here's my general topology I have this many rooms, this many rows, this many racks to Spenny switches and the topology. Is he figure goals it doesn't have to be.

A

Brooms rose, R axis, which is it could be anything, but then you can tell it store this many replicas and never put two replicas in the same room or the same row or the same switch. So it's a rule-based configuration.

A

So, for example, then, when when a client wants to wants to in store or retrieve an object from a radio Slusser, it will run crush on that cost. On that that information, you crush, will tell it. Oh your informations on this node in that node, that's where it is, and it always runs the same, but there's kind of a challenge with this and the challenge is what happens when you lose a node. So let's say I've lost this node to the green with the red and a yellow square.

A

These individual OSDs that make up radios are intelligent and they they appear with one another. So when that node goes down, the other nodes find out about it because they're constantly gossiping with one another and they realized uh-oh. The cluster map is updated. There's a new state in the cluster, so each node then recalculates the crush algorithm on all of the data that they're currently holding and they realize uh-oh to make crush work in this new cluster map. You.

B

A

To move this data from.

B

A

To there copy the data from here to there, the Red, Square and copy the yellow square over like that right, so the nodes themselves intelligently, reposition the data so that the next time somebody runs the crush algorithm. The data is where it's supposed to be so in this case, the crush algorithm will tell the client that the object is located on the new node, because the old one is down.

A

So the next thing that makes seth unique and different from a lot of the rest is the way that it stores its block devices. So let's say that I have this virtualization container running a virtual machine out of a block device. That's stored inside set. That's almost never the case, though you never just want to run one VM. You want to run dozens of VMs hundreds of videos right and so the question is: how do you spin.

B

A

Thousands of VMs instantly and efficiently and that's somewhat difficult and the answer is with Seth. You can do an instant copy of one block device to as many other block devices as you want and in this case, I've created four copies of this block device and all of them are taking zero space because the copy is instant, but it's also thinly provisioned right. So, if I have a hundred and forty four block well, it's not really blocks like on the disk.

A

It's just units a 144, a unit block device and I copied it four times it still takes 144 months. That's all it takes. Then, when my client begins to write information to this new block device, it begins to fill in the gaps. So, if I write, four four block units to my copy I end up, storing, 148 total and then when clients go to read.

A

If the data is inside the copy, it will read it from the copy and if the data is not inside the copy, it will read through it to be to the original file. This is what we call it's it's it's in the provision, so that helps people say, for example, if I want to spin up a thousand VMs I can copy might be an image a thousand times it happens instantly and then I only start taking incremental space when new data gets stored.

A

B

A

Thing that I think makes F unique is its ability to manage a directory hierarchy without a single point. Failure one second: without.

B

A

Point of failure, so the challenge is: if file systems require lots and lots of metadata, they require you to keep track of lots of information. You know to assemble the hierarchy and if you remember this graph of how the metadata system works, with with the set of s you'll notice that there are actually three metadata servers right again with Seth, nothing as a single point of failure. So there are three meditative service respects the question: how do you have one tree with three metadata servers right and the answer or well?

A

Our answer is that when the first metadata server comes online, it has control of the entire tree right. It has authority for the entire tree when the second metadata server comes online, it will take a portion of that tree and, since all of the actual metadata itself is stored inside Seth, there's no data that copies. When this happens. Another metadata server just assumes that responsibility and as more meditative servers come online you'll notice that they take sort of more equitable chunks, and this all happens dynamically.

A

So as the load on the system or as the the data requirements change. The metadata server will adapt and adjust and even allow you to have single hotspots even down to the file granularity, so you can have one metadata server that is just responsible for managing locking and permissions on a single file. If that's what the cluster eats. Let me call this dynamic, subtree partitioning.

A

So after saying all of this wonderful stuff, I have to back up a little bit.

A

Almost everything works. Instead, almost it's been in development for about six years, having just launched the company in tank, we're just starting to do some. Some some really involved Quality Assurance testing and performance testing, but here's where it stands right now right now Bri knows is awesome. It works very well, it seems to be very stable.

A

We've had a couple of deployments that are achieving very good uptime numbers, the liber8 O's is also stable, ratos gateway is stable and our BG is stable, which means that there's one that's left out and that's set the test, which is almost awesome and we're expecting that to be awesome, probably early next year.

A

The other backpedaling event here is that this today is land scale and the reason is land scale is because when you write information to assess object, it does the replication synchronously. So if you tell it to store ten replicas, it has to go communicate with ten other nodes. Before it comes back to the client and says: okay, I wrote the file right so and that's that's how we we keep the system saying, but it means that it works over land, scale, speeds or really really scary, fast, LAN speeds.

A

Sometimes we talk to people, they say: oh, it only works across the land and they say why and we say latency and they say oh well, we have you know two milliseconds from here to the moon. We go. Oh well, great set will work, you know. So it's it's really more about latency. This sighs, so I have just a quick little little slide on the current status of second cloud.

A

Stack and I chose this image because we really are at the beginning of the road with second cloud stack beginning to integrate, but there has been a good amount of work. That's happened already, so this was just announced a couple of weeks ago our 3d support inside cloud stack, which allows storage of virtual disks using Ratos.

A

It works with KBM only right now, and the volume snapshotting feature is not quite there yet, but it's definitely it's definitely there the requires the latest version of everything, so you have to build all the latest stuff, and this is the information of this can be found on the mailing list, which gives you an idea about sort of where it is it's.

A

It's we're also working on some guides and documentation and that sort of stuff, but right now it's in active development and we're really interested to get people to try it and see if it works and see if, if it does what's what it needs to do so,.

B

That's that's. That's.

A

My talking about SEP I'm available for questions if we have any, if we have any time and I will hand it over to I, guess geralyn's, you still take questions great.

B

Thanks Russ, um so a high-level question for you. um So can you explain the difference between a Stefan Hadoop.

A

I can so Seth actually came out of the HPC, the HPC world when it was invented. It was invented primarily for the HPC use case and that's where Seth FS came from second best being the part of the architecture. That's not quite ready yet so people have actually built HDFS sort of using Seth as a drop-in replacement for HDFS, which, which is it gives a little bit more. Scalability gets around some single points of failure, but that's it's it's kind of experimental today. So I guess the major difference is well.

A

The first major difference is that stuff is a storage platform and Hadoop is extruded computing platform. The more ax comparison is set and HDFS and I think that they're the differences are that stuff is object-based. Unified storage system.

B

A

Hdfs is it's very HPC focused distributed file system, so I think that it's a HDFS has got some things have very particularly useful for Hadoop tech workloads set this a little bit more general, but it can be used as a drop-in replacement ratio best.

B

Great, thank you another clarification question. So how would one use RVD with Zen or Zen cloud platform, it's possible? How yes.

A

I think what you want to look at what you want to look at is this message on the mailing list, where a guy in our community named Vito who's, also an in the class that community announces his integration, but I think it's only KDM right now, I know, there's been a lot of work with libvirt, but that's unfortunately, that's that's. The information I've got at the moment.

A

I I think that that we're going to spend some an increasing amount of time with the Zen community figuring out where that integration is and it's something that is still still in development.

B

Great thanks so another question: can you explain the difference between cluster FS and SEF.

A

I don't know a whole lot architectural about cluster. A cluster FS I know that that set has the object, block and file on single unified platform and I. Think Buster specializes in you know in in leith file and think maybe they have blocked now. Unfortunately, I don't know a whole lot about Buster. I know that that sets architecture and especially crush, was built to get over some of the architecture of limitations that we find in cluster. But I will also say that clusters had a lot more time in the market right.

A

So there have been a lot more people deploying Buster and a lot more people using it. So I think it's the way that I kind of think about it is that seth is perhaps still a little rough around the edges, but I think it has a lot of architectural potential.

B

Right so Ross, that's pretty much it in terms of questions I'd like to thank you for taking time today to walk us through this drill down into the solution and folks do do you have Ross a slide of how folks can reach out to you? Maybe we can put that up before we sign off.

A

Yeah sure yeah sorry, as I figure out how to operate my computer one moment this is scroll through my hundred and thirty six lights. There we go, it was. Everybody was very, very patient today. Thank you. There.

B

You go and then you guys so that is how you can reach out to Ross.