Ceph Ceph Meetups, 17 Mar 2014

Previous Meeting

⏯

youtube image

►

From YouTube: Ceph 101 ☷ Unified Storage #OpenStack

Description

Montréal - March 17, 2014 - David Moreau Simard, IT Architecture Specialist at iWeb, presents the basics of Ceph, "a distributed object store and file system designed to provide excellent performance, reliability and scalability."

* About Ceph: http://ceph.com
* About David: http://twitter.com/dmsimard
* OpenStack Montreal: http://montrealopenstack.org
* About iWeb: http://iweb.com

A video produced by Savoir-faire Linux:
https://www.savoirfairelinux.com
Licence: CC BY-SA

A

I'm David I'm from Iowa I'm, an IT architecture specialist. It's a fancy name for a system administrator for I web zone infrastructure, so I do stuff like open stacks, F, Swift, puppets and server, it's fun, stuff, I, love, Python and I work with high led web and database clusters.

A

What is Seth does anyone know. What's? F is okay, less handsome Swift I'm gonna have a tough job. All right, so stuff is a distributed. Object, store and file system designed to provide excellent performance, reliability and scalability. That's right off surfaces website.

A

That's not for me! Okay. um The interesting thing maybe for you guys is that stuff was created by sage whale as part of PhD thesis in computer science. So whenever I have a problem, I try and find. If someone else has solved that problem, this guy had the problem and he fixed it himself by creating a whole storage solution. So it's kind of awesome stage. Welcome, coupon and DreamHost. It's a web hosting company a bit like iWeb. He also founded ink tank.

A

The company that is today behind SEF may be a bit like canonical for Ubuntu. It's a enterprise behind the product in tank is also mentor in the next and the upcoming a google Summer of Code google Summer of Code. The has a wide range of products for students, students to work on so they're gonna have projects for self and their next summer, and before talking about stuff I'm, going to do a little bit of context, around distributed, storage, I think Marcus did a nice job with Swift and the stupid distributed storage.

A

So you have a laptop you're, a human, hey, I hope you have your computer and you have your disk. I mean this is a laptop, a lot of people here with a laptop. So that's local storage right. You have your laptop, you have a disk, and then you have a small business. So maybe have some money, you purchase the server and you have a whole lot of disks and a whole lot of employees.

A

Your company is successful. Okay now with this and that's the problem, and why why that is a problem? It's because this computer? If it fails, your company fails, you don't want done right, and this is the problem that is often tackled by something like more servers right.

A

So then you have something like this. You have a lot of humans. You have a lot of servers. You have a lot of disks, but the problem with that is: how do you address each of these logical computers right? So each of these are different servers. You maybe have different file shares on different servers. It's a mess to deal with, so this is why big corporate corporations have done something they call. Sands sands looks something like this: it's oversimplified, it's a big, expensive, appliance right.

A

There's a lot of computers. There's a lot of disks. The difference is the the black wrapper around. The whole thing is that the sand is one logical unit right. You address that.

A

You don't talk to tens of servers, you talk to one server or one IP address or whatever the configuration may be, but the idea is that you're, just one single device and often are not often but like all the time, there's going to be logic and intelligence in the Sun, so that there's replications and you don't lose your data. If you lose the server, if you lose a disk and so on so now I don't like sands I'm. Sorry, there are very expensive. The license is expensive. The support is expensive.

A

The the software and the hardware is often proprietary and there's a vendor lock-in. So you can't often you can't really mix and match your hardware and software. You go with a solution. You stay with it and you have to deal with that.

A

Stuff stuff is meant just like Swift is meant to run on commodity hardware. That means just like any art. Where you have computers at home. You can rinse off on it. If you wanted to actually, if I didn't have a network attached, storage at home, I would probably run stuff, it's free, free, isn't, candy doesn't cost anything. Ink tank does provide enterprise support and a bit like a canonical does, but otherwise it's free, it's open source. The code is on github, you can look it up and send pull requests the guys over.

A

There are pretty reactive and it's awesome. It's a feature right. No, this soft software stack now we're gonna delve in a bit in the details of what SEF actually looks like at the software level. So you have red O's, which is an acronym for reliable at animate, distributed, object, store and a bit like Swift is an object store and it takes care of replicating stuff. So you don't lose data whenever you lose a computer or disk.

A

The core of stuff is actually ratos because everything is replicated in that object, store and you have liber8 O's that sits on top of Rados and libery dose is a set of bindings and you can talk to Rados through various api's there's bindings for C C++, Java, Python, Ruby, PHP, Python script, for instance, you can import the the Ceph libraries and talk directly to liberators and then there's components that are provided directly within SEF that use libera DOS and allow you to work with SEF so have rattles gateway, I!

A

Think right now, it's it's renamed the SEF gateway or something. But anyway it's a rest. Interface to Rados so allows you to use SEF as an alternative to swift. We can talk with that about that, a bit later on it has. It provides an API that is compatible with swiftest Amazon, its s3 and Swift.

A

You have our BD that stands for Rados block device. It allows you to use SEF for block devices, so a hard drive but hard drive over network. Think I scusi allows you to do that and then there's SEF of s. That is a distributed file system. I heard earlier, someone was talking about glossary FS, that's something a bit similar to Lester a fest. We can talk about that later as well.

A

The stuff demons so and resolution again.

A

Can we okay, I, hope it's not too small for people at the back and the most the the sub daemon that works with our actual hard drives is called OSD. It stands for object, storage, demon and the OSD is essentially a piece of software that more or less takes over an actual hard drive. So what happens? Is that you'll have one demon per disk so for one server, you'll have many disks right, Wow and so you'll have many OS DS on a single server. Typically, so you have one server with eight disks in it.

A

You know: you're gonna have a 2 s, DS. The the cool thing about is these that don't really care about the the hardware they're residing it. So the hardware like, for instance, if if a specific line of server hardware is discontinued well, you can just purchase another kind of server and it's going to work.

A

There's no hard limit to scaling at the amount of OS DS you can have up to thousands. Always these without a problem. Scale is just as well as Swift. The important thing is that it serves data to clients directly. So when I say a client ism is an application or your code when it talks to SEF, and it will talk to monitors and will get two monitors there and.

A

The client will know where the data resides inside the cluster and it's going to be able to talk to the always DS directly and pull the data from there and the USD is the one taking care of the intelligence, peering and replication to make sure that everything stays consistent.

A

So I have this unfortunate little schema here, so you have, for instance, USD the daemon. The filesystem can be anything like btrfs, six, FS or HD for staff currently recommends x, FS, that's like the best middle ground between the btrfs. This isn't quite prediction ready yet and you have your disks and then you have going to have plenty of as these inside a cluster.

A

You have the monitors, you're gonna you're, not going to have as many monitors as you're, going to have a whiz, DS and typically you're, going to have a small, odd number and the reason that you're going to have the odd numbers because you need a quorum. So.

A

It's like I, don't know what you need the quorum. That's a good word, yeah right. um The monitor hang on this slide is ah isn't that better awesome and the monitors take care of maintaining the cluster state, so they're the ones aware of what's going on in the cluster, pretty much they're going to have they're going to know who the cluster members are. So who are the monitors? Who are the aziz and who are the other components in this F cluster they're, aware of the placement groups and the objects?

A

It's um it's a terminology we'll be looking at a bit later, but they also know what's the overall health of the cluster and they're going to ones be telling you a there's a problem, so this is the role of the motor it takes care of managing the crash map as well. The crash map is something maybe it's the equivalent to Swift's ring s3. It's quite different though, but it's like it's the equivalent it's crash.

A

The crash map is the one that is able to it's the one that is able to let you know where the data resides essentially, and it's also the entry point for stuff clients. So you have your application and you have a hostname or IP, perhaps and you're actually going to talk to a monitor, you're going to retrieve the crash map, which tells you what what's the cost, what the cluster looks like and by talking to the monitor the clients will know afterwards where to push and pull the data on.

A

Oh right, you have the meta data server. It's only used for sefa fast, shared and distributed the file system. Unfortunately, it's not really production ready yet and I really wish. It was because it's really awesome, like there's people, an organization that use it in production. I, know I, know a couple. Companies that do, but Inc think is not really yet ready to officially support that product in their enterprise offering. So it's not quite ready. Yet it's maybe a matter of a few months still but anyway, and the metadata server. It manages the file system metadata.

A

So things like time stamps permissions ownership of your files and folders and the actual folder and file Araki, um it's scalable, which means there. You can have as many metadata servers as you want, and the interesting thing is that the meta data is stored in Rados. So this means that you can lose your metadata server, but you're not really going to lose your metadata because it it lives in the SEF cluster. And that means that you can it's it's more or less stateless.

A

So you can provision in the Commission metadata servers as you want, and then there's also the dynamic subtree partition and it's a it's complicated concept, but more or less the metadata server is aware of which files in your file a hierarchy is being hits the most essentially and it's gonna partition your file hierarchy dynamically between the metadata servers, so that it like one metadata server, could perhaps be handling just like a small part of your file system, while the 2/3 or other military servers will be taking care of the rest of the file tree, because this one spot is being hit so much.

A

There's the Rados gateway. It's I talked about earlier. It's one of the demons. It's the rest, based interface, Doritos, it's impossible with s3 and Swift api's.

A

It's cool enough already right. Am I right Yes No, but what really makes stuff unique? It's crush and I talked about a bit earlier. Crush stands for controlled replication under scalable hashing.

A

Crush it's an algorithm and it was the main topic for a sages, PhD thesis. It's it. It pretends it does pseudo-random placement. That means it's going to place data apparently randomly, but not really. It means it's there deterministic algorithm that will furnace this one given operation. It will always do the same thing.

A

The calculations are client-side.

A

If I, if I do a parallel to Swift Swift has these databases for metadata and it has the arraign that is able to compute where the data that lives with crush the clients are able to tell where the data lives without querying a proxy server or a middleware server, it just knows, crush will distribute the data uniformly evenly evenly thanks, it's stable and by stable. It doesn't mean like well, it's it works. Well.

A

It means that if you lose a disk, our server and we'll move data as little as possible and the OS DS will appear between themselves. So if you lose a server- and you have maybe ten other servers, it's at the end of the world- you're not going to be impacted that much it's configurable. That means that it can be infrastructure.

A

Topology aware, and by that it means that you can in safe define, while these servers are living in this rack, and these servers are living in this row in this room in this data center and then that allows you to do something like replication to different failure domains. So you can have, for instance, okay, I'd like to do three applications, but one in another data center. So you can do that with stuff and you can configure replication. So two, three, four none and you can wait you're different with these.

A

So maybe, if you have faster drives, maybe you want data to be cycled in them a bit more. Well, you can wait, those drives a little bit more or you can wait them down. So they're going to be used less.

A

How Crush distributes data and SEF there's pools and pools?

A

There look at the logical container for your data they're like folder, but at the same time they're not like folders and they're.

A

Really, you could work with just one pool and you could work just fine, but the idea is that the crush configuration I talked about earlier they're, usually going to be set on your pools, so perhaps for one set of data you'd like to have two replicas less important data, but for other information you'd like to have three years I guess so we're going to have two different pools and different configuration for each of these pools.

A

These pools contain placement groups I'm going to talked about them later and images. For instance, radios blog device D belong in pools. So when you create a blog device, you're going to use you're going to you're going to have to specify in which pool disk block device will reside.

A

Images are striped to grow across placement groups. That means your blog devices are actually replicated and consistent across the cluster and I talked about replication per pool and custom crest rules, placement groups, the actual objects, the binary here. Your files, there's they're stored in placement groups, and the interesting thing is that the object aren't really replicated it's. The placement groups at are, and there's going to be a ll schema later.

A

That will give you a better idea of what it looks like the rule of thumb is 100 placement groups per OSD and physically, your objects are going to be split amongst your placement groups and the more placement groups you have the more evenly your data will be distributed amongst your cluster, but the more pages you have also the more CPU intensive it becomes. So that's like that's a that's a hard limit, so you have here a little diagram, so you have, for instance, your binary data, an image it goes through crush.

A

It gets split into objects, and then you have this pool like the pink, the pink, a square, a rectangle, and then you have your replacement groups and then each placed in groups. The objects are being stored in the placement groups, so in this example, I have a replication factor of two. So if I look at this little red square, I have this red square here and the second red square. Here, it's not going to be anywhere else so.

A

What happens if you lose in USD?

A

So you have this little diagram here and I lost, allows this placement group here. It's really oversimplified because there's a lot of placement groups on a single Drive, but it gives you an idea. So what happens is that the I lost a red and.

A

Brightness pale yellow square, so the OS DS will peer between themselves and like right. Now the state of the pool is degraded because there's some files that replicated and consistent so the u.s. news, the OS DS, will appear between themselves and replicate the data into other placement groups.

A

Let's talk money because money drives a lot of things right, I, real-life scenario and I can oh I'm not going to go over the server specifications? You can take a picture if you want or something it's something that is a typical for a storage server and something close to what we'd use at iWeb for a big storage solution.

A

The idea and what I'm trying to show you is that in the past, before having before staff was even considered what we do for high availability and failover and data security is. We would have two storage servers in dr BD, which is a more or less raid over network technology. It was that good, yeah right and with the RBD.

A

The thing at the thing is that usually you'd have one active server and one passive server, so basically you're paying twice as much for really just one server worth of data, and the thing with Saif is that you have these two same servers but they're both going to be active at the same time so for numbers, for instance, for $18,000.

A

Let's pretend I have raid 4 performance and data security and to give you an idea, I put 4 dr BD, both raid 10 and read 50 configurations. So for raid 10 there are more or less 32 terabytes worth of data. If I have three replicas, which is the more the most secure well through, because I mean it's pretty safe, right and stuff, so you're gonna have 43 terabyte.

A

So it's more than 11 thorough, bytes worth more of data, and if we go read 50 over the RBD, because we need more space and we don't really care about performance that much we're going to have 57 thorough bytes worth of data. So if we need more space and stuff, we can tone down the replication count to two and where you have sixty five terabytes and also keep in mind that with Ceph. Both the servers are active and that's worth a lot, because there are other bottlenecks involved, such as network throughput and stuff, like that.

A

Stuff with OpenStack, because I got to talk about OpenStack tour right and block storage, as sap is integrated with cinder and Nova, both in grizzly ivana and nice house.

A

It's similar to ice kizi, but better yeah, as in there's a kernel module right in the links right in the Linux kernel, and it's been there for a while for four years now.

A

So if I want to mount a blog device over staff, it's really like four lines: I load, the kernel, module I map, the blog device, the the cell flag device to my server I format, the disk image and I mount it and it's good to go. I can use it. It's over the network, it's replicated and it's highly available images.

A

It's integrated with glance in OpenStack glance is a project that allows you to upload your images, such as I, don't know: QQ cow or qumu images and a glance natively is able to talk with Saif and store and retrieve images from their object. Storage. It's an alternative to Swift and OpenStack. It works, I've tried it and it's incompatible with a API with Swift and Amazon s3.

A

The thing with with object, storage with Saif is that Saif.

A

All right off the bat is a synchronous, replication, so Swift earlier Markos said that it's able to do rites at two places and then it's going to tell you. Okay I've done the right, and then it's gonna in the background replicate a third time. It's that stuff is not going to. Let you do that. It's going to send you the ACK.

A

Only once all the replicas are done and they've done something in the latest release of Ceph I haven't had the time to try it out yet, but they have actually done like several zones and Federation's. So maybe like they've done something with that. That's able to tackle that problem shared in distributed file systems ffs, it's an alternative to glass, very fast, so CFS and stuff, like that.

A

It's supposed X compliant and you can mount it at ffs over kernel module or over fuse soon and, as you see, smells my face because I'm sad I have a really good use case for it, but I'm not able to do it who uses stuff a whole lot of people.

A

I'm sure you recognize some of the companies listed in there. There are either known users of staff to contribute to staff or their actual ink tank customers.

A