Ceph Conferences, 8 May 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph and Cloudstack: Integrating the Ceph Block Device into your Cloudstack Deployment - Ian Colle

Description

CloudStack Collaboration Conference North America 2014

A

I mean Koli I'm, the director of engineering at Inc tank were the company behind Seth I'm gonna be talking about Seth and Wu CloudStack.

A

Sorry, did you get a picture of me behind you? I know, I'm being a little frazzled I walk around a lot and I talk. What I'm gonna try to cover quickly because I had some technical difficulties with the integration between Microsoft and Apple is hiding a great SEF and cloud stack. There's a little bit about me. The slides will be available. If you want to contact me, feel free to tweet me, a tire Koli I also sit on IRC either. One of those I are Kohli.

A

All right can I just get a hand how many people currently have running implementations of Ceph? Okay, how many people have heard of Ceph how many people have no idea? What you're in for right now be honest? It's okay, all right! So Seth is unified, object, block and file storage I'll go into a little bit about the architecture, with the the time that we have, but what it does is provide you block, which you're extremely interested in in a cloud environment, but also object.

A

So you don't have to get blocked from one provider object from another and then also a POSIX file system, layer.

A

Little bit about Seth was started back in 2004, open sourced in 2006. Then it really started to pick up when it was picked up in the accelerate. When was picked up in the Linux kernel in 2010 and then in 2012. It was integrated in both cloud stack and OpenStack, so we've been integrated since for Oh.

A

What is what is the purpose behind seth sage, while the the father of Ceph, his dream was to basically take away the power of proprietary storage, no offense, if anyone's in here from MC and net F, but he really despises all the vendor lock-in and his dream was to allow people the freedom to choose whatever stores they wanted, if you're a Dell shop, if you're an HP shop, whatever you are to allow you to switch out the hardware underneath and not feel like you were, gonna have to sacrifice and availability and reliability when you did that so moving the smarts out of the raid and the hardware and putting it into your software and that's why the design of Seth is built upon the idea that your storage is gonna.

A

Fail you at the hardware layer, whoa all right, it looks like their Wi-Fi is determined to get me to connect.

A

And the the reason this is so good for an exascale size project is when you think about having thousands of terabytes of data or even thousands of petabytes of storage. Just do the MTBF numbers. You're gonna get failures all the time so with the underlying architecture of Ceph. That failure is hidden from you again community stuff, it's totally exploded since 2012. But let's talk more about the architecture, so we have underlying object, storage demons that are one-to-one with each of your physical devices.

A

Then we have these monitor nodes that sit out there, monitoring the cluster, getting updates from the clients getting updates from the the OSD saying: ham, live ham dead and they help give you a map of your cluster of what's alive, what's dead, what state things are in and all of all of that then talks to the various levels. It's based all on the object, storage and then everything else goes on top of that and we'll see a better picture of that later. But first I want to talk about again.

A

What makes SEF special and that's the crush algorithm crush, stands for controlled replication using scalable hashing, and what that does is so when you've got your object, then it hashes it in a pseudo-random manner, but it's deterministic, so you know that every time you run it through the crush algorithm, whether you're on the client or the daemon you're gonna get this. So it's gonna take this chunk right here. Put it on this OSD!

A

That's what each of these squares stand for, and given your replication rules and here I've, given it a replication rule of 1, so I've got one original and one copy. It's gonna put one of them there and one of them here now what's powerful about that is whether I'm a client. That's looking for the data looking where to write the data. Looking where to read the data or, if I'm the storage device itself I, don't have to go to a lookup table, I'm calculating that locally. Ok, so there's no I! Don't have to go. Hey.

A

Tell me where to put this put it here. Okay, now, I'm gonna go put it here, hey I, put it there! You don't get all that traffic back and forth I. Do that I do that calculation locally, and that saves you on that round-trip.

A

So this is just showing you what that looks like when I've taken this object and put it across to all my OSDs more more detail about what crush does now? What happens when we're doing a read again? It doesn't do any sort of lookup. It calculates that locally on the client. But what happens when there's an outage so before my rule said I had to have one original one copy and that they couldn't be, for example, in the same host. Let's say this is one rack. This is another rack.

A

So what's gonna happen, is these OS T's talk to each other, so they talk to the guy. Next to him say: hey you still there hey you still there and then, additionally, the pair's talk to each other. So these two guys say: hey I, know: you've got my copy, you still there I know you got my copy you're still there and then also every once in a while. The monitors will go out and say: hey how's everybody doing I haven't heard from you in a while. So through one of those mechanisms the system finds out.

A

This guy's gone whether that's a network issue, whether that's the drive, just went bad, so this guy's gone so right. Now, I only have one copy of my yellow and my red chunks, but immediately without you doing anything, the system says: okay, you guys start moving stuff around again. This is calculated locally I. Don't have to go to any sort of look up and say: hey I lost my copy. Where do I put it now? It knows my rules say that I've got to have another copy that doesn't exist in this row.

A

Here are the available devices go put it there and without any intervention of the user? The data just moves. So let's look at us now now the data. If it's going to go, do a read. It knows that it can fight at this other place. Because again the the monitor gets that updated map saying hey the copies are here: I can go there. So what happens? When you add? More storage, so now we've got the opposite issue. So, instead of having a failed device, now I've got I, add additional rack of storage.

A

Well again, it just rebalances. It calculates okay, given my new set of rules and the new map of the available storage that I've got. How can I satisfy that and shuffle stuff around? Oh, my god,.

A

So why is that important for cloud storage? So you've got one VM the block device hypervisor we'll go again more into the architecture here, because, with the with the block device that Saif gives you, you can spin up hundreds of them with cloning. So this takes a little bit to draw. So here we've got your original block. Okay, now, with our instant copy, go back to that so now, I've got my original four copies same amount of storage. I've only used that original 144.

A

Now, if there's a right, Saif is smart enough to know that hey you only wrote to these five little I mean needs four little chunks. So now, in addition to my original stripe of 144 I've got four extra, so it's not gonna rewrite that entire stripe.

A

Now, when the client goes to read, it's smart enough to know well if this is a part that hasn't changed, I just passed that through it goes right to the original. But if it's a chunk, that's changed. Well, yeah I've got to read from here, but you see your total storage, for this is only the Delta plus the original.

A

So I mean in this simple example that that may not be that big of a difference. But again, if you imagine spinning up a hundred of these and just having the Delta, that's quite a storage savings. Now, how is Saif integrated with cloud stack?

A

If you haven't heard of Vito den Hollander, you probably should shame on you. He is active in the cloud stack community he's a cork emitter and I mean I. Gherkin Lee calls the house that Vito built, but this would not be here but for him so he developed the the Rados Java bindings. He did the plugins in libvirt. He put the logic into KVM you. So all this is credit to him. So let's go back to the overall SEF architecture. We've got this underlying object store again everything sits on this object store.

A

Then the block device sits on top of this library. So we've got this series of bindings that we allow anyone to write any sort of program that they want to directly access the object storage.

A

So our our BD device is just an example where we did that internally, so we built our own block device that utilizes this library to then talk directly to the object, storage, so Lib RBD talks to liber8 O's, which gives you direct act, access to the object, storage, but the overall structure of Ceph, the key components there are the Rados gateway, which gives us three and swift compatible access to the object, store and then again the POSIX semantics for the filesystem.

A

Now one thing I I need to be clear about is that this I'm not saying that you can have unified storage in that I can talk to the exact same object using each of these. So it's not that I can I can access an object using the Gateway using the block device you're using SEF of s. Those are different pools of data that you're going to be accessing, but what it allows you to do is using the same underlying infrastructure. The object storage allows you to have access to each of these features, so.

A

Again, here's the specifics about the cloud stacked integration. So, as you're setting up a cloud stack cloud, the primary storage you can go in and this is where I call the the veto: magic, utilizing, the qmu, KVM and libvirt.

A

You then talk to lib RBD, and so it's the it's. The cloud stack manager talking to sending JSON commands out to the cloud stack agent that then converts those into libvirt calls and those liberabit calls then talk to our lib RBD, which again, as I, said on the previous one, then talks to that object. Storage underneath passes that back up. So what do I need to do this?

A

Well, if you're gonna use stuff, you got up stuff. So there's a download link, SEF comm our latest release, which should be coming out in knock on wood another couple weeks: it's we just had our C out last week. It it's Firefly.

A

Then you've got to have libvirt at least 0 9 1 or higher 1 2 3 just came out beginning of this month. It's right there to download and again that the libvirt allows you to pass those commands between CloudStack KVM agent and our live RB d.

A

Thank you move and they they're still NRC with oh I'm, not sure when that's supposed to go final but 171 is out and then, as I said cloud stack, we've been integrated since 4 o, but if you want more of the advanced RBD features that I talked about the cloning and such that's really in for or newer, so you can use for two or four 3s.

A

What do you do first step? Is you have to set up your self cluster and there's a Quick Start Guide? There that'll walk you through that downloading the software setting up your monitors, setting up your OS DS, then once you've got that up and running, install and setup your QE moon same with libvirt, then on SEF create your your pool that you're going to use for your cloud stack storage. So this is just a set command, so a Sappho SD pool create cloud stack.

A

It's that simple then, just like you would, with any other primary storage, just go to your cloud, stack administrator, GUI, add primary storage and then for your protocols. Select our BD and r BD stands force, ray dose block device, so we've got acronyms of acronyms, so ray dose. Is your reliable autonomous distributed object store? You can say that mouthful. So that's ray dose is the object storage. So then you take ray dose block device and that's where the RPD comes from.

A

Go back one right here: one thing I skipped over here is the secondary storage, so that with the cloud stack, s3 can speak. S3 the seft object, gateway speaks both s3 and Swift, so those are compatible. You can use that for secondary storage.

A

Hey what I want to spend a little time at the end here talking about what's next for SEF and why some of these are pretty exciting and the Firefly release. We're gonna have cache tearing erase your coding and object quarters.

A

The first two I think are really exciting for use cases so cache tearing if you've got a write back cache. So for your object, storage, you can have clients that, with the front-end of let's say, you want to have spinning disks for your entire five petabyte cluster, but you can't afford to have five five petabyte of flash in your system.

A

Although, if you do I'd like to talk to you afterwards, but five hundred terabytes could be a reasonable front-end and you want to make sure that you're getting good throughput on your rights, so you're gonna write to the cache and then based upon the cache rules you set up. The cache is going to push that stuff off so think of this as a hot to warm, so you're pushing it from hot to warm.

A

And then this is a feature that is not fully ready and or fully tested in Firefly, but will probably be the capabilities there, but we're not sure if we've totally satisfied every corner case. So we're not gonna, say it's ready until probably giant, but this is a read-only. So you've got the opposite problem here. I don't want a buffer reads from a customer say: let's say: I've got some sort of a content, distribution network and so I want to make sure that they're getting as much read throughput as possible.

A

So I put my flash cache on the front end and I can either pre stage stuff or I can say hey once someone has accessed that pull it up there and leave it there until I start to fill up, and then you can push it back down and as far as size, wise. The the feature that is biggest for us in Firefly is erase your coding erase. Your coding is the same math magic that makes raid 5 raid 6 work, one of the biggest kind of digs on Seth and the past has been okay.

A

Fine, you see, I can do it with commodity storage, but if I'm not relying upon the smarts of the raid and the hardware to save my data, how am I saving it? It's with replication so, instead of instead of having one copy of my data, that's rated now I've got I've, got to basically buy three times the usable store time storage, to ensure that my data is safe, so where's that cost benefit. Maybe it's worth me worth it to me to pay more for a big EMC chassis as opposed to three HP's.

A

So in this scenario, you've got it. I've got a ten ten mega object to ensure my my data retention and quality I've got one original and then two copies. So we could. We call that three times replication, so you've got three times the original size, so I've had to pay for 30 to really get ten usable.

A

Now, with the racer coding, what you do using reed-solomon or other erasure codes, you just split it up using the chunks using math and basically there's various K plus M values that you can set up that allow you to say hey as long as I've got this many, let's say: I chop it up into 14 chunks. As long as I've got ten of those, then I can I can rebuild my data if anything goes wrong. So I've got my original 10 plus 4 parity, just like your raid raid 5 raid 6.

A

So what happens in this case is before I had to have 30 Meg's of storage to really get 10 usable now, I only have to have 14 Meg's to get 10 usable, though the original 10 from my data and then the 4 to provide parity and that's coming out in Firefly.

A

So what's what's next, if you're, if you're interested in finding out more about Seth the the docks are pretty pretty detailed and pretty good there, you can go to prove those we've got some juju. You can go play with it. On AWS, the ansible Play Books are a new thing that we we just started with. In the last month. We've got a lot of traction on using ansible to set it up, and then we've got a manual QuickStart guide.

A

If you want to just get up and running and not too long, if you decide hey I want to actually contribute. There's our github information, we'd love to have you we're really active on IRC, and if you want to check out the bugs or if you want to fix a bug or something feel free to look on there.

A

So I apologize I have no idea what the time is with all those technical difficulties that tried to not fly through too much but I do we have some time for questions yeah. We got about 5 minutes, one minute, all right. Man yeah. What was your name.

A

Okay, the questions were okay, since we have s3 do we have ACLs access control lists on it and then the second question: what question was can I front Saif with I scuzzy right? Okay, we do have ACLs, we have bucket level ACLs for the the greatest gateway and as far as front with I scuzzy, we have a TGT plug-in. So you can use TGT to then plug in access, I scuzzy to TGT to RVD.

A

Yes, that's all open source.

A

Okay speaking, is my community member hat I'd, say yeah go on play with it. It's great, wearing my director of engineering at ink tank at we're not ready to support it again there when we say not production ready. That just means that we're not sure the stability of it and we don't want anyone to lose data. There are people in various places that are running it in production. There's a pretty active group in China, that's running it with multi MDS servers and are having no problems. So, basically, right now it's bulletproofing corner cases.

A

We just want to make sure that it's really to what we consider a supportable standard, which our goal right now is q4 this year to be able to say that the file system is production, supportable, yeah, ooh,.

A

Okay, the question was: have we implemented any performance, gating yeah qos on volumes? We have not implemented any QoS, that's on our roadmap, but we have not done. Okay. I want to be fair to time, even though I stole a little bit with my Wi-Fi issues. So thanks for your time, if you have questions I'll be over here and if you want I've got just a handful of schwag that they sent me so feel free to come over and grab a shirt or a sticker, or something thanks for your time.