Ceph Cloud Platforms, 21 Dec 2011

Previous Meeting

⏯

youtube image

►

From YouTube: RCE 62: Ceph Petabyte Scale Storage

Description

http://bestdreamhostcoupons.com

RCE 62: Ceph Petabyte Scale Storage Download the entire show and subscribe at: Brock Palen and Jeff Squyres speak with Sage Weil about Ceph, a next generation distributed storage and filesystem for Linux. Sage Weil designed Ceph as part of his PhD research in Storage Systems at the University of California, Santa Cruz. Since graduating, he has continued to refine the system with the goal of providing a stable next generation distributed storage and file system for Linux. Prior to his graduate work, Sage helped found New Dream Network, the company behind DreamHost.com, who now supports a small team of Ceph developers. RCE 62: Ceph Petabyte Scale Storage

A

Welcome to another edition of rce again, this is brock palin. You can find us online at rce-cast.com.

A

You can find our old shows there subscribe by rss, itunes, etc. You can also follow me on twitter, where actually I've been kind of active recently been mentioning some things and we have a lot of upcoming events, so uh you can follow me on twitter at brockpalin. All one word and again I have jeff squires from cisco systems and one of the authors of openmpi and jeff. You just got back from another mpi forum meeting, so I assume the next revision of mpi is cooking along. It's.

B

Coming there's uh there's a buff at super computing about mpi 3.0, we're looking to be done with npi 3.0 by next super computing, and that is data's tentative. um But that's that's the goal, we're shooting for and there's a bunch of interesting things coming in there and the biggest fight right now is about c plus plus, which is really interesting, and I feel responsible for because I created the c plus plus bindings back in 1996..

B

C

B

The inter machinations of the mpi forum, but come to the uh the boss at super competing to hear all about that and speaking of boss, I have my own boss as well with george basilica, from the university of tennessee, about the state of the union, for open, mpi and and thankfully the uh super computing organizers did not put us opposite the uh the mpich buff um this year last year. They did. It was kind of disappointing because so you couldn't go to both and I'd like to hear what those guys have to say too.

B

So I think they're on monday and we're on tuesday or they're on tuesday and we're on wednesday or something something something like that.

A

Yeah I'll also be, at se I'll, be floating around, I'm not doing any speaking at sc, but I'll be there the entire week. Also, I have coming up at my home institution university of michigan right here in ann arbor. We have a cyber infrastructure days coming up just a couple of days. The 29th, through the first uh november and december uh and I'll actually be speaking there on exceed, which is the follow-on to terror grid with phil blood from pittsburgh.

A

Supercomputing center cool also be given some other tutorials on compiler tips and stuff, like that for the average researcher so yeah. I.

B

Saw you you advertised something about you. Did some mpi tutorial or something.

A

Recently, a very, very basic one: is this the one I asked if if, if you wanted to come and do, but it's very basic.

C

Then I found out they only.

A

Gave me a 90 minutes to do it in like really really yeah.

B

Not not, I couldn't justify the travel for 90 minutes for.

A

90 minutes yeah, that's what yeah I'll be I'll, be speaking on that too, and the other other random stuff focused towards people who are actually getting their hands dirty. So hey.

B

One throwback to super computing too: we gotta we feel like we need to mention the student cluster competition, because that was just so awesome last year that we were a part of it and uh the energy and the excitement around that. You need to go stop by and see all the things that I think they're on the floor this year or right adjacent, I'm not sure exactly where they are but need to go, see him and talk to him, because it was really really very cool. Last year.

A

Yeah, I know that was that was a great thing last year and I think you know doug and those guys have been putting a lot of neat stuff together with teams and stuff this year. So I'm excited to see what actually I need. I need to look up what the challenges are this year, what the applications are. I'm curious.

B

All right I'll throw out one last thing too also my blog and uh my twitter on there brock's been more active on twitter than I have recently, but I've been answering a bunch of mpi related questions on my blog recently. So if you have any questions about how mpi works or why we do the things the way we do or anything about the forum or the standard, please feel free to, let me know either an email or twitter and uh I'll write up a blog post about it.

B

So with that, I think our our pre-show extravaganza is done brock. You want to introduce our our guest for today.

A

Yes, yes, our guest today comes from dreamhost. uh His name is sage a while and he's actually the did. I believe he actually started the ceph uh distributed file system project so sage. Why don't you take a moment to introduce yourself.

C

Sure yeah, my name is sage weil. um I did my graduate work at the university of california, santa cruz, where um my thesis was on the distributed storage um and out of that grew the self-distributed file system and, since finishing I've, sort of continued working on that project to make it a viable, open source solution to the scalability and reliability issues that people have for hpc type, hpc type, storage and enterprise. For that matter,.

A

So seth basically built out of this work. You did before there wasn't encouragement from somebody else, or is this something you fully decided on your own? You were going to go. Do.

C

It it grew out of um the tri labs that sandia livermore and los alamos a series of grants they made to santa cruz to look at scalable, pedoscale, object-based storage system and so there's some initial research there dealing with low-level object, file systems and placement algorithms.

C

But when I joined the project uh I was focusing on distributed metadata and sort of how to deal with that issue. At the time, livermore in particular, was just starting to use luster and they were having a lot of pain with lack of scalability in the metadata server, and so that was sort of the key motivation for that work, but sort of out of that whole.

C

I know I started with metadata, but as a result, we sort of ended up building the rest of the system, including a scalable, reliable object, storage layer and improved replacement policy and so forth.

B

So what exactly is object? Storage as composed as compared to say, you know, quote unquote normal storage.

C

um I think traditional systems typically talk about storage in terms of blocks, so you'd have like um either on a single disk. It's block number some large number or in a I guess, a sam file system. You talk about block offsets within a line or something like that. In contrast, the idea with object-based storage is that um you, you name in the same way that you name a block by numbering it. You would name the object, but it isn't necessarily a number and it doesn't have a fixed size.

C

So the object can be small or large and can have some metadata associated with it as well, and it essentially, the the sort of the key idea is that, while traditionally file systems are um have to pay attention to data layout and placement and block allocation- and you know which sectors on the disk are storing what data using an object-based interface lets, you push all of that complexity into the lowest levels of the system, where it's more or less hidden and all the distributed clustered. Whatever the higher levels of the file system.

C

Don't really have to worry about those details and it simplifies things.

A

Greatly so does that mean ceph as much as it's a file system and a driver to be able to write and read and write to it? Does it actually live on top of like something else that actually deals with the actual 4k blocks going on disk exactly.

C

Yeah, so um normally we we put our osd as we call them, although it has nothing to do with this, because the t10 ost that's sort of a poor choice, name but then, basically, the set storage servers um that manage the objects, sit on top of uh normally a butterfest file system and then batman where they actually show up, is just files, and so the sort of the low-level file system on each of those nodes handles all those details.

C

You can also run it on xbs or hd4 or whatever else, but that's the that's the basic idea and then step itself only has to worry about. You know, what's in the objects and where in the cluster, are they located and it doesn't have to care about rewriting a tree implementation to track free space or whatever.

B

Now you threw out a bunch of alphabet soup there. Can you uh decrypt some of those things there, so you said things like t10 and and whatnot what what are some of these things for people who aren't familiar with uh file systems and whatnot.

C

um Well, I hesitate a little bit to talk about t10 because it's sort of a red herring, but essentially there is a there's- a push, maybe five or ten years ago to have this idea of an object disc. That basically encapsulates encapsulates this idea of pushing the details of block allocation into the device, and the original vision was that you would actually buy a hard drive that you would store objects on and the protocol. You wouldn't, you wouldn't say, store this block.

C

This data in this stick size, block, we'd, actually name the objects and so forth um that there's a spec that came out of that um and it never really took off, um and so now, when people talk about osds or object disks, usually that's what they think about, but in reality they don't actually exist in the real world. For the most part, there are some people that sort of approximate that specification and their products, but nobody actually sells devices.

C

um So, in contrast to that, seth is something completely different. So basically, the idea is that you just take the basic idea of pushing the details of block allocation into the storage nodes and the lowest sort of layers of the system as possible.

C

And then you can sort of focus on the hard problems of making it scale and distribute and make it reliable and so forth.

A

So I think you already kind of answered this question a little bit. Let's get it explicitly, there's already a couple of uh free license: parallel file systems out there you know luster pvfs2, oh, why go in the complete direction of building a completely new implementation, rather than just putting contributing distributed metadata to luster.

C

um The real, I think, the real difference is that those systems aren't really designed with fault tolerance in mind, um whereas sort of when we were at the drawing board with sef. um We sort of realized that, in order to build something, that's going to scale to hundreds or thousands of nodes or more that you really have to design for failure from the from the beginning and those systems tend to be constructed. On.