Ceph Conferences, 17 Apr 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sage Weil Presents An Intro to Ceph for HPC

Description

In this video from the Lustre User Group 2013 conference, Sage Weil from Inktank presents: An Intro to Ceph for HPC.

Learn more at:
https://www.gaveledge.com/SFS1301/agenda
and
http:/inktank.com

A

All right sage, while is the creator of SEF project he originally designed SEF as part of his PhD research and storage systems at the University of California Santa Cruz, since graduating he's continued to refine the system with the goal of providing stable next-generation, distributed storage file system for Linux, sages, co-founder of dreamhost and as a teenager he created and sold web ring I. You know I want to make some snarky remark about the fact that that you know you're not a lustre guy, but welcome I.

A

Never can figure out Joe there.

B

A

Its command L, that's the one I'm looking for you know thanks yeah,.

C

Thank you, hi I just wanted to spend a few minutes talking about SEF to you guys today give a little bit of an introduction to how it relates to hpc and I think the problems that you guys all face and try to frame it in the context of how it is similar to lustre how it's different and why you may or may not be interested so I guess. The question in your mind should be what is Seph at a high level.

C

We describe staff as a distributed storage system, I like to contrast it with the idea of a parallel file system. By distributed, we mean that it's a system, that's built to be reliable, but it's built out of unreliable components. So it's designed from the ground up to be fault, tolerant with no single points of failure. Part of that means building it out of commodity hardware.

C

So thank you know regular rackmount servers from Dell or whoever else you can use expensive, arrays controllers and specialized networks, but those things aren't required, we'll try to make use of them, but they're not sort of an integral part of the architecture. It's designed for very large scale, so we're from tens of servers to tens of thousands of nodes and at those large scales systems by definition, or in most cases at least need to be heterogeneous.

C

I'm usually buy you a first pod bite of one type of hardware, six months later by another two or three petabytes, and so you wanna be able to buy the latest iteration as you do that, so these clusters are sort of inherently dynamic, they're growing over time on mixed hardware and so forth. So from the ground up, we design stuff to be able to have incremental expansion or contraction, shahnaz, II, sort of D, provisional breaking hardware and so forth.

C

So that's sort of how how step is built I suppose what set provides is a unified storage platform so from at the lowest layers. It provides an object and compute source platform based on distributed, replicated highly available, object-based storage and then on top of that sort of underlying infrastructure. That stuff provides. We provide a number of different services, so one of them is a restful object, storage service based on the Amazon s3 and Swift api's.

C

There's a block storage component that gives you sort of a reliable virtual disk, similar to what you get out of a San and that's integrated with Linux kernel and with the kvm hypervisor. So people setting up private clouds use this frequently and then finally sort of the most exciting piece as a distributed file system. That's designed for you know to give you pause, exim antics and be highly scalable for hpc workloads.

C

That's actually where steph sort of originated was some research money from the Department of Energy and the mid 2000s to look at at that time. Petabyte scale, storage systems and then, as a system, grew and we sort of architects architected this entire thing and came to include you, know: object-based, storage and and block based stores and so forth.

C

Seth is open source and it's based on the lgpl for the server side, the colonel site, components for the block device and the filesystem are, of course GPL because there and the mainline, Linux, kernel or they've, been for the last last couple of years. So in a nutshell, that's sort of what stuff is it's a it's? A storage system that Fred's file access, block access and object access.

C

This is an architecture picture that we use frequently sort of the key idea with staff is that the the highly scalable highly available piece is just Rados component at the bottom. That gives you reliable and scalable object-based storage and then, on top of that object, substrate we provide a number of different services, be they restful object, storage, these virtual disks or the staff distributed file system, which has its own metadata service and so forth to build out that, but it's all based on radisys, reliable object store, but steph does a number of things differently.

C

So, looking specifically in the context of how people typically set up lustre systems versus house s, systems are built in a sort of a conventional H, a environment. You have some sort of access network. You typically have a redone heads. Oss is and sort of lesser case that are then talking to some back-end disk array. That's designed to be highly reliable. So at a high level, your striping, across reliable things, reliable disk, arrays Seph, is sort of entirely different from that.

C

The assumption that we come from is that any component in the system can fail and we don't want to have to sort of deal with the difficulties of configuring, failover pairs and so forth. So the idea here is that we're striping over unreliable things, but those unreliable things are designed to be intelligent so that they're handling the consistency and coordination and replication of data across those different storage, nodes and stephs case there's, usually a front-end network.

C

There can also be a back-end network that handles where all the replication data migration traffic goes, although that's sort of an optional optional thing. But the key idea is that the servers are coordinating replication and recovery and a typical deployment and it'll look. Something like this. You'll have a node that has a whole bunch of disks on top of each of those local disks. You'll have a local file system because we don't want to sort of reinvent the wheel with block allocation tables and so forth.

C

So we like to use butter FS, but people usually actually use x fest for stability reasons you can also use x for ZFS and principle should work. Although we haven't tested it recently, but typically have a whole bunch of these things in a single rackmount server, you know maybe 15 disks or something like that, and then you have a whole bunch of these servers. Making up your storage cluster tens, hundreds, thousands one of the key problems in these systems is highly distributed at ax, so at the object layer for radius, one of the basic.

C

The basic idea is that you take all of your objects and you hash them and put them into sort of logical, but buckets that we call placement groups and then each of these placement groups is replicated on multiple servers in the cluster using an algorithm called crush that make sure that you're up guys are separated across different racks and so forth. And then, when you do this, what's all of when you distribute all of your different placement groups, you have sort of this randomized uniform distribution of data across all of your storage nodes.

C

The client, then one needs to read or write data is using essentially an algorithm to determine where to find the data and we're to store it.

C

So it does a calculation based on the object name, and that tells it which server nodes to read and write data from so instead of contacting metadata server to find out where, who is it will actually do some local calculation based on the state of the cluster and I'll know exactly which storage node to talk to one of the key advantages, then, is that you can share this sort of knowledge of where everything should be stored with the entire cluster, so that, for example, if you have a note that fails say this grid out node all the other nodes in the cluster, when they discover that their purest failed, they can calculate they can look at their own place in groups calculate whether they have you know a piece of data that is no longer fully replicated and then in sort of a consistent parallel fashion.

C

They can make sure that that placement group is replicated to another node in the cluster, redistribute the data using peer-to-peer type protocols all in a fully consistent way. So that later on, when the client comes back and says, I need to read. You know this object. Foo it'll just recalculate the location that data based on the new state of the cluster and I don't get the correct answer. So this is sort of the key idea that makes the staff object, storage layer scale to you know, tens of thousands of nodes was very minimal central coordination.

C

There isn't somebody, that's saying you read this data and moved over there. Instead, the central coordination is simply saying this note is up and this node is down and everybody is sort of responding by moving moving the data around so liberate. Us is the low-level library that lets you sort of access this this distributed storage layer. That's you know it's a standard, shared library of bindings in every language you can imagine, but in contrast to many other systems, it gives you a very rich object, API.

C

So in most object systems an object is just a bunch of bytes and maybe some extended attributes and Seth. You can store a lot more than that, so you can soar keys and values inside an object in an efficient way, I'm think Berkeley, DB tables or no sequel table. Something like that. We reach each object, is a log logical containers at keys and valleys. They can store lots of them and get efficient insertion solutions. Range queries stuff, like that. It supports atomic single object transactions, so you can do things like atomic compare-and-swap.

C

You know update the bytes and the keys and values in an atomic fashion and they'll be consistently replicated and distributed across a cluster in a safe way. There's all this infrastructure to support snapshots and that's used by the block layer and the filesystem give you know per disk image and / directory snapshots in the system. That's all supported at the object layer, but one of the more exciting features is that seff allows you to embed code into the object storage demon to actually implement your own functionality.

C

So you can imagine if you building, you know the next flicker or something you might embed code in your object, store, that'll, manipulate images to generate thumbnails and so forth. So you can send an object method, call to the object, store and I'll actually perform that computation with the data without having to read that read and write the data across the network and finally, there's some infrastructure for inter client communication and coordination for locking and so forth.

C

So you can do a lot with this dis object store and we use a lot of these features when we're building these higher level services. On top of that, so one of the more contentions contentious things I'd like to say is that as I think as a community as we move toward exascale, my assertion is that successful sale architectures are going to need to transcend or replace POSIX I'm sort of the old paradigm of having you know.

C

This weird file and directory structure with these very strange oddities around the semantics of POSIX are not really going to scale. Well then, when you start talking about, you know, exascale scales be simply because the hierarchical model does not distribute well, but further. I think that successful architectures are going to need to blend blur the line that we currently have between compute and storage.

C

So a lot of processes that we have are manipulating data locally and operating civically on a small piece of data and part of those distributed process are taking data from multiple locations and comparing them and doing some sort of higher-level calculation, and currently all of our distributive architectures, are sort of blurring the line between these two. They sort of assume that our storage is either always far away, or it's always nearby and are sort of not sort of recognizing.

C

The distinction between these two processes and I think that a successful scalable architecture needs to sort of recognize that distinction so that you can. You can ship the operations that are operating and purely on local data to the data and perform it there, and you can do the processes that need data for multiple locations and pull the data from both locations and do it locally, and that's something that I think hasn't really been resolved in this area.

C

But, finally, I think that fault, tolerance and fault tolerance needs to be considered as the first class property of these architectures. As we sort of push the scale of our existing architectures. When we start building things like burst buffers and so forth, so we can do with these huge checkpoints across millions. Of course it doesn't make a whole lot of sense in my humble, and so that being said, posix is going to be here for some time. It's not actually going anywhere. So we continue to build systems that will support POSIX.

C

So we can run all these legacy codes and so forth. So to that end systems like lustre and stuff will continue to build distributed file systems that can actually support those applications so stuff FS, Bill, deposit, Canaan space. On top of ratos, we have a separate cluster metadata servers that handle the file system namespace in a distributed fashion. We store all the metadata and objects, so we can leverage the fact that we already have reliable redundant data storage. We provide strong consistency, a staple client protocol.

C

The stuff file system was originally architected with HBC requirements in mind. That was actually sort of the genesis, the research for the project, so we distribute the name namespace across multiple servers. We mitigate bursty workloads and adapt the distribution as workloads shipped over time.

C

So one of the high level architecture looks very similar to what lustre does so. The clients are talking to metadata servers to deal with the file system. Namespace they're, talking to the object, storage nodes to actually read write file data. The difference is that you have lots and lots of meditative servers. So the challenge there is that you have sort of a single hierarchy of trees and it's sort of non-trivial how you decide how to distribute those directories across multiple servers.

C

You can't simply hash them across many nodes in the spec to expect to get good performance, so a SEF does. Is it sort of dynamically monitors the temperature heat map of the filesystem hierarchy and it determines what appropriately sized portions of the filesystem tree are, so that it can migrate them to different servers? And it does this dynamically over time by periodically doing a load, balance, exchange and so forth.

C

So, as your workload shifts over time as a new batch job starts up, it will identify which parts of the tree are popular, taken appropriately sized piece and move it to a different meditating. The cache contents one MDS over to another and Bs and letting the clients continue in a totally transparent fashion.

C

It has another of other sort of interesting features that you don't find present in most other file systems, sort of, because we built the file system namespace from the ground up. We can sort of build these into the infrastructure, so one of those features is recursive accounting. The metadata service keep track of recursive directory stats for every directory in the file system.

C

So, for example, when you do an LS al, the file size that you see for a directory is actually the total number of bytes stored in that directory recursively in the system, so the same thing to get from addie you, but in real time more or less.

C

So that's also also support snapshots sort of the motivation being that, once you start talking about petabytes and exabytes of data, it doesn't really make sense to have a single snapshot: data retention policy for the entire system. You need to be able to snapshot different directories and different data sets. So in step, you can actually create a snapshot in any directory in the system and it will affect just that subtree of the system and you can create the snapshots, remove them using sort of standard bash, bash type commands.

C

So all of that being said, it's possible to run an experiment with staff in lesser environments, I'm using sort of typical, lesser hardware, but it's not it's not really I deal and I, don't know. What's going on?

C

No, if I touch something or what there goes.

C

Don't touch it all right, so the real. The real difference is that for for Lester, it has been tuned heavily and successfully over the last decade to run on high-end, disk, arrays and high-performance networks. Seth has not sort of had the luxury of that of that tuning. It's really designed to run on smaller nodes with directly attached disks that are less reliable, so it's possible to run staff on lustre style harbour.

C

If you have that stuff, just laying around want to experiment with it, but what we find usually is that the redundancy from the expense of arrays isn't strictly necessary because Steph can replicate across servers.

C

So typically in a safe environment, you would actually buy more nodes with more disks and you replicate across those but it'd, be over all much less expensive because they're, you know commodity pieces instead of stuff that you get from from the high underway. Vendors Steph can also utilize the flash in veeram directly, whereas usually those components are sort of buried, deep within two disk array, where you can't sort of access it normally.

C

So we did some tuning as an experiment on some hardware. That was at oak ridge. National lab basically took some existing osts osss, I guess, backed by a DD and disk array. I actually have no idea what kind, except that it was roughly 12 gigs. A second was sort of the max that we were supposed to be able to get from it. When we are initially returned over access to the cluster and initially just ran our sort of naive insulation, we got 100 Meg's per second out of it by the end of our experiment.

C

We were getting five point five days per second, which was actually 11 because of the way that stuff was journaling, so we're roughly saturating the disk array, which was kind of nice. But there were a couple sort of caveats. One is that the way that Steph is writing data at the disks? It's actually doing double rights, because it has a right ahead journal and then actually writes the data to the file system. That's designed to use to be used with conjunction with flasher and veeram.

C

The same way that at the net up disk array would do that and that's usually bird in the disgrace. We're actually writing twice to the array. The other thing is that we're using IP over infiniband, because we don't have native IG support and stuff yet, but it was sort of a long series of annoying things that we had to change. No, there was the configuring. The infiniband Network properly reefa belongs on the DD n, choosing, which type of disc the journals and the data went to reconfiguring the lens again tuning.

C

Those Steve ratios, fixing, TCP, auto tuning and Rita, headed and all sorts of all sorts of knowing things that mark Nelson can tell you about in much more detail. So the good news is that once we actually like work through all these annoying issues, we actually could get respectable performance. Bad news is that you can't simply just plug it in and expect to get good numbers, but I guess you, you probably are used to that same that same issue with with other with Lester as well.

C

So that's that's mostly. What I want to talk about a little bit more information if you're interested in trying staff or think it might be suitable for your use cases or workloads, whether it's HPC or distributed computation or whatever step calm, is all sorts of resources about how you can get involved in the community? That's it out and so forth.

C

So any questions yes,.

B

D

Your native environment, how do you handle the redundancy? Do you have racial code running in some node or you don't have that you just copied in several disk Seph.

C

Is doing pure replication across notes? So typically a note is just a raw disk as considered to be unreliable and when you write the disk, it actually replicates that right to other one or more other notes, so you can do 1x replication. You do is 3x replication, whatever you, whatever you choose,.

D

So they redundancies to replication. Yes,.

C

The reputation right, so you can run rate underneath that an individual, Sefo, ste and it'll just be as a whole, more reliable, but in theory ur, your entire roasty could still explode so ya.

B

Say Jess sage, how did you do the lustre test? I didn't follow that the mapping.

C

B

Tesla draw Courage; oh it.

C

Wasn't actually a lustre test, it was testing Stefan hardware. That was what you would typically use to run Buster. So it was a no SS server that was bought to run Lester. So it's a typical Westeros s and a DD, an array that was usually used it back back Lester. So this is that the type of hardware that you'd buy for a lesser configuration would be an expensive, big, fast, awesome, disk array and then a bunch of head nodes which isn't the usual stuff configuration.

C

So we had to, you know, make Seth perform in that environment. So.

B

So here's another idea- we've just your- might have heard the talk today about the OSD abstraction of luster servers away from the background store. What about using SEF as the back end store using a SEF OSD to get access to your object, store across these redundant servers and that.

C

B

Be a way of getting POSIX without having to do this stuff, FS.

C

Yeah I I got here a little late, so I missed I missed that talks that at the 5 digging in a bit more detail. I can talk about that after yeah, good.

E

Sade you mentioned that you think POSIX will be missing from the exascale storage stack and I talked to you in the hall before and I thought you're also going to talk about what you thought. The exascale storage stack might look like in regards to burst buffers yeah.

C

So I'm I think that an x, XS, sale architecture shouldn't base on POSIX I. Think if you were to take a clean slate and say how would we actually build a machine? That's big and efficient. It wouldn't look anything like what we have today, so that's sort of my contention. So all that being said, I think the systems that we actually build because we're migrating all these legacy codes, and that are you, know, poorly written and so forth and don't actually be POSIX in there we're taking a bit a bit.

C

More specifically, when you start talking about things like burst buffers, it seems to me like the the reliability model is that we have this huge computation with the Brazilian nodes that are all sort of not built with fault tolerance in mind that are highly interdependent and the only way that we can think to make that reliable is to take a system-wide checkpoint and, and since the math doesn't work, there's no way we actually dumped that much data to disk in any reasonable mount of time.

C

We have to build in flash and every node and all this and so forth, and so that's that's sort of how we ended up at this point, but I wouldn't I'm. Sorry.

C

House f would be jitter free, yeah I'm, not saying that that stuff is any different with the first buffer and so forth. I think I think that I'm, a more a more interesting exascale architecture would be one that is based on objects that you're storing computation that has run directly on the objects and computation that that is sort of aggregating the results from different objects, and it would be some sort of you know more cloudy infrastructure that is actually running this computation on those notes and aggregating results in writing to new objects and so forth.

C

So it'd be this huge dependency graph and data flow diagram and not sort of this lockstep computation. As being check, pointed yeah.

F

C

Presentation is.

F

A a useless place, or did you move with your oh? It's.

C

Both so all the server side and SEF is in user space, they're all just user level demons. The client side is either user level, clients shared libraries and so forests or the Linux kernel, has a client for the block device and for the file system.

C

So you can map dev, rbd 0, that's a that's a roadblock device that stripe over sub-objects. It's like a way to get out of I scuzzy or something like that or the Ceph. If s watching mount the POSIX namespace, ok, quick.

F

Quick question and on the next speaker so I'm just prevent.

E

Myself from getting.

F

After a test, you did on the the lustre hardware and the DD on harder. What was that? Maybe, if you mentioned I, said Mr boat, was the driver? What were you using the file system of the block device or let.

C

Me toward what we're using actually I, remember, I think we're actually using just writing to their object store at that point. We're just getting the the object layer to perform.

F

Just liberate us yeah.

C

A

All right that guy stays well. Thank you rock. On.