Ceph Ceph Day Boston 2014, 25 Jul 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RH InkTank Ceph Day Sessions Kamesh Pemmaraju DELL

Description

Ceph Day Boston 2014
http://www.inktank.com/cephdays/boston/

A

All right, our next speaker up here is uh comes from dell. uh We've actually been working with kamesh and his folks over at del for quite a while. Now.

B

Yeah two years actually yeah yeah.

A

So uh we've had some really good interactions and uh what we have, what was our first big one uh university, alabama yeah.

B

We're gonna talk about university of alabama. Oh.

A

Good great so key.

B

uh You know marquee customer.

A

Yeah yeah, so there have been a lot of really cool things and these guys have been kind of boots on the ground and done some really awesome stuff with customers and stuff with dell hardware. So uh without further ado I'll, let camesh take it away perfect. Thank you very much.

B

Hello, everyone caffeined up and ready to go all right, so I'm kamesh pemeraju I work at dell, I'm a senior product manager. I focus on openstack specifically so you'll hear a lot about openstack today and, of course, how ceph and openstack work together is kind of my main focus.

B

I also host and organize the openstack meetup in boston area. So, if you're, how many of you are from the boston area? Local, great perfect, so in fact, every month we meet to talk about all the open source projects, openstack being the main sort of focus for us. So, as a matter of fact, tomorrow is our next openstack meet up we're going to be talking about hyper-v peter from microsoft. Is here he's going to talk about that?

B

So if you're local go check it out, openstack boston, meetup, do google that and you can sign up and be part of the community. We have over 900 people in the community in the boston area and we talk about all open source stuff around openstack, so just a plug there quickly.

B

So what am I going to talk about today? um So I spent a lot of time talking to customers. That's my job and I talk about open source cloud implementations. I talk about open source storage. I talk about scale out storage. I talk to enterprise I.t. I talk to service providers, so I've got a broad background of customer interactions and based on that, I've sort of distilled some of the things that I'm seeing in the industry. So it'll give you an idea of how cephei is getting adopted. What areas is it getting adopted in?

B

What kinds of use cases seem to be resonating with customers?

B

What use cases are still not there yet in terms of customers perceptions, so the reference architecture stuff that we've been working with with inktank over the past couple of years help you sort of narrow down all the different things you can do with ceph I like to think of stuff. As a swiss army knife, you can do a lot of different things with it.

B

It can fit a lot of different use cases, but then you can quickly get lost in the forest, so you need to really zero in on what use cases make the most sense and then build reference architectures for that and that's kind of what we've been doing with inktank and that's not enough. You also need to come up with okay. Is this reliable enough? Does it have the performance characteristics?

B

Do we have the benchmarks for that? All that stuff comes into consideration, so I'll talk a little bit about what the ra considerations are the lots of considerations just because of the way ceph is you have all these building blocks? You can put them together in a lot of different ways.

B

So I'll give you some ideas of what to think about, as you consider the for reference architecture, and we have dell reference configurations that I'll talk about briefly, but the main topic, I guess the second half of my talk is going to focus on university of alabama that we just talked about a major implementation of ceph and openstack, so I'll sort of walk through what they, what they were faced with in terms of challenges, what they had and I'm sure a lot of this will resonate with you in terms of what you have in your own data centers.

B

So this is the usual stuff we go through, but you know you talk about business considerations and budgets et cetera, but guess what the way stuff is getting getting adopted is bottom up, as sage mentioned earlier right, so some open source developer in the enterprise downloads, the open source theft bits plays around with it gets excited, does a little bit of an implementation of it and then says wow.

B

You know we need to use this stuff right and then at some point, the organization, the cio or director of it or whoever gets involved and says: huh okay, what's the budget for this stuff, why do we want to use this? uh Why do we care you know? What are the use cases do we do? We want to use this for what applications do we care about? So all those questions come up and and at some point you have to answer them right.

B

You have to answer to a cio or to your director of I.t or your executive, whoever it is you need to make those cases, and I have some advice for you along those lines uh and then there's a survey by the way. Ceph is the second most I believe, the most popular backend for openstack, according to use user service, lvm being the the number one, the the linux volume manager. um So again, what what are you looking to build out of this? You know what is the initial storage capacity?

B

Are you looking at steady state data, or are you looking at spike data usage at certain times of the year? uh What is the expected growth rate, so I I guess you have to at some point once you feel comfortable with the technology.

B

You start thinking through your pilot and pre-production environments and ultimately it's all about workloads right use cases. So is it more capacity focused? Are you looking at archival use cases or backup, or are you looking at high performance? You want to put a database on ceph. Is that a good idea?

B

Well, you know I'm going to talk about some of that. What type of data is involved? Is it feminine data? Is it scale out, you know, multi-tenanted cloud environment, or is it persistent data that you want to keep between vms as they disappear? Is it object, block file, lots of considerations, so we're going to talk about some of that, like I said earlier, ceph is like a swift army knife. It has all these different things. It can be tuned to a wide variety of use cases.

B

Let's look at some of them, so this is a a thing that I use. So you have kind of two different target users. If you will, you know traditional it that uses you know traditional sans nas devices, virtualization and private clouds using microsoft or vmware, and then you have cloud applications so kind of think of them as legacy and modern in in in one way right another way of looking at it. So here you have, you know: xas infrastructure as a service storage as a service platform as a service uh or you're.

B

Looking for a compute cloud like amazon or openstack, so you've got both of these type of things. So are you looking at capacity or performance? So these are the two things you can. You need to think through, ideally, ceph sort of fits right there in the middle, although I can make a case because I've seen some very compelling performance numbers where you can even replace ceph replace a traditional san with sef and in fact, within dell. This is a big big argument.

B

That's going on within our storage group, we have compellent and equal logic, which are traditional sense. Just like netapp and emc has so they're saying, hey, what's going to happen with our with our technologies, is seth going to go? Replace this in the long term we're having those discussions today and- and it's not an easy answer um right now- ceph is maturing. There's still it's good for certain use cases like devtest and I'll talk about what what those are. But I think this is a nice sweet sweet spot for sef right there.

B

I wouldn't really recommend ceph for very high performance. You know emc and netapp type of storage devices where you put mission, critical, high performance, uh databases and stuff, like that, not right now, but that's sort of a nice target because it fits a lot of different use cases all right. um So the use cases I'm going to talk more about openstack, because that's kind of where my my main focus has been and where I'm seeing a lot of customer demand.

B

Okay, so here we go, this is really slow.

B

Okay, there we go! So if you look at this one, we have the content store. So you can you can look at ceph as an object store similar to swift. uh You can also look at it as a backend for openstack.

B

um So let me talk about some of the specific use cases um of ceph that we are seeing and some that I think that itsef is good for the main one is openstack. Obviously, and ceph has both a swift api compatibility as well as a cinder api compatibility cinder by the way how many of you are familiar with openstack?

B

Okay, quite a few of you great. So the sender is the component for these for up for volumes and block storage. uh So the ceph block device is what you would use to interface with with the ceph cluster through the sender api. So we've got the volume interface as well as the swift object interface. So these we're seeing a lot of object right now with in terms of customer traction and they're, using it for both ephemeral storage, as well as snapshots copy and write and volumes.

B

Now the an interesting benchmark that our storage team did, it turns out. Actually, if you use ephemeral, storage on the cef cluster as opposed to the local disk, it's actually faster in terms of in terms of throughput, which is kind of a surprise to me. But it turns out, it's actually faster if you just use a femoral ceph. The other reason you would want to use ceph for for cinder is because you can also have your images hosted on the ceph cluster.

B

What happens typically in in openstack is, if you don't have a a surf cluster like this. The image gets downloaded from glance onto the local node before it boots up. So there's a there's, a overhead in terms of moving the image over, but if it's on ceph it's right there in the cef cluster, so your boot up times are a lot faster, so you'll get some benefits from from using ceph for both glance and and and nova as well as cinder.

B

So in the case of openstack, you know, obviously the ceph platform is certified against the red hat things, which is an advantage for you now. The other use case is just pure cloud storage. If you want to use it as a pure object store like a swift cluster, uh you can use the object gateway and the rados cluster, as uh as the uh as the backend object, storage.

B

uh The the third use case is web scale applications now here's an interesting thing so here we're using the gateways and the apis to get to the actual cluster. But if you want to get very, very high performance, in fact, I think the performance difference between going this route versus this route is an order of magnitude faster, so you're going directly to the seffrados cluster to using the native protocol. So this gives you an enormous amount of flexibility in terms of performance and scale and and multi-tenancy.

B

Now the latest version of ceph, which is the enterprise ceph 1.2 firefly, has the the the multi-tiered caching that's available.

B

So now we can sort of set up a performance block configuration just like that, so you can use the cache pool and the backing pool uh for for your for your volumes, uh so you can for read, write you can use the cache for right back mode and for read you can use cache as read-only mode, so both modes are available for for the performance block uh and uh sage mentioned earlier, with the with the uh erasure coded feature.

B

That's coming in that's already there right now, nice in ice 1.2, you can use it for clothes, for for cold storage um and that'll. Give you a lot of additional. um You know. First of all, it'll save you money because you don't have to spend as much on storage, because it's going to cut down your your the number of storage nodes.

B

You need for the same amount of storage and you can use the cash pulled for for basically increasing your performance uh for for reading, because this is a good read-only use case and then, like I said, for databases again go native protocol, because this is the way for you to get performance using the sefblock device and native protocol. You can in fact put databases on ceph. Now I wouldn't recommend this yet yeah I would I would. I would highly recommend you do your own performance testing with your own sort of reference architecture before you can.

B

You can go down this road, but the thing I'm trying to point out is there are ways in which you can use ceph for different use cases because it supports it using these different protocols and then finally hadoop as sage mentioned again, you know ceph file system once it's ready for production, uh it's a great replacement for hdfs.

B

So these are the things we're seeing uh and uh in terms of architectural. So let me get down into a little bit about the architectural considerations, so the questions you need to ask yourself right is reliability, important, in which case you have to look at redundancy.

B

If you have 2x 3x 4x redundancy there's a cost trade-off, so it's all use case dependent right. So that's one trade-off. You need to look at the second one is all about failure. Domains. There are lots of different failure. Domains to think about the disks can fail. Your ssd journals could fail. Your node, the entire node can go down. Your entire rack can go down or a complete site can just disappear because of a natural calamity.

B

So each of these failure zones is what you can when you're designing your ceph cluster through your config, your crush configs, to map out what the the fault zones are and, and that way you can start to put you know disks uh and nodes in different racks. So in case one rack goes down, you still have another rack to take care of it and then think about the storage pools right. uh Do you want high performance, ssd pool uh which will give you that block performance I mentioned earlier uh or is it just?

B

Capacity are looking for large scale, low cost? We call it cheap and deep right, cheap and deep storage, just large archival for your object storage. Then you need to look at a capacity pool using a razor coating as a potential for for reducing costs and the monitor nodes right, the monitor nodes. You have to think about their failure domains. You don't want them to be in the same failure domain, because if that goes down all your monitor nodes go down, in which case you don't have access to your cluster, so consider all these failure.

B

Placement scenarios lowered redundancies and performance impacts, as you design your ceph cluster. We have some considerations in the reference architecture. That's coming out.

B

We are designing a reference architecture for the most common use cases and that will be published in a month or so with with within tank and red hat, so so that that's the high level architectural considerations when it comes down to the servers, the storage nodes itself, you know how many osd's do you need right for hard disk, so the the general rule of thumb is one one osd per per hard disk right, so we're looking at about one to two gigs, one gigahertz per core osd.

B

So that's kind of the the rule of thumb for that, uh if you want, if you can use ssds for journaling, which will speed up your uh your write performance and then, of course, you can have tiering, which is in firefly, which will allow you to use. You know both ssd ssd pools as well as backing pools, so you can have both of those things but think about what happens when your ssd fails. So we were having a big big discussion. Internally.

B

Is if you have a node with, let's say, 12 disks in it and you dedicate 10 disks for your data nodes and two of them for your ssd journals? What happens if your s, one of your ssd journal nodes goes down. Drive goes down. What kind of impact does it have in on recovery?

B

Do you want to make your ssds redundant meaning? Do you want to use some kind of a raid one configuration or mirrored configuration for that? So we at the end of the day we sort of came to a nice sort of compromise, in which we said you need to have a five to one ratio between your actual data disks and your ssd journal devices, and- and uh that way you know, if you have like 10 disks, then you need two disks for ssd effectively in a single node right.

B

So that gives you enough of a capacity um so that you know you can you can take care of those things so erasure coding, as I mentioned already, you can get a lot of additional storage benefits, but the expense of compute. So I think we sage had us had a slide which was showing what happens for recovery times. When you use a razor coating, it goes up dramatically and it has a it has a tremendous impact on compute. So you need to think about that.

B

um So the for for extra capacity, you can think about jbot expanders now. Dell, for example, has uh we have these md 360 series, they're 90 discs in a chassis. It gives you at four terabytes per disc 90 times four you're talking about almost a half a petabyte in a single chassis, full of just jbods just discs, just a bunch of disks, which you can hang off of a server like a storage server uh to get this extra capacity huge capacity.

B

If all you're interested is in storage, cheap storage, that's the way to go, but you have to be careful because now you're you will be adding extra extra latency because now you're over subscribing your assassins right. uh So we have come up with an ra that kind of takes that into consideration. We've done some some testing along those lines monitor nodes. um We need odd number for quorum. uh Services can be hosted on a storage node itself.

B

You don't need dedicated nodes for for monitor nodes if it's not a very large cluster, but a fairly you know, between 10 to 20 is sort of the guideline. If you have 20 nodes, let's say each node is about 30 to 40 terabytes, then you're, okay, you'll still be fine. Having your monitor nodes hosted on the storage node itself, but if you're going beyond that, we we recommend having dedicated nodes right. So a dedicated node just for running, monitor services, keep in mind that monitors are relatively quiet for most of the time.

B

They'll only kick in if there's a change in the environment or if the if the environment is not stable. So if there's failures happening all over the place, then the monitor nodes kick in there'll, be a lot of traffic and there's a lot of chatty traffic between the monitors.

B

But if it's a stable cluster, your monitor node is not really doing anything. It's sitting there idle. So keep that in mind and if you're doing uh distributed geo distribution and you want replication across sites, uh then what you want is dedicated. Rados gateway nodes for large object stores that are distributed across multiple sites, so you can use dedicated radars gateway, nodes and use federation between those sites to synchronize between between the sites.

B

So these are all the server considerations so, depending on how big it is how geographically distributed you are, what kind of redundancy you want. What kind of performance you want? Should you use ssds, et cetera, et cetera? These are the things to think through.

B

Networking is a huge, huge thing. You have to make sure you think through right from the beginning, because guess what I've talked, we have had several customer installations where we had all these conversations with the customer, we're ready to go, deploy the thing and all of a sudden. At the last minute, the networking and security teams come up and say: huh what are you guys doing? You know, so we had to go back to the drawing board effectively and say: okay, now we're going to connect this to your back end.

B

You know core fabric and the firewall rules and how it's supposed to be all connected so make sure you bring those guys in right from the beginning unless it's a completely isolated cluster, which is usually not the case, you're connecting it to something right, some other services or something and you'll see in the case study that that is the case.

B

Do you want network redundancy, meaning multiple switches, multiple links going to your servers, or are you okay, with uh with one single switch? So that's another thing to think about. Do you want dedicated client nodes, client networks and dedicated data networks? Again, it depends on your use case how much traffic you think you're going to push through your data network so think through that one gig versus 10 gig versus 40 gig, some customers, in fact a big telecom customer based in the boston area.

B

I won't name the customer, but they they've been working with uh very high-end ceph implementations with 40 gig networks. In fact, our friends at mellanox were with us working with there and what their use case, and I was surprised when they came to us and said we want to use ceph for streaming video.

B

They want it's a video video on demand, application right and they're. They were talking about tens of thousands of users using you know all kinds of different form form factors of end devices streaming, video and- and this stuff is basically doing. uh You- know, translation and transcoding all kinds of stuff on the fly. So, okay! Well, you know they were. They were doing some testing of that performance testing, or that I mean it just goes to show that ceph can be used in all kinds of different environments.

B

Lots of different networking designs possible, spine and leaf multi-rack. You know how are you going to connect your racks core fabric connectivity and if it's a multi-geographically distributed system?

B

You know the first use case that the the case study with uab. This was back before dumpling by the way, so their main data center was in birmingham university of alabama birmingham and they wanted to have a backup facility in huntsville, which is about 100 miles from there. They had a dedicated van link, but when we did the actual testing we found out, the latency was not fast enough and at the time um inktank didn't have the uh the replication feature. So it, as you know, right ceph, fuses cons. It's consistent. It's not eventually consistent.

B

It is actually consistent. So it requires your all of your notes to be on a pretty high network with low latency. So keep that in mind. I think the newer things that that are coming out in ice, 2.0 and 3.0 will solve some of these problems. But, right now our recommendation is stay within a data center.

B

If you want multi-site replication wait for a while, we can still do it today, but it takes a little bit of magic to to make it work.

B

So what we've been doing? At dell we've been working with a number of distributions. As you probably know, we started off with canonical distribution. We've been working with susa and, more recently, with with red hat we've, been building reference. Architectures testing out these solutions on on these different distributions. For a while. This is the latest version. That's coming out. In fact, our first dell red hat openstack solution was announced at the red hat summit in april, so this one's coming out in in summer so effectively.

B

What it is is a pilot configuration if you want to start with openstack and ceph. This is the best way to get started because it's effectively openstack in a box openstack and ceph in a box completely pre-configured pre-tested. It's got all the servers. You need uh it's. It's poweredge r from dell 720 xd are the storage servers, um 10, gig, networking, red height enterprise, openstack linux and an ice 1.2 all built into this all ready to go pre-tested you get up to uh 36 terabytes of raw storage in this in this box.

B

In this it's it's actually three quarters of a rack. It's not even a full rack. So if you, if you want more storage, you can throw in more more of these storage bundles, uh which is kind of an expansion feature for for this pilot bundle. So this gives you 228 virtual machines, 36 terabytes of raw storage, which you can use as either as volumes back-end volumes to openstack vms or just as a object store. You can do that.

B

It's got an openstack manager and openstack controller, all built in with ha so two additional nodes for ha so this is coming out in summer. This comes with a reference architecture, so all the considerations that I just mentioned are have we've been having lots of internal debates and discussions both with inktank and red hat and and the dell storage team and we've finally come up with a solution that we think is is optimal for a wide variety of use cases, and that's what that is right.

B

So you can use ceph and openstack, and you can expand this out to multiple, multiple racks. If you're looking for a capacity configuration which is these dell server configs are some examples, so the the performance, capacity and extra capacity you can go with the three different possibilities.

B

So, if you're looking for a performance use case, you want to put a database on this or you you want to use it as a test. Dev environment for builds and testing and scale out, testing and stuff. The performance server would be your best choice. Right, it's got, ssds uh ssd pooling is not introduced, yet that's coming in version two, but this gives you journals with ssds and 40 terabytes of data for on in a single server, 120 terabytes and three servers. So it's pretty it's pretty dense. You get a lot of stuff there.

B

If you want to go capacity, then we have these md j bar chassis. I was talking about so this md 3060 or md 1200 uh they're. Basically, these large j bar chassis with lots of discs in it right. So you can, you can build them out again think through and the the configurations will make sure that you're not over subscribing your sas lanes or your latencies are getting affected, etc. But we are, we have done some internal testing to make sure it fits the needs of that particular use case.

B

So what are we doing to enable um dell and red hat specifically? This is the red hat solution. We have a similar solution with susa and susa has been testing ceph, as well with uh with our solution with dell hardware. So that's also available, but specifically with red hat. It's a validated ra co-engineered between red hat inktank, all three parties effectively we've been working with inktank for over two years and then the red hat acquisition happened and we've been working with red hat too for the last six months. So it's a nice.

B

You know marriage made in heaven so to speak, so we have pre-configured storage, bundles storage, enhancements, certification, professional services all nicely bundled together. So you get everything out of the box ready to go so with that, I'm going to jump into the case study so university of alabama. What was their biggest challenge? So they do a lot of great research in cancer and genomic research right. This is kind of their focus. They have close to 900 researchers.

B

Their biggest challenge was data sets right. So they had this research data effectively scattered across all kinds of devices. Usb sticks, you know hard disks, desktops laptops. It was all over the place, they had a compliance issue and they had this problem of managing the demand of these researchers right. So so the data was at risk. Productivity was going down because people are not finding the data that they were looking for.

B

They badly and desperately needed a centralized repository for the data, mainly for compliance and a for the demand that was coming up, and this is how their system looked like. It was a mess right. They had all kinds of grids, they had local servers, they had prototype cloud set up with all kinds of uh you know: open source clouds. They were using cloud stack, they had an openstack thing going on, they had virtualization.

B

I mean this, I'm sure this a lot of typical enterprise data. Centers, look somewhat like this. It's not very unusual, in fact, when we talk to the other other universities, I I can draw exactly the same picture and it'll look the same across the board right. So this was their situation and they were using hpc. I mean at the end of the day, this is an hpc application.

B

They were doing genomics, you know large compute performance type stuff and they were pushing all the stuff into the hpc cluster, which is of course connected on infiniband and they've got all that stuff. They do all their processing, put all the storage in hpc storage and then pull it back into their laptops for doing some local processing. So this is kind of how it looked, and it was all on one gig network right. The the interface to the hpc was one gig network, which is slowing them down lots of challenges.

B

With this they were really looking to move to a modern infrastructure and and a better way of managing their data right.

B

So the solution was a scale out storage cloud and openstack um ceph and uh crowbar, which was our, which was dell's. Deployment tool really came to the rescue, so uh they were able to house and manage a centrally accessible across the campus network storage cloud file system clusters can grow as big as they can, because it's all centralized it can be provisioned from this one single massive pool, as opposed to being distributed all over the place. So it helps a lot with compliance.

B

uh 400 terabytes they put 400 terabytes into production when they started off and they did the math and it came out to be 41 cents per gigabyte per month, which is actually pretty darn good, because if you look at it's comparable- and I don't know if it's comparable with amazon, but it's it's pretty darn good, very nice uh cost structure there and they're looking to scale up to five petabytes over the next year or two. uh So the researchers were very happy because now they have much, they can work with much bigger data sets.

B

uh They can save their workflows for new devices and analysis uh they qualify for grants, because now they have better levels of protection.

B

At the end of the day, um they were so happy with this initial thing that they're now looking to and they were using other things too, like research storage, clash crash plan, they were doing github hosting on pocs, etc.

B

So this is how it looks today it looks, there's a cloud services layer, it's all ceph right now they have this virtualized server and storage computing cloud based on openstack and ceph and crowbar, which is the one which is a deployment tool like we can talk offline about that, but it helps you with deploying these massive clusters automatically right doing. All this by hand is a is not easy. Trust me uh because there are a number of considerations, all the way from uh configuring.

B

Your servers, your networks, getting all your openstack services up and running all your ceph services up and running tying it all together. It's uh it's not an easy thing to do so. Crowbar is incredible. Other tools too, like foreman and a number of other tools can do similar stuff, but that's a key component of any deployment infrastructure. So this is what it looks like they have a number of ceph nodes.

B

uh They have a poc running for openstack they're, primarily using it for storage as a service right now and their poc with openstack, and now it connects on a 10 gig network to their research hpc cloud.

B

So that's what it looks like today and they want to expand this beyond just data management right now they want to have scientific computing. In fact, they want to host hpc on openstack, which is awesome, but I'm kind of scared of that, but we'll see how it goes, but we're working with them on that they do want individualized customized. You know dev test environments on openstack, which is a great use case.

B

It works well all of most of our customers do that and they want to start integrating shareware, open source and other commercial software into this it'll it'll, be it's an ongoing exercise. It's probably going to take them two or three years before they get all this integrated. But their vision looks somewhat like this, so eventually they want to have this cloud services, layer, openstack and openstack, doing all their infrastructure as a service in terms of providing virtualized server and server resources to their researchers, as well as an enterprise.

B

I t and then the ceph nodes effectively playing two roles. One is the storage as a service. This is the first thing they started off with and then being a backup store or a volume store for for the openstack nodes, so both of those and eventually also connecting all to their hpc cluster. So this is the vision um it's going to take them a couple of years, we're working very closely with, with the customer over the last two years, they're very happy. In fact, we did a couple of sessions at the openstack summit.

B

The customer was there on on the stage with me talking about some of the stuff.