Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: FullContact: Reading Cassandra SSTables Directly for Offline Data Analysis

Description

Speaker: Ben Vanberg, Senior Software Engineer at FullContact

Here at FullContact we have lots and lots of contact data. In particular we have more than a billion profiles over which we would like to perform ad hoc data analysis. Much of this data resides in Cassandra, and we have many analytics MapReduce jobs that require us to iterate across terabytes of Cassandra data. To solve this problem we've implemented our own splittable input format which allows us to quickly process large SSTables for downstream analytics.

A

A

A

My name is Ben van Berg I'm, a Java engineer, I'm working with full contact a start-up in Denver and we're basically, we like to say we're solving the world's contact management problem kind of trying to be the ubiquitous platform for all contact management.

A

If you guys are interested in that kind of thing or have similar problems to what a lot of people do with contacts keep an eye on us, there's going to be some really cool stuff coming out in the near future, and if you need to contact me about any of this stuff, Jack bird on Twitter, so with that I'm going to go ahead and get started, so this slide I. Actually this is cool, actually changed. This slide just last night change some of these just last night.

A

So this was like a typical introduction yesterday and now I thought about it more and it's probably more. What I'm trying to explain here is our journey in implementing a specific use case. So it's really a story about how we went from A to B and solve the problem. We had so I'm going to talk about that use case and then we'll talk about how we implemented it and then at the end, I'll do some example code.

A

So, to start out, I want to talk about how our data becomes to reside in Cassandra, and we have a thing at full contact called the person API and what that does is accepts.

A

What I'll call a a handle for a contact for a person and that could be an email address or Twitter handle or Facebook or whatever, and that goes to our back-end search engine manifests itself into many different types of search targets, which we gather up from public sources, and then we cash those search targets into a Cassandra datastore.

A

Those individual search targets then get rolled up into what we call a profile, which is the entire profile that could contain many social profiles for an individual and that gets funneled back as the response for the user as what we call the full contact. Basically, the enriched data, so email address in and rich contact out and this profiles Cassandra database is what I want to talk about today and that's what this work is centered around. So with that talk about the goal that we set out in our journey to solve.

A

Basically, we have all this data in Cassandra and we want to perform analytics on it. That's a little hard to do that, just straight up queries to your production Cassandra, so we thought about how we could possibly do it in other ways, the kind of things we wanted to accomplish or like count how many profile types we have so the types I talked about, like the email queries to get a profile, the Twitter queries to get a profile. So what kind of types of queries are people doing?

A

And what are we returning additionally like how many of those profiles have social data associated with them and what type and blah blah blah all that kind of stuff whatever we want to do. That's the cool thing about where we wanted to what we wanted to solve. Basically, ad hoc tat data analytics, so we could do whatever kind of analysis on this data we wanted in the future.

A

So some of the key factors about our use case that I'm going to keep in mind for this talk and for the implementation that we built is that we use netflix is Priam for backups at full contact. We really like netflix open source stuff, and this was a really good fit for us. We use a lot of their things, but freedom for backups, and I should point out that those are both snapshots and compressed the backups.

A

Additionally, in Cassandra, our tables are those profile tables. I mentioned our size, tiered compaction, and we end up with SS tables on the order of 200 gigabytes right now, those will continue to grow as time goes on and actually they're bigger than they were when we started out on this project, if you guys are familiar with size, tiered compaction I mean you basically have certain number of files kind of grow in magnitude and then that biggest file is going to constantly grow it kind of tapers off. But that's what we're up against.

A

Additionally, on our cassandra tables, we had compression enabled the snappy compression, so that'll play a key role here as well, and all of this resides within AWS.

A

That's what allows us to do a lot of the stuff that we do makes Cassandra real easy, MapReduce, really easy all those things so before I move on I just like to ask if how many people in the crowd have MapReduce experience or know how it works, awesome, so good number, so I won't dive too deep into those details. So here's where we started, we had a system that accomplished this goal kind of so we would generate queries for our cassandra database and I guess touch on that briefly. What that means.

A

Basically, we had a data store full of all these profiles. We don't really know which ones we want to query. We happen to have another system that I won't talk about too much. That knows information about what we'd be interested in within that data store. So we would ask that system to generate the data, to query the cassandra data for us and that could take a long time because MapReduce jobs that took days because I think we're talking on the order of billions of contacts that we're going after here.

A

So that's why this takes such a long time so moving on from there, we would move to this. Taking that data, the queries that we generated for Cassandra, we would bring up like a total mirror or production class because we didn't want to hit our production cluster with thousand reads per second for days and days and days. So that's going to highlight some of the costs on the right hand, side that I want to keep an eye on as we go through this they're, not super high cost, but they're not small either. So this would.

A

This section would just be essentially rate limiting our cassandra queries and distributing those across a MapReduce I'm cluster, so that we can basically throttle ourselves into that Cassandra datastore.

A

So the next thing we would do is a step that could take days as well to process that data and the final result is what came out typically for us. This was based around an export or almost a match test for a customer, so customers are interested in that information before they come on board as a real customer.

A

So in this picture we have a MapReduce cluster up in AWS for days and days and days. We have this Cassandra cluster up for days and days and days and all its serving is to spit out this final report. Essentially, so that's where we started, and these things I've already touched on the limiting factors of this current implementation, three to ten days of total time and then like twenty seven hundred dollars in extra AWS cost for us.

A

So in addition to our production, we're going to spend money on these resources and it really didn't get us our to our goal of ad hoc analytics I mean we had some data that came out and we could do whatever we wanted at the time.

A

But wasn't a lot of flexibility there for ad hoc analytics, but the biggest thing to me is the engineering time I think you can imagine this process where you build some MapReduce jobs and run it for three to ten days and find out in your final output that you screwed something up that sucks. So there's a lot of babysitting for this stuff, a lot of rerunning, and it's just painful so thinking about moving forward. How could we do this better?

A

We knew that query and Cassandra didn't scale at all. We've already seen that, so we knew about the cassandra SS tables and those would be really cool if we could somehow MapReduce across that data and as it turns out, other people have this same problem. Maybe some of you do as well, but a couple key things that we needed to have- and you guys are probably familiar with this- so I won't go too deeply into it is that we really need those SS tables to be directly available on HDFS and that's for MapReduce to perform.

A

Reading them and, like I said, these are quite sizable files for Ross using pram. They reside in s3, so we can pull them from s3 into HDFS, but that still takes it quite a bit of time at like I, think we have a 9 node cluster, so you get like a 200 gig SS table max size on each nodes, so that takes a good bit chunk of time. So, in addition to just having them on HDFS, we need to make that available as input to the MapReduce jobs right.

A

So how could we do that nicely so when we first set out like somebody must have built this already? So let's just do that so, like I said originally we're big fans of Netflix OSS and we're using netflix premium. So this thing called Netflix adjust this solves exactly this problem right, so we set out to basically test it see how it would work for us we'll ran into couple snags right from the get-go as it turns out. This works really good for netflix is use case and that's like Cassandra 10 in general and no compression.

A

So that's what it just is supported at the time Sandra 10 and they didn't have compression enabled on their tables, which made it easier well, let's not say easier, but different for them, because they could define their splitting on an uncompressed SS table and at the time there was actually I should point out a experimental, Cassandra 12 branch by the Coursera guys I saw the guys picture in the hallway downstairs.

A

So we did take a look at that and that worked pretty well for us, but I think what it came down to was the compression a compression was a sticking point for us. So when you try to run this and it being familiar with MapReduce, you know that MapReduce is really good at big data, but it sucks at big files and that's really big files that you can't split into small chunks right, which is at this point.

A

What we were dealing with, we had 200 gig, SS tables being churned through on a single thread took a long time to do so. We took a look at another solution, really cool piece of software called Cassandra. Mr helper does exactly what we want to do, including support for 12, which is where we currently reside.

A

So this actually allowed us to get the job done, but there's a couple things that we ran into this one as well, so these guys are leveraging the Cassandra io libraries directly from the Cassandra source code to read the SS tables, which is really cool idea, because if you use their io package to read the SS tables, that saves you a lot of work.

A

These guys have written, really good. I 0 stuff. That knows how to read these tables really well. So why do it yourself? The limiting factor there is that it doesn't it's not. It doesn't support HDFS. You can't run this on an HDFS filesystem. So what Cassandra mr helper would do is copy the SS tables out of HDFS, so you've already copied it from s3 to HDFS. Then we copy it out of HDFS to the local file system, and this is for our tests.

A

I, don't know if those guys did it exactly the same as we did or not, and our machines that we're using I don't believe, have SSDs. So it's even it's a little harder, and this is the same problem with the with size, tiered compaction across the board. You need to have double the disk space to hold these things, and in this case, because priam koream compresses, the backups as well and ours were compressed, they would copy them off to the local filesystem decompress them.

A

So now you have like the decompressed version side-by-side with the original version. Then you can nuke all those old ones. So at that point you need like double your disk space right, so we kept running a disk space and quickly decided that this might not scale for us. So we kind of took out that 3m decompression and we actually wrote a distributed copy custom distributed copy that would bring those from s3 and be snappy them right into HDFS.

A

So we could avoid that bit, but you still end up copying these files to the local file system, so but the biggest thing with Cassandra, mr helpers, their input format was not splittable at all, so that wasn't going to work for us, because we still had the single thread processing these big tables. Granted. We have a lot of different tables right with size, tiered compaction, you got small ones, ramping up to really big ones and those small ones would get chewed through pretty quickly. So you have like say a 24 node cluster and you have.

A

We had nine node Cassandra cluster on this 24 node MapReduce cluster. You chew through the small files really fast, but then you'd have these 9 giant files being chewed by single threads and your MapReduce cluster you're, really not leveraging MapReduce for what it's good. At that point, it's kind of a waste so but we did get the job done and it took us only 60 hours to do that. Some of those single threaded processes took a really long time.

A

A

Using existing solutions, we got to this picture, which is different from the one where we started right: we've lopped off that generating queries, because we no longer need to actually do reads into Cassandra and we can go right after the data. We want directly.

A

So basically it starts out by just reading the SS tables and I'm kind of like glazing over the fact that we already have to stream these files into HDFS right and that whole thing I talked about where we decompress them on the fly into HDFS. So that's a whole other thing to think about, but I'm trying to simplify here. This takes many many hours, that's the most of the 60 hours. That I was talking about, and it's all done on. Hdfs processes, the data in our. So we've actually made a really good progress here.

A

I mean, if you think about it. We went from many days to 60 hours, that's still over two days, but it's a lot better and then our average cost on a MapReduce cluster. 350 bucks pretty awesome, but we still thought we could do better. So we set out to do the same thing as a justice and Cassandra mr helper, but we wanted to make those compressed SS tables the ones with internal compression, not just the Korean compression.

A

We just set out to make those splittable and I. Guess. Well, I'll talk about this in a second, but we needed to make these splittable compression being enabled made it more difficult. Obviously, and we needed the Cassandra I oak code to run an HDFS, and that was really a big one and then we need a way to define the splits. So our approach was to leverage the SS table metadata. There's lots of files that come along with SS tables if you've seen them on the file system.

A

You've got the data file and I, don't know half a dozen others adapt Cassandra io libraries, the HDFS. We were going to try and do that and leverage an SS table index. That's one of those files to define our splits that we're going to feed into our MapReduce jobs.

A

So all of this basically would tie into a MapReduce input format that we could just plug in and run with existing MapReduce code. We call that the index index, because you're basically creating an index over the cassandra index, that is an index into the data file, a little confusing, but it's a it's one of those things where you come across the code and you're like somebody typed index index twice. Why delete it? You're like?

A

Oh, it is an index index, so I don't know if you guys are familiar with Hadoop lzo, but for what it's worth it. It's very similar to this implementation, and the thing to point out is that lzo compression is similar to the problem with Cassandra compressed tables, in that they both have not a single giant.

A

Compressed file but blocks within a file that are compressed so then you can actually split these things across those blocks, which makes things really nice. They didn't have that we wouldn't be able to do this. Cassandra wouldn't be as performant as it is anyway.

A

But let's talk about these SS tables, real, quick, three different kinds that are important for this talk are the data file which I already touched on, and this is the binary sorted strings table that we have and, like I said it was we do the size tiered compaction, so you get bigger files as you write more data and they get really fun to work with and move around, etc.

A

Then there's this index file. That's what we're going to use to actually try to index into the data file to create our index and define our splits for Hadoop and MapReduce, and the compression file compression info file that comes along I'll, just mention it really quickly, because what it does is it points to those blocks? Those compress blocks within your data file? That I was mentioning.

A

We didn't really have to worry about this too much, because the really good Cassandra I/o library uses that for us and like abstracts away all that detail that we didn't really care about. Well, we didn't want to care about right because somebody's already written it really well, why redo it so here it is. We have to take that Cassandra I/o library and poured it to HDFS right. It's not too bad, but there are some tricky things about it. This was probably the hardest thing to do.

A

Actually. Writing the code, just finding bugs within that and whatnot is, can be tricky they're, using fight buffers and all kinds of stuff, and when you get giant byte buffer exceptions and now produce logs, it gets kind of ugly, but anyway, this allows random access. So random reads into the data file leveraging that compression info file, so it can get those blocks and go to scan across to wherever it wants and read that data, which is key for what we're trying to do right so.

A

The index index so real quickly like how we built that thing and.

A

This kind of shows like at the top. You have a couple splits right. This is obviously a simplified version, but within a split we have the index file, which is key and the offset of key in an offset key in an offset into the data file and I'll just point out. That gets a little confusing thinking about it. But the index offset into the data file is in to the uncompressed data, not the compressed block.

A

So after it's been decompressed, the reader can just go to the right places in the uncompressed data, that's kind of key to what we were doing too, but I won't go much deeper than that.

A

So then our splits just become start offset and end offset, and we can configure our splits to be as big as we want for Hadoop and it's a little fuzzy because you're talking about SS tables and not really like blocks of data in HDFS, so we kind of get to the block size in HDFS, but we kind of don't so it's close, but it's fuzzy, but it allows us to solve the problem if that makes sense. So I just want to really quick go over there.

A

The original solution right we started with generating queries days processing these queries many days, and I guess I should point out that we were limited to a thousand queries per second on that data store. Don't know if I said that before and then you're talking about millions, millions, maybe billions of queries over many days and then we'd process that data I could take some days to cost us some money.

A

Final solution, we index those SS tables first step. We have to run that index index because this reader requires those indices to split that data across the MapReduce cluster and chew it in parallel. That takes a couple hours to run on those big giant files. Right now we have it implemented as a multi-threaded. Java executable- probably rewrite that as a MapReduce job itself, so that can go even faster, because at that point, you're just chewing some text data pretty easy to do.

A

Then we read the SS tables. We have the index, we can split it across our cluster and we just we read all those up. So all this is on HDFS, pretty awesome for what it's worth. We use elastic MapReduce in AWS, then we process that data and that takes some hours and our overall average cost, and these numbers are just like estimates. It's about 165 to 200 bucks for a given run.

A

So that's a pretty good improvement from this one where we had thousands of dollars right and many days, pretty cool stuff, I think so some results can kind of see where we started three to ten days. 2,700 kind of got that better and we got it a lot better in the end. I guess.

A

One thing I wanted to point out that I didn't probably make clear from the beginning is that when we had the nine files single-threaded on a MapReduce cluster for a long long period of time, you're not leveraging, you could add as many machines as you want right, you're not going to go any faster. Now with the splittable format, we can add machines and go faster, so I think now we're doing like 48 machines and we go through this in 10 hours.

A

We could double that and probably go faster, but 10 hours is pretty good for us. At this point, I guess I'd also point out that we haven't invested a lot of time in more tuning. You at this point, because we've decreased it so much that we're kind of like all right, let's focus on other stuff for a while and leverage the benefits we've gained so far.

A

So anyway, I want to go over a quick example, and you guys can tell me what looks better to you. I have the slides, but I. Also have this, but that looks terrible to me because it's way too big.

A

What's that yeah, it doesn't work, oh it does there we go. Thank you all right, so I'm going to do it here, because I like this better, so I'm going to go through a real, quick, mapper reducer super simple example: this code is all available on github, so you guys can look at it yourselves as well.

A

First thing to point out: is this this key type, which you know notice in the mapper you get a byte buffer? That's your key! You get an SS table, identity! Iterator! That's your rogue allows you to iterate over column, so that guy comes straight from Cassandra. This key type is also straight from Cassandra, abstract key type. We have a composite key and both of our values in there are columns are utf-8 types. That's all we're saying here, but it is a little goofy looking and deserves explaining.

A

So what I really want to focus on here is this map function right? This is a really simple one. All I'm doing in this example is reading the row parsing the value into JSON and emitting it with the key and the value.

A

It's just an example. So I meant to be performant or anything like that.

A

So here we use that key type and that allows us to basically deserialize our byte buffer key put it into a text object that we can pass along to our downstream reducer. We go ahead and do the JSON column parsing, and we write that out pretty straightforward stuff right at this point. All we're doing is processing that SS table data. You could do it, however.

A

You want in this case we're turning into JSON, that's kind of our use case at the moment to we create this JSON and spit it out, and we kind of compress this into sequence files that we can process much faster downstream.

A

It's worth pointing out that Netflix adjust this. Does this exact same thing more or less? So, let's look at the reducer and all of these this one is like you can do what you need to do here. I didn't really include like the details here, other than commenting that this is where you pace things together, because we're reading with the Cassandra I/o libraries will get all of our all of our rows. That might exist in the entire cluster.

A

So if you have three copies, you'll get three, you need to look at the time stamps to determine which one of those is actually the valid row right and you can deal with tombstones here as well. If you have tombstones in your data, you can iterate over your columns and nuke those we don't. You need to deal with that in our use case, because we do a lot of rights and a bunch of reeds, but we don't really delete things so something we kind of.

A

We do it, but we don't really need to and if you guys want more detailed examples, I love for some feedback on the github open source stuff, because always looking for feedback and making this thing better. So the only other thing I'm going to point out here is the MapReduce config you're, probably familiar with it, passing our mapper pass in our reducer there's this SS table row input format. This is the input format that allows you to get a row, and this is very similar to what the cassandra mr helper does for.

A

You gives you that identity iterator that we saw earlier so that key and identity iterator and the other thing that's important- is right here, configuring that as a stable input format because of our our row, input format actually brings in the ss table input format, so we're just configuring it at this point right.

A

So that allows us to plug in our new stuff. The cool thing is, if you happen to have used Cassandra M our helper already, you can it's really easy to change your code to use this different input format and your code should change very minimally and in our case we had a bunch of test code from the journey we went through and once we had this built, it was just like switch out the config and go and it works.

A

It's pretty cool for what it's worth this example actually runs and it against one of our data sets. It actually does work so I'm just going to really quick go over how to run this thing.

A

Two steps run the indexer really simple, java jar Hadoop jar pass it in run the indexer and point it at the root where you've stored your SS tables in HDFS right. Additionally, you can specify how big you want your splits to be so. The index will create those the appropriate size default 1024, which is what we use just kind of random default, but it works well for us running the job command line.

A

The dupe Jar Jar, simple example, key thing to point out here is the create table you need to pass in your full creates cql statement. This is so that we can tell the Cassandra IO code here create column, family metadata based on this create statement and then plug that into the random access reader and read the SS table. So that's it very important couple tunings that we do speculative execution false, that's nice, because some of these jobs get big. This one's really important right here, the num tasks, the MapReduce job, num tasks.

A

That says how many times do I want to reuse, a jade that I use to run a mapper. We do one because these Cassandra IO code uses a lot of off heap memory. You've heard these Cassandra guys talking about how they're leveraging off heap memory to keep the garbage collection, noise down, and things like this, it's just more performant for them. So you can accumulate that if you don't restart a new jbm, we did some io sort, tunings pretty standard stuff dealing with giant files.

A

Some of this stuff, I, don't think we need to worry about as much as when we first started because we're splitting now and we just do 512 reduced tasks, 1024 split, sighs, like I mentioned, and we give it a two gig heap.

A

So it's real simple to create this stuff, granted a lot of hand waving and simple examples. But generally you write your MapReduce job. You run the indexer cross, your SS tables, and then you run the SS table reader. Then you have data that you can process downstream or you could even write your MapReduce jobs to process that stuff in line. However, you like, but it opens up the door to do a lot of ad hoc analytics, which is cool, so our goal was accomplished.

A

These numbers are actually super cool because I hadn't actually calculated these percentages until I. Put this slide show together. I'm, like ninety-six percent decrease in processing times awesome, ninety-four percent decrease in resource costs. Super cool I need a raise reduced engineering time to me, that's the biggest one because we're not spinning our wheels, maintaining this mucking with it watching it for days and days dealing with the issues that we come across things like that. For me, that's like really the big one I mean these other costs.

A

They're relatively small, like I, showed you the numbers, they're not huge, but we did run those things monthly and we're a small startup company, so we're sensitive to cost. So those things are important, so we open source this thing, hoping that some people can benefit from it. So you can grab this slide deck and go. Look a dupe, dash SS table check it out right now. Our plans in the future kind of doing some cool stuff to bring it up to speed with two dot.

A

One and that'll be just porting the two dot one code, the IO code, / 22, got one and then additionally, we're going to we've been playing around with scalding, and this is Twitter Scala API on top of cascading, and we do a lot of our analytics with that right now and we're going to actually, we have plans to implement the input format in scalding, so you can just use that with in scalding, so hope we can get around to that soon, because it would be really awesome to be able to just stream those SS stream.

A

The SS table data right through a cascading pipeline. That would be super awesome, so those are kind of the things we're thinking about this point I think there's some time to answer questions if you guys have any so, that's all I have for the presentation.