youtube image
From YouTube: FullContact: Reading Cassandra SSTables Directly for Offline Data Analysis

Description

Speaker: Ben Vanberg, Senior Software Engineer at FullContact

Here at FullContact we have lots and lots of contact data. In particular we have more than a billion profiles over which we would like to perform ad hoc data analysis. Much of this data resides in Cassandra, and we have many analytics MapReduce jobs that require us to iterate across terabytes of Cassandra data. To solve this problem we've implemented our own splittable input format which allows us to quickly process large SSTables for downstream analytics.