Apache Cassandra Meet Up Presentations, 22 Jan 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Real-time Analytics using Cassandra, Spark and Shark at Ooyala by Evan Chan

Description

ABOUT DATA COUNCIL:
Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.

FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
Eventbrite: https://www.eventbrite.com/o/data-council-30357384520

A

And today we will talk about how we use Cassandra here together. What problem are we trying to solve by using Cassandra and Cassandra and spark and we'll discuss our spark and Cassandra architecture? Oh.

A

And a little and a little demo by the way, I presented this talk at the kiss under submit and that's available online as well. If you are, if you're interested in the slides they're a few updates here, but it's mostly the same so who are we our motto? Is we power personalized video experiences across all screens? If you watch a video on ESPN, comm, Victoria's, Secret REI, you've probably used our technology without realizing it.

A

We found it about six years ago and ever since the beginning analytics has been like pretty important to our strategy. We get over it as well. Over two billion.

A

Events a day from people watching all kinds of videos on customer websites, so yeah, some of our customers.

A

We're fairly large Cassandra user, we have 12 clusters, you know from 3 notes to 107 notes, but I don't know. 28 terabytes manage easily over 2 billion cassandra column writes per day it powers like most of our analytics infrastructure from the standard slice-and-dice aggregates that customers use to discovery, recommendations, trend analysis.

A

We run a range of cassandra versions from 1.0 to one or two and our largest cluster is one of the biggest Cassandra installations around. So.

A

A

What what problem are we trying to solve.

A

What we're trying to go is so we have mountains of raw data. You know so much like this over them, so you don't really know what to do with it.

A

What kind of want to get to you know feel Nuggets feel appeal, gems right, and we want to do this quickly, painlessly at scale right, I, guess, that's sort of the answer to everyone's everyone's trying to solve.

A

What we have today is that we have pre computed aggregates. We have video metrics, their computer launched several dimensions, such as geography, country, city device, type, browser platform, their free order. Things like that, so this is fine. So if, if basically I want to look up say, I am ESPN and I want to look up. What are my top videos in the US? You know, or what are my top videos by device, hi borders, mobile or a tablet? Whatever that?

A

That's very easy, we just pull out a key from Cassandra, it's pretty computer, so this is nice, it's very fast. The downside is that it's very inflexible and it's it's very difficult to change and most of the aggregates and indexes are in fact not read. For example, we store a sorted list of say: videos by city.

A

Most people are only interested in the top few cities. Very few people are interested in the tail end of things right and what, if we need more dynamic queries like when the mix of several dimensions together top content for mobile users in France I want to do data mining a little bit more interesting stuff. When, when I start to move to the morgue, it's use cases then excuse me for a minute.

A

Okay, sorry, it's just have to wait. Wait one minute, eventually, you'll find it so.

A

A

There we go okay, so in the more advanced use cases, then this model starts to start to break down of this precomputing. These aggregates, for example. Let's say you have a hundred thousand cities and let's say that you have one thousand browser combinations. Just doing cross product of two dimensions alone will get you like. You know a huge I mean you know something. That's like way way too big right.

A

So, basically, imagine that there's a continuum a scale between static between pre-computation and dynamic queries right, so you can think of it as on one end, you can pre-compute everything and and that's what your customers who look up. So this is nice, but it's inflexible. On the other hand, you can do dynamic queries over your data. This might be like you know slow, but this is very flexible but might be very slow, and what we really want is to kind of like have a a medium where you can pre-compute your common queries.

A

You can have flexible and fast dynamic queries and uses easily generate a lot of materialized views to suit your different needs right. So what that's? What you really want is something you know kind of a medium. Now, if we look at some industry trends, you have some new technologies coming out for fast query execution cloud era. Impala drill is the new coming project. You got you probably all heard about presto. You have in-memory databases, you have a lot of streaming and real-time three marks that are coming out and even for tradition, a tube you're starting.

A

You have things like cascading and peg that allow you to write, queries faster right. So maybe you know we can see if a combination of these technologies- or you know something- can help us solve our pressing data problem right.

A

So I'm going to talk about why we believe a spark and shark is, is the answer for.

A

For our for a problem, there's this a quick introduction for those of you guys that haven't heard of spark it is an in-memory distributed computing framework that doesn't mean that all the data has to be in memory. It does have modes for spilling the disk and keeping data on disk.

A

It was created by the UC Berkeley ample lab three years ago and it initially targeted problems that MRR is traditionally bad at, namely iterative, algorithms, interactive data mining and since then people have started using it for a lot of general use cases I.

A

We just came from the this just this week. There was a spark summit and the first ever spark some it and it's pretty incredible the number of applications that people are doing with what spark is really really incredible.

A

There are active contributions from at least 15 companies. This is like insurance or the number is probably increased.

A

So, let's recap what the Hadoop paradigm is, so you have a disc reader file system, HDFS or any number of replacement DFS's, and when you do have MapReduce face HDFS, you read data from HDFS and then you after you shuffle the data. You do a video, so you write the data back to HDFS. Then you read it again now for simple jobs, and it might be ok like that.

A

The more complex a job is the more stages you have, and so you find that your reading writing from HDFS a lot and that quickly become starts to dominate your your I/o. The smart paradigm is, let's say that I have two data sources and I. The first thing you notice is that you're not limited to just map and reduce you have some higher level operators such as join so I can take.

A

These data sets, transform one join them together and then this is where the magic happens like I can cache the resulting data set in memory from this point onward. I can then do transformations on the cache data set very quickly, because it's on memory.

A

What we found is that, in terms of throughput memory, really is king, so we did this experiment for, in the first case, reading from Cassandra what the cache being code just from a location you haven't read in for a long time, if you are able to warm the cache effectively this.

A

This is a dramatic increase, like maybe 8 X 10, X least so so using your cache effectively helps you a lot. Cassandra also has a lot of caching options, as you guys probably know, there's row cache and different kinds of things, but if I was to put this data in inspark in an in-memory data set, that is another, you know one or two orders of magnitude faster.

A

We found that developers with lexpark one of our engineers looked at it, for 30 minutes was able to like get up and running and write jobs right away. The collections API is very high, very high level, so it allows you to think in terms of high-level transformations and you have three development: three languages for you to implement SPARC jobs.

A

You have an interactive show which allows for you know you can basically play with your data and just type. You know. Let me let me map my data and transform it in a certain way and kind of get results out pretty quickly for testing and a little didn't see. You know you lead to quick development cycles, and you know these days oftentimes the the only advantage that a lot of startups have is or companies have. This is basically developer velocity. You know, so that's something that we find very important.

A

So on the right is our Hadoop. A word count. Example: I did shrink it to a four point: font just to get it to fit on the left is our spark word count. Example, it's is four lines.

A

I'll touch on the spark ecosystem a little bit, so what do we have here so spark so spark is not just a Hadoop replacement. It has a bunch of other bunch of other projects that build on top of it.

A

One is under underlying layer 2 the project called tack yarn, which is like an in-memory caching for HDFS. That allows like very fast reads: there is a project called bago, which is google prego, the graph algorithm on spark you guys might have heard of shark. This is basically hi once mark.

A

Finally, there's sparks streaming, which is you can think of it as micro patches. So the patches can be as small as, like. You know, a second or half a second, and and that way you can get real-time computation, fault-tolerant.

A

Wait a minute sorry.

A

So shark is 100% hike, QL compatible because it actually embeds a hive, but it is much faster. However, it can actually, we use all of your hive, sewage handlers, sergey's and other things like that. I'll touch on this in a minute and if you're running the se, you can still use the SE and cassandra FS for your meta store.

A

One one thing is that shark has integration with spark, so you can for the things that SQL is not good at expressing what you can do is you can have a query in hive QL and then use Scala and Java code to transform your data in a way that is not possible with what's 5qr. So that's an alternative to writing. Udf's.

A

We'll talk about our new architecture a little bit and how we integrate, Cassandra and spark and shark so on the left's are what we call raw events. Basically, every time you watch a video on one of our customers, websites, like ESPN, you have a little bit: JavaScript module or ActionScript module, which is sending data even when a video you know displays and when you play or you skip forward so sending events to our servers.

A

So that's what I mean, but raw events from here we have an ingestion pipeline that will write it to Cassandra and what we call an events tour from here. We then use spark to create various materialized views and the way that we then utilize. These materialized views is that we can use spark to do- and you know quick, pretty fine queries or maybe we can run ad hoc v. Ql queries in shark. Both both are available.

A

A little bit about how we actually physically organize our clusters will start with Cassandra on the notes on top of Cassandra, we put an input, format and sturdy I'll go into detail about why we have it. We basically wrote our own and perform at and I'll go into detail just in just a second about why we did that and sitting on top of the input format. This spark worker that Cassandra.

A

If you want to use like shark or hive, then you can, on top of the improvement, there's something called a sturdy which basically translates raw events into the columns of a height or a shark table.

A

Coordinating all of the spark workers is a spark master, and we also have a rest spark job server that we actually just open sourced. That makes it easy for you to run jobs on spark.

A

So ours, our schema, is kind of a it's, mostly it's a variation on a standard time series. Basically, you have an ID or a user, and you have events at various times. We use the time as as as the column key. In fact, however, we actually have a separate column. Family called an attribute column family.

A

The reason why we do this is because there are, there are a lot of attributes such as a user's IP address user agent, things like that which repeat in again again, so we keep them in a separate column family to one. It saves a huge amount of space and IO and to it allows us to easily generate indexes and to annotate attributes after the fact. If we want to.

A

And so the way that we unpack, you can think of the job as like in Cassandra. You have, you know, rows and n columns in a time series and now what we want to do is we want to extract the you can think of them as JSON blobs and expand this into an SQL like table, for you know, data warehousing, so you can think of each cell maps to one row. This is actually very similar to how cql reads out data, so your user ID, which is essentially part of the row key, gets one column.

A

Then the different fields in a JSON would get mapped down into their own columns, and so we just go from left to right and then down to subsequent rows, and we and that way this gives this returns the data in a format that you can usually run. For example, a shark queries against.

A

Alright, this slide is by the way, not in the cassandra submit presentations, different options for integrating spark with Cassandra versus that you can use the Hadoop input. Format spark will use any existing to be in perform at out of the box, so the choices are. You have built into Cassandra, Collins family input format. This was really the only option built an option before I think 1.2 I'm. Not quite sure.

A

This reads: this read order rows, one column family.

A

In more recent versions, there is now something called a cql paging and perform at, which is pretty nice. You can actually give it a sequel, query and we'll read out into the relevant data for you and it pages so that it will also handle wide rolls pretty well as well as the supports secondary indexes.

A

Now you can also roll your own I'll. Just back up to say: has it been format it?

A

Let me see so at the time that we did this integration. We hadn't looked into Cassandra 1.2, yet so we then look at securing it perform at so, and the other thing is that, because of our layout, we have like multiple column, families and our indexes. We we just wrote in perform at now.

A

You can also read data from any source in in spark without an input format. The big disadvantage is that without a new perform ad, you can't write a standard, Hadoop jobs like standard or more jobs. However, it is actually much easier to develop to read from Cassandra without an in perform ad. You can essentially do it in one line that line right here that says SC dot parallelized. What I'm doing is spark has a paralyzed operator that will take any list.

A

Let's say: I have a list of row keys a thousand row keys and I want to read these rows from Cassandra in the spark. What parallel lines does is a text at thousand list element and distributed across the cluster, then what the flat map does is it takes each row key and then you something you can have something that reads all the columns from Roky, and so you end up in one line, was something that can, in its distributed manner, read in a whole bunch of rows from Cassandra.

A

So if you don't care about Hadoop compatibility, or rather MapReduce compatibility, you can actually save yourselves, probably like I, don't know two weeks or a month or whatever developing and perform at there's. Also a JDBC RTD, which so sorry so RTD is a spark term that stands for resilient distributed data set. This is sort of like the base.

A

Base abstraction in spark and it so it's it's. Basically, a data set that is stored in memory and can be transformed, there's also, you can also use the Cassandra JDBC driver with the treaty support and do things that way, so there's actually a fair number of options. Lastly, there are companies that are starting to open source, some spark and Cassandra integration, so I just have one make there that you might want to check out.

A

If you do develop your only perform at this couple of tips to throw out you want to know which target platforms you're developing for if you've done, Hadoop infrared development, you would- or you just know what in performance. You know that there are two hundred baby eyes.

A

There's a new API and an old API plan is carefully because they are incompatible but similar enough to be very confusing and some things that build on top of it, for example, cascading I'm, not sure if it's so cascading well, for example, only work or sorry uh hive up to until very recent versions will only work with the old Hadoop API.

A

You need to be fair to spend time tuning your split computation, so I'll give you an example. If you have your only indexes in Cassandra, that could be a very long row. You don't want to spend a lot of time reading that one row out and you're in perform at because impre it has to do when you're computing, splits, that's all done on one node and it's not distributed.

A

You might want to consider sorting row keys by by cassandra token for data locality and, if you're using hive, you can implement predicates push down and your hive storage handler, which will let, for example, I, have a where clause you can have an input format only read from rows that contain a data you're interested in.

A

There's a quick example of how we use it for OLAP processing so out of the raw events we use spark to create OLAP aggregates, and then we combine them and run queries against different kinds of queries.

A

Let's see for in terms of performance numbers.

A

So this is sort of like the earlier graph. I showed you for reading from Cassandra. With the code cache takes I, don't know, maybe two minutes to read. You know one half billion events oops wait a minute when we read it again and again the couch warms up and it drops by a factor of three or four. When we process the data in memory, then we can get it out in like well under a second.

A

A

Here's how we actually here, here's the workflow for spark and how we do our processing first, we start an aggregation job in in spark through our Java server. This basically will read data from Cassandra, the raw of raw events and process them and create some aggregates. The aggregates are stored as a data set in memory distributed amongst all the different spark worker notes.

A

Then what we can do is that we can run a query job that taps into this data set and return results, and you kind of do this again and again, which keeps the aggregated data warm in in memory and allows you to run fast queries.

A

A quick note about fault tolerance,.

A

The cache dataset lifts in the Java heap only what happens if the process dies.

A

So spark has something called lineage, which is that it remembers for every computation how the the computation, the exact transformations that you ran. So it's designed to in the case of failure, go back and rerun your source commutation so that you can end up with the same aggregates now. Obviously this is rather expensive. You know you need to in the case of Cassandra, you need to read from Cassandra again and do the aggregates, which might not be what you want it turns out.

A

You can also replicate a cache data set, so you can replicate on an order. No to survive single node failures. That's another strategy.

A

What we do is, we simply persists our aggregates or materialized views back into Cassandra and and then we just loaded into cache. This also enables us to to have multiple sets of SPARC worker nodes, that process this data for hive ability and other reasons. So we'd like this approach.

A

It also allows us to easily inspects our materialized videos in Cassandra. Let me see there there's another option now that sparks project tachyon will allow you to cache. Basically, it allows you to use it as a write through cache to HDFS and if you use take GFS or you can use Cassandra paths, for example, then it will cache your data in memory so that you don't have to read it back again. Essentially, the data stored already in memory.

A

Okay, I'll take this so okay, it's about storm versus spark, so I would say storm and spark are kind of like designer address very different problems. Storm is for streaming and you want to stream from a never-ending data. Source. Storm is very good for that. Did this I guess we look at this a little differently and that we are reading terms of data, so we think of this as more like we think of this as more like spark.

A

If you used to spark itself it's, we still have storm systems in production, so this doesn't replace storm. I would say that if you want to use something like sparks dreaming that might replace storm, but I would say this use of spark is more like replaces Hadoop for enabling extremely fast batch jobs.

A

Sure why don't we take this when we take this offline, so we can continue yeah thanks.

A

So I just wanted to show you guys a quick, quick demo of like spark and shark or red or shark in progress. This would be part of just you watching a video of me typing slowly.

A

So this is a local shark node with you know, one one node creating a table from Cassandra using our input format, creating a cached shark table in memory and running fast queries.

A

So the first one is that we do a create external table and we use our own storage Handler and we use table properties to set which casts under note to connect to what are the start and end times this kind of thing and in our re-perform that understands based on these parameters, you know it can translate that into row, keys and count families, etc, etc.

A

Once you have created a table, that's that that actually doesn't read the data right away. That's just a mapping now, when I do create table events, cached that actually caches those events in memory as a cache table, and that actually is what reads: data from Cassandra, and now we want to query the cache table.

A

For example, I just want to you know, let's start with the really simple I'm just going to select count one from the cache table. You know 1.4 million rows that you know the fast half a second, but but the real power comes when you know, let's do something a little bit more complicated, so I'm going to try to get a count of a various. You know, providers or videos, and you get like this long list again. This is you know.

A

Extremely fast is a fast group by operation and not, but let's say that we, what we really want to do is we want to get a top okay or top-10 list of videos by count and so we're gonna group we're gonna order by the count descending limit 20. So this again returns, like you know, they're pretty fast. So this allows you to do. You know kind of queries that would be pretty slow if you're done from disk.

A

Yeah and that's it, thank you very much. We we are hiring so and I. Think after this also, we will have a we will have a raffle. Let me see.

A

Yeah. Thank you very much.