Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: High Throughput Analytics with Cassandra

Description

Speaker: Aaron Stannard, Founder and CEO at Marked Up Analytics
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-high-throughput-analytics-with-cassandra-by-aaron-stannard
Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.

A

Ok, hello, everyone, my name is standard and I'm. The CEO of a company called marked up we're an early stage technology company, and we do in app analytics for native software right now, primarily in the Windows platform, although we're adding support for other technologies to I'm, ex-microsoft and so are a number of the other people on our team, so so before I really get into the substance of our talk, which is really about sort of building a real-time analytics cluster using cassandra from the ground up.

A

I want to talk a little bit about what are what our company does. So we help developers learn three types of things about their software after it gets deploy to the marketplace and is running on their end users, computers. The first is we go and help them learn like who's who's, their audience who's, actually using the app every day. What types of devices do they run?

A

How often they come back and use the app over and over again, these are things you sort of need to get a sense for who your audience is and what you should be testing your product on. In terms of QA and hardware and so forth, and so that's the first thing that we do the section. The second is we capture diagnostic information about our customers, apps so crashes, exceptions, performance monitoring and so forth. We're trying to help the operations and development teams understand how their app is.

A

Actually how well it's running on other people's computers and the last and most important thing we do is we help our customers make more money we help them identify and which customers have intention to buy versus which ones don't so you know we work with some pretty big video game companies and some of them will say. Oh well, I think your American kids between the ages of 25 and and 18 will be great users for this app and we'll say actually Japanese 13 year old women.

A

Probably if you should be targeting for this application so we'll have you know we basically have or have the ability to back that up with data about our sort of worldwide install base for Windows 8. That's what we do- and you probably heard a number of talks today about companies like Netflix and kissmetrics and others who are dealing with petabytes scale and all these sort of crazy sort of scalability problems.

A

We're not there yet we're a very early stage technology company we deal with millions of data points a day, but the types of problems I'm going to be talking about today are for people who are looking to build their really their first sort of real-time analytics cluster, where you need to be able to measure some things in real time. I want to be able to integrate hive and solar into it and so forth.

A

So I'm going to sort of work from the bottom up and help give you a picture of how to get started with Cassandra and Dana sex enterprise and production. So it all begins with one question. Oh, this is our product. It all begins with one question: do you really need real-time analytics I'm going to shatter your world right now and tell you that real-time analytics is a developer buzzword? It's just like it's the the engineering equivalent of big data.

A

You don't need real-time analytics for everything and, in fact, there's a lot of use cases where it's a bad thing where you shouldn't conflate your operational metrics. You need to keep something running with strategic metrics. You need to decide how to do something in the future. So let me give you an example.

A

The way we break down analytics is really into three families. We have real-time retrospective and a third type. That's not on the board here called predictive analytics, which is what's going to happen in the future. Real-Time analytics is designed to help you keep a pulse on things that are happening as they happen, and the only analytics that really need to be real time are ones that you can respond to in real time.

A

So all of you have probably had, if you're a developer error monitoring software at some point, your lives, who are keeping tabs on the health of your servers and production. That's the perfect use case for real-time analytics, because if something goes down, you have a business obligation to respond to it in real time and try to fix the problem as it happens. Likewise, if you're a stock trader, you need to get pricing information in real time about stocks, so you can make business decisions based off those.

A

These are examples of types of things you can respond to as they happen. But what about instances where you need retrospective or historical data? So imagine you're, a scientist and you're studying changes in solar flare intensity over the past 10 years, trying to determine if there's some changes in the sort of son's behavior. You don't want that in real time, because you don't want to have a report every day when there's a fluctuation or an outlier telling everyone that we're all going to die from UV exposure one day, then the next o, our bad.

A

We had. We had a problem with the server silly us it's much more important to actually take a statistical sample over a period of days, months or even years, and produce a consistent result using a tool like the dupe, for something like that. So we're going to talk about today is how to build an analytic system that leverages both of these types of techniques to really drive valuable business insights.

A

For for your company internally or for your customers, if you're building an external facing product, I'm really going to try to help, you think about it from both the technology and a business point of view. So the point being is that you know real-time analytics, isn't inherently better than what analytics the aren't real time. It's just different types of metrics that can and should be responded to in real time. So here's how we look at real time analytics at marked up so I talked a little bit about what our product did.

A

So most of our metrics are operational. These are things like install rates for applications, number of people, who've used it every day, error rates, custom events and so forth. These are things our customers can respond to in real time. So one thing we can do that, like the windows store, doesn't do very well for our customers, as we let them know exactly when their app gets approved from the store. How many people are installing it in different countries around the world? Because what, if you're, trying to time you know PR and marketing around that?

A

And you want to be able to hit the date exactly right, that's something that we can do and our customers can respond to that. Likewise, what happens if your error rate suddenly spikes up in your application? A lot of our developers are, you know, Microsoft sort of oil developers and about I think was six months ago, windows azure had an ssl issue with storage so about we noticed a 15-percent climb in the error rate across all of our applications.

A

When this happened, we didn't know what was going on and turns out that a lot of our apps dependent on azure for storing images and other static content they pulled out so had it been their own back-end. They could have done something about it, but since they were talking to Azure directly, they were kind of kind of screwed for the most part. So these are the types of analytics that we measure and report on in real time as they happen. Otherwise, we have some metrics that are retrospective, like user retention. How frequently?

A

How long do you retain users after you've gotten them to install your app? That's something where we need to measure multiple, distinct events over a window of 30, 60 or even 180 days. So we do that with hive and Hadoop after we gather all this raw origin data inside Cassandra. So let's talk about how we actually build a system that is capable of doing both of these really well retrospective and real-time analytics.

A

So we use data sacks enterprise will heavily, even though we're early stage technology company. We only have five full-time people at the moment and I went full time on it. In August, we we've been able to partner with datastax and they've, been a really great business partner for us. So we've had a tremendous experience of their products.

A

We actually had to migrate on to Cassandra in an emergency we prototype to our product, on top of other no sequel database called ravendb, which I don't know if you've heard of essentially it's a think of it as like MongoDB, but all with with Lucinda.

A

Under the hood and a bunch of really cool built-in indexing capabilities, it was a great tool for prototyping, but as soon as Thanksgiving rolled around, we went live with two of the largest apps in the Windows Store, including like a number one or two video game, and it completely pegged our servers to a point where reports are running behind by three or four days. Reports that are supposed to be in real time, mind you.

A

So we had a major problem on our hands, so it took us about two months to move everything onto Cassandra off of Raven and we were able to do that because of data stack specifically. So I can't say enough good things about their technology and and what they're able to do for us so getting down to nuts and bolts. Let's talk about what do you need to do to get a basic analytics cluster up and running on amazon ec2, which is you know if you're, a small team or, if you're doing this with your own?

A

You know your own resources. This is probably the first natural place for you to go and look is how to do it on amazon web services. So from a vm perspective, datastax actually has a really convenient. What's called auto. Clustering am I for datastax enterprise. What this will do is you bring up, you know four or five or six nodes and it will go and negotiate. Okay. This node runs opscenter. This node has hive.

A

This note: has solar and it'll go and set up all the configuration settings for you, it'll automatically form the ring takes a lot of the sysadmin work. That would probably take you. You know a week or two on your own just done for you automatically. So that's what I recommend is using your sort of base, ami and you're just getting started now. There are some limitations to be aware of.

A

If you need to do multiple availability zone or multiple region, replication on on Amazon you're, not gonna, be able to get that out of the box with their built-in am I, so I have to roll your own eventually, but for just getting up and running. This takes like 30 minutes to set up, so it's it's really convenient in terms of your bm's themselves. We highly recommend using you, bundu 1204 LTS, as your sort of operating system, which that will usually ship by default. In the end, the Amazon am I.

A

We recommend using six large instances for of your nodes, will be pure Cassandra nodes where there's no hive, no solar, no Hadoop, none of that stuff running on them and then one node that's dedicated to just hive and solar, and then one node, dedicated or 10 just dedicated to hive in Hadoop and then 14 solar.

A

This is a really good basic cluster setup and the way you should design your application around this is have it talk to the for rideable nodes at any given time, and then that hive and solar sort of sit off on the side and do the wrong thing, then from an actually setting up your first key space in Cassandra. These are a lot of the settings that we use in production today and, with all due credit to jay patel from ebay.

A

I stole a lot of this from his talk last year, when I was trying to figure out how to set up my cluster for the first time so for consistency. We recommend setting the right, to one just hand it off to the server and bail. If you know you that you're right load is going to be really high, set it to two.

A

So that way at least two servers agree and what this value should be when you're reading out of cassandra and then for your replication factor, we always recommend going with the replication factor of three for Cassandra, and the reason for that is, if you have a cluster with a ring with the least four servers in it, you can have you up to two of them go down and still be able to maintain one hundred percent of your data so allows you to survive multiple node failures on a small cluster.

A

So three is a good replication factor to have, and then, if you're, using the network, topology placement strategy, what datastax enterprise will allow you to do is sort of add new analytics and solar nodes somewhat independently of your cassandra ring. So if you want to, you know, distribute your a hive or hadoop workload among multiple workers. This will allow you to sort of scale that going forward so sort of has like mini clusters going on within your cassandra ring.

A

If you think about it that way, and then partition errs if you're using anything other than the random partitioner. You probably are way too sophisticated for this talk. You know, ordered partitioner, czar expert, you expert mode, only sort of sort of features and I- don't I, haven't actually seen very many of them in production ever so, but they but they're there. If you do need it, so let's go ahead and talk next about so how we actually work with Cassandra productions. This is from the applications. Point of view is what we're talking about here.

A

So with analytics the right to read. Ratio is going to be astronomically high, you're going to have a thousand ten thousand, maybe even a hundred thousand rights for every read. So what we tend to do in our setup is we take advantage of the fact that Cassandra is generally speaking, much more performant at handling rights than it is at handling lots of reeds. So we d normalize our data heavily at the API level before we write to Cassandra and we use a batch mutation to go and change Cassandra all at once.

A

So here we have like three column families. We have this blog's column family on the bottom, which is I. Think of that, as like origin, raw data that we're going to you know process through Hadoop a little bit later. That's the raw object as it's being sent to the API, and then we have some counters which just roll up daily totals for each of these.

A

So we have one counter that shows the total number of logs for each application on our platform and then another set of counters that show the number of logs at the different log levels that we support. So how many of these logs represent fatal errors? How many represent just normal trace versus debug info? That sort of thing this the normals are tracing levels. So what we do is a new log hits our API.

A

We de normalize it and put it into a batch mutation, and that batch mutation will atomically make all these changes throughout Cassandra at once, and the difference in terms of the amount of time it takes. For you know, modifying one column, family versus modifying 40 is actually not that much to it to the client. It's actually been difficult for us to measure the difference in speed, so really you're not really giving anything up by using batch mutations.

A

So that's the strategy that we strongly recommend is sort of when you're just getting started now for read strategy, interesting story, so we're most of our developers are all dotnet guys, which means sequels, like in their blood I, actually had to go in and try to disable the sea. Well, three driver when we were first getting started because they kept wanting to gravitate towards their sort of sequel habits when we were first getting used to Cassandra.

A

This is before we went, live with it in production, so what we do and what I strongly recommend for working with time series data is use the thrift api's. You can use a tool, that's called a slice range to go, and let's say we have this. You know column family here we're counting the total number of logs by level every day, so how many crashes versus errors versus traces do we have and this 30-day window for this application? What we can do is we have as their as laser pointer one here.

A

No actually I, don't know what this button does. Maybe I shouldn't touch. Maybe I shouldn't touch it. Oh yeah. What did something um long story short?

A

We have different keys here for each of the different long levels and what we do with a slice range is we're able to go and say: okay, I want to grab everything starting at date, one let's say date: one was 30 days ago and grabbed everything up to date, n, which is today, and you basically say all right- Cassandra take from this column until this one and because we have dates as our column, names and columns are physically sorted and stored next to each other in Cassandra, it'll grab that whole chunk at once for each row and return it in one blob and then from there we'll go and do stuff like check to make sure that all the dates are.

A

You know, there's a value for each 28 and if there isn't will go and substitute a 0-4 on the chart. That way doesn't look weird, and this is the sort of read strategy we use- and this is very fast so now we're going to talk a little bit about how to design a schema to support this and what are some of the principles and rules you need to think about. So, in terms of a schema strategy, things that I recommend doing one is make sure that your row keys are always predictable.

A

Ie, you don't need a separate lookup table to figure out. What's rose, you need to fetch in order to satisfy a read query when you're trying to get some data out of Cassandra design your keys in a way, we can always predict what the values are going to be and I'll show you a little bit more about that on the next slide, then, on top of that make sure you're leveraging the physical sort ability of columns, particularly if you're managing time series data.

A

This will make your life so much easier, you'll be able to look up. You know anything in constant time, essentially in terms of I, want to start from this column and go to this one. Where column a is your start date and call them bees. Your end date and it'll grab everything in between. So, if you leverage that property it makes it really easy to manage time series data inside your ear, Cassandra, cluster and other one other sort of gotcha.

A

We found this out the hard way we're just getting started with Cassandra make sure you use predictably sortable data types for your column names if you're managing time series data so use long, integers and use used eight times use intz, don't use things like strings or composite types because you might get unexpected results, so we were using a composite type, initially, I think for our column name, and we got burned big time by that.

A

So we totally redesigned our schema to make sure that the all the column, all the column names are always a really simple type. That's sorts! Well, we'll talk about some of the other things on the right hand, side as we go one recommendation I also make is that when you're just getting started with Cassandra stick with distributed counters for your real-time analytics as you get more sophisticated, you'll find you know, you'll find that there are some issues with distributed counters so, for instance, there's not really any good support for retry logic on them.

A

The counters are essentially atomic values that you can increment or decrement with individual commands. You can say increment by this much decrement by this much, but what you can't really do is, unlike the rest of cassandra, where you can go and overwrite a value, you can go and say I'm just going to overwrite the value of this row and it'll just go and reset everything. So it's fully item potent with counters. You lose that ability that it'll get incremented again if you retry it and the operation goes through.

A

So there are some downsides to using to use encounters, but for the most part, they're really easy to set up and are actually pretty easy to work with. So it depends a lot on the types of data that you're measuring I would never use counters to measure like the outcome of a financial transaction. Five minutes left Wow so better step on it. So for the first schema here no go ahead and leave this up for you on slideshare, so you can take a look at the notes.

A

This is a schema that you'd use for a total ly, predictable, zor data structure, so in this case daily app logs prologue level. We know the ID of the app that we're looking for, because that's contained in the request we get from one of our customers and we know the different log levels that we want. So our roki is totally predictable and we also know the date range.

A

We know they want everything from this state to this state, so our Cassandra column, our Cassandra column, family, has all the data it needs to satisfy query sitting in a single row, so this row will grow over time, but and it'll could eventually, you can become a really wide row. Potentially, if we have you know millions of hours worth of data on here, but there's also a naturally limiting factor to that right. There's only X number of days or hours per year, and so you can sort of capacity plan around that real easily.

A

The next sort of schema type that we use, what's called a bounded number of knowns, are bounded number of unknowns. So the scenario we have in this case is, let's say: one of our customers wants to know the number of users by version, or in this case the number of error logs by exception type. So we have no way of knowing what the exception types are going to be, but we know that the number of different types of exceptions will be relatively low.

A

You know, maybe under a dozen usually, unless you have a really bad developer, which which case we probably don't want them as a customer, so you're not going to have more than you know, 10 types of exceptions, stack overflow out of memory, etc. So we tend to do for this is flip the schema on its side, where we have the known properties, sort of the app ID in the air log.

A

We also have the date as part of the composite key here, so we'll be fetching a much larger number of rows this time around, but all of the unknown values are contained as columns, and we just say: okay go ahead and grab at least 200 columns for each row, and then we sort of transform it back into a time series at the application level. This is a great.

A

This is a great technique when you have a relatively small number of unknowns, you have to work with because it's simple to work with and doesn't require multiple round trips. Now, if you have a totally unbounded number of unknown things, we recommend using an index column family for that. So in this case and our batch mutation, whenever a developer sends us a custom event, usually they have hundreds of these.

A

We go and keep the names of the custom events in a separate column family we're using the null the null value pattern for this, where the actual value of the column is null and the column name itself is the value that we want. So we'll take this data and we'll go and query against an actual type, zero column family that has all the time series data for each custom event in it. So it takes two different network round trips, which is the one disadvantage of this pattern, but it really limits the number of rows.

A

We have to query each one of these patterns have shown you takes under 100 milliseconds for us to run in production. So in terms of the amount of time it takes the HTTP request to hit our API and for us to get a response back. So most of the time that our clients spend waiting on data is actually downloading JSON objects from our server or than anything else. So it's a really good way to curb some of the complexity.

A

Here you guys can look at this on the site, adding hive and Hadoop to the mix, so I want to touch on this before we we end our talk when is to dupe necessary as a question that I found myself asking right when we were getting marked up started, having not had a lot of experience with it before and my answer on. This is when you start getting into the 100 gig data set range. Hadoop starts becoming a more I, more and more valuable feature over time.

A

There's sort of a minimum size of a data set. You need to really get any value out of Hadoop and 100. Gigs has sort of been the threshold for us, but the other things you should think about from a requirements. Point of view, if you need Hadoop, if consistency is really important for you, Hadoop is a great tool for the job. For that of speed, isn't requirement. Hadoop is slow. It is really slow, you're not going to get real-time results from it. It might take.

A

You know 30 minutes for it to go and produce a query for you, but the results will be consistent and we'll touch on all the data you needed to get, and the last thing is: if you have really complex query pipelines like so, for instance, counting the number of distinct items that fall under these different cohorts and etc. Hadoop is actually a great tool for doing that. So if you need a really good MapReduce pipeline who dupes a perfect tool, so we're lazy.

A

So we like to put Hadoop on easy mode, and we call that hive so data sex, Enterprise Edition has bindings for pig and hive pig scared us. So we decided just to go with hive instead, since for all sequel developers, so high was developed by Facebook, originally I, just like Cassandra, and it was meant to be a data warehousing technology to allow some of their non-technical people to be able to go and get information about. Facebook's users, so what's convenient about it is.

A

Is that there's actually a lot less deployment and overhead that goes into running MapReduce queries if you're using hive or pig instead of raw Hadoop, you don't deploy any code to your analytics node and you can go in if you want to set up a recurring hive job, it's as easy as running a cron job that just invoke something on the command line. So it's really easy to deploy and get up and running, and you know to give you a sample of what the workflow looks like.

A

Let's say you have this column, family, the logs column, family and cassandra. If you want to start analyzing and hive, what you do is you go ahead and create what's called an external table, and so I'm not going to read all the syntax on there, but that's actually the legit syntax for map being a cassandra column family with dynamic columns into a hive table. So hive tables is very much, is sort of a one-to-one mapping with what's in Cassandra, but has different data types.

A

You have to account for some translation there, but for all intents and purposes it's pretty easy to work with. You know what the best part is: I've automatically fetches data back from Cassandra as it updates that mapping is sort of perpetual. In other words, you don't have to go and run jobs to reinsert new data or anything else. It just runs and works so and then here's sort of what the query syntax looks like we're getting data back out.

A

As you can see, it's virtually identical to what you do and like the my sequel, command line. If you wanted to run it so just some final tips and tricks about hi we're gonna have to skip the solar part. Unfortunately I'm, sorry guys, let you down don't write it. So, if you're reading and writing from hive and Cassandra make sure that only one of those two data sources is doing the writing. In other words, if you have a column family, that's being updated by your application server and it's writing directly to Cassandra column family.

A

Don't let hive right to it also. Otherwise, bad things will happen so, for instance, our user retention and our average time spent in app reports all that data gets written back to Cassandra by hive, but there are dedicated column families that Sandra that our application never writes to under any circumstances. We found that's the best way to sort of make sure you don't have one cyst one service, overwrite the other and then the second is.

A

If you are trying to test hive queries, you can use sampling, so 4 is instead of looking at the entire dataset. You can look at just the past. You know 30 days worth of data, so that allows your hive job is to complete in like 10 minutes. Instead of you know two hours depending upon how big your data set is, so those are our tips and tricks for working with it all right.

A

Well, unfortunately, I'm not gonna be able to answer this question today, but anyway, thanks for listening- and we have time for questions or time for three questions. Okay,.

A

So depends on depends on what roll-ups you want so right now, the way we do date columns and our data set. Is there actually time stamps that are marshaled to the latest UTC day? You could go and do it if you wanted to hourly or even down to the minute if you wanted and that technique will still work. All that you have the other they have. The bear in mind is when you're doing that column slice from one day to another.

A

It will get everything in the middle potentially, so what I recommend doing is having different column. Families for different granularity have one that's a daily roll up. One, that's hourly one, that's by the minute, if you need it down to that level. That way you sort of have an expectation for what the volume of data is going to be right there. Any other questions, interesting anecdotes. Yes, that's totally not fair. The answer is use a faceted search and solder, but how do you count a million distinct items in real time? And the answer is usable?

A

Actually, I'm gonna cheat this. Basically, you go ahead and define a solar index and not going to get through the syntax here, but essentially you can use faceted search and one of the things. Solar outputs is a number of records that match and it's able to sort of do this very quickly in memory. If your, if your index documents are really small, so we use this technique. Sadly, on ravendb demand account millions of things and that uses leucine under the hood and that worked really well and we're the set up on Cassandra and production today.

A

Actually so I'm excited about it, but yeah a development. It's worked great. So that's how we do that any other questions. Yes, how do I deal with wide rows and solar and Cassandra? Okay, let's say the answer is with solar. You don't you know you're supposed to run away screaming from wide rows. If you're using solar indexing with Cassandra with wide rows, it's it's pretty straightforward. We basically know exactly what's column, we want to start with and we just grab the slice that we need.

A

So we never have a query really where we try to grab an unbounded number of rows from a from a column family that we know has wide rows in it. We try to sort of cap everything if we can good question, though,.