Apache Cassandra Data Modeling, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Real World, Real Time Data Modeling

Description

Speaker: Tim Moreton, CTO at Acunu Ltd
Slides: http://www.slideshare.net/planetcassandra/tim-moreton
Data modeling for Cassandra presents a new set of challenges, especially for developers with a background in relational data modeling. And there are added complexities in modeling for analytic applications which need to enable statistical functions over the data, but a good data model, exploiting Cassandra's strengths, can make all the difference to a successful project. This tutorial will examine a number of real-world customer data modeling examples and draw out some hints and tips that will benefit hnot just the Cassandra newbie, but also the more experienced data modeler.

A

Okay, good afternoon everybody, I'm tim morton, I'm the founder and cto of kunu. I'm going to talk this afternoon to you about real world real-time data modelling. So I know it's the last session of the day and I'm only a small way between you and cassandra branded ale, so I'll, try and keep it uh try and keep it to the point. But what I'm gonna? What I'm gonna talk about?

A

I guess I'm gonna give you the story of a couple of customers that we we worked with, um who are building real-time analytical applications on cassandra and some of the data modeling techniques that they used um and that we we helped them to build their application, and I think they're, pretty generally applicable and will be, will be valuable to you guys as well. um So to start with, you know who who the hell are cooney? Well, we we love cassandra, we've been using cassandra for a long time uh since 2010.

A

In fact, we've been working on and using cassandra and helping customers with it. We contributed a lot of the design and implementation work around virtual nodes. We originally actually built that for one of the customers that we were working with in the uk, telefonica digital, uh sorry, telefonica, o2, one of the uk's largest mobile providers.

A

uh They use cassandra and a kunu together to to collect between 300 and a billion a day at peak event, detail records which are routing messages from sms um messages, but they had a large amount of data on each of their machines and they had a particular problem with that, and you know that that sort of approach where we, um where we work with customers and uh and end up building product on the back of that, is exactly the same story.

A

I'll talk about for the rest of this uh presentation, but mostly focused on on data modeling. We also did a lot of work on the cassandra query, language and earlier incarnations and it's great to see how how far cql3 has come.

A

We also provide support to some of the world's largest enterprises and smallest startups around around cassandra and and services around it.

A

But as I say, this talk is really looking at the experiences that a few of the customers that we engaged with on a support arrangement had the the lessons that we learned through that from a data modeling perspective. So how do you build applications on top of cassandra and how do you get to grips coming from a relational world with the fact that cql and the cassandra engine are just a fundamentally different beast so before I dive into that?

A

I think it's worth sort of pulling up um a bit of experience that we have about how people are using cassandra just to sort of uh just sort of set the context here. So, on the one hand, one broad category of use case that we see is around session storage.

A

So um that's where you perhaps have user profile information, uh perhaps you're in a social gaming organization, you're and you're keeping the state associated with each user, and maybe you have 50 million of these um you you push the state into cassandra, you pull it out, um you look at it. Maybe you do an update on it and you push it back. That's a that's! A common use case- that's a common use case that we see certainly not unique to cassandra, but it, but it it's.

A

uh It's certainly a major aspect of use usage there and then the other side, um I think, is real-time analytics. So this is a certainly a different, uh a different beast in terms of the workload and the fit of cassandra to this problem space. So here you're dealing with a stream of events being collected by cassandra and you're, looking to summarize or aggregate or in some way in in some way pull information out of cassandra, which is a reflection of the data being put in, but isn't the data.

A

You very rarely do updates in this uh in this in this situation.

A

So the the characteristics of these two data sets are pretty different and I think you know it's worth looking a little bit about that in terms of if you're, if you're looking to build an application on cassandra, if you're, perhaps looking at various nosql or new sql technologies and you're thinking about, is cassandra the right fit. For me, it's good to look at these characteristics and sort of put your problem space into that um fit your problem into it into this space.

A

So, with session storage, quite often your workload is going to be very read heavy. You often want atomicity, or you want some reasonably strong guarantees about what is going to be happening to the updates that you're pushing into cassandra or the uh or or the situation in which you're making those updates.

A

So also, if you look at it, you know, even if you have 50 million users, if you have a kilobyte of data for each of those users, you're still only talking less than the memory capacity of a single modern machine. So I you know, big data is a word that gets thrown around quite loosely, but I would describe many of these sessions and storage use cases, probably not big data with real-time analytics. It's often quite different.

A

You know: we've seen we've seen workloads where people are collecting click stream, data, telemetry data from telcos infrastructure data, log messages and financial market data and a range of other streams of data. Typically, coming at you with high velocity and there, the balance of reads to writes is often very, very different.

A

We typically see a hundred times more writes than reads and all the reads are to the summaries or the results or the analytics that you're getting out the other side. You rarely want to read the original events that have actually gone in and this you know for any real system. This really isn't going to fit in ram.

A

So it's worth looking at um you know. Jonathan set out this morning, actually a very, very similar list of cassandra strong points, and I I absolutely agree. You know: you've got scalability high performance, especially right performance, high availability, so you know taking advantage of the fact that you, a cluster, can span multiple data, centers or racks and be aware of that topology and work around it.

A

Well, um those three facets are really strong points of cassandra, and one of the challenges that still remain is certainly certainly common to many distributed systems is, is being able to get strong transactional semantics. So you know for session storage. If you look at it, the scalability is not so much of an issue. The right performance is not particularly helpful for this workload. Often high availability can be useful, um but you're still stuck with some of the asset semantics. It's worth, noting that um the compare and swap operations coming in cassandra 2 will greatly ameliorate.

A

This, that's that's for sure. On the real-time analytics side, however, um the the picture's quite different, so scalability is really important here, you're typically going to be collecting large volumes of data. You know I gave you one example with with uh telefonica, but you know earlier earlier this week I was working on a customer where they were collecting one and a half billion events a day, and they want to keep these they. You know they want to keep these, and that's far from the uh that's far from the highest that we see.

A

um The the imbalance in the reads to rights means that right performance is a really useful characteristic of cassandra and high availability. Stereo here is still still useful, but you don't need strong, transactional semantics in this workload um and I'll come on to show you why? So you know one of the one of the I guess is slightly contentious, but maybe what I'm saying here is that there are many systems for which uh for which you can apply you can you can go to for session storage. You know lots of users.

A

Use lots of users use a number of different uh solutions here. If you want high availability, cassandra's, a great is pretty much the only solution out there, but for real-time analytics it is a very, very good fit and building applications that need to get insight out of high velocity streams of events is a really common and natural use case for cassandra, and I suspect that many of you guys in this room actually actually have that problem.

A

So that's, I guess the sort of uh the rest of the talk is going to focus on data modeling around real-time analytics applications, but these cat, these these trends are probably going to be useful as well.

A

If you do have a session storage application in particular, though I'm going to focus on one example use case which is not o2 telefonica, but it is another telco it's where we were asked to sort of help with a system where uh users were collecting tens of thousands of cool detail records a second and they wanted to do network um network monitoring. You know several different use cases here, looking at drop rates being able to track them outages as they emerged and understanding uh how and which customers were involved in and implicated in these problems.

A

So what we're really talking about here is operational analytics right, so it's not needling a haystack type processing where you're collecting different data sets together and hoping to come up with some insight, which is going to transform your business in an abstract way. It's very to the fact to the point I you know, you use your domain knowledge to understand to apply the metrics that are important to your business and then you want to be able to know what is happening to those metrics within a small time period.

A

If there are many use cases for which the supplies it, you know, if you are a if you're doing, advertising analytics it's clearly, it doesn't take a hadoop cluster. To tell you, the click-through rate is the metric that you should be tracking likewise latency on on api requests, so these guys were looking to turn cool detail records into real-time dashboards, so they could make take corrective action when, uh when, when issues arose and just check the changes they were making to the setup were working fine and they they were using cassandra to do this.

A

So I'm going to talk about how they did that. So to start with, um I'm going to just just quickly summarize some basics from cassandra data modeling. This may have been covered at a couple of other talks today, but I think it's well worth driving at home. The first thing is that you need to denormalize. Cassandra is not oracle, it's not a relational database. You need to insert data in every arrangement in which you wish to read it back.

A

That gives you a lot of power, but with great power comes responsibility, and it's also a bit of a challenge as we'll see later to maintain agility. When this is uh this, this is the case. The reason you can do this is because cassandra has a great storage layout, which means that random rights or rights to random keys more correctly, do not necessarily or are unlikely to incur disk seeks.

A

So that means that you know, whereas in relational databases, you tended to end up getting performance constrained on rights, that is not the that is unlikely to end up being the case anymore, unless you have some pretty interesting data sets in which case come and tell me about them.

A

The second thing is that cassandra's data model is has a couple of um a couple of important facts. You need to know about it, so rows are used to organize data around the cluster.

A

If items are going to be accessed together, you want to put them in the same row and in particular rows. Allow you to efficiently maintain sorted data, so column keys end up getting sorted. So the row is your partition key in cql 3 terms.

A

Sets of I mean the converse is also true right, so sets of items that you're likely to not access together. You should not put in the same row and there's no point having incurring the cost of having to maintain those being sorted and, in particular, you're, probably going to end up with enough stuff in each row that you need to somehow distribute your load across your cluster and the final. The final sort of basic tenet here really is that atomic counters are a really useful building block for building real-time applications.

A

So cassandra can allow you to insert what are essentially plus ones into the system and when you read them back, you get the actual value, and this is this is a great building block for not all real-time analytics use cases, but pretty much pretty much most so just to highlight this one event being one event, update is likely to result in potentially many updates across rows. Don't worry about that? It's absolutely fine! That's that's! To be expected.

A

One query read typically, will hit one row or you should hit one or a small number of rows and will will you know you know you're doing well, if you can make it touch sequential columns? That's that's your that's your sort of perfect setup for for a cassandra data model.

A

So um what follows I guess is a is a cookbook of several techniques that we we help this customer adopt and that's only the first part of the story. um I'll I'll tell you what happens afterwards. So the first thing they wanted to do is be able to track the occurrences of certain metrics and be able to do so by a variety of different time hierarchies. So they wanted to be able to count currencies by day hour, minute and second and while not rocket science it does.

A

uh It does require a little bit of thinking about what what they do here is use every row for the for a level in hierarchy and within that row use the columns to maintain the sub components at that hierarchy. So, for every day you maintain a row for every hour. You maintain a row and for every minute you maintain a row and inside each of those rows the inside an hour row the minute the actual minutes were encoded as columns.

A

So this is pretty convenient because if you notice here you'll be able to pre-build, what's basically a a graph of uh of these, these numbers, these occurrences over time at any granularity here by just doing a single slice operation on a particular row.

A

So you know here's the here's, the pattern so for every the for every event, coming in the time that that event occurred increments, several different buckets, so we're firing plus ones into several different places in the data model and what you can do there is then go and ask any query about give me the number occurrences between any two distinct points in time, and we will do at most two slice operations at every level in that hierarchy.

A

So you can see there asking, on the right hand, side for the counts between 11, 59 and 1302.

A

We get the hour bucket for 12 o'clock to 1 o'clock and then we just add the minutes on either end and that's a that's, a pretty powerful way of being able to get a high level view as to what's happening and then be able to drill down further and to be able to draw that on a graph with any particular time granularity and and and show the change over time.

A

So the second thing was: how do you add wares to this right, so they were looking to filter these. These cool detail records these alerts and these occurrences by a number of different fields right so you're, looking at things like device type model carrier and particular characteristics of the characteristics of the network.

A

Now there are a couple of different ways of encoding encoding wares, but basically what you want to do is include it in your roki, in your partition, key in in a cql term, and the reason for doing that is that when you're filtering you're unlikely to you know, you're filtering you're, trying to do something like select me. The number of occurrences grouped by time where this this thing happens so you're unlikely to need to touch multiple different, wear values at once.

A

If you are, if you are fine, there's potentially other ways of doing it, but usually it's a filtering type operation, and so here you're augmenting the row key uh and you're having to augment every single combination of the row key that you've already got with with, with with the with each value respective value from that from that filtering field.

A

So group group, by is a is a sort of similar operation. What you're? um What you're doing there is you're just augmenting the column, key that you have in the field in the sorry, in the data model by the actual value of that group by so then, you can slice across uh multiple different uh com, multiple different components and be able to pull out all those values. At once, remember each row is going to be co-located on disk on a single or small number of machines in a in in the cluster.

A

So to go and get back that data you are going to only need to go to one place or on disk, and that's uh that's what you're looking to do, you're looking to minimize work for every every operation that you do so the next? um The next challenge these guys set us was okay, so we've got all this. This is nice, but we've seen some anomalies.

A

So we uh we'd like to find out what the constituent events were. We want to find out why, of this sort of latency, histogram say that you've just built using the techniques you outlined, what why there are these outliers? Why are there twice as many dropped calls in this particular area or for this particular device type at this time? I want to understand what those cool detail records were. I want to understand what subscribers they affected.

A

So what we, what we do there is actually use an identifier in the column key and store the event in a different column family as well, so we maintain basically a separate event store and- and this this works well for this customer.

A

What you're doing here is you store the original event, some of which you know some of that data is not going to be useful for grouping some of it's not going to be useful for filtering, and it's just going to be useful for finding out more detail about the underlying causes of a particular particular piece of analytics. So we maintain basically this this id mapping.

A

So it's it's like a sort of manual secondary index, if you, if you like, but managed by yourselves in inside the cassandra data model, so you see now what you're getting is is is quite a lot of um it's quite a lot of different colors in this data model and remember these diagrams are simplified, because what you're going to need to do is for every time you do a wear or a group buy or maintain a drill down.

A

You're gonna need to to to manage to manage all of this through cql, so we we got to a point where these guys were. You know asking us to do. uh You know we ended up with quite a lot of color in in the data model. From from that point of view- and this was this- was this- was getting pretty tricky so at this point um the system was working nicely, but the customer wanted uh to change things so the manager, the manager they demoed it to their managers.

A

They saw how um saw how how much potential it had and said. Actually, please can we do this? um Please can we filter on by these different characteristics? Actually, please can we do this other form of analytics here and I'd like a drill down for this particular aggregate to this particular event, and then, of course, they also asked the question very reasonably.

A

So actually could you just hand the system over to me, so I could do that and suddenly what you're looking at is uh needing to be able to offer what currently you're doing in cql or in in in thrift through to through to your business owners through to them through to managers, and in fact the big problem really is that you can never anticipate complete requirements, and this is this is a big part of the challenge.

A

So, after a little bit of iteration on this a little bit of the customer, having learned some of these techniques and built this, they said actually. Is there a better way, and so, from that experience, what we ended up doing was productizing some of this. So we um we. We noticed that there was much commonality in the data models that people were building and wanted to help people build something that was a level higher. So we put together a framework which we call kuno analytics and it's for cassandra users.

A

You can think of it as doing three things. It provides you with a simple data, ingest mechanism. It provides you with a way of automatically managing your data model and it provides you sql-like queries.

A

So I'll talk to you a little bit more about those uh those things and but all of this actually includes a lot of the data. It really started out, as the data model that we worked with on a number of cusp for a number of customers.

A

um Acuno analytics allows you to collect data from uh in via a restful, http api. You can just fire json object, static or log lines. We have flume integration, storm integration and various message queue integrations and what it's doing is it's basically building continuous, so olap style, cubes continuously on ingest, so rather than wait overnight and come back in the morning for your data store to have built aggregate cubes across a range of dimensions, which is essentially what you're doing in cassandra.

A

But in real time we we allow you to do that using a high-level language using concepts, familiar from sequel like where's and group, buys limits, happenings, joins and so on, but without having to uh without having to go into the detail or the depth of managing cql. So data comes in. We use cassandra to store raw events and aggregates, and you can issue queries out of this json api, uh which actually touch cassand the aggregates in cassandra.

A

We also have a bunch of dashboards which allow you to basically build very easy, quick, real-time visualizations of all of this data, so.

A

Very briefly, I'm just going to screw up the graphics system here to give you a bit of a demo. So I was just talking to eric evans over in the other room and he said to me: I have never ever given a demo at a conference because they almost always go wrong.

A

So what I'm going to do here is prove eric wright and see whether see whether we can we can make something happen. So what I've? What I'm just going to show? You is basically how you can achieve all of those cassandra data, modeling techniques- actually in about three minutes, hopefully before the my time runs out. So so the first thing I'm going to show you is that we have a so we have a a uh a ui here.

A

So basically, okunu maintains a set of dimensions and a set of cubes on top of a table, and that table has a an endpoint which you can fire json events at, um so I'm just going to kick off a load generator which is firing some latitude, longitude, timestamp duration and other data at it.

A

So what you can see here is that we've just basically said time treat timestamp as a time and just aggregated up this hierarchy and treat latitude and longitude as a hierarchy aggregated at these levels and maintain these cubes for me, because, basically that was what I was doing in my cassandra data model. So when you do things like head over to here and say.

A

Something like show me the count between five minutes ago and now group that by second, you get a bunch of results and I don't think this is probably classified as big data either, because we're only doing about 40 a second, because I did it particularly slowly, but um you can. Then you know when you're, when your boss asks you to also well compute me the average duration of uh calls coming in from those cool detail records. I can do that and then you can go and add that to a dashboard.

A

That I will call demo and then you have real-time analytics powered by cassandra on datasets, where we've deployed, I think, up to about uh use cases with about 5 billion events a day, so analytics scales out linearly over cassandra, and you can then also go and run. Queries like this one that I made earlier.

A

To do exactly the sort of um to do exactly the sort of you know, you see all the sql-like concepts there to do exactly the sort of analytics that this customer was aiming to do with uh with cassandra directly so um in myself, a new dashboard.

A

So there I have a real-time um geoheat map, updating, aggregating cool detail records, um that's basically powered by cassandra, and you didn't need to write a single line of cql to do this.

A

In fact, it's pretty much all just set up through the ui. So you know these. These queries are just significantly higher level than you. You would have to do if you were using cassandra directly. So it gives you the power of cassandra, but also helps you get to value more quickly. So what I'm going to did just here was just change the latitude and longitude granularity, because we we have those two buckets and now we're going to aggregate it at just a different uh different hierarchy.

A

So you can see actually I'm collecting results that just a different uh different granularity there.

A

So, um but you know this isn't just a a tool designed to you know, there's an api behind this. It's designed to help you um designed to help you build real-time analytics applications and one of the one of the things you can do is just make. You can very easily embed these widgets in your own uh in your own uh sites.

A

So, for example, there's a javascript library and I can go and stick this text into a html file like this and run over to here and hopefully, when I run that I have an embedded graph, collecting metrics powered by cassandra, which you can then go and you can go and put into your own applications.

A

So I think that's that's all I had to say um thank you.

A

And eric, if you're in the room, you owe me a beer, any questions.

A

All right, okay, yeah, have you got a microphone or I can maybe repeat back your question or maybe you can.

A

A

Yep, so how does this relate to storm? Where would you use this alongside storm? Well, um we have actually a number of customers who use storm at the front end of what we do so storm for those who, who don't know, is a sort of a distributed stream computing setup where you just fire streams of events and um and and do processing on it. So storm is like a cep system where you uh we see storm as a sort of upstream setup for processing data on the way into analytics.

A

The difference with analytics is it's like a it's a backstop for your data, so the trouble with stream processing is that once you've done your processing, you need to store that somewhere and if you haven't, got anywhere to store it. You can't come back later and ask for questions on it.

A

So it's storm on its own is great for use cases where you are looking to basically do processing that spans a window of time which you can fit in memory and then maybe raise alerts out of that or raise sort of downstream events, but acuno analytics allows you to allows you to do that, but also maintain historical context.

A

So quite a lot of our users do things like um raising an alert when the volume of trades for this particular symbol has fallen below the mean minus one standard deviation of the usual level that that stock trades at four ten o'clock to eleven o'clock on tuesdays this time on tuesdays historically over the last six weeks or just show me that graph and show me another line of what happened last week.

A

So it's like you have that you have that historical context and you can come in and ask asynchronous queries of it rather than just do something downstream. At the time an event comes.

A

A