Apache Cassandra Cassandra Summit Europe 2014, 27 Dec 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Burt: Using Apache Cassandra For All The Things

Description

Speaker: Theo Hultberg, Chief Architect at Burt

At Burt we use Cassandra for a little bit of everything. We have a graph database, a tracing system, a stream processing engine and a document store that uses it for storage, and of course, we use it for time series too – but with a twist. Cassandra works great for all of these use cases, but not out of the box. We've learned the hard way what not to do, and what to do instead.

A

Hi everyone, and thanks for coming so my name- is tio I work for a Swedish company called Burt. We do business intelligence for publishers and help them understand their revenues and their online audiences through analytics and other things we. That means basically, we track a lot of web site visits and we bring in data from our customers, ad servers and ERP systems and CRMs, and that kind of thing and try to bring that all together to to make their businesses more understandable, I'm, also Cassandra MVP for 42 years running now the original author of the.

A

What's now the route data strikes Ruby driver for Cassandra. So if are there any Ruby developers in the room or all you java? Oh that's great! Are you using the Ruby driver? That's good! That's really good I'm, not responsible for any of the bugs that have been introduced for the last few months, but I'm gladly you using it. So what I wanted to share with you here is that we're using Cassandra for wildly different things and I thought it would be fun to to show a few examples of where we you could.

A

We use Cassandra because I'm sure that you'll you'll, if you're not already using Cassandra I'm sure you've, read all about how cassandra is great for time. Series and cassandra is great for this thing and that thing, but it they tend to be quite similar and I, just want to show you a few very different things that are all sort of fit into the same. The same platform that we that we're running in everyone I mean everyone has to have favorite database of that they feel that they can use to solve.

A

Basically any problem and I'm not sort of want to stand here and say that you should use Cassandra for everything, because it solves every problem. That's not that's, not all it, but for us, cassandra has been a really really great tool, because when you- and if you take sort of take scale out of scale and performance out of the equation, I mean one's favorite database can be used to solve any problem. But we've managed to use Cassandra for four things that are are not just just anything we've.

A

We have systems that are doing tens of thousands of rights per. Second, for example, we have a graph system, that's tracking, a graph with hundreds of millions of modes. We have another system, that's doing millions of increments per second and another one. That's storing time, series of billions of billions of values, so there's sort of those are the things that you would be hard-pressed to to solve with any database and they're, not that many that can solve these problems for you.

A

So first, cassandra has really been a big win and for me, I think it all comes down to one of the first descriptions that I read about Cassandra. There was the the big table, data model with dynamos, their distribution modes and log, structured storage and those three component sort of makes a very, very good package for the kind of things that that we're doing so to just go through just what I mean about that. The big table data model is the the idea of having rows or partitions.

A

With that essentially map to assorted map, you can build lots of complex things on top of that, I think, for example, cql is a great example of what you owe of a very complex data model that has been built on top of the quite basic tools that you get from the big day. A big table, big table model. The dynamo distribution model, gives us a very simple system to run.

A

We we don't really have any operations, people we have a number of clusters in production, I think it's for right now and might be 30-40 nodes in total in in production might be it's probably a bit less than right now in staging, and we don't really have any operations, people that are managing these clusters because they sort of manage themselves if a node fails we'll and when we need to scale up we'll just add another node, it's it's very, very simple: we've even I mean we've even been able to upgrade from Cassandra 1.0 all the way to two point: O in basically a day where we upgraded through.

A

You know you need to upgrade to the the highest version in every every release, so we we've upgraded, probably three times in a row in a in one cluster and all while the application was running. Writing thousands of operations per second, it's like changing the the engines of a flying airplane. I think it's just magic that it that it works and the log structured storage thing enables a massive right capacity, which is something that you will see.

A

When I talk a bit more about what we do, you will see that we tend to do much much more read much sorry, much much more rights than we do reads. Sometimes we just read: oh sorry, just right, we never really read back the data and having a database that can just swallow all of those rights. It's just fantastic and the sort of what looks kind of like fireworks here in in the periphery of the of the slide is actually a visualization of our platform.

A

Is a production, the production system down in this corner and the staging system in the other corner and it's the nodes are colored by by their roles and their it's a force directed graph that shows sort of what nodes connected to other nodes will have network connections with other nodes in the system. You can sort of see some of the Cassandra clusters they have. They have a very, very peculiar particular way of growth being in a force directed graph, because all of the nodes are connected to all of the other nodes.

A

So I'm sure you can spot a few. A few others like like that in this in this visualization.

A

So the first thing I wanted to talk about was one of the first applications that I've built on top of Cassandra, a conversion tracking application, or actually it was a way to model attribution. So if you're an advertiser and you buy advertising, you might buy advertising in a lot of different websites and you want to know but of what was effective and when people buy something in your in your in your store, for example, or sign up for a newsletter.

A

You want to know what advertising did actually influence that and a very simple model might be well. The last ad that someone saw was probably the the ad that influenced the purchase, but you can also have something more complex, like the last ad that was visible for more than 20 seconds would would be the one that influenced the the purchase and you can make just about any any model that you'd like that, looks to wrap at the the visit history and see what what ads were displayed.

A

What it might not even even have been an ad that that that influence this decision, it might have been a blog post, for example.

A

So when someone purchases something, you want to be able to go back and look at their history and see sort of what did they do up until the point when they bought something? So you can figure out why people are actually coming to your to your store and buying things and I'm sure that the sum of you in the audience who experience with Cassandra have already sort of got thinking about?

A

How would you implement this in Cassandra and obviously the it's a very, very simple thing: you've you've got your partitions and this is sort of the way I'm going to show you a few examples of schemas. There's the I might be saying rows and columns, because I started using Cassandra when that's what we call it, but now it would be partitions and cells and clustering keys.

A

So the the partition key, the thing that decides where the data is in the cluster, that's the the big box and then the clustering key would be the the row on top and then the value would be the row at the bottom. So this model, oh, this schema, is very simple. We have some kind of ID for the visitor, maybe a cookie, maybe some other identity.

A

For so when we store all of the events that belong to that visitor or with a time stamp, you might probably use something else and a time stamp, but things could happen at the same time, so maybe a time uuid, for example, and you store the full event in as the value and when someone buy something- and you you mark that history, yes interesting and then at the end of the day, you take all of the interesting histories and you ship them up to Hadoop, and you do analytics on those and make reports, and you run your customers custom, a custom models on to figure out what ad or what blog, post or whatever.

A

What was it that influenced this? This decision so very, very simple model, but one, but cassandra is really great, because not very many people actually buy things online in might I might be running out of 10,000, and that would probably be a very, very good conversion rate. So for each history that we actually would like to look at. We've got 10,000 histories that we've written, that we basically don't care about.

A

So the the right volume you need to have something that's sort of going to handle that right volume, because most of the data is just going to be thrown away and speak speaking of throwing dat away being able to tell Cassandra in advance that you don't want to keep the data around. It's a really really nice feature. So we in this example, we wrote the data with a TTL of a few days and said after a few days, just get rid of it, because it's not interesting anymore.

A

Something that someone did a few days ago is not going to be the trigger for what happens today. So we get an automatic windowing of all the queries in the system. We know that you will only be looking at data. We don't need to have any complex system that running in parallel with this of figuring out what to delete or accidentally running out of disk space, because we forgot to run that Co process.

A

We also do a lot of analytics, obviously, since we're an analytics company and when I'm sure that most of you read about doing time, series with Cassandra, you might even have been to one one or two presentations today about time to sink in Cassandra and cassandra is really good at that. Most of the time when people talk about time series they talk about, sends the data type time series.

A

So what I mean by that is, for example, you want to track the CPU usage of all of your machines or the how much disk there is left or that or you have weather stations and I want to track the the temperature at each weather station.

A

So you got a finite all right or even an innumerable set of sensors or metrics that you want to track and then you've got you got time in the other dimension when you do analytics for other things, it kind of becomes a little bit more complex, so you want might want to do the the equivalent of this.

A

So if you've got a data warehouse where you collect all of your events and in this case page views- which is something that we work a lot of with- you- might store every single pageview in a big big table. And then you run sequel sequel like SQL queries on on that, and that's a very, very expensive way of not figuring out in advance. What you want to what kind of reports you want to to run. So instead, you need.

A

We need to find a cheaper way of doing this and the interesting thing about this that compared to the sensor data time series that I talked about is that we don't really know the the cardinality of these dimensions. So the number of categories I've got no idea how many categories Reese I might be on a on a website that we track.

A

It could be tens or it could be hundreds or it could be thousands who knows, and they could be new ones coming in every day where there could be new authors and old authors are moving on to other two other publications device types might not be something. That's changes very often, but well, you know. Just a few months ago, Apple announced a new kind of device, so maybe we'll add, watch to the list of possible device types in in half a year or so so we have these unbounded dimensions that we we need to track.

A

So we can't really just write the right, the right, the metrics and and because we can't fine. If we do that, we can't really find them. We need to have a way of sort of searching for them and I'll get back to that. Though, in analytics we talked about a bit about cubes and we talked about slices and and there's lots of lots of nomenclature like that I'm, not really a date, data warehousing expert, I, sort of learned.

A

What how you do it when, when you're, not using data, where are things I sort of borrowed, some of the terms and I just thought? I'd I mentioned the terms so that young, you know what I'm talking about. So you might in a if you do a report like this, you might want to have a few metrics, so the metrics might might be. Those are the things that are numbers, and then you have a few dimensions. Those are the things that you group by or you want to you want to report on.

A

So you want to know the number of page views by category author and device type, and sometimes you might want to to have more dimensions, and sometimes we want to have fewer dimensions. I recall the call this a cube, because it's got dimensions and in this case I thought I choose three dimensions so that it actually was a cube, but sometimes it's two dimensions and sometimes they're five dimensions or more dimensions, but it's still called a cube.

A

That's just apparently the way it is so in in one of our are and systems we've modeled. This one lit like this. This is in simplification, but it sort of shows the the the most necessary parts of to to understand this. This system does not serve metrics, it's actually.

A

It actually serves as a staging area for another batch batch type, application that reads all of these metrics and then does filtering and sorting and- and it rolls up the metrics by day and bye, bye week and bye-bye month and a few other things. So when it says time here and where I actually mean this hour. So we got dates an hour and got 11 cell here for for each slice per hour, and so to reiterate the cube here. What I mean by cube is just in this model.

A

It's actually just a string with the that contains the name of the dimensions of the of the cube and the slice is also just a string that contains the values of these dimensions. So we might be retracting a site with a category called sports and an ortho called su and the device type called tablet. So the metrics for that would be is that the number of page views where they're on the sports section where Sue was the author and people used a tablet to to read the article and the metrics.

A

There is just a blob of blob of metrics and for this system it that makes sense because it will be, it will be aggregated in in a few more steps, so just keeping all of the metrics in the same place. It works really well for us there.

A

This system is actually doing around two million increment per second, but it's not doing it with counters. So, as you saw says, we're storing all the metrics together in a blob, it would be hard to use. Cassandra's can and we also have a few floating-point metrics. That would also be a bit. You know it can be done with counters, but it's a bit easier to do it with, if you're, not using counters and doing two million operations per second, even if Cassandra can absolutely do it.

A

Cassandra can't do that on seven node, so Ingram incrementing and aggregating in memory and using Cassandra sort of a the storage, the the persistence layer is, it works really much much better for us at least, but there's a problem with this schema and in general, we've had quite a lot of problems with wide rows and I'm sure you've read about Cassandra can handle wyd Rosa can, but you can have billions of cells in a row and that's true, but it's also something that you should be a bit.

A

You shouldn't. You should really try it out before you assume that it works for any any use case. So when you do it like this, I, as I said the dimensions so essentially unbounded. We don't know how many different values there will be in a night in earth and a dimension, so you can end up with a graph that looks like this, which this is showing the queue depth in a queuing system. That's feeding the this analytics engine and the peak. There is about a few hundred million messages and that's the situation.

A

You don't want to be in one of my colleagues. He he posted this picture in in what, in our chat channel a few days later, sort of give you the scale of things. We were really really concerned when we were at that that level and then he just blew out.

A

So this is what happens is that when you're tracking websites and you're tracking, for example, recipe sites recipe sites, have this they're, usually not huge websites, but before Christmas and before New Year's they just blow up completely there there the they get an order of magnitude, if not several orders of magnitude, more traffic and when people browse recipe sites intend to just look at it. Look at very lot of different things. So if you have a dimension like URL, well, suddenly you have this this.

A

This jump in in cardinality from a few few, tens of thousands of different URLs, for example, to hundreds of thousands or me, maybe even millions and and that gets a bit that the rows get quite long. If you, if you actually have all of them in just in a single row. So we solved that by introducing just shouting up the the row so that each slice came in in a different row. We just used simple hashing function to introduce an extra level of splitting up things into rows.

A

Where is something to keep an eye out of just long rows are a wide rows are really really nice, but it's also something if you, if you really have something, that's unbounded and it's also a lot of data, then you keep an eye out because it's not really how wide the rows are. It's actually how big there are, how much can Cassandra sort in memory at the same time, and that was what's basically the problem for us.

A

We also have a time series more time series like database, that that is doing essentially the same thing as the the system. I. You just show you. This is sort of the next generation of that, and it will be it's. It will be replacing the old one. It is also serving data right out of it. So it's it doesn't need that batch layer of doing secondary aggregation and the scheme is quite quite similar. We but we sort of broken out the instead of having the slice in the clustering key.

A

The slice is, and the metric name are actually in the partition key. So each metric in each slice gets it gets its own partition and then we did to make that possible to find all of the metrics which is sort of the problem. When you get unbounded dimensions, you have an extra index table where you, where we were keeping track of all the slices for a fork fork you again. This is a simplification.

A

There's lots of extra things here too, to make sure that we can find things and different customers have different, get different partitions, and that tiny thing this this gives us the the narrow roast and it gives it another bit narrow having narrow rose actually gives us another benefit, because it's, oh sorry, actually the narrow should say the narrow rose. We make sure that we get narrow rose by introducing an epoch, which is basically just a time /.

A

Some constant and I also see that I forgot to remove version from the clustering key they're just in there, so that wasn't complication. I hope that I could have time to talk about, but I won't be having time to talk about you I'd love, to talk more about that after the presentation, though so the epoch is it's a way of just dividing up all these long time series into shorter, shorter series so that we can. We can be sure that we won't get enormously long row.

A

So we just divide them up into about thousand time units each. So a row will never be longer than a thousand thousand cells and the thousand is probably too small. We could have gone with 10,000 or 100,000 and we probably be fine. We might be tweaking that in the future, but first in check belt, quite quite good. At the time the narrow rose also give us another benefit, which is that we can do parallel read so. Instead of reading a hundred thousand things, sequentially just read this row and give me hundred thousand items.

A

We can read hundreds a hundred rose, a thousand items from a hundred Rosen said in parallel, and that sort of helpful with latency and that kind of thing, which is an added benefit that we didn't really consider before doing it. So it was actually a bonus, but, as I mentioned as well Lee one of the big problems, I think in doing time series is finding the series: it's fine. If you got an innumerable number and a new bro, innumerable set of sensors or an innumerable set of metrics, but we've got.

A

We don't even know what the the values are in advance. So we need an index to to make it possible to actually find something. So if you want to find the the number of page views per device type in the category sports, for example, that you need need to have a way of finding it and again, this is SQL, not sequel, just for showing how you would do it if you had a table with all of the, if you didn't didn't pre-compute all the metrics, if you add a table with all of these, these events.

A

So this is another illustration of the same thing: you've got an index with the category in the device type, so in every cell, there's a category the category and then the device type as a string, and if you want to find the number of you want to find all of the the different decisis that have that have category sports, you can do a query in Cassandra with its, where we're all give me all of the cells where the the components of the the key starts with sport, and you will get Sport Plus computer Sport Plus tablet sports, but plus mobile, but you can't do if you want to have all of the one to know all of the places where people have used a tablet which categories do we have metrics for where people have used tablets, you need to scan through it all and undo the and do the filtering in memory, because this partition doesn't doesn't really support that see you.

A

Instead, you write it 2 2 x, disk is very cheap and Cassandra can take lots and lots of Rights. So why not just save every pot, every permutation of your slices to to a Cassandra table so that you will have one? There will be one petition that will be able to support any query that you might have so the the one at the top. We will support a wild-card query on device types you can say sports and any device type where the the one at the bottom will support.

A

You can say computer and any category, and then we do this for all all the possible cubes and that ends up with a with a huge tables, probably the biggest table in this system. But disk as I said, disk is pretty cheap and writing things to cassandra is also really cheap. So it's a it's a way of making this possible by optimizing for the reeds, rather than optimizing for the rights and I, really like really like being able to do that.

A

So we have another case where we were using Cassandra for essentially transient state. So it's a system where we we see all these small fragments about page views. We we know that when the pastry started, we know when you scroll, we know when, when there's an ad that's being displayed, we know when you're sort of leaving the computer and going to lunch. We noticed that you're inactive and with that that becomes an event, but we want to wait for all of these events and assemble one single object.

A

That represents the whole page view, was sort of a timeline and what happened and the total duration that the patrius and if there were any ads on the page, wont to know the duration they were visible and that kind of thing um this system is actually just doing all of this in memory.

A

It's actually never never reading anything off of Cassandra unless it needs to be restarted for some reason, if it crashes or if we redeploy or if you for some other recent reads restart, it uses Cassandra to sort of replay its overload its current state. So we continuously write all of the fragments that we get in and we write an index of all the active sessions and then, when a session is no longer active, we've assembled the complete page view.

A

We delete all of the data from Cassandra and we send the page for you off, so we do only only right. Basically, we may we do a few reads every other week in this system, but apart from that is basically just right, which sounds a bit odd, but it also it Cassandra is basically a way to be able to to restart without a reasonably these. These nodes can just crash at any point and they we can restart and get back to exactly where we were when the so could we take questions afterwards.

A

I think that would be easier, sorry, so the index and the index is just sharted on there's a number of shards just to get get the despair spread out the rights for for that index, and there the the top table here is quite similar to to the table in the first example. Actually the attribution example, but it's it's tracking something else, but it's essentially the same thing.

A

This also has quite a wide row failure thing in it and we're very, very bad at doing things right with wide roads, apparently because what happens with the with the session index, we we write that with now we're tracking this session, and then we, when we stop tracking it, we just delete it. So we end up with a long long long row for each shard that that just contains deleted data essentially- and we sort of knew this going in, but we also figured that well.

A

Cassandra will will sort of compact away these all of these tombstones. Eventually, it will go through of these tables and compacting them away, and it's not that important. If, if it takes a few seconds to read this, we can, it only happens when we restart so not a huge problem. But what what we didn't know was that we actually cassandra sort of needs to see the whole row before it actually removes tombstones, and it can't really do that, because these these rows will be spread out of across all of the SS tables in the whole.

A

An old key- oh all, of these tables on this, because this row has been around forever. So what happens when we restart? Eventually, we we sort of trolled through a hundred thousand tombstones for every actual, live object that well or live session that we read, which is really really bad for performance. So eventually it took 45 minutes to restart the system, and we decided to do something about that. So, instead of doing this, we introduced an extra, an extra piece to the session index table that was essentially just the day of the year modulo 30.

A

So we just had. We would every day we would use a new partition and until 30 days had passed and we started using the first one again and the reason why we did this cyclic thing is that we at any point we could be when we need to restart. We might actually have to read a few entries from last days from the last days partition and a few days from from the current partition. So we still start reading at the current partition current day, and then we read back until we don't find anything.

A

So we needed something that was sort of deterministic and then that ends up we're looking kind of kind of like this. It's you the bottom.

A

There is the sort of the currently active partition that we're using it eventually gets filled up with just deleted, deleted columns, looking sort of like the previous day's previous day's position, that's just tombstones event, and then Cassandra can find all of the all of those and we'll be able to find the whole row and compact it together, and it will be completely gone by the by the by the end of the cycle.

A

When we, when we reuse it, it's it's a bit of an ugly solution and now we're actually not even with still doing this, but we've also also done. Another thing is that we're running major comp actions on this cluster continuously, because the amount of data that we actually tracking the amount of active data is really really small, so we can make just a 12 just to run Compassion's all of the time. There's only a couple of hundred gigs of data, that's active, so Cassandra will manage that nicely, just keeping it one.

A

One big, or at least well, not huge, but at least one big SS table and reads- will be reasonably fast and and rights will still just we just throw tens of thousands of rights per second at this I've got lots of more examples of what we've done with Cassandra. We won't be able to share all of them with you, because the I don't have all afternoon to to do it.

A

We've, as I probably mentioned, we have a graph database, it's not a general graph database, but it's a it's tracking, a bipartite graph, where nodes of one type can connect to one or more of nodes of another type, we're using that to track as actually identities across websites. We've got a tracing system that keeps track of sort of what happens for in our in our whole platform, and we can go in there and see what what's slow. What's not kind of like Zipkin, very inspired by Seth Kim, but very tailored for our system.

A

We've got a key value database. That's not just a sort of key keys and values, but it's sort of using cassandra. Cassandra rose again to to enable things like since we're integrating with lots of different systems. Our customers might have the data about, for example, our ad campaign. They might have that in in a few different systems. Some data in one system, some in another and instead of having to integrate with all of those assemble one object and and write that to the the database, we can have different integrations that work on different time.

A

Time schedules that just right there piece of piece of it and the the complete object is not assembled until you actually read it back, which is a very convenient way of doing doing it will, if I'll, if I'm back here next year, we'll have probably two or three or ten more examples of what we've been doing with Cassandra and I hope that some of these have inspired you to to to do great things at your company's I.

A

Think that cassandra cassandra has really helped us to to do these things, and I can't think of another database that would that would have made it possible to to do these things and I'm really really grateful that we have ghassan run that we have the great guys and gals at datastax that had created for us. So thanks a lot and catch me afterwards or tonight and I'd love to answer any.

A

If we got time I'd love to answer your questions now, do we have a mic for questions or where it's just, if you've got a question, just raise your hand and rule, so there were a question over there. So do you want to.

B

A

Yes, no, that was very so. The question is the question: is we I've been alluding to to us doing lots of in-memory aggregation yeah? The party, so I mean lots of magic around these. These things aren't actually Cassandra. The magic is more in the routing in the applications.

A

Yeah so I'll try to repeat the question just if people yeah yeah, alright so yeah when we're partitioning the the data, do we sort of map that to how it would be partitioned in Cassandra? That's sort of the question? Yes, yes, yes, and no actually, we tried so so since the the the pre-compute computing cubes. What, since you have one stream of stream of objects? That are that you could route on some particular key, but then they you need to explode them into sort of the increment operations. That will happen.

A

So we have one system where we solve that by the writing into the system is completely random and then it sort of gets exploded into the all the increment operations that need to perform, and that is routed to two nodes that are keeping doing the memory, aggregations and they're sort of that maps quite closely to how it's partitioned in Cassandra, so you'd know that each partition in Cassandra would only ever be written by one particular node, so we won't be written by by two there's no concurrent writes in those in the partitions.

A

Does that kind of answer the question? Maybe we should we should talk more afterwards. So is there any any more questions for me while I'm here or do you want to just catch me afterwards? If you got something Thanks.