Apache Cassandra Cassandra Summit 2013, 19 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose

Description

Speaker: Andrew Noonan, Developer at Gnip
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-dude-wheres-my-tweet-taming-the-twitter-firehose-by-andrew-noonan
Gnip ingests and must serve out hundreds of millions of social activities every day and social platforms are only growing. This makes the scalability of applications essential for Gnip. Enter Cassandra. Problem solved, right? Not exactly, Gnip's relationship with Cassandra was not all rainbows and unicorns. In this session we will walk you through why we began looking at Cassandra as a data store in the first place and the valuable lessons we with Cassandra that has made it an invaluable part of our infrastructure.

A

Welcome everybody to dude where's, my tweet taming, the Twitter fire hose I'm Andrew, Noonan I'm, a software engineer with I'm controlling two laptops, so bear with me. So first I'll get into Who, I am and what I'm going to talk about today? My name is say my name is Andrew noon and I'm. A software engineer at can dip talk a little bit about hooking up is what we do kind of what our business model is and how Cassandra fits into all that talk about how Cassandra and first got together what it was.

A

We used it to solve a little bit about rainbows and unicorns, and hopefully some time for questions at the end of it all so, who is well nip is paying spelt backwards. That has a lot to do with how the company started. We we tried to reverse the pinging of social media api's. We provide a centralized streaming HTTP connection for social media api's. We are in beautiful, Boulder Colorado. It's pretty awesome place with a lot of technical companies around.

A

So that was a big thing in helping us when we first got into Cassandra to lean on those people talk to them about their experiences with Cassandra and hopefully try to avoid some of the same pitfalls that they had.

A

So we are a like I said, a social media data provider. So we font front all the large social media publishers, big one as you'll know as Twitter, but we also have tumblr WordPress, discuss and tense debate list goes on. You can see them up here.

A

So all these all these different publishers provide us with a unique problem. So twitter is a large publisher in terms of the number of activities that we see, but each new publisher we bring on might have a different format, large payloads, that we have to deal with so there's a lot of variability in the scale that we deal with and social media publishers are only growing. Social networks are only growing, so we as a business have a large need for scalable solutions.

A

So speaking to Twitter, specifically the Twitter fire hose as it stands today is like five to seven thousand tweets a second. On average, we regularly see spikes of up to 20,000 tweets a second. This right here is a spike we saw recently that was like up to I think it was like fifteen thousand messages. Second, completely unplanned, we didn't know it was coming.

A

Twitter just gets customer queues or user queues backed up somewhere in their system, all of a sudden they flush, and we see some downstream event that we need to have our systems planned in order to handle these events smoothly, not drop any messages. There's planned. Events that we see all the time Michael Jackson's death, not very planned, but it's a social media event. We see the Super Bowl, for instance, last year was like twenty thousand messages.

A

Second sustained for several minutes, so definitely have to build these scalable solutions that can handle all that we hope hold the entire twitter archive in s3 actually, and we save all these messages as they come into s3. We ingest something like two terabytes of data a day. I think we have a full petabyte at this point up in s3, we've got a mix of cloud and dedicated hardware, so clouds a great solution for these scalable apps that we're writing, but every now and then you need a little more horsepower out of your single machine.

A

So we've actually made some investments in that area. um We serve over ninety-five percent of Fortune 500 companies with social data. So you know these are enterprise customers paying a lot of money for this data. They need reliability, they need availability and they don't really tolerate. You know systems going down that kind of thing and speaking to our scale, some more. We deliver 120 billion social activities every month, just growing every day.

A

So why all this matters this matters because it, you know obviously tons of data. So everybody here is probably familiar with the idea of big data and dealing with things at scale.

A

We also require redundancy and reliability, so our customers are paying quite quite a lot of money, for this they've got big dollars riding on it, and so they require a reliable solution that has redundancy behind it. There's always going to be failures, so we build redundancy and reliability into everything we look into.

A

Obviously that relates to Cassandra and availability is very important to us. You know, as as we encounter real time problems we need available apps, so we began searching for different solutions to our data problems, and you know we were looking at Cassandra. I was on the team task to kind of come up with the new technology that we're going to use that was going to be scalable meet all these requirements, and so I will tell you about how getting up in Cassandra first met.

A

We had a current system in place that was basically a metadata lookup of historical activities as they left the gannett system upon customer requests.

A

This system was seeing a regular, sustained write, throughput of about 500 to 700 messages a second and we were starting to see regular spikes up to 1500 messages a second and we were building a new product to put on top of this, that was probably going to was definitely going to increase the read requirements on the system. We were thinking about 250 clients at any given time that has now scaled to more like 500 clients.

A

So it was very clear to us that the solution we had in place right now we're right then, which was not a scalable solution, wasn't going to work in the long term. We needed to come up with something else. These were the specs that we were looking at. This is actually a graph from back then, and you know just a for instance. We saw this random spike of 3,300 messages. A second we actually didn't know what the hell was going on upstream.

A

At the time we talked to the publisher and finally found out that they were doing a concerted effort upstream. That was batch processes that were causing these increased messages to come down every now and then so, and they told us that was, you know, only going to keep growing and so we're like all right. We should probably do something about this, so enter Cassandra Cassandra, obviously seems to fit. A lot of the things have been talking about.

A

Definitely has a high write throughput so, as we were going to continue to see growing volumes in both this first instance, where we were really just entering metadata, you know it'd be great. If later we were just shoving all activities, we were seeing ended so sure and Cassandra, so high right throughput was definitely covered. Definitely a scalable solution. So as we grew out this new product that was going to be throwing a big, read, load onto the system, we would be able to scale out, as we saw more customers come on to that.

A

So that was definitely a check for us. It's highly available. So you know, as we see different network blips which regularly happen in between you know our own nodes and our customers and our our system. We would be able to stay available to serve any requests coming in and it was persistent which is very important to us. You know we hold historical records of social activities and it's very important that we persist those because if we don't persist them, no one else really is we can't really get them back.

A

It's very hard for us to go back to publishers and say you know, hey we were down for a little bit. Can we get that back again, but that regularly happens with our customers? You know they get disconnected from a real time stream and they want that data back. They come back to us and they say hey. Can we get it back and you know we're in that position to say yeah you can you can have to pay us for it, though so Cassandra its rainbows and unicorns? It was, you know, fit these specifications perfectly.

A

It's a brand new technology. That was everything that we wanted it to be. We stood it up and we started with I. Think like two nodes may be in AWS. You know I set it up locally. First tested, it seemed cool all accounts. Do it and put it in AWS at two nodes started writing to it. The write throughput was pretty cool but figured. We should grow to four nodes, but it is four nodes and you know we wrote a little test app. We were thrown as much mock data as we could add it.

A

It seemed to be doing great, and so we were like hell yeah. We figured it out, it's going to work right. Well, no, it's not it does it. You know it's, it's not always going to be the shiny new toy that you think it is, and we learned that fairly quickly. It's it's to be expected.

A

It's a new technology, it's open source, you know it's, it's young, it's growing, it's immature and you kind of have to expect to hit these these road bumps and- and you got to really learn what you're doing and pay attention to it. Don't just trust that it's going to solve all of your problems.

A

So let's talk about the road bumps that we did see so first we stood it up. Like I said I did some initial testing. It was mostly with mock data of some apps that we just kind of wrote up to test Cassandra, but then we actually threw it into our real system. You know first in staging, obviously not in production but threw it up there and started writing to it. Things looked alright. We did not have any maintenance put in place. This is a bad idea.

A

We learned so because we started to load test the hell out of it. We started seeing dropped immediate mutation batches. You know we read up on that and it was like yeah. You should probably have maintenance in place to make sure that your consistency does happen in case there are drop messages in between your nodes, so we put maintenance to place and all of a sudden we see a 2x growth and data on disk. We have no idea.

A

Why so start you know kind of getting concerned about that we're looking into why it might be that our data has all of a sudden ballooned on disk, and you know we totally thought that if anything, it was going to shrink so obviously Cassandra scalable right. So we just add more nodes. It's going to split the key space in half and we'll have half the data on the disks. Obviously, that's not really true. Either we start seeing lots of data being streamed between the new nodes and the old nodes.

A

We started seeing bootstrap failures due to load on the entire cluster.

A

You know it wasn't really working out exactly as we planned so started to panic and freak out, and so we moved beyond that and we started to think about whether we had made the wrong choice was Cassandra really the right choice for our datastore solutions should we have gone with a different technology, and- and the answer is no, but we we had made a good choice. We just had to fully understand what what we were dealing with. So we started to turn to the community.

A

It was around us, you know, ask other companies in the area that it we know use Cassandra. We have friends in the tech community, so talk to them. Had they seen these problems that we were having some of them had we started talking to datastax. You know we were getting a little bit of support from them. We started to really hammer them pretty hard with questions.

A

We started to get ourselves into the IRC chat channels, great place to get information, talk to ask experts there and you know, describe your problems and see if they've seen these before and if they might have answers for you or direct you in a good direction.

A

So we took a step back and started to to really analyze what was going on. We found out that you're right pattern matters just as much as your read pattern. So you know when we first started looking at at Cassandra, we were, we heard you know design your data schema around how you're going to access it right.

A

It's not a relational database, and so you know you want to be able to access it in sort of a predictable manner, and if you want that performance out of it, which was very important to us, but we found out later that you're right pattern matters quite a bit due to your compaction strategy. We were using sized heerd compaction. It turns out that we were updating, SS, two evils or updating rose all across the cluster they're all across the key space on a regular basis.

A

We actually I mean we knew we'd do a little bit of this updating, but we weren't quite sure how much of it we were doing. We did some analysis around that. We found out that you know somewhere around. Eighty percent of our rows were being updated at a later time.

A

It was a lot of them, and so, if you've looked into how sighs tiered compaction works, essentially as an SS table grows to a certain threshold, and there is another one at that threshold, it will compact the two of them into the next tier and then once there are two SS tables at that tier they get compacted into the next year and so on, and so you start ending up with these large SS tables of data that you wrote a little while ago, or hopefully, a long while ago.

A

If you continue to update them, you will start to see fragments of that row across several SS tables and then, when you need to reconcile that I mean just if you need to do a read, you might have to read from several SS tables, but if you need to stream that data or if you want to do a compaction with size to your compaction, you need at least double or up to double your largest SS table on disk. If those are starting to get fairly large, you need quite a bit of scratch base.

A

We learned that that was the problem we were seeing when we put maintenance in place and a lot of confections we're taking place. Our data was doubling on disk, essentially because it was trying to reconcile all of it. They were updates all across it and we found out it's extremely important how much data you store per node when you do want to do growth of your cluster or any of these repairs or maintenance.

A

The data that you have on one node is going to greatly affect the amount of time that it takes for that to happen. Right we were seeing when we put maintenance in place that it could be up to a week to do maintenance around the entire cluster, because you know we didn't want to. We needed performance out of the cluster at all times, so we couldn't just take the whole cluster down to do repairs or anything that we wanted to do it.

A

One node at a time go around the ring and it could take up to a week to to handle that.

A

So the the knobs that we realized, we could start tuning their one big one was you know, so we learned with the compaction strategy that we were using what was going on. We decided not to go with a leveled compaction, because we were worried with that. It would cause a low read throughput if you're worried about that at all it can it can put extra strain on your cluster. We decided not to go with it, just understanding what was happening underneath was good enough for us.

A

The compaction throughput megabytes per second was another knob that we realized could really speed things up for us. The compaction zar taking quite a while it's recommended that it be 16 to 32 times you're right through put out of the box, I think it's 16 megabytes per second and we were seeing our rights at somewhere around. Like eight megabits, eight megabytes a second, so we definitely had to up it. We upped it to I, think 150 and we were able to see compaction 'he's we're going smoothly.

A

Our reads were not getting too greatly affected, and so we were happy with it made our cluster happy and things were humming along so later down the line. We came up with another use case for Cassandra. It was an n-day archive of Twitter data, so we literally were planning to throw the Twitter fire hose at it, and so, like I said, we regularly see about five to seven thousand messages, a second in the fire hose, but we see these huge spikes up to 20,000 messages.

A

Second, so Cassandra was going to do a great job for us there, and so we we decided. You know we came up with a schema. It was basically the tweet ID was going to be the roki, the column keys, just a constant placeholder and the value there, the entire payload of the tweet.

A

We decided that we would do dynamic column, families, so Daniel, McConnell, column, families or just you know, create and expire column families as you go on. We bucketed them, basically in three hour chunks and we wouldn't use the I forget what the term is. You can set a lifetime on your data in in Cassandra and basically it will handle removing that data at some point, but your storage reclamation might take some time.

A

We would actually simply we would simply go ahead and just remove the column, family and then three hours later delete the data on disk. We add an ex FS have an ex FS file system, so the deletes on disk or blazingly fast, and so things seem to be looking pretty good here. You know we weren't doing we weren't updating, rose. Ninety-Five percent of our rows were written to once, and that was it and we were going to be rolling rolling these column, families off, and so that would save us.

A

You know essentially a constant, a constant amount of space taken up on disk. Our reads were going to be much lower than our rights, for you know we're throwing the fire hose at it, but the the usage you know at the time of design was kind of undefined, but we were able to scale that we knew it would be kind of one-off reads at user requested rather than anything streaming at it. So so things were looking perfect for us, everything looked fantastic and we thought we had already been through the trenches with Cassandra.

A

So clearly it was going to be rainbows and unicorns again for us and not at all. We definitely ran into a couple more difficult situations.

A

So a little bit about our current cluster that we have back in this product. We've got 16 nodes in the cluster, a replication factor of three each node holding about 2.5 terabytes of data, trying to remember if that's, including the replication factor or not, but we had 40 billion keys, which was an extremely important note that we hadn't really looked into because, as we were growing it like I said it was an n-day archive and at first you know, we we scaled it up to 15 days than 220 days.

A

Everything was looking fun and then every time we'd get try to get past. 20 days, we'd start seeing GC thrashing. We were trying to figure out why, and so we started to look into the memory, consumption aspects of Cassandra and so I, don't know if you guys were at the keynote speech earlier with when Jonathan spoke.

A

He spoke directly to these actually being addressed in future version versions of datastax Cassandra, but there's a couple key components when, when a reed happens, it first checks a bloom filter and the bloom filter is a is basically a piece in memory that will grow. Based on your data, you have on disk the partition. Key cash is a constant hold of you know a number or a size of keys that you live in memory and then the partition index is also another piece of memory that will grow with your data on disk.

A

You know we we knew about our key cash and our row cash and we were tuning those to appropriate degrees, but we really hadn't taken a look at the partition index and the bloom filter very much specifically the bloom filter. We weren't very concerned about it. We, the the bloom filter, I, think it's the bloom filter, false-positive chance is a setting and I think out of the box.

A

It's like point, zeros zero, seven percent or something- and so you know we- we hadn't really read into changing it, but it turns out with 40 billion keys, I think Jonathan actually had the stat. Earlier today said it was like 12 gigabytes per billion keys that you have is going to be taken up in memory, so 40 billion keys was a significant amount of memory. We were holding up just over the the bloom filter. The index interval. The next one is basically on startup.

A

Cassandra will look through your keys and grab every nth key and pull that that index into the SS table into memory. That way, when you go to do a read, it will basically say: okay, you have key you're looking for key 10 I know where key 5 is in this SS table, so it can go, seek directly there on disk and then start looking past there for your key and so out of the box. I think that one's like 128 every 128 key is, is brought up into memory again, 40 billion keys.

A

We were looking at 315 million keys in memory, and so we bump that up to every 512 key, then we were down to 78 million. So it was a lot better on our memory and with the bloom filter. We brought that up to a point, zero five percent chance of a false positive, which is fine for us, because we could, on the client side, determine whether or not it was looking like a valid key. You know we can determine if it's a well-formed, tweet ID.

A

We can determine if it's within our end days that we are currently holding in the cluster and so there's only a couple of factors that could be in play that that key is actually not going to exist in our cluster, so the bloom filter, false-positive chance was very unimportant to us, except for the fact that it ballooned our Ram.

A

And then the last piece was the compression offset, so that one is not really tunable that I'm aware of, but it will keep metadata in memory based on the number of keys that you have on disk. So it's just another one to be aware of and sounds as though it's being worked on by datastax.

A

So the the themes that came out of all this were essentially I you're, not always going to fit the mold. We definitely didn't in all of our cases our first use case of Cassandra. We definitely we learned that, even though we thought it, you know fit our problems perfectly.

A

Given our right pattern, updating rows all over the cluster, you know it wasn't a perfect use of it and our read pattern was fairly intensive. You know we were. We were throwing a lot of clients at it and you know we thought it would scale out which it has. It's definitely solved our our problems after we were able to tune it enough, but at you know, you're not always going to fit the perfect description of how Cassandra should be used.

A

But if you, if you learn how it works, underneath the covers- and you use the you know- 400 configuration options, it has to to tune it to exactly your use case. You can, you can really get a lot of horsepower out of it and so explore those options. There's you know that huge configuration file there's a lot of startup options for Cassandra. You can set a lot of properties for your column, families as you create them. So you know after we figured out what those settings really meant and what we could do with them.

A

We were able to leverage the power Cassandra more than we ever thought understand the consequences of your choices when we decided to just start running repairs and throw new nodes of the cluster, we didn't quite understand why it was trying to stream almost all the data from the first node when we thought it was splitting the key space. Then it turns out it was just trying to reconcile the data after it computed the Merkel tree.

A

We didn't really realize why, though another big one that we hit was keep your staging environment in your production, environment, identical if it's possible, you know, there's there's a lot of things that you don't. Quite you might not think matter. You know you might think. If I have half the nodes and I keep just half the data than my production environment, I should see the same things happen at the same times and that's not always the case.

A

Keeping the staging environment in the production environment identical will help you to run stress tests that you know will will exactly mimic what you'll see in your production environment when your customer or your publisher, or something like that stress test your system one day, and so that kind of wraps it up. If we have any questions, I can answer them. I.

A

Don't know no all right well, my name is Andrew I'm with you can tweet at me at newnan, isms or noon isms, and thank you.