Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Time is Money

Description

Speakers: Jake Luciani and Carl Yeksigian, Quantitative Strategists at BlueMountain Capital Management
Slides: http://www.slideshare.net/planetcassandra/jake-luciani-and-carl-yeksigian
This session will focus on our approach to building a scalable TimeSeries database for financial data using Cassandra 1.2 and CQL3. We will discuss how we deal with a heavy mix of reads and writes as well as how we monitor and track performance of the system.

A

Get started, my name is Jake luciani, and this is Carl. He y X. Again we Blue Mountain capital, I'm, a patchy, Cassandra emitter and we are going to talk to you about how we are using Cassandra to build our time series database before we begin, though I just wanted to I came up with a joke about Cassandra and I. I couldn't not share it, so my joke is what did the one Cassandra notes say to the other Cassandra node after the network? Split?

A

Can I get a hint.

A

B

A

Just going to go quick quickly over our use cases and architecture and I'm going to quickly pass it over to car, which is going to take a deep, deep dive into how we've tuned our arc Sandra cluster and how we've dealt with with with certain problems specifically compaction. So the the talk is about time is money. There's sort of two two aspects there right. You know time is money in terms of latency, and time is money because we have a time series database.

A

So the the main idea that we want to start with is you know, describe our problem. Our problem is, we have a compute cluster of a thousand nodes which are constantly running calculations off of real-time market data, so it basically reads and writes as fast as it possibly can and all those reads have to be consistent across all readers. We have. You know we also have a UI so that so that our team can basically run ad hoc queries and everything has to run quickly. Oh and we have multiple data centers.

A

So it's really a hard problem to handle is very rewrite. Intensive and in cassandra is a is really the only option for us so time series intro, I'm sure everyone's seen stock prices. This is the the simplest example. You know the the change of price over time, but if you think about it, most things are time series and and for databases like like Cassandra, it does really well at dealing with with, with with time time ordered data.

A

So you know in terms of time series normally when you're querying it, you basically have two types of queries.

A

The first one is just a time series query your query: we have a start and end and you're basically saying you know, give me and give me a point for for each minute and the other one is a cross-section query, which is the idea of you know: I have a I, have a fund or I have a portfolio, and it's broken up across all these different instruments and I need to query all of them for a slice of time, and the reason why we we keep historical time is because you know if you're running a calculation, and you want to look at you know why did I get this number as of last week.

A

You know you need to always have the same exact data. So so so we never update data. Oh sorry, we never over right data, we're always adding on the end. Our cross section data is random. So when you think about us searching across any given fund or any given portfolio, you can't really partition that problem, because you're never going to know what what all possible combinations of that data is. So it's difficult to do, which is why you need something which in which can scale horizontally.

A

So you kind of have to fan out to get that data, and we also support this thing called by temporality, which is there's, there's two dimensions of time. There's there's the time of which something happened and the time in which you learned about it so say. For example, if you're fixing a bad data point after two weeks of you know of it being in the system, you want to be able to go back and run something with the old data point and the new data point.

A

So our system optimizes for time series and we are using Cassandra, 12 and c ql 3 the with c ql 3. It makes it pretty straight forward to define this type of table. We've simplified things a lot here, but you know the main table looks something like this. Where you can have a identifier, you have a type, for example price you. You have the sort of time foot for the.

A

As of you have time for your knowledge time, which is that two dimensions of time and then you have your your your your value blob and if you look we're using the the compact storage, which is really like the old style composite columns so and by using that you can sort of order order things, so things are naturally ordered first by identify.

A

Well, you know what you have to send in the identifier, but then it's ordered by the property that, as of time, the knowledge time and so on, and the clustering ordered makes it so that you're always looking in a descending. So it's you know from from now going backwards.

A

So to do something like a time series query, it's very simple: you can just do select star from TS data, where the ID and this property, and as of time between this and this and you'll, get each knowledge time and the same and for cross-sections it's it's similar. You give a specific point in time.

A

You give it a knowledge time cut off where the last point before you know this knowledge time and and and you get it for limit one and you- and you just run this in in in parallel right, so you send all these off of parallel and into the cluster. So what we built is none of our clients talk directly to Cassandra. We've built a thrift service on top of it and we're using the cassandra.

A

Has this thing called the fat client and what allows you to do is sort of act like a Cassandra node with without being responsible for any data. If you look in the examples directory there's a example of this, so what allows us to do is cut down on the amount of serialization, so our thrift service is sort of running Cassandra internally, but not dealing with any data. It can just use the storage proxy directly, so it can use the same internal protocol that Cassandra uses to talk node to node.

A

This makes it simple for us, so we don't have to worry about connection pooling or you know, to token identification or dynamic, snitch or na that stuff. It's all built in because it's it's running the same code as what's running in Cassandra, and what we're doing is we're passing the cql through that layer. So our applications talk to our service, which is called Olympus now. The other thing is: if we don't just price data, there's lots of different financial data.

A

There's reference data, there's also sort of when you're, when you're running models, you sort of want to track the the the metadata about your model over time, so that information needs to be stored. So we use thrift structures, and this is similar to you- know goo goo, using protocol buffers kind of keep things on disk and what and by using the thrift Union type, we can easily sort of reflect on what type of data we're pulling back. So so our users will create a thrift structure.

A

They'll, add it to this Union and they'll be able to read and write data in and out. We don't really have to know. What's going on and then this is just a shot of what our UI looks like for the ad hoc queries. I figured I'd, throw it in there and I'm going to transition to Carla. Ok,.

B

So that's the easy part: that's our data model so now I'm going to talk about how we actually scale this thing to handle all the reads and writes simultaneously so first rule about scaling. Is you don't just throw everything at it and let it run as fast as it can? So you don't just turn to 11. That was actually Jake's slide, so I'm just going to throw them under the bus. For that ok, so scaling! You know you got to get fast machines. You have to get you have to avoid the GC pauses.

B

You have to tune cassandra and you have to prefetch and cash inside of your application. So, but before we get there, you can't fix what you can't measure. So we use Riemann, which is a really great stream based monitoring system. So you can easily push application and system level metrics. So cassandra has a plug-in for this. Currently we push about 6,000 metrics per second into a single Riemann instance. So so this is the little snippet of code. That's required to actually hook up Riemann in you into Cassandra, so there's two components to Ahriman set up.

B

One is the dashboard which shows us the live stats that we currently have, and one is graphite, which shows us historical stats, so using both of them kind of gives us a good picture of what's wrong now versus how has it been acting in the past and then visualvm is a tool for the the jvm, which shows you a bunch of information about your your jvm. So it shows you what the threads have been working on. What's going on with your GC, it also has like j beans plugin.

B

So it's got it's about all sorts of great tools for it so gain the right. Hardware is really important for us. We got SSDs for all of our hot data and we're going to get spinning disks for all of our cold data. We use the J bod configuration that way. We can use as many disks as as possible. We don't have to go through the raid controller. We don't have to do software raid.

B

You want to get as many cores as possible, so we we have machines. I have 16 cores we're looking to to get machines of like 32 cores in the future. We currently have a 10 gig network and we actually use bonded network cards. We get 20 gigs on a single machine and then in order to actually take advantage of 10 gig network, you have to use jumbo frames, so Jay baud really helped us out.

B

We actually had we had multiple simultaneous SSC failures because of bad firmware and Jay baud allowed us to swap out those SSDs and continue running the cluster. So our end users didn't actually notice that these machines were down, even though we were taking out SSDs.

B

So actually configuring cassandra is another big part of making your application go quickly. So there's three components that we went through. One was configuration of Cassandra I'm not going to talk about it at this one, because we only have like five more minutes, but we we have a slide, show that we did in New York. That has all the configuration that we did.

B

The next thing is compaction, so that's that's actually a big part of the time that we ate the CPU time that we that we spend and then compression, so we actually switched to a faster compression algorithm called LZ. For so we use leveled compaction. The good part about levelled is that it limits the number of SS tables you have to read.

B

So, even though you have all these levels, you know exactly which SS tables will actually contain the data that you that you're reading, so it puts a upper bound on the number of SS tables that you actually have to read. So we use wide rows, which is why this is actually an issue so but under high right load. All of these SS tables, as the mem tables, get full. They get flush out to disk.

B

So in your level 0, which is completely random data, you have to read every single SS table in order to figure out what data you actually need to respond with. So our original attempt, which is actually included in Cassandra 20, is this thing called hybrid compaction. So it's it was just rolled into regular, leveled compaction and what it does is in the first level, rather than just writing just doing the regular compaction zup to level 1.

B

What it does is it takes a bunch of the SS tables and it does the the old size tier compaction. So that way, you have fewer SS tables that you have to read when you're doing the this heavy right load so that now we have a new idea that that I'm currently working on it's called overlapped compaction and basically, what this allows is that, instead of taking your level 0 files and combining them with level 1 and then writing all new files info level, one, it just takes your level 0.

B

It does the regular compaction that it would up to level 1, but it doesn't take level 1 files and, at some point it switches back into this regular mode of taking all of your files at the the tier at the level and combining it with the the level above. But this way you don't have the bottleneck of I have to take these files and the files in the next level to do the compaction.

B

So this will allow us to get higher throughput because we'll be able to to do more parallel compaction and then another thing that we're working on is a sea, optimized library. So there's there's actually a lot of a lot of times spent in on the repast doing comparisons and doing the the CRC check for for compression. So the composite comparison actually eats a lot of cycles, so CRC check is implemented on on the chip for a lot of architectures.

B

So if we can take advantage of that, that'll give us a huge increase in throughput for compressed data, so we're working on actually building a jni library that will contain a bunch of these native implementations.

B

So right now we have 16 nodes across two data centers. So a replication factor is six, but that's three in each data center, so we can lose a node in either data center or we can lose an entire data center and continue continue operating. We get 200,000 writes per second at each forum and 150,000 reads per second at local quorum, so the difference there is that writes, have to be acknowledged by both data.

B

Centers reads only have to be acknowledged by the data center that you're that you're in and we currently store over 30 million time series and we have over 15 billion data points and that takes up about 10 terabytes of data on disk and re latency for the 50th percentile in the 95th. Percentile is one millisecond and five milliseconds.

B

That's at the application layer, apparently all right. Are there any questions. Yep. Can you sorry can go to the question Thanks.

C

B

C

Just wonder if you actually benchmark performance with wisdom without jumbo frames on 10 gig Ethernet, because I'm also curious about this problem. Yes,.

B

We did a bunch of benchmarking for our our workload and that's how we came up with configuration that we did so we have like a 12 gauge heat. We have a two gig new gen, so there was a lot of tweaking that we did originally to try to I.

C

Guess my question is specifically about the use of jumbo frames and if you were able to measure the performance difference.

B

Oh, isn't this.

C

Out them two then yeah. We.

B

Didn't we didn't really measure the performance of jumbo frames? It was just like that's that's our network configuration in order to really utilize the 10 gig network. So without the jumbo frames you can actually utilize the whole 10 gig network that you get so you're back to basically having a one gig network.

C

And another question is whether you noticed any problems with jumbo frames in terms of, for example, routers and add the networking infrastructure outside the Cassandra Questor yeah.

B

We actually had a lot of problems with jumbo frames. No original structure was that clients would talk to the the Cassandra cluster and there were a lot of packets that we get dropped between them. We would have liked some machines that just couldn't talk to the Cassandra cluster, so I don't actually remember how that got resolved, but the.

A

The basic thing with jumbo frames is your entire network has to use demo frames, otherwise it doesn't really work so well. So that's one of the reasons why we have a separate service right, so our Cassandra cluster is is a jumbo framed by.

C

The way this possible to use them gigabits without jumbo frames well.

A

In my experience, but.

C

Certainly I'm not I mean.

A

They just are our experience. We had some issues with it, but.

C

Just open multiple connections and use direct and I, of course, okay, thanks.