Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: FamilySearch: Performance Tuning Apache Cassandra in AWS

Description

Speaker: Michael Nelson, Development Manager at FamilySearch

A recent research project at FamilySearch.org pushed Cassandra to very high scale and performance limits in AWS using a real application. Come see how we achieved 250K reads/sec with latencies under 5 milliseconds on a 400-core cluster holding 6 TB of data while maintaining transactional consistency for users. We'll cover tuning of Cassandra's caches, other server-side settings, client driver, AWS cluster placement and instance types, and the tradeoffs between regular & SSD storage.

A

We're out it's a wiki style tree out to have everyone put in their genealogy share that, since we all share a common genealogy and ultimately, our goal is to have one pedigree of all the human family that is on record here on earth.

A

um Let's see so uh I would be amiss if I didn't make a uh just a short plug for why we do that. The mormons have a doctrine, that's a little bit different from many churches and we believe that all of mankind, even those who did not have the opportunity to hear about jesus christ, can be saved and um and I'll say, one more on that if you have ever been up in the middle of the night and wondered what the purpose of life is, why we're here? Is there really a god?

A

Does he know me and does he care? um I assure you that the answer to all those is yes and that there are logical, rational and answers to those questions that bring peace.

A

If you want more on that mormon.org or come see me afterwards, okay, so the family tree application currently is just over 900 million persons in there. With about 500 million relationships, we record things it by changes and I'll talk about the data model. Here in just a moment, 8.4 billion change log entries uh about a million changes a day come in to those, and that equates all that equates to currently about seven terabytes in cassandra, 13 terabytes in oracle.

A

Our current system is on oracle and we have a um intimate and painful relationship with oracle for quite a while here. So um when you, when you're in a pinch and you need to scale oracle, it is possible to fork over lots and lots of money for hardware and scale up oracle and that's what we have done for a number of years, and so we are. My group is a research project to get off of oracle onto cassandra.

A

So we went through a progression here. We uh first went on and did a bunch of testing with decided we needed to do a side-by-side with cassandra. We did that and uh went with cassandra uh two main reasons: better performance and administratively it was better. It was more stable.

A

um So we are an online. uh An oltp system. Online transaction processing users expect their changes to show up right away, at least for them, and so there's that transactional side that we have to maintain, and we have some interesting data performance issue- data related performance issues. You might think you could assume that people usually don't have more than say, 20 children.

A

However, you know you- everybody here probably has stories about crazy data that you encountered well, we have people with thousands of children and and yeah there's interesting histories on how that happens, but there you go. Similarly people people have one set of biological parents, but again through oddities and how people use the system. We come up with dozens of parents, and so so we have these issues that have been painful to resolve, because people still want it to render and at least show them that that's what's going on right.

A

So a couple of screenshots. This is a what we call a fan chart. You see, there's there's me down there in the middle. That's the circle! I have two parents, that's the next! The next ring they each have two, and this goes back. Nine generations there's 511 slots on there and we want it to render like that right, everybody's used to that, and so you have to walk those relationships to generate that thing.

A

This. This is actually done through a partner that we provide apis for everything we do so so we have many partners that offer things based on our data: here's another screenshot. This is now the interactive pedigree, the other one kind of gives you the forest. Now you drop down to the trees and there are two red rectangles here. The the one on the left is kind of the default view. When you come on and then you can expand.

A

uh We currently have big performance issues trying to render this so in in just pulling up enough records out of oracle and and walking them to render it, and so that's what we are out to fix and scale at the same time, here's a shot of the person page again those families can be an issue. It's not uncommon to have to load dozens of people just to show the spouse's children and parents and siblings of a person on there, and then that we've had some real issues with the change history.

A

Also there there can be thousands of entries that we need to go through to render that change history.

A

Okay, the the data model uh we use at a uh this is this: is our new data model? It's event sourced? That means there's two two data models: actually one for writing and one for reading the journal. I've pictured here on the on the left is a series of journal entries and we always write to that. So so an example of a journal entry would be change. The name of this person from mike nelson to michael nelson, a change like that or add this relationship to a person.

A

Any any kind of changes always get written to journal in journal entries. They are immutable, they never change. So if you want to fix the record, you always append another another journal. Entry on that does a couple things we're using time based uuids as the key on those and- and that means you can be inserting from uh to the journal.

A

Multiple application servers can be inserting at the same time and they will always resolve to a coherent view, so there's no locking at all necessary. So that's the writing. When you go to read um that always comes out of what we call views and I've there's two of them shown here for person one. I show two views a and b. We actually have three views currently that we compute, so uh one is a full person view.

A

Here's the current snapshot of the person, all the information about them and everyone they're related to there's a summary view. That's just basic information, name, birth, death uh and places for those things and again who they're related to and there's a change. History view that we do there. So what that does is all the business logic goes into when we write and then when we compute the views and when there's a read, cassandra has exactly what we need and it's a simple lookup and return return it in every case.

A

So so we get the the benefits of the denormalization there. The view has exactly what we need. Okay, so we're talking performance. So you need to know some basic information about us here. 77 percent of our trafficker reads: 23 rights, so that's pretty typical. Being very read. Heavy our reads are consistent, almost always consistency level one. There are a few cases. We're not and are very simple rights, are much more complicated, we're using atomic batches and quorum consistently the consistency level for the rights.

A

The rights also go to multiple tables. This is the reason for those batches. So if you write a journal entry, we're writing it into the journal itself, column, family, but then also to each of those views so that then the view can be incrementally updated from there.

A

So uh so it's not uncommon for some operations to to involve um a dozen rows. So, for example, if you're adding a parent on a person, it's going to affect multiple people and multiple relationships and views for all of them, so there's there's a number of tables there and then business logic, there's some pretty basic things that we have to enforce and they can be quite complicated to to actually enforce so, for example, we will not allow you to enter a relationship that will make you your own grandfather or grandparent right. That's obviously incorrect.

A

We won't allow that okay, so I I did. We've done a series of tests uh earlier this year on a 28-0 cluster learned. A lot of things did a lot of optimizations on the application, and I'm just going to talk here about the optimizations that were cassandra related optimizations. So so, overall, we ended up with about a 10x improvement in what we could scale to, um but I'm not you're not going to be, maybe you'd be interested, but I'm not going to share the cassandra.

A

I'm sorry the application specific optimizations, so the tests earlier this year were on a big cluster 28 nodes on amazon that is uh they're. All of them were high. One four xl nodes, so that's 450 cores in that cluster and we got to 250 000 ops per second. I went back into a series of tests in preparation for this presentation on an eight node cluster and with all the optimizations that I'm going to show you here.

A

um At least those were the most of the most of them were not the application we were able to get. I was able to get up to 200 000 with just the eight nodes, so it makes a a little optimization goes a long ways here and those are very expensive. The high one for excels are very expensive. There's a lot of cost savings there, okay, so the uh test system that we used that I used here.

A

An eight note cluster, as I mentioned high one four xls, those have 16 cpus each 60 gigs of ram and they are ssds.

A

um And then I put the app servers up here. Purposely did more app servers and more load. Runners than load agents then were required because I, the purpose, was not to test the app here. I was trying to push cassandra, so I wanted to make sure there was plenty of cpu and network for the app, uh and I should mention here- it's using borla and silk performer for the load to do the load on it. We've had we did some stuff with jmeter.

A

We had some issues, we went to borla and silk performer we've had some issues and we're talking about going back to jameer. So it's a yeah, that's been fun. Okay, so here is uh the summary.

A

What we found of what I found overall, the the two big gains for us were wrote in the row cache and making the driver token aware, and overall cassandra's been pretty stable for us. We've had a few incidents. uh We we have our first cassandra cluster in production right now that is not family tree. Not this app! It's a different one and no big incidents. Yet overall, I've been pretty impressed with the stability um we're using ssds for everything here. So the big advantage that that we've found with ssds is that repairs become a non-event.

A

So I heard all these people, with with lots of pain, around repairs and we just haven't had those so, for example, if I I can do a a full repair on all the nodes simultaneously on the eight node cluster and it takes about four hours and load never hit more than about two on any of them and it doesn't saturate the ssd, so lots of lots of headroom there with the ssds.

A

So overall, it's about a 2x uh throughput increase, okay, so row cache yeah. I'm sorry.

A

um It's about seven terabytes on here and replication factor. Three.

A

So it so it's about 900 gig on each node. Does that answer the question replication factor? 3.?

A

Yes, yes, so rowcache um our! We have a a good-sized data set 900 million uh persons, but there's a lot of uh the working set is not nearly that size and so on an eight node cluster. We can fit the working set in ram and and that's why I believe the row cache was was a big benefit for us row.

A

Are we put the views? There's they're small in the row, cache not the journals themselves and the reads are consistency level one. So that means that um all all that can be in the row cache and we get hits and a nice bump from row.

A

Cache uh latency went from 11 milliseconds down to about seven and a 30 35 bump in how much the cluster could handle, and I should mention I was pushing the cluster as hard as I could in every case here took the the load um load, agents and said: push it just as hard as you possibly can, and so all the nodes go. Yellow and op center thinks we're sick and but get as much out as we can.

A

Average row size and what was the other question.

A

How many columns so in the views, there's typically only um five, I think five columns, uh the journals are different: they have a there's a column for every journal entry, so there can be thousands there in those what was the other? That was what was your first question row size. So the views it depends on the column family.

A

What's average where's john, it's, uh let's see we're. Probably between a thousand up to a thousand bytes up to 50k is typical, which is a fairly small. That's that's why rokash worked worked well for us there. They can get big, they can get half up to half a meg, but it's not typical.

A

No, it's it's the individual uh view of that person. So if they have the person that has a thousand children will have a big view, but most don't most are not that large.

A

Okay, so the road cache has worked. The question was uh data. Stacks is discouraged using the road cache. What's our experience, our experience has been great. 35 percent bump in uh throughput. Latency is down and haven't experienced any problems with it. I a lot of that is that we, you know they talked about this this morning. uh Jonathan did that it'll bring the whole row in, and so, if you have really big rows, you you have an issue and we only put our views which are small and contained in there.

A

So it's it's worked great for us. I was I've, been surprised to hear them talk it down, but what was that yeah? Yes, we are very fortunate that we fit into here well um and the underlying theme, I think, of this whole thing ought to be. You need to test your app right. Every app is different and you need to go. You need to go, try it.

A

uh So here's how to configure the row cache there's two sides to it in cassandra.yaml. You need to enable it so nodes can have you have to tell each node how much memory to put a set aside for the row cache and it does it off-heap.

A

We did 32 gigs and so uh eight times, 32 gigs. We had 250 gigabytes of of cash there to use, um and then you have to turn it on for the individual tables, and so, as I mentioned, we did it on our tables that have small rows, not on the big ones.

A

And then a screenshot from op center when we were running okay, so one point to make here is you see there is about a 90 hit rate on the rowcash is what we're getting and the disc over on the over on the top left. There is the disc utilization and it's zero right, because everything fit in memory now. The interesting thing is everything fit in memory before we turned on the row cache as well, so so it was all coming out of the buffer cache right.

A

Linux had cached all of the files into memory, and so before we turned on the row, cache it was every read was hitting the key cache. It would know where it was and go to straight to the right spot in the files and those were already in memory, because linux had cached them. So everything was in memory both before and after and yet we got a 30 35 percent increase in throughput and a big uh reduction in latency with the row cache.

A

It also reduced the the cpu usage slightly, so I I didn't put a graphic up here, but we were about 80 cpu consumed um without the row cache and it's well. What is it there, 75 or so so, just a little bit reduction in cpu usage when we turn on the road cache?

A

Okay, so that's the row cache that was the first big win on cassandra tuning. The next one was when we went token aware. So this is another one where data stacks I've I've had data stacks people talk it down going token aware, and yet we saw a big improvement from it. So with uh cassandra, as I'm sure you know, when you go to do a read, the default load. Balancing strategy is round robin, so the client will will pick a node and contact that node and give it its query.

A

It will then act as a coordinator and contact the replicas of that data, gather the results and then return them to the client that network hop but to the replicas uh and just the coordination of multiple nodes means. You're talking to multiple nodes means that you're adding to the load of those machines and that there's network latency that you're introducing there so cutting that out again a big reduction in latency from seven millisecond reads down to two millisecond reads and fifty percent more throughput.

A

When we went token aware so the token aware the client gets from the cluster where which regions of the of the keys are in which v nodes and where those are and then we'll direct the query to the node one of the replica one of the replicas, which means that node has the data and can answer the question if it's consistency level one without consulting any other nodes. So you get a big gain on that, one uh that so here's how to here's?

A

How you turn that on this is that we're using the datastax driver and when you create the instantiate, the cluster, you nee, you can pass it a load balancing strategy. If you don't it's going to use the top one under the covers.

A

Excuse me it's a round robin policy. If you do- and this is what we set it to token aware policy- and you can nest inside of that another policy what to do once with the list of replicas. So the token aware policy says: choose the list of replicas for this.

A

For the token that's needed, and then the round robin inside of that means round robin between those three we've had a little bit of back and forth with the uh other team. That's using cassandra internally around um round robin versus the latency aware, where, with that one, the client will keep track of the latency with each node and direct it to the one. That's responding that has been responding, the fastest.

A

um Okay, finally, um the last thing- and this was an incremental- gain- a five percent throughput improvement. When we bumped up the concurrent reads- and this one really surprised me, I thought for sure we would uh need to bump these up earlier. We went from 32 to 250, I'm sorry to 128, I'm sorry, nope, 256., concurrent reads and concurrent rights that are allowed and bumped up the native transport max threads to 256..

A

uh Only an incremental gain.

A

16 16, the question was how many no cores on each node 16.? Yes, yes, you're using v nodes.

A

Of going token aware,.

A

Of going token aware, oh on this one so but you're saying because we were using v nodes, we were seeing some benefits here and we probably wouldn't, if we weren't using v notes, yeah, maybe maybe.

A

uh We have not changed that number nope, but we haven't. I. I was talking to the sony guys earlier and they've had some real issues with v nodes and lots of stuff that are you from sony. You're smiling.

A

We haven't had any problems with it. So.

A

Okay, the comment was reduced: the number of tokens on a node and saw an improvement when you did that, how much.

A

You went from 256 to 10 and how much? How much did it buy? You.

A

25 gain in what oh in latency read: latency. Okay, that's interesting.

A

Okay, so uh this is, this has led to a little bit of dilemma now, um which we're getting adequate throughput and we're probably not going to push a whole lot harder on this. But still a mystery for me is: where is the bottleneck, and I haven't got to the bottom of this?

A

One: nothing is saturated, the cpu isn't saturated the network isn't saturated and the disc is, as you saw pretty much idle because we just all fit in memory, and yet I can't get any more out of it out of the cluster so, and I thought I would I would share a little bit of more what I've been with that.

A

The network side has been a bit of a mystery, so the highest ever have gotten cassandra to use in these clusters has been 800 megabits of the network. High one four excels have 10 gigabit network. They were deployed in the same the term uh deployment request. That's not the right term placement group there we go placement group, so we so we really can get uh high throughput there. I've done iperf between all the different ones, measured.

A

Indeed, the bandwidth is there, and yet I took a net hogs screenshot as we're at the max, and I have never seen cassandra achieve over a gigabit as it's there.

A

Okay, so a little more on the network and yeah, the network seems suspicious to me so so there's this five second cycle that I keep seeing and um and netstat and I have tried you know they talk about tweaking these parameters and I've tweaked them and have not been able to measure impact from any of those.

A

So the and the client machines you seem okay at the same time. So I'll show you that here this is a net. Let's see, the top of this is uh from tc traffic control utility, and you can see right here this. This means that at this moment, there's 870 packets in the queue to be uh to be processed, that's the backlog and and two and a half uh megabytes.

A

I don't know you guys see that over here, 870 and 2meg and it's had to re-cue 850, 000 packets, so there's obviously some issues there going on with the network.

A

And then, if you look in on all these sockets here, the send queue, all the send cues have a bunch of data in them. Okay, so that's that's one snapshot. I wrote a script here and a couple seconds later whoops. That was not what I wanted a couple seconds later.

A

The send queue is send, cues are almost empty and the receive cues have a bunch backed up in them all right a couple seconds later and it's looking fairly normal. This is this is kind of how you'd expect to see a healthy running network connection, and uh I wish I could explain it. I don't um the amazon has their new enhanced networking haven't played with that. Yet so maybe that will help with this type of thing.

A

This is a high one 4xl, so there's 16 cpus 60 gigs of ram yeah, oh yeah, 10, gigabit network, the the application machines are all moderate network machines and yet there's plenty of those. So there should be and- and I can go put iperf between them and the and the cluster and get lots of bandwidth. So I do not believe it's simple lack of bandwidth.

A

A

Okay, so that is pretty much the results of what we have found. As far as the cassandra optimizations, we have tried some other things that didn't get us anything, and so I didn't include those, and I would emphasize again that your mileage will vary. Your application is different. Those mix of queries the how much rights the types of rights, the complexities of the rights, the types of data, the access patterns on the data, all those things vary, and so you really got to try your app and see what you find any questions.

A

Does any part of the application use a blob? Yes, uh we're storing.

A

Almost all of our data is blob like we're, storing it in in text columns, but it is, it is json com. Let's it's hex encoded, um compressed json that we're putting in there.

A

Okay, oh there's one over here.

A

The row size so the the there's there's the journal and there's the views. The views are all um less that we probably averaged 10k for the views. That's what the reads are against and then the writes. The rows can get very large because we do a column for every journal entry, so they're typically.

A

100 up to 100 000 columns there, so they can get very large.

A

uh One at a time we have found good success. Doing many parallel asynchronous reads: that's what we've done!

A

Okay! Thank you very much.