Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: FamilySearch: Huge Online Genealogical Database Driven by Cassandra

Description

Speaker: John Sumsion, Software Developer at FamilySearch

FamilySearch hosts a collaborative family tree with over a billion editable records. The tree currently serves as many as 10,000 concurrent users at peak weekly load. These users come from across the globe and collectively maintain and enhance the tree around the clock. Recent efforts to port the tree from a relational database to Cassandra have resulted in drastically improved performance and scalability. The database consists of more than 5 billion records in journaled form, and we anticipate having over 10TB of live data available for user view & edit, with that data size growing significantly as our user base grows. The dataset has resisted sharding in the past, so the port involved rethinking the core data model. The model we chose retains the consistency that our users demand, and is able to be implemented without requiring ACID transactions. Specifically, the consistency model we chose combined a Convergent and Commutative Replicated Data Type (CvRDT and CmRDT) with Cassandra's atomic batch implementation to form the basis for a consistency model that met the demanding needs of the family tree application.

A

A

My name is John sumption I work for familysearch. If you've ever heard of that we are a genealogical database and website where we published a lot of records and run a global family tree for the whole world.

A

I'm gonna talk it's the outline of the talk I'm going to talk a little bit about family search family tree so that you understand the data model and why it is the way that it is and then we'll get into some details.

A

We are running right now on Oracle we're using Oracle as a blob store, and we have persons and relationships and we'll get into the detail on the scale and stuff. In a minute we have a very large single pedigree in that in the family tree, there are between eight nine nine hundred million persons. They are lineage linked in that pedigree. How they are related to each other is recorded.

A

We also host a very large collection of source document records like indexed scanned in various countries. Records are kept of when people were born when they were married. That kind of thing- and we digitized that index it through with the help of a large volunteer group of people and publish it and we're in a genealogical library in Utah the family history Family Search, is the family history department of the Church of Jesus, Christ of latter-day saints, otherwise known as Mormons.

A

So if you haven't ever seen a Mormon before here, I'm one all right, if you want to know why family, why the Mormons care about family history, you can visit this link, Mormon org, slash, family history and I'll. Just let you do that. Based on your interest, the records indexing community memories. All of that feeds into this family tree records die all the time you. You would be very surprised about archives, burned down, there's mass record destruction that happens all the time. People there are floods, people die and they're.

A

You know the records that they collected during their life or have not a value to anyone anymore and just get thrown in the garbage. So we have a motive, we're motivated to try to capture those records in a timely way and make them accessible.

A

We capture those records. Microfilm was a technology that we used back a while, but we're now I'm capturing digital. We store those records also so a three and a half billion indexed records. That's where we were at when I checked the site we index about 300 or 35 million records a month.

A

Well we'll get to the Cassander stuff, and this is very short here- memories we let people put in stories about their ancestors and about their family. We also run a wikimedia site that gives research guidance. All of this kind of leads to the family tree project that I'm going to talk about. All of that data ends up being accessible and useful to people when they're putting their stuff in the family tree. So 500 million person records it's an open edit system.

A

So if you want to change information about a person similar to the way you would edit a Wikipedia article, you would go and say the person was born in 1908, not 1907. You found a new record you'd say why you think that way: the relationships between those persons 500 million there are a couple relationships and parent-child relationships. Those relationships are also open edit in the history of the family tree since 2012. There are eight, almost eight and a half billion change entries. We track changes and right now it's live. You you make an edit.

A

You see the result, there's no batch update process before you see the result, and we have some really interesting data dependent performance issues like persons who have a thousand sets of parents or a person who has 500 spouses, because there were 500 different copies of that person. Similar to the last talk.

A

I was nodding, my head, the whole way through about fuzzy fuzzy matching, because we do a lot of that this is we then take that pedigree and in interesting ways and in ways that are like really hard to do in one thread on an app server we produce display is similar to this. This is a 9 generation pedigree.

A

It takes about 30 seconds to compute this and render it there are 500 persons on that one thing and you can think we're running it out of Oracle right now, so that's at least 500 primarykey, fetches and at least 500 global index scans. To find related persons, the this is an example of what the family tree looks like the different persons are linked together by by their lineage.

A

We, this page right here, happens thousands and thousands and thousands of times every day, but it's really expensive to compute. We also track this there's there's a screen called the ancestor page where it shows related persons and all of the changes that have had or the the latest recent changes that have happened on that person. That's hard to render right now.

A

This change history page takes 2 or 3 seconds, which is horrible, so we're looking for a more performant solution, we're also looking to pay less database license cost and something that meets the reality of our system, but it solves the performance problems for large families, for large change histories and for large pedigrees, so I want to show you the illustrate the index problem in order to traverse from one person to another person in the current system. There's a relationship table that sits in between to query that relationship table.

A

We have to have a great big global index that great big global index is 300 million rows big and a range scan on that global index costs somewhere. We've tuned it and it's somewhere around five milliseconds, but we do a primary key read on the person. We do a global index scan on the relationship and we do another primary key read on the person, the change history, eight eight and a half billion rows in that global index.

A

That costs a lot so taking a big step back, we were looking for other alternatives when we saw Cassandra and said hey, let's do it, so we did a full data scale, proof of concept. We imported all eight and a half billion change entries in and and looked at it and we could not believe the result. It was just amazing we'll get to the stats at the end, but we had to reinvent the data model and we had to reinvent a consistency model.

A

We were using optimistic concurrency update where ideas this and version is X and bet didn't work anymore with the distributed database. So let's talk about the re-implementation, we picked an event source data model, so we, the truth, is in the journal and the view is computed from the journal. We keep that up to date incrementally and we don't have to have any indexes at the moment- were most of the way through reimplementation.

A

So this is kind of a work in progress for us, but primary key indexes is the only thing that we do and we were able to satisfy the consistency. So this is that this was the core of what I'm trying to present it's the consistency and the data model. So for person 1 and person 2.

A

They share the same relationships, two of the same, but let's say person, 1 I'm person to our spouses, so relationship 2 would be a couple relationship and then let's say that person, 1 and person 2 are parents to the same child, so our relationship 3 would be that parent-child relationship.

A

We store the relationship in as many different person records as it is, as it is present and it stored the same way in all of the persons. So that is what let us get rid of. The index range scans, the wide queries and just go from one person to another person with primary key fetches.

A

We had to denormalize the relationships in order to be able to do that, but that created an additional problem which is well. How do you keep that relationship up-to-date if it's stored in two different places? Who has who owns the truth for that relationship? I'll get it I'll get to that. Also for change history. We had to re-implement how we did change history, so we have two different views for a person. There's the person view and then there's the change history view and the person view is stored.

A

The journal is the truth, but the person view is stored and the change history view is stored in a single cell. The chain to change history is in a single cell and that lets us have very quick change. History responses, no global scan, no parsing the journal. It's just prepared ready to go and that's kind. It's stored in a ring buffer, so as new changes come in old changes drop off.

A

So now, let's talk about the consistency model. There are three parts to the consistency model. First, we capture commands. Then we record those commands in a journal and then we compute one or more views for that. For that, for that record, the command is there. Even though Cassandra has atomic batches. We wanted to capture what the user said as closely as possible to what the web request is when they came in when they said. I want to change the name John to Johnny or something like that.

A

The journal is the truth, though the command is not truth. Once it's recorded in the journal, you can reconstruct everything from the journal and the views are computed from the journal. Let's go into detail on the command. We write once there with quorum, different kinds of quorum for different edits, and there are three different journal: two or three different command tables.

A

There's a pending there's a completed or an aborted so if it gets written into pending and the database crashes or whatever and we haven't got into journal yet for that particular command, we have a process that will scan pending and write it into and then pick it up where I got left off similar to how atomic batches work, but for our app at our app level and the way that a command gets written to the journal is the same.

A

So you could take command and write it to the journal any number of times and it would go to exactly the same key and exactly the same row, and so that's how we pick it back up if it gets interrupted.

A

The schema is pretty simple just to key. In a time UUID and a blob binary, we ended up using a binary form of JSON, not message back. It was my mind. Just went blank Oh, remember so the journal. Is you write once there? You don't ever update the journal? If you have any additional updates, you append the new journal entry and we write those in a.

A

We write journal entries in a batch, so one command could let's say for that relationship case person, 1 person, 2 and our R 2 in the middle, but R 2 is stored under person, 1 and person 2. We will write two journal entries.

A

One journal entry to our one that says the marriage date is 1888 instead of 1887 and over here we'll write the exact same journal entry just with a different row, key just with a different partition, so that journal entry is then able to bring both of the persons up to date with the new relationship.

A

So we we, the journal itself, is denormalized across persons. Each journal entry is stored as a separate cell and the partition key is the person ID or the relationship ID. We because it's stored as a separate cell. It gets the journal and size to it compacted and the cells get scattered across SS tables. We rely on compaction at trying to get as much of the journal into one SS table as possible for four journal. Read performance, but journal read is actually not in the not in the critical path.

A

If you're familiar with CRD t words commutative replicated data type or distributed I forgot, what does the C stood for, but there's commutative replicated data type and the journal is commutative. That means, if you have two different, if you have a cluster partition and you write one journal entry on one side and another journal entry on the other side, if the partition heals they come in and they just enter, leave together, because the primary key is time UUID, and so they are unique.

A

Partitions converged without conflict because of the time EU ID. So an example is here, is here's a person.

A

Kwz 3 p, 7 1 and the command UUID is the last or the journal. Entry ID basically and the the content is just a big binary blob, not big journal entries are usually pretty small, be somewhere around 500, bytes, ok, or something like that. So now the view there are multiple views for different users. This is the key to how we were able to get the consistency model to work scanning. The journal, all the time and RiRi running the journal for every single page view would be horrible.

A

Well, so we store views when the journal entry. When the when the journal is there and there's no view, then we will compute the view one time and then store it. Every other time that somebody comes back to read. They'll just read the view. The view is stored in a single cell. It's looked up by primary key there's only ever one column, one one cell in that in that partition- and it's really fast to read.

A

But then you say well what happens when I new edit comes in for that person? That edit is incrementally updated into the view. The old view is deleted and the new view is written, there's a visualization of this, but the view is not canonical data if, for whatever reason, if there's a partition in the view table, perhaps the view gets written over here and over here on the other side of the partition they converge.

A

Well now there are two views in that case we rebuild a canonical view from the journal by rolling it up all the way from the bottom. So here's a visualization of that here's person one and there are two views for person. One a journal entry gets written to all three tables, so when a journal entry comes in, it doesn't just get written to the journal. The journal entry also gets written to all of the view tables.

A

That's so that the view can read, see what's new and incrementally refresh when a new reader comes in, he sees the view. Is there a new journal entry? Is there that reader? Does the incremental update at read time writes the result back out and then returns that result to the user and deletes the old journal entry and the old view now. You're left with a new journal and just a single view.

A

In order to do this, we had to make the view table have exactly the same schema as the journal table.

A

I D command, UUID and blob, so it's pretty simple schema. All of what we do inside is inside the blob that allows the journal entries to be written to the view table so that the incremental refresh can happen that that one thing right there about how the the views are have the same schema is the core of how were to be able to get the consistency that we needed the NCR DT terms. The view is a convergent, replicated data type. It takes app reconciliation to to resolve partitions.

A

Let's see, the steady-state, though, is one view of a given type per entity.

A

So, let's see I might have time.

A

In order to get the performance and scale we wanted, we needed to be able to look up by the partition key only with no indexes other than the primary key index and any cross entity change had to be happen in duplicate on all partition on all affected entities. Basically, it's the way we talk about it and we had to store current state views in order to have fast read.

A

You can imagine that we have more date in Oracle that the data is probably did a dump and it's somewhere around five terabytes I think when we, when we import it into Cassandra it, and it's probably going to end up being.

A

We haven't written the migration code for all of the types yet so right now we're running at around 7 to 8 terabytes. So there is definite data expansion with denormalizing the relationships and writing journal entries to not just persons but also relationships and duplicating the journal entries. But it's not that bad. It's not like 10x of what it was in Oracle it and with with the strength of Cassandra, just you know being able to throw discs at the problem. It's just amazing.

A

The the other thing is that we wanted to have some flexibility in the schema in our Oracle.

A

Situation we tweaked this what's included in the blob, oh no, we have to do this. Math fix up over 900 million persons and 500 million relationships, and we do those fix. Ups like every month, there's a new fix up that the data quality team has to go run against all of those records. Well now, if we want to change how the view looks, it's easy. We invalidate the views and let the Refresh just roll it up from a journal, and we don't just have one view which is the situation in Oracle.

A

We can have multiple views so there's the big view with all the person and all the relationships. Well, there's also a different view for traversing the pedigree. This one is much lighter. Weight has a subset of the data and it's totally in duplicate with the other one, but it lets us. Do the pedigree traversal much much faster, the flexibility of having multiple views of the same journal, data of the same truth.

A

Let us have greater agility in our development process, and so if we want to- and the views are disposable if we want to tweak it overtime mechanic so now what we also had to solve and a pretty interesting set of things, we have business rules that disallow certain kinds of edits like saying a person is a parent of themselves. We wanted to be able to say no, that's not a valid edit. Well, it turns out in a distributed database.

A

We didn't want to introduce any kind of locking mechanism to to enforce those kinds of business rules, and we also wanted to make it so that two concurrent writers couldn't produce an invalid state. So we ended up with this. This journal entry writing mechanism. That is a read for first to double-check, that the Edit is going to be okay, a right and then a read after so this is like the bad.

A

You know anti-pattern whatever for for for a system, but it's what we had to do in order to get the business logic to apply, and we do these reads in quorum or local quorum or each quorum for the distributed case. There are only certain kinds of edits that we have to do each quorum. We can survive with local quorum for the most common kinds of edits, so we we take the consistency level. We do read of all of the affected entities.

A

We apply our business rules before we do the right, then we do the right and then we read again to see if anybody created an invalid state. That was a concurrent writer with us and then, if there is an invalid state, we assume we were the problem and we go back into the journal and we say that last journal entry, I wrote here's a new journal entry that says journal entry. A is no longer valid.

A

That causes a full refresh to happen where a and the a is the journal entry that was bad B is the journal. Entry that says a is bad. We stripped those out and then we come up with our prior state, and so the view has the state before that bad right and we have this concept of appropriate quorum. There are some we support merging persons together, that's a really complicated journal entry that has a lot of different things going on inside that one journal entry in order to get ensure that the resulting state is okay.

A

We were planning on a multi region deployment, and so we had to do each quorum for that in order to make sure that it was didn't, leave it in an inconsistent, inconsistent state question.

A

A

The question is: there's two persons a and B, and we are writing like a we're, creating a relationship between a and B saying that a and B are married.

A

What if, when we write the journal entry to a in between instead, what, if right after we write it to a our app goes down, and we never get the right to be. In that case, when we're writing journal entries in that that right in the middle between the reads, we do the write with atomic batches, so log batches Cassandra has the feature that I don't remember when it showed up.

A

I think it was in 1.2, but we basically take all of the inserts for a and B put them in a batch and say hand that off and say make sure that both of these eventually get written and it's up to.

A

We leave it up to the database to ensure that it's they're both written another question.

A

A

A

So here's, let me try to rephrase or restate the the question is: is there a piece of software that makes sure that all of the views are up-to-date after a right did I capture that right.

A

Yep yep the business rules. That also applies the business rules that software. Yes, that piece of software does exist. It doesn't all happen at right time, though. So, when we write to person a there, let's say there are three different views. We actually use. One of those views during this read write, read business logic enforcement, but we don't use the other two views. The journal entry is still left sitting there in the view not yet rolled up. It's the first reader that comes in that's responsible for undoing the incremental refresh.

A

Yes, it is the we have the logic on the app server is the thing that notices a new journal entry and notices, an incremental update, has to happen, and who does that and writes the result back out? Did that answer your question.

A

Yep, this is not yes, this is not a like data level replacement. It's an app level replacement.

A

Yes, another question.

A

The question is, do you always have to read the journal to know that the view is correct? The answer is no. You don't have to, because when we wrote the journal entry we didn't write it only to the journal. We also wrote the journal entry to the view itself, and so it's enough to just read the view table to say. What's my current view and then, if I notice that there are newer journal entries, then I can incorporate those into the view without having to go back and read. The journal did that answer great.

A

Let's see I think okay, we are I, think I have 15 more minutes and I want to save some time for questions at the end. But I wanted to show some of the experience that we had with Kassandra. We built the the full prototype on 1.2. We used. We switched over to 2.0 late in the game and didn't have any problems with that excellent performance, easy cloud setup.

A

We got some great response from data stacks developers when we had trouble with the data stacks ami and we ended up doing our bulk load straight through cql 3. We didn't do anything with computing, asus tables on the fluttered in Hadoop beforehand, and loading them in that way. Just we just it was cql. 3 was fast enough for us. It was. We are going to deploy inside an AWS VPC. The data stack same I didn't work for us for that. So it's cost us a little bit more to figure out our own deployment.

A

So the bulk load experience when we eight eight and a half billion change, log records and Oracle turned into. We wrote a migration piece of logic that turned that into 5.8 billion journal entries and that ended up being two and a half terabytes lzo zipped.

A

The cluster that we imported in into was high one for excel on AWS each one of those nodes had to one terabyte SSDs. It took 11 hours to import five node cluster. It took five hours to import a 30 node cluster.

A

We didn't really tune the database very much, but we got a hundred and forty rights a second over 128 writer threads and we ended up getting the best performance by using unlogged batches for our inserts. Instead of inserting each one in a separate, insert state and separate roundtrip, we just used analog batch of 20 inserts and said: hey do this and we inserted with consistency level. One data came in just fine.

A

It took about an hour to after the import was done for the size tiered compaction to catch up on on the SSDs on the journal, four terabytes of disk. After importing the journal up Center community, we had trouble getting the repairs, automated I think that's just a pain spot for everybody.

A

We got so after the data was imported. We ran something to compute all the views and we had calibrated our load tests based on the logs that was going on in production, and we were just totally blew mind. We were able to get 25 times our production load right now on the production system. Right now we are like we have a little bit of headroom, but you know if, if.

A

Let's say if 10000 were users joined the site and started being active users of it on a weekly basis. We would have some serious issues.

A

Well, I guess: I can't really say that that it's that bad but we're we know we have a limited amount of headroom. We don't know how much it is exactly and seeing that we could go 25 times. Our peak load on commodity hardware out in AWS was just the the proof of concept was was a success. Let's just say it that way. We people talk about row, not helping well, it certainly helped for us.

A

There's another talk at two o'clock, I think by somebody who's going to present her the the tuning on AWS and he's going to talk about the real cash. We ended up running the active data set with mostly no disk access. We ended up getting bottlenecked on interconnect between the Cassandra nodes and we with the round robin just round robin client. We ended up doing token, aware round robin and got 50 percent more throughput.

A

Ops center was great for visibility of what was going on and SSDs made it so that we could. We could do a full repair, a rolling repair, while our load tests were running with at least production with at least one X production. So we're definitely going to deploy an SSDs. This is just a visualization, that's probably hard to see from the back, but there are three or there are four sets of numbers.

A

The bottom tick is the 50th percentile, the middle one is, let's see, 95th I, think no, the middle tick is 90th and the top tick is 95th percentile, and this is from the perspective of a service level request. This is not a web page. This is just for the kind of service level request we need. We need to support the features at the bottom, where we ended up log scale graph.

A

Our current system is at the top, so the person in relationships for the ancestor page 50th percentile, was somewhere under a second and 95th percentile was somewhere under 10 seconds.

A

These three graphs are 1 X, 10 x, +, 20 X. We saw a little bit of degradation as we wind up, but very close clustering. So we were very, very satisfied with 20 X numbers.

A

We're working on the implementation and rollout we are work basically just working on the migration pipeline and reconciling that, with the existing truth, figuring out a reconciliation strategy so that we can have incremental update we're gonna make a flip we're gonna flip the switch not trying to do everything bulk loaded, but we're going to practice incremental as we go, and then there will be an outage, a write outage while we flip the switch. The consistency model that I talked about is separate code.

A

I haven't got permission to open source it yet, but if there are people who are interested, please come up and talk to me after and I'd be happy to try to run that up the poll right now. Family search isn't an open source first kind of place. I wish it were, but I would be more than happy to make the attempt if you're interested in the code- and so let's see I, think I have two or three minutes for questions. Any other questions.

A

B

With the number of journal entries that you have for nd in the system, do you worry about rows getting too wide? Yes,.

A

B

Getting too wide and what are you, the.

A

Widest row is about 10,000 cells.

A

We haven't seen a situation where the number of journal entries would go into the hundred thousands in the foreseeable future, but if it gets to twenty or thirty thousand we're going to chop it off summarize the old history and throw it away or archive it, yep.

C

I see see, I could see, but I think a cool feature would be on the platform as when people enter data, yeah have a timeline of who's edited data and some of these records all the time. There's guys that are cave and data about people's life yeah, but what's and tradition, is very always the archive who thinks for its some snapshot in time. Yes, I know it'd, be cool yeah.

A

Yeah, the timeline of edits, yeah, we're exposing time line for the person, but timeline for edits is also a great thing. Thank you for coming.