Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to Cassandra

Description

Speaker: Mohit Anchlia, Architect at Intuit
Slides: http://www.slideshare.net/planetcassandra/3-mohit-anchlia
This session talks about Intuit's journey of our Consumer Financial Platform that is built to scale to petabytes of data. The original system used a major RDBMS and from there, we redesigned to use the distributed nature of Cassandra. This talk will go through our transition including the data model used for the final product. As with any large system transition, many hard lessons are learned and we will discuss those and share our experiences.

A

Hi everyone thanks for coming here today for this session. My name is Melania and I'm architect at into it too I'm going to talk about consumer financial platform at into it.

A

I'll give some background as to of the products that into it develops then we'll go into problem statement, and we will also talk about why we developed consumer financial platform. Why we chose Cassandra, we look at cfb stack is basically consumer financial platform. Stack and I'll briefly go over CFP data model and we also look at some of the learnings we had in production. Then we open up for Q&A. However, if you have any questions in between just raise your hand and I can answer those questions.

A

So, to give you a background into, it is a maker of turbotax. We can quickbooks and other small business unit products I'll pause here and ask a question: how many of you here file your own taxes, okay, and how many of you here use TurboTax good and how many of you here like TurboTax, say: yes, okay, okay, so basically we have various services that work behind the scene. That's obviously not visible to the end user and they work together to deliver awesome product experience.

A

We have a huge infrastructure that that supports these services as well, so.

A

But what has happened over the year? What happened over the years about two years back when we started to look at we started to look at overall and into it and took an account of all the services. What we found that we have many services that we built over the years and these services as they came along the creator, their own technology stack basically, every service that got built.

A

They have their own databases, they have their own operations team, they had their own machines and hosts and all those things and as new sites came along, they also quoted directly to these services, so that led to some of the problems that we started. Seeing and some of that is code duplication. Basically, there are certain core services that our products need and mostly any new product, even existing product. They would go to these services and we had the code duplication between all the sides.

A

This also created data silos and it became really difficult. So if you have to analyze the tons of data massive amounts of data, how do you do that? But now you have to coordinate between different teams. You have to ask for certain set of queries. Then you will have developers go and run those queries and you have to coordinate across various teams. That's very difficult to do. Other problems were if there, if and when we had schema design changes, say XML design changes on change.

A

Database changes you have to know, coordinate across multiple services to make sure that the changes are seamless and that your end users, don't see any impact when you make those changes. So that also is a big overhead.

A

One of the other things that we noticed was the time it took for new sites to come on board and build new products. Basically, they there was a added overhead. If you are trying to put something out in production, you have to first build and go to these core services. So it wasn't the prototyping itself. Wasn't that fast and want to eliminate that as well. Go ahead, oops, I'm! Sorry! These are websites so websites, for instance, turbotax online.

A

If you're on turbotax, online or say this also includes client applications like if your own mobile, app or if you are on tablet on the browser. Doing your taxes are using into its products, for instance,.

A

B

This is all into it. Websites now.

A

So these are all into it products into it, websites that we develop yeah.

A

Okay, so looking at that, we knew that we have to do something here. We have to make some changes so that it's it's easy for client teams, it's easy for sites as easy for products to integrate into the platform. So we said yeah I mean why not introduce a platform, and this platform is going to abstract most of the services most of the functionality out from the sides and from the product. So this is similar to any other service oriented architecture concept. I mean we were using some.

A

What that concept before, but this one is really integrating all the services and providing a service steer and data tier, so sites and products and apps. They don't have to worry about creating their own services. They don't have to worry about creating having their own operations team. They don't have to worry about managing data and so on developers just focus on a writing business logic and it also speeds up the process of developing new products.

A

So by doing that it the benefits we get is now, since you have your data in one place, you have your logic in one place. You can provide more personalized, highly personalized services to end users. You understand your end, users behavior even better, and that helps to give more personalized services and also it also helps in delivering secured services to them.

A

This brings information together. So you don't have data silos when you have to do analysis on a data set, you have all the data sitting together days. Less data movement we are able to analyze data set much more easy.

A

It also helps sharing data between applications. You might be using tablet and then you might want to use your cell phone and then you might want to use your desktop so having data in one place, you are able to share data much more easily and.

A

It also helps in bringing up sites of into production the new sites and new concepts into production quickly. You don't need operations team to do that. It's all managed by the platform, so so the reason that, overall, if you look at it, the reason we develop this platform is so that we can abstract out a lot of the services and the functionality that deals with orchestrating the services between the data, as well as some of the functionality between the site and the data is completely abstracted out in this manner.

A

Any questions here.

A

Now so we have one team that manages all this all the services and they work on different functionalities of those services, and it all is part of the one here.

A

Yes, we do have guiding principles for those and, depending on what kind of data it is, we follow different security rules for that type of data and access is defined. Based on that, and that's why I mentioned by having this data together, you can do it much more effectively. You don't have to worry about and coordinate between various services to make sure that they are doing the right thing. I.

A

Users, our users are people who are filing their personal taxes.

A

Right, I mean the user, okay, so they're multi I'm. Sorry, oh.

A

Yeah, so are the platform? Users are websites and products, but obviously our end users are people filing taxes using those products yeah. So what it really means is that we, since it's a platform we manage of the entire infrastructure, we managed infrastructure for our size for our products, that's what that means yeah.

A

So this platform is kind of a mix of platform as a service and software-as-a-service same awesome. So I I'll talk about data platform, tier which is more relevant for this discussion for this session. So some of the principles we laid out for data platform tier was that it needs to be highly available, highly scalable, fast, easy to operate, and software only solution. Software only solution is really important because you might be operating in your into data center or you might go out to amazon or any other data center.

A

So appliances was out of question for this and we also need to support, structure and unstructured data, and these are like massive amounts of deer and projection for just this a platform alone. Next 2-3 years, most petabytes. We are about to hit that in first year. So it's it's. This massive amounts of beer and support to support five ninths of us LA, and it's easy right. You can do all that.

A

So we said okay, let's look at a traditional RDBMS and see if we can find a solution there, because that's something that that everyone is familiar with. It provides sequel interface. It has all the common tools and everything that has been is being used and has been used for several years. So we looked at it. But when you look at it based on our experience what we have seen, we know there are challenges with a traditional, our DPMS, as your data comes in when your database starts out.

A

New all is good and, as you add, more data, we are database grows your indexes groves and you start to have slowness. Then Sloane has their various reasons why you could have slowness. It's the volume is one. Then locking is another one you have. You might have a slow io subsystem and in traditional databases generally, you would use either an eye sore ass and solution, but still it share it shared by between other applications as well. And then you have a big day and there's a crash.

A

This crash sometimes is five minutes, sometimes 10 mins, sometimes it's even more, which is not acceptable for business. Other solution is, do you shouting.

A

Sharding is good, it works well, but there are some problems when you have huge amounts of data, you obviously need a lot of instances, but that's not a problem. The problem is that you have to manage all that logic in your code of how you shall, then you have to worry about how do I avoid single point of failure. So you add slaves to it once you had slaves to it, your operations or dramatically changes right, I mean you have you are going.

A

You are just multiplying the same number of nodes and you have slave nodes and you're, not even writing to it. These are the slave nodes, which itself is not a good pattern and then, if you have a failure now you have to worry about failing that node over and then once it recovers, you have to worry about failing it back and then you have to worry about sinking it and not to mention.

A

If we have to add more nodes, then you have to worry about making sure that all your old data is still going to write cluster and you haven't partition you data in a way that not now you're not able to read data from the right notes.

A

So no sequel, easy, not that easy. You look at no sequel definition and, first of all, when we looked at it, we cringe to be said. How is it going to work? It doesn't have any? Is it guarantees right, atomicity, consistency, isolation and durably it's something that takes time to get over it and also I think this has the s it has been used historically, but they are ways where you can do the same thing without having transactions and it works really.

A

Well, if you have, if you, if your workload is kind of a one transaction or one row, updating so type of workload, because in that case, then you can manage that in one way or the other, it still is not same as taking a lock and having a transaction, because it doesn't provide any rollback capabilities in Cassandra. As you know, atomicity is limited to a given row. If you are making any changes to a row.

A

Anything in a row itself Cassandra makes sure that it's atomic, but you cannot combine multiple inserts into one transaction, that's not there and that that's for good reason right because transaction. Inherently it's one of those things that limits your scalability their ways. You can do transactions and you can build on top of Cassandra. There are some tricks you can follow, but essentially.

A

It's something that you should try and keep away from it. So we looked at our core use cases when what we found was that we actually really don't need transactions, because we have oltp type of workload. It's mostly one row insert and update, and in most of the cases we actually go any transaction. What we can, what we did is we put logic on the client side.

A

So, for instance, if you had a failure or if we had a change that was made to the database, which was invisible to the client and there was a failure on the client on the second read on the next we'd we'd detect that and we take. Basically, we check that and throw error back your joy to our clients.

A

So we evaluated HBase, MongoDB and Cassandra.

A

Mongodb is really good when you are starting up, it's easy to set up, and it's it's pretty straightforward. You just have a simple schema. Json schema that you use to ingest data in it, but the problem really occurs is when you have huge amounts of data when you have to scale your nodes, and at that point you start to see that MongoDB.

A

The way MongoDB has certain things like setting up your application. Shouting data, it's I would say it's not ideal and it's something that if you're going back to our principles that we wanted something that's easy for our operations team to manage. So it's something that we couldn't use because of that HP HP's again is a really good. It's equally fast. We did benchmarking between HP's and Cassandra. They were pretty comparable, accept that HBase, favors consistency over availability and one of our principles obviously was to have highly available system.

A

So HBase is something that we decided not to use, but it's something that we do use for our analytics platform Cassandra. So why Cassandra Cassandra it's? You can scale it horizontally and it's highly available. You can design it such that it can sustain no single points of failure, and you can it's easy to set up clusters between data centers.

A

You can just add nodes, you all you have to define as us, you have to just define the configuration in a properties file and you throw it out there as long as you have proper easy else open your cluster is up and running, it provides fast snapshots and one of the other things is that you're able to do rolling upgrades from I think a Sangha one on your. You can do rolling up grace without worrying about backward compatibility and operation that this is really key.

A

Once you are up in production, it's it's really important that your operation team is able to look at the cassandra performance is able to do. Operations on cassandra is able to install, is able to upgrade cassandra quickly and if you give it right hardware you can get. We can get really good low, latency response times and Cassandra.

A

So this is our high-level cfp stack, so we have a services platform at the top and that basically is a tier where we provide a framework for developers to just drop in their business logic. So this tier industry, we use mule ESB, which is operation engine and what it does is.

A

It helps us orchestrate business logic and business rules between services, so you can basically create a flow between various services, for instance, if you're, if a user is logging in, you might have to go to a different service to create a ticket, then you might have to log it in the database. All those services are oxford. In mule, ESB q service provides queuing of messages and cash services.

A

Basically caching static data in there and services platform. What we did is we provided a framework. The idea was that at some point open it up, so that anyone into it- and maybe at some point and even outside into it- they're able to just drop in the business java code logic, just just the business logic applicable to their services without having to worry about writing the entire framework around orchestrating services, so developers just concentrate on writing their business logic and dropping the core. In the framework.

A

A data platform is a data tier and it's exposed through a clean set of XML interfaces.

A

Mule ESB again we use it there to deal with all authentication and authorization of data. It's really important that data is protected and it's only accessed.

A

Wherever the is accessed only by those clients that have write access to it, we use, obviously, we use cassandra. Cassandra is used for storing, structure type of terra and storage is where we store all our unstructured data. It's a distributed file system, it's again highly available and scalable.

A

A

Why don't we a good question so with Cassandra?

A

There is the active this activity that goes on in the background called compaction, and that is something but what it, what happens is as you're writing data in in memory as you're writing data and you're flushing out and creating assets tables and at some point, when these SS tables merge that merge causes lot of I/o. Now, if you look at the blahblah blahblah as huge data right and still it has to traverse through that the entire data set to merge the file and that causes huge I Oh spike, that's our additional I Oh spike.

A

That actually is just makes the operation too slow. So blob storage, if you have high volume, blob storage- and we did bunch of benchmarking on that too- it typically it's better to keep it outside Cassandra.

A

This is, this can be I mean in terms of file sizes. This can be up to mb, even jeebies yeah, and then we have analytics platform, that's tightly integrated with data platform, and that's basically used for analysis for batch processing and providing analysis on top of the data that we have.

A

Well, I would so Cassandra is used for matadores storage. So what we do is we say, for instance, they were, for instance, there is a request to create a record in data platform right. So with that what we get is we get certain set of attributes that defines that particular record and with that we also get a document. So the document goes in redhead storage and metadata lives in Cassandra and there's a pointer to the document in Cassandra.

A

So it's kind of you can think of it as like object, storage like if you have heard of Amazon s3, something similar to that.

A

So this is our stack or architecture for multi data center. We use active, active multi data center and the reason we do that is. We generally try to keep ourselves away from active, passive type of architecture, because we have we generally scale in both data center to support our peak activity and if you have a passive data center that if you have a passive data center, that is not doing anything you're just wasting resources. So we have active active data centers and.

A

So they're Cassandra application is really helpful because replication, the Cassandra tier, is really fast as soon as he righted. It replicates right away across data centers and to avoid any consistency, issues we have a global load balancer and we provide 30-minute sticky sessions. So the user is actually is on one side for 30 at least 30 minutes, and that gives us enough time to make sure that data is replicated between data centers.

A

Our schema is pretty simple: since it's a platform, it's it's pretty generic, so that users can easily adopt to it it's kind of a graph representation of entities and relationship, and we have additional CF used for indexing.

A

Basically, indexing attributes and other types of data.

A

So some of our learnings in production, based one of the biggest learning that we have in production, was monitor your heap closely and pay attention to your cpu's. You will see uneven CPU spikes when you see uneven CPU spikes. What it really means is that you have and and if you can correlate it with the volume, then most likely it's garbage collection and, as you can see here, we had that problem.

A

We had this problem in production where we ran into garbage collection issues at the time. As you can see, we we had, we were seeing like 400 texts and then after we tuned garbage collection and we added more nodes, it was, it became consistently dropped down significantly 200 CPU ticks- and this is this is during peak, and one of the other things that was really important is to monitor your disk. It's something that causes I, think most of the pain so keep in close eye on it is really important.

A

Monitoring data per node is very, very important because, based on that is what determines how your, how big your heap is in 1.2 I think might be able to use you haven't tested yet so I don't know, but based on what we are hearing I think we might be able to use more data for node, which is going to be good, but watching data per node is really key, I think I'm out of time anyways and yes, we are hiring so contact me.

A

O our machines, now we use bare metals, for these are physical hosts for our dear tier yeah,.

A

Yeah so bloom filters we, this is setting in Cassandra that you can put in place to reduce the bloom filters, and we we one of the important things, is to look at your cache, misses and and then determine what the right ratio is for. The bloom filters like turning down broome bloom filters is it's something that has other performance implications as well, so you have to be really careful and also obviously you're. It also depends on body problem is like for us.

A

It was really causing lot of pains and customer impact, so we had to do something, and so we looked at caches and we brought down bloom filters that we had to go through some operational procedures to basically reduce the heap size basically have to run some commands to make sure that all your old SS tables are upgraded to this new bloom filter, so that the heap size goes down.

A

That's right, yep, exactly.

B

A

Yep, that's right and and that's what that's exactly what we see the the what you see on the left side is basically, this was a first peek that happens around at january january time time frame and what happens is what happened there was we had the spike in memory spike in heap usage and that that's something that you you generally don't see right. You only see when you have high traffic and that's what we saw and we had to immediately take some of the steps it was.

A

It was difficult to get nodes at the time, so we had few notes that we were able to add, but we had to do some of the other things like reducing bloom filters. We had to increase our heap and we had to also I think there's a threshold that we had to tweak to keep our eye or consistent on disk.

A

Yeah we use OC 48 and it's point-to-point between data centers.

A

You mean to develop the entire platform.

A

So we started this journey around two years back and we were up within six months yeah and we we will live that here. We had good learnings and then this year again we were pretty much there in production.

A

Oh yes, yeah, so one of the things that fear it introducing is going back here. If we look at it, it's possible that your data is not replicated at Cassandra tier right. So one of the things after do is you have to run repairs continuously, but it's still even then you might run into scenarios right.

A

So one thing we are trying to do is we are trying to put that logic up at the services platform or at the act here to say that if you don't find this go to the other data center grab it and give it to users it's similar to what of you will see that and other clients like gmail or yahoo mail or any other clients that deal with multi data center applications, there's pretty much dating.

A

Right, so that's a good thing, because Cassandra replicates real fast, so we will detect that will detect that that this data is there and this data was written in this data center and we can go and pull that and give it back to you, sir.

A

Sorry I didn't get your question.

A

Now so we are entirely using Cassandra. However, cassandra is good for direct lookups right like if you have key based look up. You can look it up real quick, but if you want to range queries or if you want to do some kind of fuzzy queries, it's difficult to do so. The thing we are looking at is adding a search tier in there so that it helps our operations team as well as it also helps us analyze that data. That's something that getting to work on this year.

A

Yeah, so it's pretty simple. What we do is we first write two redhead file system and if it fails, we don't even write to Cassandra.

C

A

Cassandra feels then, obviously it's that data is sitting out there right if Cassandra fails, then the timeout is sent back or there I sent back to the user yeah, but then, but that extra data is sitting out there. We have a reaper process that would go and clean up the data. That's not.

A

Now we are not pulling data from here and pushing into our dbms. However, I think should be able to just unload it and send it across.

A

A

Now so we don't have any such use case today, but I think there are some tools like as a stable loader utilities and some other tools that you can use. If you have to migrate dear.

A

A

Yeah, so we have queuing service in between where data gets returned and analytics platform fetches from it and writes to hardwood and we use flume there. There is some data set that we get directly from sites like click stream type of data that goes through our real time: processing engine through flume into Hadoop, and then there is data that comes asynchronously to us to message was.

A

All right, I think that's it thanks. Everyone for coming.