Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Big Architectures for Big Data

Description

Speaker: Eric Lubow, CTO and Co-founder at SimpleReach
Slides: http://www.slideshare.net/planetcassandra/2-eric-lubow
Having many different technologies within an organization can be problematic for developers and operations alike. Structuring those systems into discrete modules not only abstracts away a lot of the complexity of a heterogeneous architecture, it also allows the evolution of systems using common access and storage patterns. This session will discuss how to think about, architect, and maintain a service architecture for a big data system.

A

All right guys get started so yeah, there's that my name is Eric lubo I'm, the CTO of simple reach and I'm, going to talk to you today a little bit about how we built our architecture to process the amount of data that we see. So these are the things I'm going to talk to you about and tell you a little bit about the company I'm going to talk to you about what the goals of having that architecture are.

A

What the tools that we use to build them are and then I'm going to hopefully have time for some questions, but there's no guarantee they're. So, first of all, one of the things that we learned when we were starting to work with really large amounts of data is that most of it's absolutely useless and you have to get through the useless stuff in order to get to the good stuff, but the actual way to phrase that which I prefer a little bit more to Borat. Is this even with the right tools?

A

Eighty percent of the work of building a big data system is acquiring and refining the raw data into usable data. So before I show you how we do that I want to talk to you a little bit about what simple reach does, and this is the shameless pitch part.

A

So what we do is we give content creators the ability to track engagement at astral level, and what that means is all these folks who create lots of content, guys like Forbes people time they published hundreds, sometimes thousands of articles every day and we give them the ability from the instant and article is published to track all the social actions and social engagement in real time across the entire web, and we present it to them in this beautiful dashboard.

A

So what are some of our stats that entitle us to say that we're a large, a big data company?

A

Well, we look at millions of URLs every day we look at about one and a quarter billion page views per month, and that grows pretty much every day and we look at about 500 million events per day, which translates to about 6,000 events per second again, that's growing all the time we're on ec2 we're on it, amazon we use Amazon, Web, Services very heavily, and we autoscale anywhere between 125 and even now, it's a little out of date to up to about one in 80 machines every day, just depending on the the traffic peaks and valleys that we see and what we did is a company was we built a predictive measurement algorithm for the social web?

A

Okay. So how did we do it? We use all of those we use all those technologies in order to be able to deliver to our customers and deliver to internally delivered to the team, all the data that they need in a timely fashion and presented to them in the fashion that they need. We use vertica, Redis, Cassandra, solar and in terms of programming languages we use Ruby, ember, j/s, node python and go and you'll notice. Java is not on there and I'm super thrilled about that.

A

So in order to build this this system to process all this data, we have a couple of goals in mind. We need we want consistent non data, storage layer access patterns and I'll show you what that means. In a little bit. We want data accuracy across storage engines. If you're storing the same data in as you are in Cassandra, you want to make sure that the numbers match there's really interesting ways that it appears when they don't. You want to minimize downtime or minimize the cost of downtime, because there will invariably be down time.

A

So you want to minimize its impact. You want highly available systems. You want to allow access to many tool, sets right if you're, using all these systems- and you know using you're using Cassandra, you want. You want the access to the toolkits that and the tool sets that each one of those systems will provide you, because certain things are they just there's.

A

Just cool tools for there's just some cool tools that only work on Cassandra, there's just some cool tools that are only backed by Redis, so you want to be able to take advantage of all those things, all clients that need to talk to your architecture and that's not people, clients. That's you know, consumers of data with algorithmic clients, system system, type, clients. You want them to have minimal knowledge of the underlying architecture.

A

You don't want them to have to know that they have to talk to under the hood or that they're talking to under the hood or or read us under the hood or Vertica under the hood. You want them to be able to just ask their question, find the data that they and get it back in a consistent manner.

A

You want authentication, tracking and throttling, and I will show you why that's important in a minute and and you want to control your data flow patterns. These are what was very. These are the tenants that were very important to us, so consistent data access patterns for us we have something called the real-time score.

A

Cassandra gets stored as a composite column and in it can either be stored as real time score, real time or SRT, depending on whether or not it's a short document or a long document so rather than having the client asked for which ones which you just have a consistent access pattern, always call it real time score. So one is good. One is bad and this will become important later on. So why is authentication tracking and throttling important?

A

Well you, it's really easy to have services run amok when you have, when you're deploying all the time different consumers different data producers, you're ingesting different types of events. It's really easy to have these things just over process. Ask too many questions, or you can just end up writing some bad code and not that anyone here writes bad code, but just in case someone did the services could just you know da see ourselves internally, so you want / service access keys, so every single one of our services has their own internal access key.

A

You want to track call volume because you need to know. Do I need more API. Endpoints do I need more capacity to handle certain types of requests. Do I need more capacity in the data storage layer for this type of request than say another request again you want to prevent internal denial of service attacks. These are unintentional, but they can happen and when they do happen, it really sucks to blame yourself, and you also want to want to monitor availability and performance of those calls.

A

If you typically see that say, your call for an account data for a bit of account. Data takes 10, milliseconds and all of a sudden, it's taking 40. Well, that's you know, might only be 30 millisecond difference, but that's kind of a big deal. So how do we control our data flow? We use something called NS q and I'll and I'll show you what NS q is on the next slide. But for us, NS q is a really interesting piece of software because it allows a couple of things one.

A

It allows us to queue up all of our requests, all of our events and all of our two dues at every stage, and it allows us to do multi casting of those requests. So, for instance, say a new tweet comes in. It gets stuck at the edge, so our social data consumer will pick that up and we'll multicast it to three different consumers.

A

It'll first pick up the batch and right data right, it'll, just take a whole bunch of tweets, put them together and process them, update a total count and write it to disk this way, you're not doing one right for every tweet. You can group them up and in Cassandra you get to take advantage of like the batch mutate, for instance, raw data. Same thing you just take the raw data, put it in a batch mutate and have Cassandra write it.

A

This way, you turn many rights into one right and on our end, we do something where every time we see a social event, we like to calculate a new score, because that obviously updates the value of a particular piece of content. If it was tweeted about so we then update the score, which then will create a new NS q job for writing the score to the various data stores that it needs to go to.

A

So at any point, we can control the flow of that data and we can say stop here while we adjust one piece and then continue to pass it along.

A

So what does n SQ n SQ is, as you can guess, by that little introduction is a is a queuing system that was written by the good folks over at bit ly, and why do we like it?

A

Well, it allows disa it's it's distributed and decentralized, meaning that any time you add a new message producer, you can have the consumers, which are also decentralized and can be anywhere find out about that consumer through a look update, I'm sorry find out about the producer through a look up, Damon there's always at least one message: I'm sorry, the guaranteed delivery of a message will happen at least once or more.

A

If you want multicast delivery, which again something we make heavy use of theirs, that runtime discovery, I just talked about every time you add a new consumer or a new producer, the look up Damon will tell the consumers or producers where they need to look to get more messages or to get the existing message sets.

A

The control data flow allows us to create maintenance windows with no downtime. So, for example, where we were writing that raw data, we can just let the cues back up while we say to a Cassandra upgrade run through all of the upgrade nodes and then let that stuff process and the processing will be quickly because we're not writing single events, we're batching them up into groups of, say you know a hundred or a thousand and letting them right in groups.

A

It also has a really cool feature where it's just a femoral channel. So if you want to take, if you want to just take a look at what's coming down those message queues, you can look at it and you won't lose those messages and won't act them, and you can take a look at. What's going on and those messages can get processed by your actual system and you don't have to worry about the delivery.

A

So what does that look like when it comes to our actual system? We have a bunch of different ways that we collect data. The specifics are not really important, but from the internet we make sure that all data does what we like to call flowing downhill. So it comes from the internet to our edge collectors, hits a set of cues those cues talk to the consumers, pull from the cues, the cues I'm.

A

Sorry, the consumers will then go to the internal service architecture to get any data that they need to process their jobs, that internal service architecture is responsible for talking to the individual data storage layers and then passing the information back up, and then it will write them again. Going downhill and I'll show you what the internal service architecture looks like in a minute. So how did we get here? I'll tell you.

A

It was with making a lot of mistakes a lot of mistakes, a lot more than I'm happy to admit about, but the biggest thing that we needed to figure out was: we need to understand what our access patterns look.

A

That was everything from how often we access each type of data store the types of messages we produce the types of messages that come in everything that everything revolved around, knowing where we could be accepting of read latency and knowing where we could be accepting of write latency and knowing where the real time patterns were important and knowing where the the sort of offline jobs would be important. We built the service-oriented architecture and we also had to do a lot of data accuracy checks.

A

So again, if you have data stored in and in Cassandra and in vertica, like in our case and actually read us as well, if those things are not all identical, then when you pull from one and you pull from another and they don't match and you can show them both on the same page, for instance, then a you know, a customer is going to look at that and be like. Well, how come this thing says: 20 tweets- and this thing says 25 there's clearly a discrepancy.

A

So one of the things that we had to build is going down through the line. All the consumers needed to be aware of what was being written to the other data stores, so that we could ensure accuracy in one datastore versus the other data store, and we built a frame work out for testing and for testing different engines. So what this means is that, as in front of the service architecture, I'm sorry, the service architecture sits in front of all the data stores.

A

So, as we were trying to decide which data store was going to be best for a particular feature or feature set, we needed to see you know, did we want to use, for instance, verdict? Oh did we want to use vertica? Did we want to use info bright if we want to use infinity, be all these are column storage engines and for us to find the right one that fit our business and fit those features we had to figure out a way to do it.

A

So what we were able to do was just plug and play. We put all the storage engines behind that service architecture. We ran the query against the service architecture. The architecture would then say, ok, I know, I need to run this query against vertica and infinity be and info bright. The results should all be identical. It'll log, the response times write them off to a different place for us to look at later and when we decided, which one we wanted to go with, which was ultimately vertica.

A

We just pull infinity, be and info bright out zero downtime. Nobody was the wiser we made our decision and we had an entire testing system and we can do the same thing with with with cueing engines. If we wanted to try rescue, vs, n, SQ or RabbitMQ, because that service architecture has gives us the ability to have a consistent access pattern and put space and time in front of requests.

A

The other thing that we did that gave us the ability to build this architecture out. Is we made sure everything looked the same? Every chance we get we get so the base image starts out with an Amazon am I. We have our organizational information which, as you can see, is users is the application specific configuration application groups, meaning is: does s need to be on there for sharding?

A

Does the NS QD client need to be on there for consuming or for producing or if it's a lookup, and then we have whatever application sits on top and this provided us, the ability to say anytime, we need to launch a new image. We just launched that base image. We put the organizational specific stuff on there and then beyond that we decide what application group does it fit in. Is it a? Is it a database? Is it a application? Is it a web server and we put the appropriate things on there?

A

So what you're looking at here is the basics of a producer and the basics of a consumer, so this is very pertinent to our messaging system.

A

This is all because we have a great systems- guy I'd love to say team, but we're a small company. So it's really just one guy with a lot of headaches and I can't get them to wear the hat, so I've tried. So we make extensive use of AWS. Aws has some great stuff like opsworks, which is a chef based system for configuring machines and type and application types. We also monitor everything very heavily and anybody who's had any level of monitoring. Experience know that no matter how much you do it never seems to be enough.

A

So we've got nagios for the base. Monitoring stats d for application-specific, instrumentation and I would love a replacement for graphite. So if anybody knows one, please tell me, because it's not as awesome as we'd like it to be so again, chef, opsworks vagrant for anybody who does systems in the room. It allows you to spin up a small machine set up on your local machine, so you can have a production like scenario or production like setup on your local for development.

A

Css HX is a cluster ssh client, which basically, what we use before we had the ability to use chef and deploy across everything. We would just spin up cluster SSH sessions SSH into a hundred machines at a time, run the command and then close that out. That is exactly as painful as you would imagine deployment. We all do we do now with Chef.

A

We migrated from Capistrano but again having a consistent deployment system was very important for us to be able to get to the point where we are today again documentation like monitoring, no matter how much you have it's never enough, but it's very important in order to convey from the system side to the development side, how to both monitor and instrument and do all the things that are really important for everybody to understand how well your system is, or more than more than likely is not working.

A

So I'm putting that. I put this slide up here, because if anybody is working in AWS, they typically stick to the standard, the ec2 and end and EBS volumes, and that's that's good if you're a small shop, but it's also not good and the reason it's not good is because there's all these other features that make your life significantly easier and we were kind of hesitant to use some of them.

A

You know putting our vertica cluster inside of the virtual private cloud meant we were able to reduce latency between machines, understanding what external tools are available. So, for instance, we run a lot of offline jobs. We run a lot of MapReduce stuff. An elastic MapReduce is good, but you have to have everything on s3. So what we did to get around that was. We found a company called mortar data which uses elastic produce under the hood.

A

You put your you, give them access to some of your s3 buckets and even other methods of access, and you get to take advantage of the AWS services without even having to you know, understand EMR elastic, beanstalk, every time, we're doing new rails development or testing new apps. We just spin up an elastic, Beanstalk app, which basically comes with all the rails pieces built in, and it's so much less work for us to do on the system side.

A

So all these things are tool sets that we get to take advantage of because you can just plug and play right into our architecture and the developers do not need to become aware of additional things in the architecture. So this is actually what it looks like in a very superficial sense. I promise there's more than that number of machines, though. So what does the service architecture look like again? We start with our base image layout.

A

This is I can't stress how important having that base image layout is, especially when it comes to the monitoring and instrumentation, so the proxy machine and you'll see on the next slide what the proxy machines are for. The proxy machine sits in front of any storage machine and it tells the the requesting app to hold on a second while it gets the information.

A

So the proxy machine is, for instance, 10, minute content or hourly content, and the reason that these are important is because, every time the request comes in it's up to the endpoint to decide where, in order to keep the information from the people requesting it or the system requesting it, it provides that proxy layer so that the 10 minute or hourly content says hey.

A

I, know I need to get this type of data from this type of data from vertica and this type of data from Cassandra and packages it all up and sends it back through JSON and I'm. Sorry, a JSON format back through to the querying machines. The reason that this is a very good methodology is because it does not force the existence of a giant monolithic service architecture. So every time you do a code deploy, for instance, if somebody makes a typo and it takes down your entire internal service architecture. That's for obvious reasons, a problem.

A

So what we did was we broke it all down in a little chunk, so each API endpoint is its own tiny system, so 10 minute content its own little Python, App, hourly content, the own its own little Python App account its own little Python app so and it's not just Python I was Python as the example we use, whatever language is actually best for the data storage layer. So in some cases we use go in some cases we use Python and in some cases we use nodejs. So we did this for availability.

A

We did it again for the consistent access patterns, which is another feature that I can't stress enough again minimal downtime on your changes. If you want to deploy just a change to say one endpoint, there's no need to deploy a giant again: monolithic service architecture, war file or whatever it might be. You just deploy that one little app change, restart that one app it becomes unavailable. For you know, however long it takes to restart that app and if it's a small one it'll probably take a cut. You know a second.

A

Maybe if that and you've got yourself a newer version of the of the endpoint smaller code deploys clearly I like that word. So what is how do we keep ourselves available? We made sure that, even though we spin up an auto scale quite frequently, we made sure that the distribution within Amazon, at least us East, is such that we can lose an entire data center. Like us, East 18 could just drop out of existence and we'll still be good, because every time a new machine comes up, it checks, the other.

A

It checks the other availability zones to make sure that we're doing some evenly balanced machine deployments and here's an interesting fact that we learn the hard way last week. If there is not an even number of machines within an availability zone, elastic load balancers will push mower of your a disproportionate amount of your traffic into the availability zones with more instances. So, for instance, if we have four instances of an end point in u.s. East one a4 and 1b and 3in1 a then one a will get like ten percent of the traffic and the other.

A

Ninety percent will be distributed between 1a and 1b, if you add one more instance equally distributed.

A

That might not seem like a big deal, but when you got millions of events, it's pretty easy to take down a small number of machines, hence that internal denial of service attack- you know, accident stuff, so you're always going to run into problems and anybody who's. Seen any of my talks. I just love this unicorn, so I put them in everything. It really has no meaning, but you're always going to end up with problems, and this guy clearly has them.

A

So the reason we built that architecture, the way we did is because we've seen when we've been in a scenario where us East 1a, for instance, has gone down and we haven't. So in fact, when I was creating this slide, I guess about six months ago, prior to another conference, US East 18 did go down, which is why I actually used this one in particular, and we did. We did not go down with it because of our distribution.

A

Every time we want to create a new service. We have a very specific subset of questions that we have to ask, and we ask ourselves the same set of questions along with a few others. First, can this host can the host of the service be completely homogeneous, meaning does it fit with our pattern? Does it fit with that architecture, pattern of the chef base, the organizational users and then the application or the application group, and then the application itself? Can it except downtime, and what does downtime look like?

A

Can you create a scenario where you can let the producing the producing of messages back up in a queue with minimal impact, because that's always our goal is to have the minimal amount of impact for downtime? Does it fit into an existing service? In other words, would it be better to lump it in with another service? Would it create an additional? Would it create a large code base for that service, or can we just or does it really need to be its own? Does it require data center distribution? The answer to that is almost always.

A

Yes, but again, you just have to ask, because you need to know what what what trade-offs you may need to make when creating a new service. How should it be instrumented or monitored? And again? This is also very critical because, as I said, denial of service attacks, whether internal or external you're, going to want to be aware of them and you're going to want to need to know you're going to want to know what normal usage patterns look like.

A

So, just to tell you everything that I already told you again only in a few short bullet points, you need to know what you're looking at you need to know what you're working with and that's sort of an evolution. We made a ton of mistakes over the past few years to get here, we're quite happy with where we're at, but we're still working towards we're still working towards you know smaller more efficient code bases and deployments so build use and integrate the external tools again.

A

Ns q ended up being a great tool for us, and the ability to test and easily integrate new tools also ended up being one of the most valuable things that the service architecture allowed us to do: abstraction, creating good ways of having your querying you're querying programs talk to your consuming program, so the way that the way that you're, the language that is used within programs, is actually very important, homogeneous distribution again that's architecture and that's individual machines and monitoring and automation.

A

So those are the those are the big things that that really need to be thought about. This is the new. Thank you slide. I've decided that so if any of this sounds even remotely interesting feel free to come. Find me last thing: I have is just a little announcement myself, and this gentleman right here are writing a cassandra book which is hopefully going to be published in September called practical cassandra, so be on the lookout for it. If you have any questions, I got two or three minutes, so I can probably take a few.

A

So he asked in the service architecture and in the storage layers under the service architecture. Are we storing different information? And the answer is yes, we find is much better at storing aggregates or counts the incrementer zar faster and way more reliable than counters in Cassandra. We try to avoid them. We can't really, but you know we use, for instance, for handling our users and our accounts, because it's got a great object layer, an ORM layer that rails plugs in very easily with so it's just different there's.

A

There are different types of data, but we also use the same type of data for guaranteeing accuracy.

A

Yes, yes, we do have processes to reconcile the differences at the end of every hour. At the top of every hour, we try to make everything within our system trigger based. So every time a new event comes in or a new message is created that will, you know, kick off n number of other events, but there's just certain things that it really won't work with.

A

So at the top of every hour we have cron, you know kick off a couple of jobs that say: hey reconcile the data reconcile the data between Cassandra and vertica or I mean there's a bunch of processes that do that. But that's that's a pretty typical one.

A

So the question was: how do how do we know that that the URLs yeah? Actually we don't so? The question was: how do we know that a URL is become inaccessible I'm? Actually, we don't really care all we care is. We want to know how many events that we've seen for this to date, if it doesn't exist anymore, its kind of not our problem.

A

You know it's one of those things that it would be nice for us to know and but then we'd have to build a crawler and that's not really the business we want to be in I got time for one more question:.

A

So the question is: how do we reconcile data so we do have what we call an authoritative data store, which is what we use Cassandra for cassandra is the authority. So if any one of the data stores says something else, we say set it to what Cassandra says and if you don't like it too bad.

A

Thank you very much. If you have any questions, I'll stick around and hopefully be able to answer them later.