Apache Cassandra Cassandra Day New York 2014, 11 Jul 2014

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Apache Cassandra & Python for the The New York Times ⨍aбrik Platform

Description

In this session, you'll learn about how Apache Cassandra is used with Python in the NY Times ⨍aбrik messaging platform. Michael will start his talk off by diving into an overview of the NYT⨍aбrik global message bus platform and its "memory" features and then discuss their use of the open source Apache Cassandra Python driver by DataStax. Progressive benchmark to test features/performance will be presented: from naive and synchronous to asynchronous with multiple IO loops; these benchmarks tailored to usage at the NY Times. Code snippets, followed by beer, for those who survive. All code available on Github!

A

I'm the token New York Times guy at this event, but we do use Cassandra and I'm going to talk a bit about that and about this project that I'm working on we're Python years primarily got new Python. It must be one of them. Ok, so I'll not try not to be too boring to the Java, guys that I remember I've been in computing a long time. This is actually my 50th year of work.

A

I remember when Jabba came out and it was fun but I've gone and my project is the fabric, and this is a little graph that I had fun creating one afternoon and people ask you know why is the funny be in the middle and the funny be is in the middle, because my boss was Russian at the time, and this was a extracurricular project and I had to talk him into supporting it so fabric fabric eyes, russian word Fabregas and, and he liked the russian be, and that gave me everything I needed to steal the time to work on it.

A

What we tried to do with thinking about fabric was to kind of start with a new slate at the times and think about how we wanted to distribute news and how we wanted to gather analytics and in particular to see if we could build a build, a platform from the ground up.

A

That would be radically simpler than the enormous odds podge of stuff that we have cuz we're an old company and old company in the tech space might be five years. You know were considerably older than that and we have a lot of those vestiges of ancient technologies from Linotype to you name so, and they all show up in our systems and they all show up in one way or another in the stuff that we have to maintain. So could we really make it simple?

A

Could we really make it flat and could we really make it global all at the same time and I think of our systems as we're trying to build a system for the ninety-nine percent? You know, there's the googles and there's a Facebook and those guys, but we aren't those guys. We don't have the money and we don't have an amp power to do that.

A

So could we take all the good work that they've done and just think about ways to simplify it and something that we could afford to run and that would meet our needs for reliability and global reach.

A

So we started to think first really about communications building, a communications matrix and so I think of it as matrix that the fabric is primarily this communications matrix. That extends through data centers around the world that we need to support our communications and, as we thought about, because we're we're working on the DevOps I'd too. So we're going to build the infrastructure, the application, everything all from the ground up this by the way when I say we am talking about me and my partner, so is not a big crew.

A

So we thought well what if we just sort of scatter nodes out of there. These are small nodes and we're going to just spread them through this matrix and what we want to do is armed with a way for them to self-organize and to register their capabilities because they are have different capabilities. Typically, you know we run two or three different sizes of nodes, just a long. The simplification line we run with like 1am I. We do running amazon by the way and.

A

All of them make use of a small set of software, and then they are distinguished by the roles and then they take on personality. If you will so, we refer to them as stem cells and they evolve into their personality and they connect up. And then everything connects to that.

A

So all of our systems, all of our users, all connect to the fabric and they don't connect through intermediaries that connect directly to the power.

A

So how do we? How do we do that and well first, we we spread the fabric across multiple regions right now, we're actually in production.

A

In Oregon Dublin we've got like a toehold in Tokyo, probably sell Aldo will be next Australia, pretty much we're using the Amazon Cloud to project low latency, wherever we need to when users connect, they connect using lease latency, and actually we have a second layer below that where we can redirect- and we do this- because I kind of think that there's like a big wheel and the operations guys like this and if they the wheel to the left, everybody those two or shift our load- and we can maybe split it by basically using a combination of lease latency weighted routing and adjusting the parameters in real time.

A

So reason we do this with health checking by the way is this because it really takes care of the problem for us of failing zone or failing region. We had a good test to this just a couple months ago.

A

Really the computing problems. Network proms, are not the big problems. We run into run same problems you run into, which is people make mistakes, and we had one of our team, the team my team has grown to three by now and he introduced a bug a little bit of buggy code into Oregon and it ended up because things are tightly connected, bringing down every node in Oregon, all the Cassandra nodes, all the core nodes, all of our problems, all of our gateways.

A

What happened is that every all the connections shifted to the other regions and end result was that nobody noticed.

A

We noticed, of course, and it took us a couple hours to get everything back, but nobody else, because we got a scaled and it's like.

A

Or thurs owns, we spread everything out across those zones pretty evenly. We ought to scale the Gateway nodes and just to dive down into his own. This is kind of what it looks like our into a region. This is really just a subset and I'm not going to go through what these things are, but just to give you an idea of the connections and the potential connections, so they all where it says it's a potential connection.

A

It means that if, for any reason the connection fails, it will switch over to one of the potential connection and those.

A

Like the core, you see the yellow ones and the orange ones in there we're actually look you're seeing one layer there, they're, actually multiple multiple layers there and every layer has its own set of connections.

A

So there we run with a lot of connections throughout and that's why I like to think of it as a mesh, because after I made this diagram, I said I'm not going to make any more these too many lines so just to go over a kind of a real overview, then and I think of it as a global mesh with memory, and we need a memory for certain kinds of operations.

A

Users have preferences when they login. We want to retrieve those very quickly. We use the memory for that. The memory is Cassandra.

A

If users aren't logged in and we send them a message, we want them to get it as soon as the logon okay. So we use Cassandra for that.

A

When we want to assess analytics, we need a place to put stuff now we're a platform, so we don't actually do much analytics processing ourselves instead with say, hey, we'll give you a cluster rpi business intelligence group can work with it. So that's not our issue.

A

But we do have millions of users I'm going to show you a graph later to kind of give you an idea, the variation, but we may run from a hundred thousand fifty fifty to one hundred thousand users at the load. Little point of the day to a couple million users.

A

Scaling up and down we're always adding and subtracting to do that. We really only speak WebSocket amqp sock j/s is a transition protocol for us that makes use of xhr streaming and so forth and so on to emulate a web socket interface. So we can support virtually every every browser. Every device that connects to us for the news.

A

We we picked up some stuff, that's similar to Cassandra, it's interesting to me anyway, from a messaging background, which is that important things are replicated. We replicate our messages and we let them race to resolution. So, for example, if there's breaking news, then thats gets injected at one point in fabric. The first thing we do is replicate that to every region and zone we process it in parallel and then resolve. Then we exchange and resolve the duplicates and try to send only a single copy to consumer.

A

We then resolve at multiple layers. Typically, they only get one, but it's a messaging system. They may get a duplicate. Well, that's! Okay! In our system, the whole mash tends to resolve in about a second globally, so it's pretty fast, at least so far in our use of it.

A

We give back a message: ID that's been generated by our major producers so that they can check. They can do an end-to-end check to see that everything got there when in doubt they resent, if you get disconnected reconnect and will redirect you to something, live I'm not going to go into it today, but our web socket server is really proud of that.

A

We we currently support about a hundred thousand connections per machine with absolutely flat connection times, tens of milliseconds, it's Python with a/c coprocessor bolted on it's really fast and that's running on a small machine. That's on a c1, medium I'll! Writing in the web.

A

We think we can go substantially higher than app that actually will stay with small machines and I was I've been really impressed here at Cassandra at the size of machines and the things you people, we try to stay small, so we're trying to run cassandra on the smallest machines we can, and so we try to bring everything we can out of it, and that kind of brings me to the python stuff. Okay. So this is.

A

I took a snapshot this morning, it's just kind of see what yesterday looked like, and you see that big ugly spike there in the middle. That's what a breaking news alert looks like okay and just and if you're familiar at all with RabbitMQ, which is what we use for our message broker, and you can see these rabbits are just loafing here and we we like to keep the rabbit sloughing, because we are using clusters.

A

Cluster needs. A lot of needs to be able to talk cluster in a RabbitMQ world is different from the Cassandra cluster. In that it consistency is at the top of the list, and so they need to be chatting with each other all the time. So the other thing we do is we never we process. We do everything in memory. We never persist and the reason we don't ever have to persist in our message.

A

Brokers is because we use Cassandra for that, so we've really separated the problem in a way that takes advantage of both of these platforms. Okay, let's so we do messages and we have a very simple data model. This is actually a little too simple, but pretty much or my message is some metadata and a blob.

A

We don't care what's in the blob, because we're a platform, so we just push blobs around. We do care about the metadata and in fact we have a very nice cube, as was discussed in a previous one of about eight items that we use on our metadata. So we track everything by product by project by so we can do internal accounting and keep track of our users that way and provide them with nice reports and so forth about it. So that's what a message looks like in Nets course a Cassandra version of a message.

A

This is what a message looks like in rabbit and you can see how the metadata is kind of migrated down there into the headers and then some of those items have been promoted in a way that is useful for the rabbit up into the proper okay. So the idea is that the message structure is general enough that we can represent it in Amazon s3. We can, we can represent it in a lot of different environments.

A

This is useful for us, because we're dealing with people that use Java PHP go all kinds of consumers that need to come in either via web socket or in compete.

A

So, let's think a little bit then about these benchmarks are going to tell you about so I've got some timelines, the rabbit the client kisan, of course, there's more than one rabbit, there's more than one client and there's more one Cassandra, but we're going to simplify a little bit, and this is a pretty simple diagram of of how the benchmarks work. The client subscribes to the message: broker: okay and gets an event which has data a message and then does something with cassandra. Cassandra acknowledges it and we acknowledge back to the rabbit.

A

So that would be a synchronous approach right and in my push benchmarks, this is kind of what they look like and what I've done in my benchmark code as I, modeled, our actual environment and I'm using clusters that are like my actual clusters in production, so that I could get a good sense of what things look like now.

A

The pole is a little bit different and because there's more data coming back so I would expect that it would be slower because I'm retrieving oh I'm, not sure we want to look at the benchmark and be there's a lot more processing that has to occur because I have to marshal and D marshal those messages.

A

Okay, so that's synchronous, so I could do that with thrift or I could do that with the native interface. But if I start to add concurrency, then I have to go asynchronous.

A

Okay, so we're using sequel native now we use the lib EV event loop and actually that's a lot of what I do is to write the drivers. So I wrote the driver to rabbit and got it to work together with the driver to Cassandra. So I, don't know if you've ever worked with the an event loop like Libby V, but it's like having a wild animal in a cage and you feed raw meat in one end stand back and it spits it out.

A

The other end it's really fast, but it's not that easy to control and when you have two of them running and you're trying to run the may synchronously together, you can run into problems. It's so Tyler hops and I've gone back and forth a little bit and making the changes to both drivers that actually made that possible.

A

So this is it concurrency degree 3, ok, so we're getting some overlap in processing. That's really the key thing to note, but we're doing it through event, processing we're not using threads here, ok and we're getting more concurrency now this is a little misleading, because all those arrows are so neat. In fact, the order of things is not determinant. Ok and they will come back in orders that you don't expect. They will all eventually come back. Ok, so that means you have to kind of keep track and the driver helps you do that.

A

So, let's build a message. This is just some simple code, my Python code to to build a message and it creates messages which has roughly the dimensions that are typical of our application. And then, when you push it, you just keep pushing it in the pipeline right we've got pipeline going out, we've got a pipe coming back, so I'm going to push it to my level of concurrency, then I'm done and then basically, this program waits around until something happens. The thing that happens is that something comes back in the pipeline.

A

Okay, so I built a message and I've submitted the query: what's that actual query submit will look like, and this is- is I grabbed the body and I do an execute a sink in Python, okay and then I add my callbacks, so my callback is where I'm going to get something back now you notice that I I pushed my limit of concurrency into the pipeline. So now every time one comes back, all I need to do is am I done. Yet, if not push another one. If I'm done then finish?

A

Okay, so that's it is basically how work I run a 50 in the pipeline all the time until it winds down at the end. So that's an example of concurrent program. So let's look at the push. This is my disappointing. First graph of performance and what's wrong with this picture, what what's really wrong is I, didn't use, DC, aware: okay, we're running with a six nodes in Dublin and six nodes in Oregon and 170 milliseconds of round-trip latency.

A

There is not good for your performance when you are running with small levels of concurrency, but let's see what it looks like, as we add more concurrency- and I really I really like this graph, because it fulfills my expectations of queueing theory and what I'm really talking about is, as we increase concurrency DC awareness becomes much less important and that's because I'm filling up the pipeline between Dublin and Oregon, and so therefore, no IDC awareness less important and what that is the interesting take away from there, which I'm still thinking about is, if I have a point source.

A

It's really emitting a lot of data. Maybe I want to turn DC awareness off and let it go to every node in all my regions to spread out the work.

A

Because do you see awareness not that important if I'm throughput oriented, then of course, if I'm latency oriented amount of luck, I really want to stick with the TC awareness? Okay, so that was that was kind of interesting doing these benchmarks.

A

But let's look at a little more detail here and I'm going to be relatively quick going through these I think, but basically running we're. Looking at how many workers now worker this python program will multi-process to any degree, and so that's how I'm doing this and it'll also take levels of concurrency. I'll, show you a little bit later and so, as I add workers.

A

Now there are a couple things to look at here: their currencies down at the bottom I'm doing local quorum, because I found that that's how we run and there wasn't that much difference between concurrency level, one and local forum in my benchmarks. So I focused on that the percentages are interesting because I'm running on an eight-way processor here, it's a c1 extra large and so there's a limit to how much I can allow any one of my services to take, and it's probably around for one hundred percent of the eight hundred percent. That's available.

A

Okay, so- and you see an interesting break in the in there- okay so and now, I'm starting to lose because the process is getting inefficient here and I'm hitting other things, I might be hitting memory or in my clusters actually I peeked at my cluster cluster is getting close to being over driven at this point.

A

Ok, that's a, and these this is what they all look together and basically, when you look at something like this- and this can tell me- had you know- I like these ones, where the slopes are rising rapidly, that's kind of where I want to be. You know one of these guys up here, but it's also got to end well and these top four.

A

They don't really end that well, so those ranges- those are those are indications that I only ran three iterations here, but those is the high and low in the media and the medium so of those I would say for push in particular. You know I probably pick number four because most of the time- and I probably operate it with concurrency up around 100 or 120 most of the time I would expect it would be operating in this range, the benchmarks tailored to our environment. So these are bursts of like twenty two thousand messages.

A

When I get a burst, it'd be okay for it to go ahead and rise up the graph. At that rate, we're talking only under 10 seconds of processing to take a burst. That would be a burst like that would be typical of what I get back on a breaking news. Okay, so I'm really, I guess the message here is tailoring to your environments, important this benchmarks really tailored for us.

A

This is the Python program. It's got a bunch of you can pretty much play with everything. How many remote DC hos prefetch prefetch is the concurrency, how many workers, whether it's DC, aware whether it's token aware I didn't talk about token awareness, I haven't found any benefit to token awareness and I. Don't know whether it's my code or the driver code and I'm going to have to dig into that a little more deeply to figure out just why we're running replication factor 3 on on six nodes in each region.

A

So it could just be that the additional overhead is overcoming any benefit additional overhead and look in the driver, and it's it's got like jump around a little bit to figure out which machine to go to so maybe that's just it, and actually, when we push when I push run these trials, then there's a supervisor program that that that runs all the stuff.

A

Just to give you an idea, this is kind of what they look like when they're running you know so he's building up his 22,000 messages and then runs it. If you look at the unacknowledged, their unacknowledged in rabbit terms means that's my concurrency, so I'm running with overall concurrency of 3. I've only got one processor mark one consumer. If I look at one of the later trials, I'm running with 165 three consumers, that's a concurrency level, 55 / prize for sub-process, okay, now in Python, these are not particularly expensive in terms of memory.

A

We're talking about 30 megabytes for each sub process that you run. So it's not expensive. That way, you have to be careful about how much CPU is being consumed overall to stay in an efficiency. Okay. The polls are similar except there's a lot more data, so that I would expect my graphs to be flattered and the reason they're flatter is that I'm using a lot more cpu to accomplish and I just run out of CPU much more rapidly and don't get near the through, but but still pretty good throughput.

A

So my overall graph look like this a lot flatter. So this helped me with my planning, do get births that are looking for reads, they're nowhere near as much as common as the bursts that we get up pushes where I get a burst like this will be when a WebSocket gateway node either go typically goes down. If it goes down, all of its clients have to move at once.

A

They all will be looking for their information out of Cassandra at the same time, and so I may get a burst. That'll last you know a while, so we're measuring one process here, we're always I mean one one guy here. Each of our course is going to be running. At least three of these.

A

We may be running multiple cores in each region, so we can they aggregate literally. So we can plan for capacity of many thousands we're not actually anywhere close to that.

A

Yet, as a infrastructure provider, it's kind of my job to stay ahead and I'm trying to stay ahead by an order of magnitude to two orders of magnitude over my internal groups, because they turn around to me and they say: can you do this next week and I want to be able to say sure all I need to do is triple the budget, add the nodes and and we're ready to go so that's kind of the the overview of the fabric and a little bit of the work that I did on Cassandra, where, when we started with kisan with Cassandra to which is in January, we went into production with it.

A

I was pretty conservative about using the Python driver, because I think it was brand new at that time, and so we're running we've been running completely in synchronous mode with the Cassandra driver, but now with these benchmarks so on and pretty confident that we can run with with asynchronous and it's going to increase our capacity by a substantial amount over the same hardware.

A

That's it. Anybody got any questions.