Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Activision Blizzard (Demonware): Deploying Cassandra for Call of Duty

Description

Speakers: Seán O Sullivan, Service Reliability Engineer & Tim Czerniak, Software Engineer, at Demonware

This presentation covers the eight-month evaluation process we underwent to migrate some of Call of Duty’s core services from MySQL to Cassandra. We will outline our requirements, the process we followed for the evaluation, decisions we made around our schema, configuration and hardware, and some issues we encountered.

A

A

So hello, we are demon, ware, well, two people from demon ware and we're here to talk about deploying cassandra for call of duty.

A

So that's me, I'm tim, I'm a software engineer. That's sean he's an operations engineer and we both work for demon, ware, so demon who so demon wear, is a subsidiary of activision, blizzard, we're owned by activision blizzard um and we write, deploy and maintain um client and server applications for activision and blizzard games.

A

We've been around for about 11 years now. At this point we were bought in 2008 by activision and uh or 2007. Actually, I think and um yeah, so we've been doing it for a little while here's some of the titles that we've been involved in this year, um there's also destiny, which is not there, uh which released on tuesday.

A

um So we've got advanced warfare uh diablo, which it turns out, was our 100th title. We were involved with 100 titles, um there's call of duty in china, which is releasing later in this year on this year and skylanders, which is releasing in I guess. Next month we've been involved in a lot of titles um skylanders every call of duty since call of duty. 3 loads of guitar heroes and dj heroes and other heroes and bond games- and you know mobile apps, etc.

A

um We have a lot of services and we do things like matchmaking, which is you know if you want to play a game with someone, you have to say: hey, I want a game and someone else has to come along, say: hey, I want a game and then we match you up and and off you go uh leaderboards chat, file, storage, you know, leagues, social network integration. You know uploading things to youtube streaming them uh big content servers. We have a lot of stuff.

A

We have about 100 services that we use um for in various configurations for each game, um so some of the technologies we use um on the client side. We got c plus plus, of course, because um you know consoles run c plus they need to run fast, uh http for web-based applications and and websites um and on the server. These are the main things we use so we're a big python house. We use airline as well mysql, mostly um we run everything on centos and we use public for for automation, there's plenty more, but they're.

A

The main things um so we have a bit of unusual use case. Most people in this industry talk about ramp up they. um You know you'll you'll, start out small. You might have a few customers you'll gradually, maybe gain some some popularity and eventually you know maybe you'll have a couple: spikes and over the over the years, you'll gradually ramp up and up and up. We have the exact opposite case. So here's one of our games, um you can see there there's release day.

A

That is the first weekend.

A

That is the peak. um This is not working, that is christmas um and what happens is over the next year or two. These numbers will gradually just tail off, um so we have the exact opposite case, where it's straight up and then tail off. So this means that, um because we're in america I thought I'd use an american quote, uh we need to be prepared on day one.

A

We need to know that our infrastructure is going to withstand the load, that's coming and and then we gradually consolidate as time goes on for that title um yeah.

A

So I'm going to hand over sean.

B

So why we chose cassandra, why we why we started moving away from mysql, we use mysql for 11 years, 10 11 years, we're very confident in it. We know we're doing with it and we're happy with the milestone set up, so it needs something pretty big to make us want to switch away.

B

So the big thing that we needed was you need to share data across dc and mysql isn't great of that, especially when you master both dc's you're, going right in the same data set in both dc's, so that was our big push to move to something different.

B

So we had we identified approximately four services, which would look which we decided to look into and see which of these is suitable if any, so the first service was our progress store. This is a storage system which is high right volume and a low read volume.

B

Typically, users come online, they read from the progress store, then they write continuously playing through the game and that's it. They don't read again until they log off and log back on again.

B

The file size is roughly four kilobytes and it's persistent. So it's your save file, it's your loadout, your weapons, your level of your character, all of that, so it you can't lose that the next one was presence. So president is a pretty standard service.

B

It's high right hand, high, read it's constantly pinging saying I'm online, I'm online, your friends every so often check who's online. So it's kind of it's transient data. It's very small data size and the last one the notification looked at was messaging, so messaging is mail messaging. So it's sending mails or messages for their users. So it's low, read low right. You don't get! It doesn't happen too often in game. It's like in-game invites for jumbo party, that kind of thing and again transient.

B

We also looked at the fourth service, but we went about about two or three weeks and we decided it was too relational and just to keep it mysql that ended up having kind of the two mysql dbs, both dc's, doing crosshair application. It works, but it's not pleasant to maintain or administer so the requirements uh I mentioned across dc. That was that was our main push so that whatever we chose had to do that and a happy master but master in both dc's and rights to both the next was consolidation and expansion.

B

As tim mentioned earlier, we have to build high early on and we consolidate months or a year later, and we reduce our server size because, as the user kind of drops off, so we typically may spend three-ish months a year, maybe doing consolidations, maybe maybe less maybe more depending on the year. uh It's a lot of ups time. It's a lot of. We have to get people uh capacity planning involved, we have operations involved. Development involved, do a lot of planning and then actually we'll do the work itself and it's mysql.

B

It's sharded mysql, which we do it's fairly intensive, work wise and also then you're actually waiting for it to happen, waiting for it to finish consolidation as well, which can take a fair bit of time.

B

We the way we typically do. Is we follow by title. So we have a single cluster and that single cluster does a single title. If we're getting another title, we run a new cluster up and the new cluster is this new title. We keep things very separate because we never want to have a situation where one type will impact another title um again. This doesn't help with the whole miles. You have many many more squad hosts and for sure and consolidation then becomes a bigger task as well.

B

So whatever another requirement was whatever we picked consolidation expansion to be easy manageability, so we had an operations team of about 15 people, um 15 20 people because of that and they're all used to mysql.

B

So if we're going to rip up mysql for some of our core services, so the progress star progress stores, especially, is a required service. If that's down the game doesn't work, you don't get on. So if we're going to replace one of our core services with an alternative, we need to make sure the alternative was easy, manageable and it was relatively.

B

We could script automate and make sure that any new people can just give a read one page or two-page doc and do the basics or get if they're running two in the morning with on-call, they can actually fix it.

B

Lastly, of course, throughput- and we looked at the throughput for each of the services- we're looking at from the application point of view and the throughput for the title we're actually trying to build for was about a million and a half requests a minute for storage, which is the present store about a quarter of a million for the presence and about 850 000 rex a minute for messaging.

B

uh Now each of these requests I mentioned from the application level. So if you convert this to nosql non-relational databases, you may be turning that into two to five different requests.

B

So we one of our operations, engineers was getting introduced. The whole cap therm and as this proves kind of he got, he had the ap on. uh He had the ap sort of not too much to see bit.

B

We shortlisted suitable options, uh rick and cassandra's off the list. um They were both level db. They were both based in dynamo db. They both would roughly have a data looking into with available options. There came close, so we want and they both use expansion both v nodes. We we deployed in cassandra one 1.2 and v nodes was a major reason for that.

B

B

We also wrote our application back end twice, so we believe in testing our stack. We don't want to use some automated testing tool. Would you put numbers in and see how that performs a cassandra? We want our complete stack, built out and see how that actually performs. So we built our stack to use cassandra and rake in the back end, and then we did load testing for approximately six to eight weeks of each as an evaluation before actually doing proper, full unload testing.

B

We did find that learning doing an evaluation of one and then the other made the evaluation of the second one easier. We learned the second one easier because they had many common aspects.

B

Load testing so the way we did load testing, because multi-dc was one of our main objectives for this. We built two clusters. In one dc we get a cluster with single cpu, ssds and average memory. I think average margins were 24 or 36 gigs.

B

In the second dc we had dual cpu, so I think about 24 cores altogether. Perhaps we use spindles and high memory so high memory being about. I think it was 64 to 96 gigs.

B

um The reason we did these two two very different clusters was we identified bottlenecks. So if we had to the same cluster, you see it's a bottleneck and I or a bottleneck in memory, there's very little. You can do other than shipping your hardware in or trying to upgrade your hardware, which is pain. So by this we actually saw. Okay, cpu is fine here, but memory is not so good here we could actually tweak it. Then we could get a rough idea. What we're looking at.

B

We use realistic workloads, which is very important, um so we can look at our development environment. We can see what the game studios are. What kind of calls they're making the frequencies of the calls and development based on that and previous titles?

B

We can make a pretty good guesstimate of what they're going to look like in production, so we use that for load testing and we load tested using again, I said the full spec, the full system, we included peaks and troughs during load testing, so we typically may just kind of you know, hit bring up a million or two or three five million users and keep with that, and with this time we decided to actually simulate logins and log offs and that helped us tim will go into later.

B

We also run a soak test. I think this from the very start we decided a soak test is one of the most important things we have to do where typically, our remote load tests of approximately three to four hours, maybe five or six, because cassandra will change over time a lot. We decided we needed at very least three to four days. Ideally, two weeks, I think we ended up with five five days because, as we always run out of time, these things some interesting numbers.

B

So some of the numbers we load tested where we load us with five million users at that it was about six thousand. This is a per node thing. It's about six thousand reads for cassandra uh standard node and about fifteen thousand rights, and at that it was about 2.5 millisecond, read latency at the 98.

A

A

So, um as we said, we we load, tested, um riak and cassandra, um and the winner was react. No, it wasn't react, um we thought react, would be a slam dunk, because um well, it's erlang based and we know erlang.

A

The tooling is actually excellent for react. um It performed very well, um we did we were. We did the react testing first, and so we thought. Okay, you know this is looking pretty good, um but um oh yeah and we previously evaluated it as well. uh One of our one of our engineers had gone into evaluating it for another for another product uh previously. So, um however, um cassandra was actually the winner in the end. Otherwise we wouldn't be here, it's just not working.

A

um So the right performance, um like seriously seriously beat react. It was about four times better, um just for our workload, that is, um it had a richer feature set and it compounds keys. It's got the partition, keys and the clustering columns, um and that enabled us to do a bit. More react is quite simple. Really it's it's just key value, mostly or at least at that time it was, um and uh so so from a developer perspective and feature perspective.

A

It was, it was much more um uh inviting uh cassandra was much more inviting at that time. um The maturity of the code base um was.

A

It was a bit more mature. There was a bit more community around it. It seemed like it had gone through the the the uh gone through the wars a little bit and and was a bit better off for it, um and uh is this not working?

A

I might have to use yours, but soon, um so we continue testing 24 7 until launch um um oh yeah.

A

The other thing I meant to mention there is um so generally speaking, if you mention java to our engineers, because we're not java heads, we run a mile, um but one thing about cassandra is that it doesn't really reveal um too much the the uh the technology on which it's built, which is really nice um and react, shows its guts a little bit more, uh and even though we are longheads, it's it's always nice to have something. That's a bit more rounded a bit more um a bit more!

A

uh You know you don't have to get too in too into it. If you don't want to so we continue testing um until launch. We have two offices that we're doing this testing. We have one office in vancouver and one office in dublin in ireland and um so that that obviously helped we were able to do very big sub tests. As sean pointed out, um we planned as well mysql cassandra hybrid thing for one of our services, just because it was one of our really really really critical services.

A

That's the progress store. um We kind of plan to you, know, write to both and then have a percentage of reads from one and then gradually move over. It turns out it was. It was overly complex and just had too much overhead operationally and and developmentally, um and so we dropped it and we just went with cassandra so so a bit about schema.

A

um So the progress score uh was a perfect fit. It's it's. You always know the key, uh because it's user based, so it's just a user's progress and it's mostly right and you just read when you log in and then you just update, save as you go along. So it's like perfect, really really simple. Like schema was just it wrote itself um sorry.

A

So then we have present a bit more relational. It was kind of a little bit tricky because there was various ways to index into the data and we had to figure out the best schema there. It's a high throughput um service, and so we had a lot of tombstones um and as sean was mentioning, the peaks and troughs of of you know. Daily logins and log outs uh helped us actually iron out a couple of a couple of issues with that.

A

um So we had an issue we had to use ttls um because it's you know. Obviously uh it's high throughput and you know someone logs off or if they don't log off or they you know if they disconnect uncleanly, you need to just kill the data eventually, so we had an issue where we had two keys space or sorry, two tables with the same ttl, but one was being written more often than the other, and so what happened was one eventually one just like died or just went away, and so we had all these errors.

A

Where you know, one thing was indexing into the other, and so we we learned a lesson from it, but uh uh that was you know just an interesting issue. We encountered there um and then for the messaging service. um It was time series data so well suited um it's literally just you know you got some messages, reader messages, uh but again we had an issue with tombstones where um our nagios check actually caused it.

A

Interestingly enough, so the nadius check was using the same partition key every time for its check and it built up a load of tombstones in one partition and eventually or gradually, the performance of the cluster went down like and, and it eventually hit a point we couldn't sustain. So we were like what is going on. We figured it out. It was all all the one team tombstones and the one partition and we actually hit. It was actually a cassandra bug which actually had been fixed already, but we had not deployed um so yeah.

A

That was an interesting issue we encountered um so lessons. We learned from schema. Keep it simple, um so the shift in thinking of of of you know coming from a mysql background or an sql background. um You know it's not a relational db. You need to rethink the way. You model your data, listen to patrick. uh He knows what he's talking about. um We worked with him on it and and figured it out. It was really really good, so so keep it simple.

A

Get your partition keys and your clustering keys right and cassandra will just do what it does best, which is be really fast and distributed, and.

A

Do not ignore the cap theorem, it's it's, you should live by it with cassandra. Cassandra is very tunable and it's very you can tune the consistency up if you want. You know multiple nodes to come back and and give you your data and make sure they're the same before it returns.

A

You're going to hit, you know, you're going to have latency issue or you're going to have higher latency. Because of that, so you need to make the trade-offs and just make sure that you're you're doing what's right for your application and load tests with real numbers.

A

Some issues are not evident in unit tests and they will not. um They will not show their faces until until you do them at scale um or you do them. As we said, you know, with peaks and troughs and whatever else um that's it head over to the shop.

B

So config um the default settings when you install cassandra at least this packet for dse314316- uh probably not what you want they're somewhere in between a vm and a high performance server, and they don't really work for it. They didn't work either for either very well. For us at least, um we took a run quicker into the config changed a ton of settings based on common sense and then reverted a few of them.

B

After we kind of did a quick run through the config with the kind of the common sense settings, we then did a kind of a change. One setting load test for three to four hours, check results compared to previous load test and then load test again um we'd never touched the heap. uh We didn't we pos. If we had more time, we probably possibly would look to us, especially after seeing some of the talks, and here and other other youtube videos we looked at.

B

um It does look possibly a worthwhile thing to do, but again it takes a fair bit of time. I would imagine there is an appendix at the end of the talk where we list all the config changes we made. So when you get when the slides, just I'm sure you can see.

B

That so the hardware when we ended up going with um we're going with two and so some background this in dean, where we typically have two specs of hardware each year, we've kind of an app spec which is high cpu, high memory, high cpu high memory and pretty crappy discs and db spec, which is low, cpu, low memory and very good disc. So, for the cassandra we ended up merging the two pretty much. We went with two intel cpus at two gigahertz.

B

We went to raid one with two 40 gig ssds, uh 32, gigs of memory and one gigantic network.

B

So I saw some of the explanation to this um talking to patrick and data stacks they're they're really pushing the raid zero thing and it got us to thinking about the cap theory a bit more and got us thinking about kind of relational databases and traditional databases versus like apache with a high availability, where, typically you don't care if nodes die, we're not really used to that. So there was kind of internal discussion as well as discussion with data stacks, and it ended up being a case of we trust our hardware.

B

Pretty good our hardware vendor is very stable. We get very low failure rates in hardware and disk failure rates is probably the highest we get and even they're pretty low. So we decided raid one would suit us better, because we actually have to replace hardware. Less will be less less effort and what better for the game.

B

This capacity was our bottleneck, so we ended up pretty much having to scale the cluster up based on the disk size. Now, that's kind of why you've got a high cpu there, because the plan is, we can increase our disk size when required. We can go up to one terabyte or maybe two terabytes when they come out and when we do that we don't touch cpu or memory or anything else. They'll be fine.

B

So monitoring, when we first started doing our load tests, we ended up not only load testing, cassandra graphite as well one of the settings, the graphite reporter was how often to send metrics. So we didn't really notice or see this too quickly and it was sending metrics at one everyone a second, so we had a 60 node cluster for load testing and we were sending from each node 170 metrics every second second half brought down graphite so quickly, change that to our normal, which is going to one minute metrics.

B

um The key cash hit rate is an interesting metric. We kind of we used a lot and we thought we're doing very well and we understood it better. So, on the key cash hit, we were typically seeing us about 90 low 90s, no 90 percent loading been 90 percent and we thought this is great, we're doing good and then one of the one of the other operations engineers after lunch looked into it a bit notice, we're actually only about 60 or 70 percent.

B

What's happening is when action was happening, compaction read all the assets tables that would bump our key cash hit 100. So on the average, which is the number you generally see, you see a hundred percent so to get the actual key cash hit, you have to go for the column, family, key cash hit versus the actual queries done and you can get it out. You can get the accurate one that didn't bite us, but it could have been.

B

It could have been bad if we're allowing it heavily um negative next thing, so we generally use 90s follower, monitoring and typically new service comes along. You, google, niger's service name, you get you get a lot of plugins didn't work so well cassandra. There was very few plugins available and we ended up doing a lot of a lot of ourselves and jolokia helped a lot with this, where geological export, jmx, metrics and jmx in general over http, and then you can quite easily write your own checks or your own tools to actually change.

B

Jmx settings uh I'll use, hp interface, which is again we're a pythoner link shop. So getting into java and java. Tooling and jmx wasn't something anyone really wanted to do so. The fact we could easily do over http was really good.

B

We're still changing. Excuse me we're still changing what we're actually monitoring alert on um we've, we've gone through several iterations of doing long term and long and short term read latency right, latency checks and the problem is when you deploy cluster changes over time. So if you look at your, if you look at a metric, so during load testing, for instance, one metric may not have gone over half millisecond, so we said right.

B

We alerted two milliseconds because that's giving you fair buffer time, you don't alert, knock and you don't alert your people kind of just on the case, something going up with that. Actually something critical happening.

B

The problem is, when you do repairs possibly or your cluster can start seeding in more that may change so again, you're constantly kind of tweaking the metric critiquing. What you're actually alerting on and monitoring.

B

So some gotchas we hit v-nodes and rack awareness again. This was for this is specifically three one. Four slash three one: six for dse, which is approximately cassandra apache 1.2 point some late version.

B

um We have problems of enough direct awareness where enabling rack awareness, when using v nodes led to a thirty percent data distribution, difference on some nodes, and as I mentioned, we had our main capacity issue was actually this capacity. So when we started seeing some of our nodes have 30 percent more than other nodes, that was pretty bad for us, so we decided to turn red awareness off. We weren't really using rack awareness.

B

We actually had a blade blade system, so we were trying to do was use a kind of a chassis awareness based on rack awareness, but bench eventually didn't end up going with it low balancers, weren't, really gotcha, but because we're using the old python driver uh we're doing sequel over thrift, so we're using the old python driver with no awareness of node awareness or token awareness, so we put low balance from load balance in front of it all uh we used them. We during, I think, a half way to our load testing. We realized.

B

We must get the clients to reconnect every 10 minutes because otherwise, in a production situation, if you need to expand your cluster and you're, not actually reconnecting every 10 minutes, you're receiving person's connect, persistent connection open, then you're not going to use any node you out your cluster, so you have to reconnect in every for every every application, server spec for every type of v8 and then our dev dev cluster differs from production or did um dev, for us is production as well, because game studios actually develop against dev.

B

So dev is prod searches, the certification environment, that's also broad and product, is also proud. So we very little testing environments, but in our development environment um we typically use vms, so we built center and vms, and I was pretty much after load testing.

B

It was about six months after kind of the whole, the whole project started and we ended up having problems within a few weeks, because we same config as our production, which are we kind of load tested and we tested for months and, of course, the same config didn't work very well um stuff, like we'd compact.

B

No, our throttling was turned off for compaction um memory size for the heap, a lot of the search, this sort of modify um it hit us pretty hard, and eventually we actually did push for getting dev to be the same as production, at least for cassandra, because I think we wasted it ended wasting about a week or two debugging issues which weren't affecting wouldn't and weren't affecting the game but were affecting this development environment, which was kind of important.

B

um We also ran a couple of different issues initially before launch or network around the network. So, during the evaluation period, where we had tested two different dc's, we had a problem with the network capacity, not network capacity as such, but each individual connection was only pushing up to about one or two megabytes, but the overall link was two gigs. I think um we spent about a month debugging into that and eventually it turned out.

B

It was right at the ipsec tunnel, so we ended up doing a gre tunnel, but we thought it was fine, because when we got a production, it's two different dc's. So it's never going to hit us again. So we went to production and sure enough hit this again.

B

So we ended up doing the same thing after another month of testing. We aren't doing the same thing again: jury tunnell, not turning off ipsec. We test each individual component network and each individual each area and it all works. Fine, except when you do the full link and then they all failed horribly.

B

So lunch was boring. Thankfully, as patrick mentioned his speech, his talk uh in the first week, the dev manager came over to my desk and asked to simulate second week go over my desktop similar note. Failure, because you know we put the practice in we've got documentation, we did run books, we've gone through it all, but you never really know what's going to be like until you're actually doing prod. So, let's you know take a note down, so I was quite pleased to have a note.

B

A note did die in the first week and no user was no user. Experience was affected. It all worked as expected.

B

We do two of the nodes I have for christmas. um This didn't go as well. We, the nodes, designing themselves were fine that didn't cause any problems cluster and what did cause the problem was repairing the nodes so replacing the loads and running compact running repairs that caused a lot of compaction.

B

Then the repair would actually stall in cases you have to kill the new node and then put the node in again and then start repair again and hope that a pair of work this time, so it didn't affect user too badly. I mean I think latency went up about 10 or 20 times, but we had a kind of pretty good buffer there, so that was that wasn't a problem for us, but it did cause niagara starts to spam and knock to start ringing us every 10 minutes or so and explain to other titles.

B

So this talk was against again about them. Launching call of duty for launching ecology with with cassandra and now that it worked well and we've actually seen that it works well reduces operations, time we're actually planning now to use other titles with it. um We've put diablo, we've put a certain aspect of develop onto it. The newest diablo uh reaper solves onto it and we're hoping now for next year to put more titles onto it.