Apache Cassandra Cassandra Summit Europe 2014, 27 Dec 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Demonware (Activision Blizzard): Deploying Apache Cassandra for Call of Duty

Description

Speakers: Seán O Sullivan, Service Reliability Engineer, & Tim Czerniak, Software Engineer, at Demonware

This presentation covers the eight-month evaluation process we underwent to migrate some of Call of Duty’s core services from MySQL to Cassandra. We will outline our requirements, the process we followed for the evaluation, decisions we made around our schema, configuration and hardware, and some issues we encountered.

A

And so my name is Tim. This is sean and we are here to talk about our experience at demon. We're deploying cassandra for call of duty, so Tim cherniak. That's me, I am a software engineer.

A

This is sean o'sullivan, he's an operations engineer and we both work with him away so demon who so demon, where is a subsidiary of Activision, Blizzard Activision Blizzard, is the umbrella, evil corporation and we're owned by them and, generally speaking, their model is that they kind of own a bunch of studios and smaller companies who all work together and they kind of lead them be for the most part, so we're one of those one of those entities and what we do is we write under ploy and maintain client server applications for Activision and Blizzard games.

A

So what we've been working on this year? We did a bunch of stuff for call duty, advanced warfare which released not very long ago and soon pretty well Diablo Reaper souls on the console. We didn't work on a pc version, one on the coastal console versions. We did a lot of stuff which is earlier this year. You've got Skylanders their trap. Team came in not too long ago.

A

We do a lot of stuff for the little figures that go on the podiums and appearing game and all the kids love it and then there's also a title in China based on call of duty, which is releasing just for the Chinese market early next year that we've been working on as well and there's also destiny which isn't there, but a Activision works with bungie who originally did halo and they're working on destiny, which is like a big FML. We also helped out with that.

A

So um some of the stuff we done in the past every single call of duty since call of duty, 3, there's one every year and we've had a big hand in in all of them, and a bunch of Guitar Hero games in the past DJ hero band, hero, all the heroes, gold and I, all the double, oh seven games and Skylanders and a bunch of other stuff.

A

We've I think we have a hundred and we around a hundred and second title or something that we've been involved with at this point lots of stuff, and so we provide services and so things like matchmaking. So you want to play a game when I shoot someone in the face online, you say: I want to shoot someone in the face and someone else goes: hey I want to shoot some in the face and they both talk to the matchmaking service, and then we match them up and they go see each other and face.

A

So we do things like leaderboards, chassé file, storage, progress, storage, which we'll hear about a bit little bit little bit and leagues, social network integration, etc, etc. We have about a hundred services or more I, can't even remember at this stage and yeah lots of stuff. So some of the technologies we use, we use C++ for the client because of course, games are all written in C++. So we give the studio, that's making the game a library written in C++ and the integrated into their game and hey presto.

A

We also use a lot of http-based stuff. Like we integrate with websites- and we have lots of restful web services internally, what kind of stuff and on the server side, we use a lot of Python, mostly Python Erlang as well mysql, mostly for databases, and we run everything on centos and we use puppet for automation, they're the kind of main things we have a lot of technologies in our stack but they're the kind of big big players em. So we have a slightly unusual use case, most services.

A

You know you start a company, you start small, you get a few users, you might get gain a bit of popularity, you'll gradually grow in popularity and gradually get more and more users more and more people online. You might get a couple of spikes. If someone writes about you in some famous website or something, but I'm gerry speaking, it's like a slow and long uphill scaling up for us. We get the exact opposite so on day, one a game launches.

A

Everybody goes online and then gradually over a long period of time, maybe a few years it'll tail off. So you can see here. This is an online users graph for one of the call of duty games. There's the race day. You can see that's the first weekend, that is the peak number of online users for the game ever and that's Christmas, so you can actually see the weekends there as well. So you can see that's a run up to Christmas. So in the words of Benjamin Franklin. By failing to prepare, you are preparing to fail.

A

We need to be pretty certain than our services are going to work on day one because if they don't, you can have a lot of angry gamers at your doorstep, and so we need to be very, very prepared so I'm going to pass on to Sean he's going to talk about a predicament.

B

So our predicament is we typically, we have one title: / DC or we we I, say titles / DC, into Sigma clusters. So we don't generally share data green titles and that change for ghosts where your new platforms, you'd, ps3 and ps4 and xbox 360 xbox one and people would often want to play in 360 and then why did buy the new copy and xbox one and migrate and keep on playing for my list where they left off? So we need to share data across TC. Mysql is not great at doing that.

B

It would mean because again, because we can go from ps3 to ps4 and then continue playing ps4, and if your friends and ps3 help I can ps3 so need to be able to write to both and read to both and we weren't prepared to have my squirrel, because we also charged infrastructure as well.

B

So we looked us using Cassandra used literally using a non-relational database for some of our services, so we brought a tangible for services which were mostly on relational key value type services. The first service was the progress tour. This is kind of blob of data, we don't really know what's in it, but it's typically used for what level you are some of your stats and your loadout. So what guns you've equipped that kind of stuff?

B

It's a the read, write, read, read right ratio for that is 124, and your file size is pretty small, so before four kilobytes and persistent data, so you don't want your. Never it never goes away. You always want the load. That is to be there. You are the level to be there all the information with the player to be there.

B

The next one is presence. So the present service is when you go online. You I am online new, regular service and you're right there service a few times a minute and when people check what friends are buying online you'll check the presence. Your internal presidency is this guy in line which my friends are online. So again, the read/write ratio, for that is one to ten, where you're reading quite and frequently, but you're writing all the time. To make sure that's your at the information is correct. Davis has a minimal. It's pretty much.

B

Just a case of users online and some very small metadata, its transit data as well. Of course, the messaging service is quite generic. It commutes from messaging it can be for game, invites agrees for males it can be use for any kind of that kind of messaging type. Stuff again read: right ratio is 50 to 1, so you're, constantly checking to of emails to have invites has something changed. How someone send me a message?

B

What the actual people sending messages will be quite low, again tranxene data, because once you read it study to automatically for you so our requirements as I mentioned earlier, first row, Karma's, cross DC and, as I said, we sell a bite idol, so we typically have one cluster per title or, if group of title, so it could be a tight like ps3 and xbox one or that contact about Mike one cluster and then another foster might be ps4 and Xbox 360.

B

The next part is consolidation and expansion. So, as Tim mentioned, we typically have our biggest growth on the first few weeks and then I kind of slowly eases off. That means we spent a large portion of our time, consolidating hardware and making cluster smaller, because, again, once Christmas is gone for certain titles, that's kind of biggest boom, so Ian he reclaimed by Carter from that and an expanding again, sometimes we'll add titles to current clusters so need to make thruster Roger.

B

So we need to make sure whatever we looked at, we could easily consolidate and expand the clusters without much effort and we've been using my skull for approximately 10 11 years, the stage so we're pretty confident in us. We can do it well, we've got a lot of tooling around us and we've got a sharted set up, which means we can just keep on adding adding right through putumayo that issue. So we need to make sure whatever system we chose, we could manage just as easily.

B

We need to be able to able to automate the point where you can have. It won't take click a button and it will deploy new server allowed to and I to the group of the cluster and lesser expand, that's where he needs to be it and I also for the operations teams. We need the case where, if a server died, they could easily replace or without any without much effort. And so then, of course the last requirement was throughput.

B

So we looked at the previous titles we had and we estimated what the new one would be and require. So there are performance, wear for the progress store million behalf request a minute after the presence, a quarter of million requests a minute and for messaging 850,000 a minute. Now again, that's a request in the application layer, so that could actually turn two three four or more requests at the Cassandra level of database level.

B

So, during a hundred valuation, when the operations engineers came to the office with mismatch and footwear, so the joke went around that he was highly available, not consistent, and this did. We did require a shifting toward going from a relational database to going to non-rational and the whole cap theorem we're used to mysql. We used to using large heavy beefy boxes and we know how to configure blue know how to do my score.

B

Well, when you start doing no SQL select sandra's of like refocus and are using dynamodb types of you have to really think differently and realize this isn't just my cigar, such as in the database, the eft understand the architecture actually works, so we shortlist the two available options: rating Cassandra, the main reason shortest those were there Walter nobody dunno DB they're, both leveldb devotes for the V nodes as easy fix and consolidation and expansion. Tutoring was pretty good.

B

We rewrote our application back in twice so our application back-end currently supported mysql. So we added react and cassandra support. This meant for load test. It was that easy ER for load testing. We built two clusters and product pretty production like cluster of production like hard work, so the first cluster was single CPU SSDs fast disks, an average memory being about think was 32 gigs at the second cluster, which is different. Dc was jewel, CPUs, spindles, so larger, but slower disks and high memory, so that ninety six gigs around I think it was so.

B

This was a deliberate decision. We talk because we wanted to see where the bottlenecks where, if we just had one cluster and we did a tweaker to change, we wouldn't exactly if it fixed one bottleneck. We wouldn't know what the next bottleneck is, where vio would be CPU so having two clusters of meant that we could actually see where certain bottlenecks where and make configuration change and see how that affected both types of hardware. It also meant, since we haven't, bought the hardware after this project, that we could see what we could see better.

B

Whatever Hardware we needed and comparing the two, so we used run software stack to load test. We didn't get a specific. We didn't have a special special test. Software going to software. We drove just a load test. We had our own, the normal software we use in production. We have any reproduction, we modified that to use Cassandra or can use react, so our load test will be awkward. I think the most important thing we learned from load testing is you have to is really you have to use real estate and realistic user profiles.

B

So we can check your development environment, see what Cubs remade. We can check your previous production environments, get a rough idea of the quantity of calls, and then we can actually emulate that and we have our own load tests, clients which act as users and do the whole log in log. At that whole process and an issue we didn't include peaks and troughs so users, login user logged enough, and that bed is pretty hard as Tim I'll go into later and.

B

We won importantly, which I think it needs to do for this kind of test is soaked testing or you run a test for ideally one to four weeks and Cassandra can change over time.

B

You've more data in it caches change that kind of thing, so we only actually managed to get about a four day, soak test, I, think because time constraints, but we're aiming for one to two weeks submitting starts over, we actually load tested, so we load test or cluster five million users at that at the 5 end user rate with six thousand reads a second 15,000 writes a second, and at that peak we had two point: five milliseconds latency at the 99 98 percentile.

A

So am so we load tested, react and Cassandra, as Sean said so who was the winner? It was react. No, it wasn't react, so we thought we thought them. We thought that react would be a slam dunk, because well it's a relying based, and we know where line we use it for some of our stuff.

A

The tooling for react is actually excellent at the time it was the first one we actually we did the load tests on, so we thought okay. This is looking pretty good. You know everything is performing well and- and we had actually also previously evaluated it for something else, and so we thought okay. This will be good. You know like this looks pretty pretty decent, but then we did Cassandra and it was way better, so it won. In the end. The right performance was about four times better than react for our particular use case.

A

That is, it had more features. It had a compound primary Keys, it had partitioned clustering keys and that kind of added to the the available query set that you could that you could use had cql, which is a bit more friendly. It was better for our developers and it was a bit more mature, as the react was fairly at the time. This is maybe a year and a half ago react was a bit it was. It was kind of a bit early in the in the in the cycle of that of that particular product.

A

So you know we were a little bit happier. Wikus on are being a bit more mature. um The only thing is I guess we thought with Cassandra were like Java. A lot of our developers around engineers in general would run a mile if you mentioned Java, but the nice thing about Cassandra's. It really does not. It doesn't reveal its innards at all really like, and so you don't really have to deal with Java. You don't need to know that it's Java, which is great on it, it's the way it should be.

A

So that was good, and so that was it. Then we just continued testing on Cassandra cluster for twenty four seven. We have three offices, so we have one in Dublin, one of mine, Coover and one in Shanghai, and so we were able to take that ran the clock coverage to just monitor the thing and do loads of soap testing, those about testing tweaking etc.

A

So we had also planned a hybrid solution because we're a little bit paranoid new.

A

You know new technologies, make you a little bit paranoid, so we planned okay, well, we'll have a mysql cluster, but also have a Cassandra cluster will write to both and then we'll read from mysql and then maybe we'll dial up the cassandra reads, and then we have this fancy thing where, if for, if something terrible happens, the cassandra cluster, we have a disaster recovery plan where we can migrate mysql data over into cassandra eventually, and it ended up being way too much way too much overhead. It was.

A

It was overly complex, too much overhead for our development and ops teams and we just at the end we had low tested so well. We trusted cassandra and we dropped that idea. So we just went straight straight to Cassandra, am so the schema so for the progress store, it was pretty pretty easy as a perfect fit. It was, you know, key Valley, you always know the key. It's mostly rights, Cassandra loves that the presence was a bit more relational and we had two tables and we had an issue with TTL.

A

So there was, it's pretty high throughput see a lot of tombstones a lot of I'm on line. Ok, now I'm offline I need to lead a row, and so you get a lot of tombstones.

A

We had tea tl's for like failure, cases you know and making sure that things were cleaned up after a while, and we had an issue where, until a Sean said, the peaks and troughs actually showed us an issue with the schema in that or with the TTL is actually in that we had two tables, and one of them was timing out earlier than another.

A

So when people were logging off, you had this thing where, after a little while all of a sudden you get like tons and tons of errors, saying row does not exist, and so we were, you know we didn't see that until we did a full scale load test with peaks and troughs so yeah, so that was interesting and then for the messaging service, its time series data, it's pretty well suited Thunder does that well, but we did have an issue with tombstones again and again until we have production class, we didn't notice it nyjah's check we had was that was writing some data for one user and then reading the messages.

A

And then you know deleting that data again a minute later doing the exact same thing over and over and over again keep writing deleting, and so it was building up a load of tombstones in one of the partitions, and so we hit a Cassandra bug actually, which bit is pretty hard but turned out. It had already actually been fixed. We just haven't got it on our cluster, so we Julie updated our cluster but yeah.

A

That was a another interesting thing we came across and so the lessons we learned from the schema keep it simple: it's not relational. You need to relearn things a little bit get your partition keys and your clustering keys right, because Cassandra will work very very well. If you do, you won't have to worry about it. You need to learn how the data will be stored, how it will be accessed before you design. Your schema, listen to Patrick McFadden.

A

If anybody went to the training yesterday, the man knows how to do data data modeling in Cassandra, so listen to his advice and you should be fine.

A

Do not ignore the cap theorem, the sound is pretty cool because you can you can tune the consistency. You can say. Ok, for this particular query. I think I want a bit more consistency.

A

I want to make sure that that data is right, but you will sacrifice your latency for it, so you need to like it's a nice back Saunders even tune that what you just need to be very aware that when you do you're going to sacrifice one thing for another and and load tests with real numbers, because some schema issues are not evident until you get everybody online at once and over a long period of time, so I'm going to hand back to shop, so I can figure.

B

So again we started on one dot, 2 or DSC 304 thing it was, and at that stage at least a lot of the settings that were there by default, aren't really right for anyone. We were using the hardware relatively high spec and the setting is quickly run for that again. Looking after the vm point of view, we also did a dev environment. Vms royal settings are also wrong for that. So, if you don't just trust defaults, definitely go into a poco settings, so we kind of took one pass through the config reddit change.

B

The time settings see this. You know what seemed reasonable to us to change one example in multi-threaded versus single-threaded compaction and change all the settings, and it seemed pretty mostly fine except there's a fair few others. A few change lease which didn't go so well as in backpack compaction, setting where the multis already come back, she's actually slower than single threaded during actual testing. It's a case of making one change at a time, doing a full load test and in seeing results. Comparing you, the results and graphs to your next test.

B

Make more changes have each time if he make too many changes once or even more than one change of one's you're not going to see what which change actually affected? What and we would like to change the heap settings. But we run out of time with load testing and.

B

One other example: the kind of the one of the early changes was the SS table size where I think the default was 12 Meg's or something. So we look. We load test with everything from 12 Meg's up to 512 Meg's, and we we stuck with 192 Meg's the best for us I, think the Cassandra guys now have changed their default of 160. So it's nice to see where it closed. Anyway.

B

Sexual hardware we chose, we am going for a kind of a mixture. The two pieces are two stars of hardware: grand for two CPUs to SSD and read 1 32, gigs Aram and one gig Nivin and area network, and that's mostly due to the infrastructure we have in the data center and we typically have an ops back in TV spec host each year. So up spec is typically high, CPU high memory and low disk, so let's say spindles or smaller disk, where a DB spec with a large disk, lower memory, oddly and also lower cpu.

B

So we can emerge those two specs specs to critics and respect and for this spec. For us, the disk capacity is the kind of the factor we have to kind of scared by. So we already scared by this capacitance, but is it does mean that in the future we can easily pop the disc so add an extra ad in larger discs and then increase the capacity of luster.

B

Those are fabulous question red one and three Cassandra guys. There's the data sucks guys and many people advice for raid 0 or you know, forget to know your nodes. Are your notes are expendable. Your nodes are don't trust in their arrival, our hardware's exterior arrival. So we decided that we don't lose our that. Often we do those disks, probably the most frequently that'll, actually fail, so we decided to go 81, so they should know, does go down or reduce performance, one node for a while we can see it.

B

We can place the dis pretty quick within an error too. So we went for red ones because of that.

B

Monitoring am so one of these, probably the most important things you can do in truster, and it's not the most important thing, because, if not wondering, if you know what you don't know where the problems are so the metrics and graphing solution, we use is graphite and during a load test, and we actually help load test, graphite ooh, because the default metrics reporters, we typically send metrics once a minute.

B

The default metrics reporter sends it once a second or once every 10 seconds, so we 100 150 metrics per node, and we can have hammer graph I pretty hard with that. The graphic I wasn't too impressed and we denied years check we use initially was one written by causes.

B

It's on github I think it's prob, a check which uses jolokia so use that initiative is ok, but by deploying jolokia, which exposes JM JM x, metrics over HTTP epiny, easily write our own checks, negative checks in using python and then quite easily have stuff like long-term and short-term checks. We actually look at look at graphite data for short term and long term usage averaged out over the day and then keep an eye and actually do negative. Nigeria checks if our read latency is increasingly decrease, not kind of thing.

B

Jolokia is a fantastic way to actually for non java house places. Gel oak is a really great way to actually look at metrics and monitoring for Cassandra, because you can get everything off HTTP. You can also make change. You can also change like do your jmx mayor, right and kind of config changes or setting changes where you can tell it to flush something over J over the jalopnik thing to one of the interesting graph so hit which which we found pretty weird, was the key kashyap rate, and we monitored this for a farewell.

B

We're kind of we know to square around 90, plus percent. We were very happy with that. We kind of put in monitoring checks, 490 alert. It was until a few months after we launched. We know that this check this metric is actually wrong or it doesn't actually show you the day. You think it's showing when you're doing compaction, every single read is a Kish hits the cash so for a periods of time you're doing a back.

B

You got hundred percent ki cash yet which then screws off your actual percentage, resection you're at seventy percent for us, so the only way to actually get is by looking to metrics by looking at your reads per key space and then looking at your cache hit 50 space, you can text ya. One metric for the key cache hit rate doesn't actually work much alright. Well, some of the gotchas we run into and the first one was v nodes and rec weariness.

B

I, don't know if this still the case now we were still on 306 the moment, but vino vino de NOC awareness. Don't as far as they know, work very well. Together we saw herbs and thirty percent difference in tazers data distribution and we'd, the two enabled, and so again as I mentioned, this capacity was a remains. Scaling factor remain kind of limiting factor. So we start saying three percent disk usage difference across nodes. That was no for us, so we were using blades.

B

So because of that, we hope to turn the chassis into kind of a rack and get data moved round that way carefully, and we have ended up having to ignore that. Just go v nodes without rec awareness in the end, lobe answers, so we don't use the new Python driver yet we're currently using the old Python driver and github, and so because that we've a low balance for the Cassandra all-region rights go to love answer.

B

Because of that we need to make sure we'd have a client each client with disconnect after approximately 10 minutes and then reconnect, because otherwise, if we expand our cluster, if we start start seeing problems with the number knows we had if we know that line to reconnecting the Klein, we just keeps the action. Opening would never to use new nodes and, of course, which multiple I'm sure a lot of people here to have difference from production. So our development environment was VMs or production. Environment was really machines and because costs are harder.

B

So I was close to lunch when we're pretty busy with things that all of a sudden, we saw burning problems in two thousand apparent where it was crashing. Few nodes are crashing and ended up being that the we use the same settings in dev and prod. So we need to modify the settings pretty heavily in dev like reduce memory put in compaction thresholds because we were just, we had no thresholds in compaction, so it made a few changes there.

B

Eventually, we moved or development environment production machines, so at least they would float more closely match Brooks environment and with a new data center as well. So we didn't get chance load test. New data center has been built at the time, so we aren't a few issues there, where our Linux bills we're running cpu speed by default, so wondering why one was slower than the other: the newer faster machines virtually slower than the older machines, because they were actually a CP user used to one gigahertz salsa to and between data centers we're using geo regional.

B

Instead, we use JIRA tool instead of ipsec, because when running ipsec read saw a lot of network issues around there and I never gathers in hymns and off.

B

So in turn and out when we see network outages and any prolonged network outage, we had hint hint started, building up in notes, so we had to give you very close eye on his hand off and do nice restaurant that to make sure that, if we're down time for a considered by the time, we'd actually just turned the hints off it turned into the head off off completely and then do repairs to actually repair data because hint, if hint start building up for long periods of time again, this spacer a primary concern there.

B

So lunch lunch is boring. Thankfully, and the week of lunch want to develop management over and a request to similar note failure he's a bit concerned. Yeah we test some of this in load testing. We would never actually used it in our production systems. So it's quite pleased I'll tell him the first week of lunch. We had a node failure and it was no doesn't users didn't notice? There was no impact, it was nice to nerds guns, / dad for christmas.

B

This did impact as a little bored and we expected to know themselves dying to have any any problem or impact. The problem impact came from the repairs. We started repairs. The Cassandra cluster again are with two clusters: 30 nodes in each DC for a large cluster and that the smaller cluster is 16 notes HGC. So when two nodes failed out of 30 nodes, we had repairs on both both sides.

B

Data center and repair started hanging, which is quite common in the window to ranch, and so we started seeing latency increased by a factor, but for first I was fine because we're still our legacy. We get our application and can handle us, but it caused learn bells for a while, and it meant that during Christmas instead of saying you know its failure, we can deal with after Christmas. We ought to actually kind of go in there fix it and keep an IQ Freight Los Dinos.

B

Also 20 other titles, so this is built for Sandra for color duty ghosts and we've expanded. Then last year to use develop. Three reversals also uses it now and call duty, advanced warfare users, as a two are planning to expand more titles on the same clusters. So any new title rebuilding will start pushing more more to Cassandra because from management manageability and operations point of view, it's a lot nicer, easier to use and you don't worry about charting expanding its holidays and illustrator. If it comes easy questions.

B

And I throw on some of our config change made into an appendix the end. So.

B

B

You hear me, is it.

A

B

Yes, perfect, so the repairs: how do you go about running them? We don't so we kind of break the Connecticut and Rhode advice for they say run repair once a week, we're going to repair one repaired once once a week, we're going to repair less than GC gray seconds. We don't do that. We are nodes. Don't our nodes are don't change that often so we typically run repairs if nodes. If there's any changes, apology or we change any node will then run repairs. We don't run repair as frequently when we, when we do.

B

We have scripted ourselves so with a script that we run from one node. It creates a lot have to us appear on the screen. Credits lock, file gets a node list, hops onto every node, does a repair there and that's if I finished successfully goes the next node, so I think they ops center Hauser repair service, which is kind of designed to her all the time. We don't really want that, because I'll impacts latency pretty heavily so instead we just have the script thing. We run penitas. Okay, thank thank you.

B

Good, so what the talks have been decided to talk to Ivan delayed by half an hour for this start late in the rest, will start later at a lunchtime, I think lunch times being have to have an error, so the rest of the day after lunch down with potatoes, normal and just.

A

All so I'll do our recruiting pitch as well. We are also hiring so if anybody is interested in dev or ops come talk to us. Thank you.