Apache Cassandra Cassandra Summit EU 2013, 12 Nov 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit EU 2013: Cassandra Adoption at Sky and Live Data Centre Migration

Description

Speaker: Paul Makkar — DevOps at Sky
Slides:
How to bring up a new data center and take down the old one with zero downtime, using Apache Cassandra.

A

Can everyone hear me we're back yep, good, okay, I'm, not trying to show you anything today, so mine's going to be probably a little bit more lighthearted than some of the other talk to hope.

A

It's slightly different from the published title in which I'm going to be talking about an adoption of Cassandra as sky, as we have done over the last couple of years and some of the organizational and logistical issues we've had around that which might be relevant to some of you and that will also give us context for the DC migration project we had to do recently so contents of my talk.

A

Are we giving you a bit about myself a bit about Skye bit about Cassandra at Sky, the data center migration project without downtime and some of our future plans? So about me? So a member of the cloud and develops team at sky in the online sales and service area? Obviously, sky is a very big organization. So that's primarily my focus call might I consider myself a DevOps DBA with a background in our DBMS and for some reason, DBA seem to have this reputation. For saying, though, is that right does that retry many bells?

A

There's only one brilliant? Thank you so there's my blog I've got a few places on that. The minute not too many but might be of interest, and these Express to my own, to give a disclaimer to Skye about Skye is a 14 billion pound broadcaster with over 10 million subscribers.

A

As I said, I work in online sales and service, so that means things like rewards help homepage sales, that sort of thing we've got 13, agile, scrum team, so agile is very much a big thing with us, a couple hundred developers, so you can see that introducing a new technology, they're, obviously going to be some logistical and organizational type things we we need to get our heads around.

A

We do hundreds of releases a month, mixture of auto precision, physicals and VMS, as well as some cloud apps and what's interesting in this talk is about developer empowerment. We know we try and give the developers as much how as they can to get on with it with the least resistance to do their job and deliver for the business and I've been told to Tory that we're hiring say. What's our initial use case of Cassandra sky, so back in 2011, online sales were starting to ramp up, which is awesome for us as a department.

A

We were taking over sales from the call centers, but we were having problems with reliability and persisting the cost customer shopping basket. So this is literally you're on the sky website. In shop pages, you select your box, your packages, the you know, movies. You might take it out put in to the sports and that sort of thing which is talking about that and we'd been using eh cash at that time. But we were finding under sort of load of busy periods which start experienced cache misses.

A

So that obviously leads to very disappointing experience for the customers. They suddenly have nothing in their basket, whereas five minutes ago is full of stuff, which obviously would like me to buy. We also had orkla in that stack. So we thought well, why not give it a try? Maybe we could persist our shopping basket there.

A

So why not give it a go? Ask the DBA bear in mind that the payload on the shopping basket is in the region of 100 K, so this cannot go into a standard row but has to go into a lob segment. So it's quite an expensive operation. So what do you think the DBA said?

A

Who was that hands up? It was me sorry about that, but we tried to Oracle anyway, and you know what for a while, it was pretty good. It did actually do what it said on the tin.

A

It handled the traffic, but again as traffic continue to rise, we started to get some problems with it and rather than trying to patch it or trying to work out more solutions for Oracle, we thought we weren't really comfortable with the solution anyway, so we decided it was time to look at alternative solutions for this and that's where cassandra came into our world sort of a very brief summary is that that we did some white paper analysis.

A

You could say of various solutions out there that this sort of thing and we decided that Sondra it looked like pretty good fit. So it's the one we tried for us, partly because you get multi data center out of the box, which is just a brilliant feature and I, don't think. There's many other solutions out there. That would do that, for you and disaster recovery become quite a hot topic at sky itself, oops, so yeah. So we set ourselves up a full name.

A

Physical cluster okuni were holding our hand for part of this time to get us set up to run a performant cluster and it was pretty impressive. We're getting 10 times the 3 / from memory than we were with Oracle without any mrs. or I, was on the shopping basket. So after some time this run into production has been running. Our shopping basket, persistence ever since so now we've got Sandra and Oracle my skills, we've got no SQL, we've got our DBMS in the stack. Can we go further with it?

A

You know now their options are open. We're not just talking about our DBMS anymore, which is the default position of most uses, but you know develops to say where's the database when we put it in there when that actually got a choice of a couple so before I actually move on to the adoption part I'll just done say that I took a bit of a timeout and I thought. You know think like a data scientist- and you know, consider this whole thing of acid versus no SQL versus cap theorem about I DBMS, my tape, vs.

A

NoSQL thing and scratching my head for a little while and I thought you know what this is, what it boils down to for me. Your ID BMS is like your new york, copper or maybe not New York, but you know he plies the rules. He won't flex very easily. He'll tell you when to get out of line, whereas Cassandra NoSQL seem to be all about keep on running it'll just keep going whatever you through it. You will just keep going data integrity. That's your problem! I don't care about that.

A

You make sure you take care of that want to write, write a new column of data. Just start writing it down. What are you waiting for? You don't need to define your data structures so coming from an hour, DBMS background. This is the kind of new world were moving into for databases from me in those early days at the end of the day, is all about trust, and so you've got the one paradigm which says we entrust the development and developers completely, and we have something like a no SQL solution.

A

We relax all of the controls around data that come with an R dbms and the other paradigm is where we're capturing all that control and blocking you. Should you get out of line so.

A

The question then, in this new world of us, is why should we consider using anything other than our DBMS, what does Cassandra or no SQL off the for us and I would say? Most importantly: are these features here more than actually what it offers from a development point of view? So what does it give us out of the box of the ones with styles against it? Multi data center comes out of the box.

A

That's brilliant alt, a purging comes out of the box, so how many times have we all been across projects where you reach renew data is being written down and you go and ask the question to the development team develops guy walks up to the journal in teen says: excuse me: what are you doing about data purging or archiving your data and it's like yeah there's a card down there somewhere we'll get round to it?

A

Maybe next year, sometimes like great Cassandra, absolutely awesome, because, as you write your data down, you can set a TTL time to live on that data, those brilliant as soon as that time expires. The data is gone. It's just it's just like magic from from a DevOps point of view.

A

We get the high three put in our case. That's the one use case so far, which is the shopping basket, persistence and you get the auto sharding and what the auto sharding is done for us. So far is less about massive volumes of data, but more about being able to use local disk of the nose itself instead of sand, which is obviously much much more expensive. You cannot necessarily guarantee a hot herb sort of go into your eye.

A

Ops when you're using a sound as well so Otto sharding means that if we need more storage, we just scale out our nodes and voila. We suddenly got more storage and guaranteed I ops from it and control of the ions.

A

So the teams started to get interested in Cassandra, and you know it's all very new and we were trying to find ways the best way to sort of bootstrap the teams and developers if you like. So I came up with this questionnaire, essentially out of which drops your Cassandra configuration and it's kind of a springboard from which to think further about how you want to use Cassandra what it can do for you.

A

So I just run through this with with you, so that out of this will be able to determine replication factor, consistency level when writing and reading and time to live. So our two teams mentioned here is one wanted to write some journaling data, which is literally about you, know, storing journeys as customers are clicking around the site, moving around, maybe adding things to shopping basket, they're buying, and that is literally a dump of their journey in Cassandra.

A

The other team has already talked about. Is the shopping basket experience? So the first question is: can you afford to lose data in the case of journaling? Yes, can afford to lose the data? You know what this is only debug will only be referring to it occasionally doesn't matter if it all disappears shopping basket, absolutely not there's already described. Do you need disaster recovery again for the journaling? No because journaling for a journey you know, tracking the journey in one data center is irrelevant per our data center.

A

Those will be tracked individually, so we do not need disaster recovery for that for our shopping basket. Yes, we would like to if we fail over, we would like to not disrupt the customer journey. If possible, will you be updating the data journaling again? No shopping basket, of course, will be adding removing stuff from our basket. If updating, what are your read requirements? Must you guarantee seeing the latest data or is eventually sistant, okay, journaling that doesn't apply, but obviously for shopping basket? We obviously want to see the latest content in the basket.

A

Can the data repost, which is really more like after how long can your day to be purged in the ideal world in the journal in case it's one week, 24 hours in terms of shopping basket, then the next couple of questions are not so much about how you would set about configuring yourself for Cassandra use, it's just more about understanding how it will be used, because this is a multi-talented system at the end of the day.

A

So we need to make sure overall for the whole department we're managing the expected I, ops and volumes of data. So we've got what kind of volumes of data do you expect in the TPS and out drops of that for our journaling team is replication factor of one in data center one only and they will have a similar day, turkey space in data center too.

A

So this is a key space trip earth configuration they will be as they're reading they'll be reading one and writing one, because this data goes down once and therefore you only need to read it once to get a consistent view of that data and TTL is a week in seconds for the shopping basket. We have replication factor of three in each of our data centers, so this will allow when we do our reads and writes using local quorum will be Gary.

A

Guaranteeing this twofold really will guarantee that we'll get the latest data when we read back anytime in either data center. If you had a disaster as well and that you can afford a failure. So if a node fails, you've still got two copies of your data, two replicas left from which to retrieve the latest shopping basket. Again.

A

The TTL there is described in seconds for 24 hours and essentially Sandra goes through this process of compaction every once in a while when it compacts, it reads: TTL of the data and if its terms is now out of the window for its lifetime, it will then just remove that completely from the database.

A

So I'd say yes, we've crossed the chasm. Most of our teams have a now using Cassandra. We have 22 key spaces in production. We've got probably still one specific use case, which is a shopping basket that has to be done in Cassandra, the rest of them. It's more like I, have a choice now and when it's it sometimes very simple. Data I will choose to write that in Cassandra.

A

So if I'm storing a survey, if I'm storing session information, for example or if I'm bulk loading, some read-only data, it's actually very simple and elegant to do that in Cassandra and of course you get data disaster recovery out of the box.

A

So now I've spin forward onto the data center migration without downtime, and please note in doing all this. The empowerment of the users of the developers should I say and the team's you get much more control with Cassandra than you would with an r 2ms, and some of them are not relevant or configurable in our DBMS. So you have control of key spaces to be able to create drop update, which is the same as creating a user in Oracle or a database in MySQL yep.

A

That's the same control of the replication factor, that's something that doesn't exist in our DBMS, but they have. The developers now have the power to to control that control of the read and write consistency, control of column, families, create drop, updates and connect control over the notes that they will connect to in the cluster that, with our dms, obviously you're, normally connecting to one load of using something like Oracle RAC you'd be connecting to a service which, with them back off to maybe several servers behind it.

A

But I've shown any facility one service presented to the application.

A

So a pre-migration topology looks like this, which is simply for node in two data centers in each data center is useful to note that the nodes are distributed evenly. So this will result in an even distribution of data per data center and I'll. Come on to that again in a minute another way of viewing that our target topology very similar, now DC ones out the way and dc-3 has taken over, but they will have slightly different tokens because, across the whole, cluster tokens must be unique.

A

So how we going to get there first of all is considering doing this. There's several ways to do it, but the ones I was considering to start with to do involve down time, essentially, because data files are immutable. You copy them from your like for, like notes from DC one into DC three, and then you shut down your node in DC, one startup in DC three, as if it's a node repair or basically you're building a recovery note for the one that's just gone down, but that requires downtime.

A

So this is going to be about doing it without downtime.

A

So the first step is to obviously get your nose into DC three install the Cassandra binaries then, on the light for like basis copy over the Cassandra yamel from data center one, and then we adjust in there the listen address because obviously the to match the IP of the node in DC three. So that will change the listen address and then we just take away one from the chokin as compared to DC one. So the chokin t1 from that person out in DC one would be copied.

A

The young will be copied to the node like-for-like node in DC. Three. We just take away one from the token, which seems a bit weird, because it sounds like well. The tokens aren't equally distributed. So it's not going to lead to problems but I'll come unto. Why?

A

That's not an issue in a second I also made sure that the auto bootstrap was set to false, because when we come around to starting up the node in DC three, we want to do that in quite a controlled way too, because obviously we're trying to do a data migration rather than just set up some new nodes.

A

Finally, there's the network topology file that lives on all of the nodes, so all 12 notes and those need to be edited to to take into account all the nodes and the locations of the nodes and when I went on to wonder the live, Nadine data center, one and I ran an O'toole ring. I could see now all of a sudden within a couple of minutes.

A

I could see these new nodes appearing in DC three, but they're in a downstate, of course, because we haven't started anything yet so yeah, just briefly about network topology, essentially by using network topology. This range that we're familiar with, which goes from 0 to 2 2 127 minus 1, which is our token range when you're using network topology that effectively forms these lines, or these rings on a per datacenter basis.

A

So the fact that I copied over the animals from DC 1 and put them on to DC, 3 and just took away one from it does not mean that there's some kind of clash or there's too much closest between dc1 and DC 3, because DC 3 is considered its own ring in its own right, and so as long as your nodes in each of your data, centers are equidistant.

A

Your kind of good to go. You've got a good configuration, your local. Your data will be balanced across all nodes, so I ran a check to make sure all of our clean spaces were indeed using network topology strategy, which was the case. How do I found any with simple strategy in there? Then I'd have had to have done to remedial work, because essentially, this would have all bunched debt.

A

From the cluster point of view, it would have that it would basically collapse that line down into one, and then you would see that this very uneven distribution of nodes along this line exists and say sundays are going to get hardly any data, whereas others are going to get hell of a lot more. So it's critical to get any simple strategies out of out of your configuration.

A

I would almost say that really that shouldn't be in you know, once you've moved to network topology, and you define that in your cluster, you should not be allowed to use simple strategy would be my point of view, rather than having the choice of both I think when you consciously made the choice to configure for worked, apology and you've got multiple data centers. It should basically stop you going back to things like this, so the next step was to start up the nodes in DC three one.

A

At a time now, at this point, I have to say went to my first node I was expecting within a couple of minutes to be sort of pretty much beat strapped so to create the system key space and bring the schemer up to date. But there was like half an hour later still seeing the log file doing stuff it just didn't seem to settle down. I was checking the keyspace account and I was expecting to see that steadily increase until I had all my key spaces on this new node.

A

But I was seeing stuff like this, where it would, you know, say, be at ten Lillard count again and it would go up to 14 the great more data coming in more key spaces. Then, after a bit longer, it was dropping down to 12 again. So it kind of didn't make sense to me and when I described cluster I'd see that the schemer for this node was not in agreement with the other live nodes which were all matching each other in terms of the schema version.

A

So that's when I came across this fun page on the dates tax website, which is called the great early. The schema management Renaissance, which describes points Evans of dark ages, greater than points them into one as middle ages and 1.1 as Renaissance, and we found ourselves in the Middle Ages, which was awesome.

A

What this meant really is that when you set up a new node used to try to bootstrap it, it has to replay the schemer as it was being used on the other nodes. So if you've done lots of changes like adding key spaces, dropping them adding more dropping making changes, changing the replication factor, all of that is rolled forward on your new node and, and so basically, once I knew that that was happening. Well, you know.

A

Where is 1.1, you basically start up the Navy just so grab me the latest schemer and work with a latest schema. So that was a problem I found myself in there, but once I didn't once I knew that I, just let it run, and it actually was so big that it took about a couple of hours before the schema agreed with all the other nades.

A

We were kind of bootstrapping good to go and I knew that, because, when I ran describe cluster without redacted the IPS here, you can see that you you'll get this sort of thing where it will say: schema version, some long, passionate herbs, some long number hex number and the nodes that match that schema and I also did a account on the key spaces and that match too.

A

So the next part was to update all the key spaces. This is really our first pinch point in terms of the developer empowerment. You know, we've said it's all for you guys. You know do what you want to do: make sure that you've meet your business targets, but now we need to gain control of the key spaces to account for data center three. So this is really a two-fold process.

A

One is to find make sure for each of the teams you work out where their deployment scripts or get involved with them to adjust their to adjust their replication strategies in their key spaces. To now account for data center three, so we were going from. You know, create keys, baseball, our strategy, ops in DC, one and DC two to one where it says: dc1 and DC, two and DC three, because what we're attempting to do is have three data centers, all with replications sort of independent sets of replications going at the same time.

A

So that was the first thing is to get control of the script. So if there's any new deployments, they would take account of that, but DC 3 is there. The next step was to actually go in through the command line, client and run a sort of a manual update on the key spaces as they existed. So you can just run update key space to take into consideration dc3, so that kind of covers you for the key spaces to make sure nothing gets missed out.

A

Nothing gets left behind in the next phase, which I thought this was going to be the hardest face. To be honest with, you is where you actually try and recover the data in that that node, you basically saying to the cluster, please provide me all the data so that I can get up to date with my partition range and you do it on the note by note and the data center is good to go in that respect.

A

But you know we don't have masses of data, we're not talking terabytes, you know, I think we're we're in the tens of gigabytes, but this is sort of thing you'll see when you run the natural repair, minus PN and the PN just means repair the net data for that node and not for any other replicas that is responsible for on other nades. You see this sort of thing, we will say: you'll be repairing a new session.

A

It will give you the key space and the column families it's trying to do the repair on and you get messages like the repair for the column. Family is synchronized for the session and it will go on to do more as the more exists for the column, family and eventually, you'll get this last line down the bottom, which says the repair session is completed successfully.

A

So once this was done, these are the kind of checks I ran on the cluster. To make sure we were kind of good to go, and I would say that some some degree it's not an exact science. It's not like doing a row count in your ID to be a mess and doing the exact locale on your on your other note that you may be doing a clone of it, for example, whether you're doing a migration project. These are the sort of checks.

A

I was doing, was a key space camp that matched column family count that matched you can go to the data directories and run rd you just to see if your data in terms of volume roughly matches what you have from from other nodes in the cluster, because they should all roughly using the random partitioner. All node should hold roughly in the same amount of data, and that's not going to be exact, because some names will be have just completed a compaction.

A

For example, there may be snapshots in there that will also account for some of the data. That's that will appear on 2d you, then you can run no tool ring which will show you a list of all the and show you the balance of data across the nose across the whole cluster. That helps gives you confidence.

A

If you want to do, go closer and inspecting which I did on the a sort of spot check basis, you can run the notoriety of histograms, which basically takes as input the column family, sorry, the key space in the column family and that will produce your histogram output, telling you the number of keys on that note for that key space, as well as how large each of the rows are, it would show distribution chart of that.

A

So the next thing is to do the reassignment of clients, so, whereas the clients before were pointing at DC, 1 and DC to they now pointing at DC to DC 3, which requires a redeployment of the applications, so you have to go round the teams, one by one make sure they are redeploying to point to DC 3, because we know we're good to go in terms of the data now. So that's another pinch point we have there in terms of you, know, logistics and organizational concerns.

A

Then we repeat the exercise of updating the key spaces which this time we'll be taking out. Dc one leaving just DC to and DC three again means iterating across the team's getting hold of scripts that they might have the deployment scripts as well as running a live update on the key spaces itself.

A

Finally, we come to decommissioning once we're happy and we maybe let it run for a couple of days to make sure nobody's complaining. Nobody says they've got missing data.

A

We take out the data center 1 by running the natal decommission and then remove the nose from the topology file, so the topology file that exists now on DC, 2 and DC 3. We can just strip out the notes that we have from DC one.

A

That is essentially the way we did. The live migration.

A

Happy to say there weren't any complaints or thoughts or bugs or errors reported by any of the developers say for the future, how we can act of help? Yes, quite the organizational challenges actually harder than the Cassandra ones, I would say, and so how we going to help that in the future. So we're moving like lots of organizations more towards cloud applications and that's great for us in this respect, because what we can do is in abstract away the Cassandra services.

A

So now look they're no longer involved with sort of hard-coding data center, one data center to replication three, and all that sort of thing it would basically be provided to them so that they they will select from their clout that the survium cloud application, for example, cassandra consistent with data disaster recovery or it might be cassandra single value, one data center, something of that nature, but then behind the scenes. We then, what one place to make our adjustments.

A

Should we have to do things like this in the future or maybe ddc migrations or add nodes, or what have you so similar goes with the client connection pools as well. So they're no longer defining the notes that they're going to connect you to write nicks and data they're going to be connecting they're going to be basically saying they want to use a Cassandra service in the cloud and that will handle the pool connections for them. And we can then manually adjust that in one place behind the scenes.

A

A

A

A

Yes, one class of we, we started with four nodes and we have four nodes in one day to interpolate in another physicals, actually HP dl380 s and that's been fine for us. So far,.

A

No, because that's an interesting question: why would it because it you know the applications cannot connect to any of the nodes and there's no master node and so all of them couldn't you know that the work gets evenly distributed.

A

We've got a whole monitoring thing going on so wee wees, NAT Geo's. We don't have any specific, Cassandra monitoring tools at the moment. We're probably look to do something like that, but the moment is more or less just nod your monitoring. You date, sort of a disk, CPU memory, usage stuff of that nature, hmm that they're up and the system looks healthy and sort of general sense. Also checking the things like the heap size stuff of that nature, not necessarily for Cassandra specific, but there's enough coverage there too, to cover most of entry. Allergies.

A

No now we wait, it's been remarkably razon, so we have. One of our philosophies is to be able to sort of bare metal provision, our systems and we go through a process of rebuilding our systems once or a while. Now, to be honest with you with Cassandra, we haven't done that in a while, but we have in the past where, where we will literally take it down, rebuild it start it I'll bring it back into the plaster.

A

Knowing that anybody with sort of important data requirements will have the replication factored correctly configured so that this won't affect them.

A

How do they say?

A

Yes yeah? So all our teams are responsible for their own support. You know it's part of the whole agile way of life. I believe they get full empowerment, they get full responsibility. We have you know like five minutes, SLA s2, to get to initial incidents, and you know some form of resolution or understanding resolution within half an hour, even if it might take longer than that to achieve so. Definitely it's not. You know. We've tried to decentralize as much as possible. Obviously you need a dev ops team.

A

You know who's going to put Cassandra and who's going to configure it who's going to maybe start providing advice about it, but you need that for all the infrastructure stuff, but but other than that as much as possible. It goes out to the team's.

A

No standard disk.

A

How much think is around about 170 young being sent to something like that? Pernod GM's of the youth yeah about 170, I, think I'm, not convinced we're using all of that and as Jonathan talked about this morning's talk, you know you liberate. By going to use memory directly outside of the heap, you can then make more use of your system memory and we're not in that position yet, but but we will obviously hopefully once you've done our upgrade.

A

We will we'll get yes, yes, so our compacts, your major compaction, czar out of ours, so we remain no UK business. So we can afford to do this around the midnight one o'clock in the morning type type type of time. Yeah.

A

Which levels, I think it is the child in my head, I know by thinks level I think it's, the whatever is the default or recommended.

A

I think it's pretty much default. Yeah.

A

A

What I observed is that it seemed to be pull, it won the repair and it seemed to open up. I was right, I was looking at netstat and I saw lots of you know: ports open from various nodes. Wasn't just a two or three of them. It seemed to open up to lots from that. I, don't know if any of that's optimized to now have a sense of geography or latency across the native I suspect on this version that we were using that it wasn't. You know that intelligent yeah.

A

For what so, it depends, so we've got 22 key spaces, each of them or everyone's got the renter. That was the point of the questionnaire out from that drops, so how you would like to read and how, if you write like to write your data.

A

Nope because we're kind of like I didn't say necessarily at the time, but at one point we actually were running with three data centers, although still connected to two DC one and DC too, but we actually had three, and so it was just a question of switching over to the third one in favor of the second one. Sorry, the first one so yeah.

A

What kind of rebuild do you mean.

A

I'm not sure which ones more optimal, noto no pet natural repair is the sort of the prescribed method of doing this sort of thing, and this is one we tried it's very performant press, so I'm not sure I even considered doing a rebuild instead.

A

A

It may it may be if you have partial data on there, so you don't want to go back to the beginning. You go to. If you do rebuild you're, basically saying go from the ground up and if it's a repair, it's like well I've actually already got fifty percent of the data, which is what would happen if you copied the data files on there and you said repair.

A

It would actually say that I already got this much later and I'm going to collect some more, so it's probably more performant to do, but actually, if you're starting from scratch, maybe they have the same okay, I think we are done.

A

A

Thank you so far. A memory failure, no nothing! No cuz! We evaluating Cassandra okay for Samantha's, know: okay, write you a little valuating to sunder right I thought I might get as much out. Yeah yeah, fo point.