Apache Cassandra Cassandra Summit 2015, 14 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: PagerDuty: One Year of Cassandra Failures

Description

Speaker: Donny Nadolny, Scala Developer

Despite being a highly available system, we have had three outages caused by problems with our production Cassandra clusters over the past year. We'll take a look at each of these outages: what we saw from the inside, the actions we took to recover, and most importantly the procedures and monitoring that will help prevent it from happening to you.

A

B

Thank you. How many of you have been using Cassandra or any distributed system in production for more than a year, put your hand up and keep it up for a sec. Now only keep your hand up if you have never had any kind of production issues or outage or anything, hey one person claiming no yeah all.

C

B

Down so distributed systems can be really complicated, there's a whole bunch of problems that can happen, and you can get kind of the unintuitive result that what should be a highly available system actually has lower availability in practice than a single machine server. Just because of all the complicated things that can go on now, one of the things that I think can help improve.

B

This situation is if we talk about the mistakes that we've made and the problems that we've had, so we can learn from each other and hopefully not keep on repeating these mistakes that other people have already made. Now.

B

What I'm going to talk about today is three different outages that we've had that pagerduty has had with our production Cassandra cluster. If you don't know what Patriot II does our customers are mainly tech companies? They have a lot of different mongering systems when they detect a problem like a server going down or high latency they'll send an event to us, and then we do a whole bunch of processing, and then we call or text or email you to let you know what the problem was, that you can fix it quickly.

B

Now this part in the middle is where we care about high availability, a lot, because if your stuff is going down, you want to be able to rely on us to stay up. So in the middle here we have a whole bunch of different services. They are mainly using Cassandra and zookeeper running in multiple data, centers and actually one of the fairly uncommon parts about our Cassandra setup.

B

Is that we're running in multiple data centers and we're doing coram reads and writes where our quorum actually goes across the wide area network, with significant latency co-worker of mine is giving a talk about some of the kind of cool things that happen when you have that kind of a setup tomorrow. So if you want to check it so the first outage I want to talk about which I have honestly nicknamed the backlog.

B

The a bit of background about our cluster. We have a shared cluster, so all those arrows from the different services that were pointing at one cluster that is actually just one cluster spread across multiple data. Centers is still just one and it's fairly small cluster we're talking low tens of gigs of data and even more than that, with our usage pattern, we tend to write a bit of stuff into Cassandra and then pretty shortly after that will process, it may be a few seconds or a few minutes later and then we're done with it.

B

So, in terms of the active data set, it's even smaller than that, it's more like tens or a few hundreds of megabytes. The first outage actually begins the day before we had some kind of warning signs. We had a few small degradation as going on and what triggered this was. Cron kicked off a repair process on our Cassandra cluster and it caused a fair bit of load. We were seeing high latency from the application and what we ended up doing to resolve.

B

This was disabling thrift on some of the nodes and turning some of them off for us we're still on the old thrift interface. For so disabling thrift is forcing our application to use a different note for the coordinator, and by doing that, we can kind of push the load around a little bit kind of shuffle it, and with that we were able to recover somewhat.

B

So what it looked like was we have a read, latency, very, very low, what it should be and then it spikes way up, and then we use these techniques of kind of disabling thrift, disabling nodes, shuffling around the load and then it kind of improved. But we can't leave it like this because we have notes turned off and that will affect our availability.

B

So we turn them back on and the main problems again and then we turn the back off and then it recovered, and so we kind of shift it around and by the end of the night. It looked like this. This is over about a six-hour period of kind of struggling to get our cluster recovered.

B

At the end of this, we did have everything up everything recovered and we were pretty much okay, but the problem was that we had killed that repair process that had started this whole thing and you need to do repairs fairly often on your cluster. Otherwise, you can have data coming back from the dead, so we had to actually do that repair again. So our plan was to get a bunch of people around to keep an eye on the cluster trigger that repair and hopefully be able to react really quickly.

B

If anything went wrong and as I mentioned before, this was a shared cluster that all of our services were using. So we were able to turn off a couple of non critical services to decrease the load a little bit, and our plan was to just use these strategies that we would use the day before if things went wrong. Well, the problem was that, when this happened, when we manually trigger this repair, what we didn't realize was that we had chron set up on that machine to trigger a repair on a different key space.

B

Shortly after that, and then a little while after that, we also had a compaction that was triggered on one of our largest column families. We were using size, steered compaction at the time and we were running manual compaction. So all these things happen at once, and we got this.

B

This is a graph of our outgoing notifications that we should be sending to our customers and our backlog just keeps on growing and growing, and so we use the strategies that we'd come up with before we were trying to push load around, get it to recover, but it just kept getting worse.

B

You can see kind of almost a little bit of recovery in the middle I'm, not sure if that's actually real or just a blip, but give us a bit of false hope there, and if you want to see what an unhealthy cluster looks like by the way we have lots of graphs of that here. We have the read stage. This is when you are doing a read in Cassandra. It has a bunch of internal cues that things go through and we can see here a growing backlog and kind of very unhealthy q4.

B

This also a cpu just looks terrible here. We're.

D

B

Pegged but where we're not really good, and so this went on for around two hours and unlike the last outage, we didn't really have any recovery periods here. We just had you know more and more of a backlog, so we were struggling, trying to figure out what to do what we were doing before wasn't working eventually. What we kind of concluded we had to do was all the data that we had and our cluster is basically in flight data.

B

So it's not a real user data, it's just kind of things that we put in there that should be handled quickly, so we had an option which was to delete all of our Cassandra data and bring up the cluster to recover it, and so we really very quickly tried this out in our low test environment to make sure that we actually could do it and, in the meantime, try to fix it. We couldn't fix it. So that's what we ended up doing.

B

We kind of hit a factory reset button of just purging, our Cassandra cluster and I, say you know it kind of worked here. It did work our cluster recovered, but you can't really call it a success because we deleted a bunch of data which is not very good, but it did work ish.

B

After this we did a ton of investigation into what went wrong now in kind of a very selfish way, I'm just a tiny bit happy. This happened because I got to spend a lot of time. Learning about Cassandra figuring out went with what went on learning about the internals, so the benefit from this is that a lot of people at the company learned a lot about Cassandra. Because of this, what we kept on trying to find during this was what were the leading indicators.

B

What would have told us that this that things were going wrong because, while the average was going to on it was really clear that things were bad but leading up to it? A lot of the metrics that we were looking at didn't really show that. So what we found was that we needed to put more effort into looking at the Cassandra specific metrics. We were looking at mainly host level metrics, like CPU network. That kind of thing- and those were fine, almost all the time or right up until you actually have an outage.

B

So we didn't really tell you anything. The Cassandra metrics, though I'll talk a little bit more about what we found there after what we ended up concluding from this was basically that our cluster was just underpowered, may mean CPU during normal usage, when we had a regular request, things were fine, but when you have repairs and compaction is going on on a somewhat growing data set, it was just too much and we were on to core machines and in the shared clusters it was just.

B

It ended up kind of overwhelming the cluster another one kind of a specific one to our use case. But we did a lot of operations that, at the beginning, you kind of needed a whole bunch of operations in a row to succeed before you could do other work. And so if your cluster is unhealthy and even a small percentage of operations are failing, that whole sequence fails and then you can do basically nothing rather than doing a little bit of work.

B

Some of the lessons that we learned from this first one, the big one, was capacity planning. This was just an oversight on our part. We had what we thought was a fairly small amount of data. We thought the cluster could handle it, so we weren't really paying attention to capacity planning, but we should have the other one is that we added a lot of Cassandra specific laundering, so Cassandra exposes a lot of metrics through jmx and we were actually collecting them, but we didn't have them on easy-to-use dashboards.

B

So we put a lot of work into that. The other really big thing that we were in from this was more isolation as good. We had all these different services hitting one cluster and it made it really hard during the outage and after the outage to figure out what was going on was it you know which service was causing the load? Was it in Cassandra? It just makes it really tough to figure out, and it also means that when it fails, everything fails rather than having just some things fail.

B

Some of the metrics that we found I won't go over all them. I'll just mention probably the most important one is the drops messages one. This one is Cassandra's sign that it's overloaded, so when it gets a request and it can't handle it in time, it will drop that message and record it in a metric and we weren't paying attention to that. But that would shown in advance that our cluster was becoming overloaded.

B

All right now, a completely different outage, unrelated to the freshman role. We had made some improvements since the last outage. One of the big ones was isolation. What I mentioned before, and we did this a live migration of all of our services to have their own independent cluster, and we actually did this for not just Cassandra but for zookeeper as well and I have a ton of stories. I can tell about that, but the short version is that isolation pays off there as well, even though it should be highly available.

B

There are lots of problems where it might not be. We also added a new surface, which is what this outage is about, and last one we bumped up our cassandra version. It's a little bit embarrassing how old of a version we were on so I'll. Just tell you the version that we went to, which is we upgraded to version 1.2, so we were a bit old at the time. Actually so now so what happened with this new service?

B

Was we needed to add a bit of data and have it wired through a few different column families? So we did our our schema migration on our cluster and then we did a deploy to actually use that and we started getting a few errors recorded in the application. This invalid request exception telling us also what what key space in kollam family had problems.

B

So we immediately checked our cassandra danger, metrics, dashboard, that this is actually a real name for it too, since the last edge, which we made a dashboard, which is lots of different metrics, which can be a sign that Cassandra either is overloaded or is having problems. But in this case it was clear, but from the exception before, meaning that it was something schemer related. So we ended up running described cluster through the CLI and we saw this output, which shows us the problem.

B

Every node here is on one version of the schema, except for one, which is on some other version. So we know we need to take some action against that. No, but we're trying to figure out what to do and the solution ends up being turn the note on and off and then once you do, that everything is fine. This is what the output should look like everything on one line, meaning that all of the hosts are on the same Cassandra version. So at this point we haven't really had an outage.

B

We had a few errors, but with retries all the tasks we were doing were successful and it was fine for a couple of hours and then we get this. This is the outgoing notifications for pagerduty, where everything is flying for a while, and then it drops off to nearly nothing so checking our Cassandra. Oh sorry, actually what we saw else in the application was really high. Latency to Cassandra should be low spiking up too much much higher than it should be.

B

Checking the Cassandra danger metrics page this time we do find something which is the mutation stage. This is similar to the reed stage I showed earlier, but this is for right operations that are going on in cluster. It should be basically not how many are queued up, and it should be basically nothing, but instead one host goes off on its own and has a ton of operations that are backing up.

B

So we know we need to do something in this case, we immediately disable thrift on that note to prevent the application from using it a little while later we notice that we have a repair process running, and so we kill that ensuring we, after that, we just kill the node completely and that ends up working for us. This is how many operations we were able to perform on our cluster.

B

Now the cool thing about what we did before was that there ended up being a bit of a gap in between each operation that we did so zooming. In on this graph, we can see what the effect of each individual operation was. When we do that, we see that first, we disable thrift and then immediately everything recovers. So the other actions that we took didn't actually fix it. It was only disabling thrift that fixed it now. Disabling thrift is just forcing our application to not use that note as coordinator.

B

So the question we have is: why did just one bad node have such a big effect on our latency that it basically caused our application to not be able to process anything at all? There's another small puzzle here of. Why did the application keep on using bad coordinator and I?

B

Do have the answer for that, but I don't have time to go into it, but just trust me that we do know what went on there, but with this we only had that this one node that was being bad, but overall, our operations for taking much much longer than they should have been, and the answer to this turns out to be fairly straightforward.

B

The what you need to know is that our average request time for a successful operation is about 25 milliseconds, but the time out in the client we had set to 10 seconds and 10 seconds doesn't seem that long and you know human terms, but when you compare it to how long a good request should have taken you've had this giant multiplier of a bad request or a request sent to a bad node takes four hundred times longer than one sent to a good note.

B

So, even though, and keep in mind, we had a fairly small cluster. So even though only some of the operations are going to a bad note, it doesn't take a very large percentage for you to have a really high multiplier of how long your average request will take, given that some of them are going to a bad note.

B

Now, there's also the question of what happened to Cassandra. We know kind of given that Cassandra was acting up. Why other things went wrong, but they the question of what happened to Cassandra. For that we don't really know. We do have some theories that some people are laughing who recognize the picture. We do have some theories, but we don't have any actual reproducible things that we found at least not yet now. Some of the lessons that we learned, one of the big ones, is that we got some payoff from this isolation.

B

We had problems with our cluster, but it only affected one service now, because I services kind of form a pipeline where you need all the ones in the chain to work. It did still cause notifications to be delayed, but it meant that all the requests from our clients coming in they could just be queued up, and it would just happen a little bit later, rather than being dropped on the floor.

B

We also learned how we should be doing schema changes, which is you do describe cluster make sure everything looks good, run the schema change for one column, family, and then you describe cluster at the end to verify that everything actually went. Okay, I'm even added a bit of monitoring for this schema disagreement to all right. The third outage this one is particularly painful. This one had the lowest impact on our customers, but it was pretty directly caused by me.

B

So what happened with this one was we were scaling out our one of our Cassandra clusters. We've been adding nodes after they were added, you run repair on them and then they're good. We had added a new node, we ran repair and then, after a couple of hours, we noticed that nothing was happening on the node Cassandra wasn't logging anything out, it wasn't having any network traffic, so we restarted that node and just to be safe. We did a slow, rolling, restart across all the nodes in her cluster and partway.

B

Through doing that, we got this a giant spike of latency to Cassandra, so checking our danger. Metrics page, it was unfortunately clear what we ended up having to do was pour through the logs of Cassandra and find this info level log that one of the hosts was doing hinted handoff.

B

It was trying to replay the hints to another node, but- and it did do some of them successfully at part way through it failed ended hand off by the way it was when you know that you want to write something to another node, but you've either tried and failed, or you think that the note is down, so you don't even bother trying at all. You write a hint locally, and then you replay it later so this replaying we replay some of them, but what all of them now together cluster back into a healthy state?

B

What we ended up doing was picking that host, which seemed to be having problems, killing it and then everything recovered, and then we waited for a little while after things were good, we brought that note back up again and then everything was bad again, so we put be shut it down and may actually we were curious to see if it was just a coincidence that have happened after we brought it back up. So we brought it up another time and then things started going bad, so we killed it and said: okay. Clearly, something is wrong.

B

With this node, so we started investigating and we found something strange in the commit log directory of Cassandra. We saw a couple of files that were owned by root, and these were actually from about a month before the outage they were really old files, and so, after a bit of digging, we found a couple of commands that were run around the time. Those files were created by me. I had run this SS table to Jason command as root on our machines, and it ended up creating these files now quick detour into SS table to Jason.

B

If you don't know what this is, it's a really cool command, where you can look at kind of the low-level details of what cassandra has stored on disk in the SS table, and it's really cool and I was trying to run it in our low test environment, even even before the month before the outage, and if you've never done this before what happens when you first run it is you get an exception? Some partitioner doesn't match this other thing, and so you look for a while. You try to find. Is there another argument?

B

I'm supposed to pass to this, to tell it what partitioner to use after giving up on that and just searching. You find a bunch of unhelpful advice, telling you to delete files and other stuff, and eventually you find what works, which is you need to set an environment variable to tell the tool where to find your config file for the partitioner that it needs to use. So after you've done that you get another exception, and this one you're. Just thinking like come on.

B

I just want this thing to work: I, don't care about from there's just a run, so you run it as root and then it works, and this is a low test by the way notnot production. But the problem here is that this SS table to Jason command has a bug in it. It was reported after the outage where it will write commit log segments.

B

This happens because the SS tables adjacent command actually lives in the same jar as the rest of the Cassandra code, and so it manages to accidentally call the commit log code and write out a commit log segment, and if you run it as root, you get a commit log segment as root. So we have a commit log segment written as root.

B

What's the problem here well, the problem is that Cassandra will try to recycle, commit log segments that is rather than creating a new one and then deleting the old one. It will rename the old one to the new file name, that it wants and then truncate the file and try to do that. So, if you try to do that and you get a permission problem, but the other part here is, this was a very delayed effect.

B

I had run this command a month before and it wasn't until we restarted the process that that's happened, and the reason here is because Cassandra will only reuse the commit log segment for the files that it knows about so because it was written by another tool. It didn't know about it until you restart the process, then it will read in all the files from the file system and then try to reuse them later on after it's filled up.

B

So that's why we also get the the kind of some of the the hints go through, but then, after the commit log segment fills up, then nothing else can go through, and so this is the whole chain that breaks its. You have a mutation stage, which is just a write-in Cassandra, and to do that, it has to add something to the commit log to add it to the commit log. If it runs out of space on the current one, it has to fetch a new segment and the code that is populating.

B

These segments failed a while ago, with an IO exception, trying to rename this file, which was owned by so with that. What we have are the lessons from this outage, which is you need to be careful about what habits you develop. I think this was particularly hard because there's a delayed effect of you run something ammo test and it works there and you kind of build up this habit of doing it.

B

A certain way just run it by root and everything was fine, but if I was to run it in production, the first time I wouldn't have just on that I would have. You know, looked more carefully switch to the Cassandra user, probably but I built up this habit before the other part is that some discussion that went on in the ticket for the SS table to Jason command is that they should have had the command be more isolated. It was able to do that.

B

Writing the commit log segment, because it just kind of happened to call that code because it lived there, but instead it should have been completely isolated and if it was nobody would have written, you know commit blog segment writing code for a tool that it's supposed to be just reading a data file, the last one I think the most general one is that any time you have rarely run code like process startup code.

B

Even have these delayed effects where you do an action, and it's not until much later, that you feel the consequences of it like some kind of failure. We've had other problems like this to where we change a config file. But then we haven't restarted the process that read that config file, and it's not until you do that that you find out that whatever changes made to the config file broke. Something.

B

So one of the things that I'm happy about with these outages, at least the trend here, is that they seem to be getting less severe. The first one was really terrible, and these are in chronological order by the way, so we had a really terrible outage. Second, one bad, but not quite so bad third, one we're even faster at fixing them I. Think that the main reason why that has happened is because we put so much time into investigating these problems when they happen.

B

We put in a lot of effort into figuring out what actually went wrong in some cases like an outage to we don't get the full answer, but we at least try as much as we can, and so that's what I want to leave you with its that whenever you have these kinds of outages, you should put in as much time as you can reasonably managed to put in to figure out what actually went wrong and ideally share these mistakes with other people.

B

Even if they're, you know really simple things like permission problems or really complicated things, it just helps everybody to learn from each other's mistakes and improve. Thank you.

A

Questions for our speaker.

E

Not all once should.

F

I hi the particular metrics you chose as your sort of leading dated danger indicators after using them for a year. Do you still think those are those particular ones or the useful indicators or other you sort of looking for better ones.

B

The drops metrics one certainly is a good leading indicator blocked, flushed writers, at least in the case that we saw it was a good indicator that can also happen for a few other reasons, but I still think it is kind of generally pretty good GC behavior, it's a bit hard to read so it's kind of hard to tell and the lagging ones that I gave those are still quite valid, but they are lagging indicators. They just tell you when something is already wrong.

A

You guy.you, do you guys use pagerduty to other yourselves? Do.

B

We use pagerduty to a little self. We actually do so for most of the problems that we have the vast majority of them. We're actually still up. We just notice kind of weird problems, and so we can still use our own application to alert us. We do have a couple of backup ones that we use just in case if we have something that actually causes all or it's not to be able to go out or the account that we're using for a self not to be able to go out.

C

Hi I, so in a config file that you are using the property file snitch, are you still doing that so I? Can you repeat that I saw in the config file? You are using the property file. Snitch. Are you still using that one yep.

B

I believe we are, I.

C

Recommend changing it to gossip first engine, as you want. A fourth outage tell.

B

Which one sorry, the.

C

Gossip property fighting charlatan just a proper to fight a snitch we adapt Iran has beaten us in the past Mona most.

B

Improper okay: let's talk after I'm curious what problems you've had with it.

A

G

You repeat, the question sure.

A

Were you able to take care me? Okay, we able to take any of these files offline and reproduce any of these problems and test environments, or do analysis and separate environments rather than in production. Yep.

B

We often do try to reproduce these problems in our low test environment. Usually we don't have to bring data over. We did that when we were expanding our cluster to make sure that when we were adding nodes that nothing went wrong, we've heard about problems with that happening in the past. The most of these we were, they weren't, really. Data related, though the permission problem we've been able to reproduce pretty asian easily. We've also been able to reproduce a lot of schema disagreement problems that we had just by adding packet loss.

A

Good things, question.

E

We're back here hi, yet this way. First of all, thank you for the nice sharing your nice experiences for the last one year. The failures in the Cassandra. Those are very good. Thank you and we had gone through the similar things, but you know I wanted to understand more about the danger matrix that you guys are using, which tool that you are using men know that you.

B

Want to understand more about which one a tool.

E

B

You know BSS able to Jason 20 know.

E

The danger matrix you I to you are using some UI to view the matrix right danger matrix. Oh.

B

The the tool for the metrics that we were using yeah.

E

B

That we use data dog, it's a hosted version of a stats TV provider. Oh.

D

E

D

Yeah so I you mentioned that process. Startup can be like a very rare process, so in the spirit of if it hurts MIT, do it more like if you guys ever considered running, restarts on your clusters continually and if so like, did that help reduce those types of errors and issues or have we thought about doing.

B

Research right, um what I would worry about that is that would be masking other issues. I think that kind of the ideal solution is anytime. You have one time process startup code. You should also write some other code that monitors, for whatever condition is there and almost do it continually some of the case of config files rather than reading it once and then using it? Maybe you sure you only read it once, but you should also have monitoring that if it changes you should get alerted, and so you should take some action based on that.

B

But interesting idea.

G

Yeah hi, I am not entirely sure I caught this correctly, but at one point you said that after expanding a cluster, you run a repair, yeah. Okay, could you please elaborate a bit on? Why did you do that like? Why do you see the repair after expanding the cluster? Why.

B

Do we were to run a repair after expanding the cluster I? Don't remember the exact reason. I think it might be just to make sure that there's no data missing while you were being added to the cluster but I'm, not sure.

A

F

F

So how did how does schema a disagreement occur.

B

How the scheme is disagreement, occur bugs I, believe it it should never actually occur. We are using one dot too and I've heard that it's better in 20, but at least in one dot to you can quite easily call a schema disagreement by adding some packet loss to a node and then doing a few schema changes. And after a little while you'll see that the cluster disagrees, it's not.

D

Something that is supposed.

B

To happen, though,.

E

F

Let's go over speaker a round of applause. Thank you very much.

F