Apache Cassandra Meet Up Presentations, 27 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Apache Cassandra at Pager Duty: Watching Your Cassandra Cluster Melt

Description

PagerDuty had the misfortune of watching its abused, underprovisioned Cassandra cluster collapse. This talk will cover the lessons learned from that experience like:
• Which of the many, many metrics did we learn to watch for

• What mistakes we made that lead to this catastrophe

• How we have changed our use to make our Cassandra cluster more stable

Owen Kim is a Software Engineer at PagerDuty and enjoys whiskey, riding his Honda Shadow 600 (named "Chie") and discussing the finer points in narrative and expression in video games.

A

um Hey everybody, uh my name is owen. As as owen, it's me yeah owen, to be clear, I'm I am from pagerduty. Yes, thank you. uh So, yes uh to start off with, uh I was told that presentation need needs uh gifts. I chose I thought I had to have at least one this one seemed the most appropriate.

A

uh Basically, this is uh a talk on essentially uh pagerduty back in june. Had the misfortune of watching our cassandra cluster, like crumble, fall to pieces and melt, and so this is essentially a story, a story of that and the lessons learned from that uh yeah. You always just shut your eyes right like, but there's always the one guy who doesn't, and that was us um anyway. So I'd be a little bit remiss if I didn't talk about what page of duty is to start with, someone would probably yell at me.

A

What page of duty is, if you don't know, is uh pagerduty is an alert notification and incident management system. Essentially, if you're running any kind of web service, you've probably got you know a swath of different monitoring services to make sure that you know everything is fine and dandy. uh Well, we integrate with all of those monitoring services, centralize them all into one place and also collect uh you're on calls.

A

You know: notification uh contact contact methods, their preferred ways to be contacted they're on call schedules, your escalation policies in case someone doesn't answer, and basically, and essentially when something goes wrong within your system. We make sure that someone, someone is notified so that they can hop on it and handle it, um uh and so, with that in mind again, the key point is that, if what we're providing is you know, a notification when your systems are down? Obviously a huge constraint on us is to be up. uh You know what good is.

A

What good is an alert notification service that doesn't actually receive? You know receive events if you can't tr and if you can't trust us who can you trust um so uh to kind of go into that? What uh how pagerduty works at cassandra and how? What we're kind of using it for is a lot of it is around kind of this, the redundancy uh and the independent faults uh and the distribution, so we actually use it for uh providing durable and consistent re uh reads and rights in sort of a critical pipeline in pagerduty.

A

uh The way that our architecture is kind of set up is that we have sort of an end-to-end uh uh pipeline that starts with basically an an endpoint all right either.

A

For us, we integrate with http endpoints, which either generically first party integrations without monitoring services or just a generic email endpoint, but basically we we have a set of a set of distributed endpoints um and then that flows through the whole system uh goes to the say in instant management, goes through a notification, management service and they'll ultimate to the messaging service that actually reaches out to reaches out to people. So we we have this sort of uh pipeline of separate surfaces that uh that provide um that must always be up and flowing and healthy.

A

uh This is- and this is also this is built on top of uh separate scholar, scholar, services. uh We again leverage cassandra for a durable platform, as well as using other kind of uh technologies like distributed technologies like zookeeper, for coordination in terms of the volume that we take we take in. We currently get around 25 requests, requests of seconds from uh from various monitoring services or people's infrastructure, and each one of those turns into you know a handful of cassandra operations followed by you know the ace actual asynchronous processing that goes through.

A

So each one of those actually turns into a fair amount. A fair amount of load um and pagerduty also has sort of an uh internal policy that we've built a lot of architecture around, which is you know we don't lose in events. We don't lose a message if we've accepted uh the message, we've 200 on the http endpoint, we do not, we never don't drop it, except for when we do and I'll talk about that a little bit, um we don't. We don't do that in our heart of hearts.

A

We don't we don't do that, except for the little corner of our hearts. That sucks- um and this is a lot of our a lot of our infrastructure in design- is- is kind of around this around this principle, um so kind of to talk a bit more into the details of what what we've built uh in regards to cassandra. We do use cassandra 1.2.

A

um We are uh built on top of the the thrift ap api. We have not migrated off to cql we've been uh we've had cassandra in production for about two over two and a half years, um so we're a little slow uh it's in migrating to the new cql. um Also over the years we've used, you know hector cassie astin acts as our cli as our various clients, uh we're still on assigned tokens in terms of uh the uh how to uh assign um assigned token ranges within within our cluster.

A

We're not up to v nodes, and uh probably the most key thing uh is that actually page of duty is not using cassandra in terms of big data, and our data set in cassandra is actually relatively is relatively small. uh I think right now, it's even in the order of tens of gigabytes.

A

um uh The point the the idea behind that is basically, as I said, we're using cassandra as a platform on a critical pipeline and so in this critical pipeline. Once it's gone from start to finish, it's not super critical in this pipeline anymore, and so we actually try to keep our data set and our therefore, our working set as slim as possible in cassandra um and sort of this is, and then this is kind of what our cluster looks like.

A

We use uh in general uh five nodes. So as though we've like we are in the process of upscaling, one of them to ten nodes. Again, it's also it's a fairly small cluster, um but we use those five, uh those five nodes and a replication factor of five two two and one um in multiple different dc dc regions and uh and actually in this graph. Basically dca is uh aws west one aw uh dcb is aws. West 2 and dcc is linode fremont, uh and so we have.

A

We have this multiple uh cassandra set up on multiple regions. uh We do use uh quorum, consistency, level operations and, like I said, we have a full replication factor or we have a replication factor of five when we scale them up we're going to keep the same replication factor, and so the scaled up ones look like four four and two nodes in a cluster, uh so implications of that uh they are so like, uh because we're doing quorum, consistency level.

A

On top of this, like a two two and one uh cluster, every operation does a cross. When uh uh you know, operation uh read or write- and we do take obviously then take this into dc latency head um for our purposes because we're talking kind of like a pipeline, that's not as uh human visible. We actually think that we actually find that to be fine. We have this. We have, you know we're not latency sensitive because of that uh we are latency sensitive in that the pipeline needs to maintain a reasonable sla.

A

But that's on the order of you know a couple minutes, not you know milliseconds, so we're actually totally we're actually able to maintain that uh latency and take. We are more throughput, sensitive and more con concerned when it comes to our cassandra uh setup, um because we're then also because we're using cassandra the quorum consistency level.

A

uh We have consistent reason rights and that's back to this back to this principle of I drop something, let's back to this principle of. uh If we take in if we write an event and we take it in- it's not lost, it doesn't get lost into the ether of the internet or into a single data center and then never come out until that data center suddenly reappears if we've accepted it, it needs to flow through, and so with.

A

Also with this kind of cluster configuration we get, we get availability up to up to the point where we can lose an entire data data center region, aws west one can entirely go down and the cassandra cluster can hum along just fine, uh and this is actually a scenario we simulate.

A

You know in production to make sure and to validate that that's the case um so uh back to the to the uh core of the story here so uh june, 4th was actually the fateful day uh uh on june 4th everything kind of fell to absolute core pieces and kept me up for a couple nights as I cried. um So it got to basically to a point where a center cluster, uh essentially at its very front, end at the very core front.

A

End of the pipe was beginning to refuse all requests uh and no new events were coming into the pipeline, all in basically effectively all in-flight uh messages in the pipeline uh were halted and basically ground to a complete halt. um We obviously saw pretty some like degraded performance in the overall pipeline and essentially pagerduty took a three-hour outage on june, 4th um and and and as part of this we had to in order to rectify the situation and I'll go into this more deeply.

A

We ended up having to basically blow away all the data in our cassandra cluster in order to get things to to begin to flow again, uh if you want to know all the gory details that I have wept about, we actually have this like a very public uh blog on our postmodern blog. That we'll talk about basically the extent of the damage that this did to pagerduty um it's we, we tag them all things on our blog as postmortem, so it should be fairly easy to find if this doesn't get posted somewhere.

A

So, yes june, 4th june 4th, uh it was a fairly regular day. um There was no on june 4. There was no particular changes in our like in terms of our input volume. There was no the rate of traffic didn't change significantly, but we had had essentially an incident on june 3rd the day before. So, what happened on june 3rd was that we actually saw some some degraded, uh some some degraded just a minor, a minor blip uh in our in our uh cluster someone's laughing at me, because I called it minor.

A

uh uh We had we had an issue with our cassandra cluster the day before, and what uh basically we were seeing is that our compactions and our repairs had gotten had begun taken, began to take longer and longer at so long that actually, we begin to see compactions and repairs, starting like compactions were starting to happen at the same time as repairs, and it was happening on multiple, multiple nodes up to the point where, essentially enough nodes were were uh having issues having issues that the the whole cluster was was not performing properly.

A

uh We during on that day, on june 3rd, we took a couple of remedial steps to trying to try to mitigate the issue. uh Some of the some of the nodes had particularly high uh particularly high loads, so on those ones we use no tool, disable thrift, because we didn't trust them to work as coordinators anymore, and that also that saw some pretty major improvement.

A

um We actually had some started uh taking down some nodes, um we tr and we also for a head foregone like we were supposed to do a repair during this degraded performance and we actually stopped the repair before it started and to put it off to another time, uh the disabling, the node. Things really did work uh and uh uh we saw first and I've never been able to explain this particularly well, but for some re for some reason. During this time uh we took our cluster down to three nodes.

A

Instead of five and everything and like load just dropped, we bring one up, it would be, it would skyrocket again uh and then we'd break. It was, and it was any three out of five. It was the most mysterious thing. uh Basically, if the cluster would only work with any three or five nodes, I have no strong uh strong uh theory on one of these. I thought perhaps it was maybe hinted hand.

A

uh You know every we brought one up and hinted: handoffs would it would come into play, can come into play and kind of degrade the whole cluster, but it eventually kind of repaired itself, and even though the hinted handoff should have still been a factor. So I've never really had a very satisfactory theory on that one.

A

Anyway, that's a bit of a tangent, uh so the next day since we had base, since we had before gone a uh uh repair, we were, we sat down to uh sat down to you know make up for that that missed repair, and then uh this happened- and I don't know how clear this is. This is this is a graph of one minute system load on our cassandra cluster and if you can't see the y-axis very well, uh the high-water mark where it hits is 55 um and so yeah.

A

So at various points the average load on our five cluster node was over 10 uh and things kind of uh that was not good. It was a very unhappy cluster at that point. um So uh during this, during this whole fiasco.

A

uh Basically, we we tried a number of different things to to mitigate the damage that we were seeing. We had, we had certain we have, uh we stopped the less critical clients or less critical tenants to this cassandra cluster, hoping that would mitigate load and it had no real meaningful effect. We tried to uh disable thrift on the more unhappy nodes to know real to no real fact. We tried replicating the the random thing where we took down nodes and we couldn't explain and that had no effect, so we basically um at that point.

A

Basically, we had no choice but to actually uh do what we hate and do and lose basically lose events. We blew away the entire all of our cassandra data. uh We deleted the commit log, those caches, the data, the data directories, absolutely everything, and then uh we restored some of some of them. We have a very small small subset of non-ephemeral data. We restored that restored the schema and then everything purged just fine.

A

uh There was no me like again: there was no meaningful changes in the data set, there's no or in the traffic patterns or the traffic volume in any way. But again we blew away the data set and everything worked again uh so um kind of talking about then what went horribly horribly wrong. um So one thing I kind of hinted at uh uh at that we were doing with cassandra at the time is, even though we had these, you know services in a critical pipeline.

A

They were actually at the time all being served by a single cassandra cluster. This was this. It was in retrospect, hindsight 2020 uh mistake. uh There's there's an I suppose. There's there is an argument for operational ease in running a single standard, cluster and and supporting multi-tenancy. um I don't, but I I don't think it's worthwhile. Basically, even during this this horrendous event, uh we were trying to figure out. Where is this load? Where is this load potentially coming from? What that? What data sets is it is?

A

It uh is causing this what traffic patterns was causing this and because a lot of our a lot of the metrics we were getting were all very much at the at the uh cluster level uh and not at the key space level. uh We had no real way to kind of to at uh very quickly and at a high level uh narrowed down where, where this was coming from, the other issue was that we were pre.

A

uh We were again in retrospect kind of under kind of under provisions, so the aws nodes that we were using at the time the ec2 nodes were m1 larges, so they only, they only had two cores and about eight gigabytes of memory. So when looking back like we can, I def you can definitely see in our metrics uh that the cpu was was was hurting, but it was. I think uh it was definitely also the memory constraints that we were having.

A

That was really really uh harsh and I'll show I'll show a little bit more about that in a second. um The next thing was that we uh we had a fair bit. We had like monitoring and metrics, but we the hardest the hard part about our monitoring and metrics was. We didn't necessarily have meaningful thresholds and high water marks. Like we didn't know what the like, what is like, yes, load hits skyrocketed, that's bad, but like how many pending compactions is bad. How many like how many blocked flush riders is bad?

A

We didn't have great like thresholds on that, and that's that was uh because that's something you only usually figure out when yeah things go wrong um and we also kind of had a misguided and very twisted desire to get everything we could possibly could out of this little cluster that we uh had spun up, and that was that was definitely a mistake in retrospect uh sure yeah.

A

So, basically what I mean by multi-tenancy- and I I meant to uh clarify this- is that uh uh for our we have again this uh this multi-service uh pipeline that has uh different, you know, scala applications of it and each one was talking to the same cassandra cluster and their their data sets were separated by by key space. um That's what I mean by by a multi-tenancy in that our services, uh our applications of what logically separate apps, yes within our in within our infrastructure, we're talking to the same cassandra, cluster, okay, um yeah.

A

So what why did not? We not see this sooner clearly, there should have been some kind of warning signs and to some agree. Yes, there were, but I like to tell myself a few things about what the mistakes I made to help myself sleep better, um even despite the fact that we were abusing this cassandra cluster, like 99 99.9, some a lot of the time everything was fine. Everything was great great.

A

uh Our read, write latencies in in this case in this cluster, uh we're pretty close on average, we're pretty close to the inter dc latencies that we would expect you know if, if the dc latency is like 20 milliseconds, our reason writes about 20 milliseconds. This was true. This was true, probably 99.9 percent of the time, despite even even if, like load, were around one which would like it is pretty high. We would actually see that hey it's it's still operating okay, um so we kind of there's sort of in.

A

In my mind, there was sort of this lesson that cassandra kind of had this. These two modes of operation, everything's, fine or everything- has gone to hell. um Furthermore, we kind of we kind of thought hey if everything is fine most of the time and it's they're wrong. Sometimes it must be when we misconfigured something. We thought about tuning the java heat memory. We thought about. You know throttling compactions.

A

We were like it's not us, it's it's the configurations we can make it work, we can make it work right right um and then finally we thought we aren't. We don't have a lot of data. Cassandra is supposed to be able to handle like big data. So therefore we're again something must be misconfigured. We we should be able to make this work. uh All lies. We kind of told ourselves um so kind of back to like that. Like constant memory, pressure thing, uh this is uh we it's like. I said we have eight gig.

A

We had eight gigabytes of ram on the boxes, so we followed the defaults of the jvm heap size, which was a quarter of it. So there's about two gigabytes of memory on the for the java heap on cassandra. um You can see there, or maybe you can again. The y-axis is kind of small and I apologize, but on the left, it's hanging around one, it's hanging around 1.8 uh and then that thing in the middle is when we blew everything away and then the thing on the right is that after it really blew everything away.

A

Basically, what we were seeing was like gc was happening in the in the jvm on cassandra and never finding more space. It was not having this kind. This more much more healthy, like up down of garbage collection, happened. Therefore, there's more space. Now it was garbage, collection happened and it's still struggling to find space, uh that's bad. uh That is something we definitely should have seen um and kind of. Furthermore, at that point we you know we saw. We saw some other things in our logs.

A

The logs were reporting, you know we're flush, we're flushing this column family to relieve memory pressure, and uh that was there and uh to be seen and we weren't. uh We did not act on that and harder again another thing harder to measure, but it was definitely was definitely there is like I said we were seeing slower, repairs and compaction times um and likely that was actually related to our the memory pressure, uh but that was definitely another warning sign. We should have acted on and another thing that was that's.

A

uh Perhaps uh uh we that I've only really seen uh in retrospect again um we were measuring the operations uh latencies and uh in terms of like average median 95th percentile, um and we were largely looking at them at the at the median like what is the median average. Is that does that look sane uh right now in our semi and our semi-healthy in our what we consider to be a healthy cassandra cluster?

A

We see maybe about a 30 40 difference between uh like median latency and p95 latency, and some of that can be chalked up to we're doing when operations. Some of that can just be. You know, jitter from the network. Some of that can be and some other we have a heterogeneous cluster.

A

So um you know, because we are on multiple different providers, so we can't uh so maybe some of that some of that can be tracked up there, because they're different they're different nodes, some of it we've also seen just because we have different latencies, but links between dc's that actually load doesn't actually distribute perfectly in our cluster. So we actually see some asymmetric load in our cluster. That could explain a difference between a difference between our p95s.

A

um At the time, though, looking back the difference between our median and our p95 was actually close to 100 150 percent latency difference uh which, which is actually something we see in some of our wide wide row, reads and some of our like admitted anti-patterns.

A

um uh But these we, if you look back, we see these on even like our very standard, key, very simple key value reads and writes on skinny rows. That should be very quick and very quick and easy and stable in cassandra, but we were seeing the medians at the median and the p95 deviate pretty pretty severely.

A

So after that uh you know uh particular day. uh The aftermath was particularly rough. uh We we blew away the data set and everything was fine, but obviously there was a in our minds that was only going to buy us a little bit of time, so we had to take some pretty clear steps. um First, off we bumped up every single node immediately to uh m2, x2 largest, so that they had four cores and 32 gigabytes of ram. uh Again, I think that I think the gigabytes around there. It made a huge difference.

A

um We stopped serving multiple apps off of the same cassandra cluster. uh This actually required a huge amount, a huge amount of work on our side at the application level. To do this kind of like like hot migration, uh it was, it was a lot.

A

It was actually turned into a lot of work, but it was absolutely necessary and absolutely a great benefit, uh and then we also began watching you know watching metrics we hadn't been watching before, in particular a lot of them around the sort of like the penny tax pending task metrics you can get out of the jmx um things.

A

uh Things like you know, pending re-digest, dependent, compactions flush block writers, are, is one that uh I've been seeing uh as being a pretty good indicator of uh you know, load and need just needing to scale and also drop messages. Your drop messages are just bad just if those start happening, you should begin to worry um and so and then we have uh kind of a set of lessons lessons to learn. uh As I said before, cassandra feels like it has. Basically, everything is fine or everything is very much, not fine.

A

uh It has very steep performance uh degradation, um and with that in mind, it is very, very important to stay kind of ahead of the scale, the scaling curve. uh Don't don't ignore warning warning signs that something might be wrong just scale it. It should be easy to scale.

A

And don't try to squeeze configure too many configuration issues. um It should be easy to scale as the next lesson like it's like this is like pretty con. This is a common operation that you'll like you'll, be doing because it's a cassandra cluster like it's one of the reasons you're here probably here now that I am, is because it scales well so like the the the operation that comes that you need to do in order to scale. It is something you should definitely practice and be comfortable with in case.

A

You have to do it in a pinch, um uh and the next thing is uh basically because I mean understand that uh cassandra, uh the performance of cassandra to tyrion as like the data, the data set, takes shape. um It's not necessarily the operations that are going on right now. It has to. It has a lot to do with the you know the the data set.

A

That's underneath, then, all this kind of comes from you know, maybe the width of your rows or the just how much uh work it takes to do, repairs and a lot of basically background asynchronous anti-entropy stuff.

A

So don't make the same mistake that we did and believe that, just because your latencies and operations are still quick means, it means that everything is fine, because everything might not be fine and again one last time.

A

I don't think that, like multi-tenancy, I don't think is- is worthwhile um whatever operational ease that you can get out of multi-tenancy and running a single cluster with multiple clients should be kind of mitigated. If you have, you know a nice, a nice robust and flexible and extensible like management and configuration infrastructure configuration, so we internally use chef, and so uh it should be like you should be fairly easy to use those kind of management tools to bring up and configure and start running new clusters.

A

It wasn't for us at the time I'll admit, but it we've since made it easier cool. Having talked a little fast, I think that's, that's kind of all from me. uh Thank you very much for this uh for listening. uh I would be yelled at if I did not say that we uh pagerduty is hiring uh someone. Someone really will yell at me. um We're hiring like people all over we're hiring.

A

In particular, cassandra people, you know between enthusiasts and experts, for basically our real-time team, which I work on, which is again that kind of critical uh critical pipeline as well as kind of persistence, engineering. So people like focused and dedicated to on our cassandra clusters.

A

I got some links there, but generally, if you go to pagerduty.com jobs, you'll find your way.

A