Apache Cassandra Cassandra Summit 2015, 14 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Netflix: A State of Xen Chaos Monkey & Cassandra

Description

Speaker: Christos Kalantzis, Director of Engineering

This talk will cover how Netflix monitors its Cassandra fleet and the steps we take to make sure we can survive even the worst unplanned outages.

A

So this is a question. I always ask before I talk, who's, running Cassandra and production, very good, who has a Netflix account who's running Netflix in a country you're not supposed to oh good.

A

Well, so this morning we're gonna talk about well title status, then chaos, monkeys and Cassandra I'm gonna start by telling you guys a little bit of a story.

A

This morning last year, Cassandra summit was early, September I believe it was September 11th and something happened a week or so after that talk and we're gonna talk a little bit about that incident and then I'm gonna hand it off to two of my senior software engineers, who will tell you a little bit about our automation stack and how we handled that event, but first, just for my name is christos. I, lead cloud database engineering at Netflix and today, I think you're, all very lucky to get John Sebastian, Janet and Nia are all funny. Sorry.

A

Fuzzy sorry, who are the engineers building, who maintain the old automation framework and are building the new automation framework. So, let's get started back to the story.

A

It was a Wednesday a couple of weeks after the cassandra' summit last year and we got an email from AWS telling us hey we're about to reboot the whole ec2 fleet across the world, and that was that was interesting. That was very interesting. It was named AWS reboot or reboot Apocalypse Now.

A

Anyone who gets really bad news goes through five stages: they're called the five stages of grief, and so just like in the five stages of grief. When you get very bad news, we had very similar reactions, so on Wednesday they told us start Thursday and for the next four days we're going to be rebooting ec2 instances. Well, our Cassandra fleet runs on ec2 instances, so our first reaction was denial.

A

No, they can't be doing this. This is ridiculous. This is this is AWS. This doesn't happen. Well, it turns out. It was going to happen whether we liked it or not- and the reason was a security flaw in the Xen virtualization software. They used to cut out ec2 instances on their hardware, so obviously the next stage was anger for all. You know for lack of a better word. We were pissed. Are you kidding me you're gonna. Do this?

A

We actually have a party set up in Los Angeles to celebrate our 50 million users and now we're all gonna miss this party, and you know it wasn't. It wasn't a good end of a Wednesday to to be worth it to be a Netflix CD employee. So then, then we went through the bargaining phase. We called up our Tam and said: hey, hey, hey, look! Please! Please! Please don't do this, we've got like a trip planned.

A

Can we push this out a week, or at least you know staged it out over more than four days and of course their response was no this this this has to happen. So of course the next step is depression.

A

You know, I basically told the guys hey. We might want to update our resumes and and maybe go see the people at the Apple booth for a job, because if this all goes down and goes to hell well, you know there might not be a Netflix come next week, but the final stage of grief is acceptance and it wasn't an acceptance. It wasn't a capitulation of. We can't do anything about it. It was. It was more of an acceptance of wait, a minute we test for this. So so maybe maybe we're gonna be alright.

A

Well, how do we test for this? A lot of you probably have heard of the simian army chaos monkey, which made a cameo yesterday during the Kia in the keynote chaos gorilla and chaos, Kong, where we evacuate all traffic from one AWS region to another to ensure continuity. When there's events like this weekend's DynamoDB outage so about it year and a half ago, we we've started the Poynting chaos monkey not only to our stateless systems but to our state full systems. We turned on chaos, monkey on our Cassandra fleet and.

A

With Cassandra being as fault-tolerant as it is, and with the automation that we built a lot of, you have probably seen this logo before we've proven that Cassandra can actually withstand a monkey attack and- and if you think about it, AWS reboot was basically a monkey exercise on steroids.

A

So Friday came along sorry, Thursday came along started, doing the first set of reboots and, and things look pretty good. Then then they did another zone in another zone and and we started building our confidence that hey maybe we can. We can go to that party over the weekend and Friday night. We were all still on standby and you know we decided hey. No, let's do it, it's working everything's fine, so we all got hopped on planes and went to LA and had a good time.

A

We hung out with Snoop, Dogg and r2d2, and it was lots of fun. So Monday morning comes along all the reboots were done and the total was 218. Cassandra knows got rebooted, 22 didn't come back, and so our automation detected it found out why it didn't come back and initiated Auto remediation and the cassandra. The instances were were restarted, data streamed to the new nodes and Netflix suffered zero downtime.

A

That's one heck of a.

A

That's that's one heck of a feat, so I'm gonna now give the stage to John Sebastian, who is gonna talk a little bit about what the stack looked like back back last September and then he'll be followed by near. Who will tell you about what the stack looks like today and how we leverage a new product called stack storm and they have a booth here.

A

You can go check them out and how we're evolving that stack to be able not just a handle our Cassandra fleet, but our elastic fleet and our own built our own homegrown DynamoDB fleet. So I'm gonna give the stage now to John Sebastian.

B

Everybody hear me: okay, so Christos I'm, sorry, I'm gonna have to correct you on one thing. You talked about the five stages of grief, there's actually six stage and the sixth one is implementation. That's.

C

Where I come in.

B

So I'm gonna start by talking about what our stack looked like at the time. Our automation platform on the Left you're gonna see a Sandra cluster class external cluster. The only difference that you might not have on your clusters is the preamp process. Preamp process is taking care of maintenance operations on the cluster, for example, repair and compaction, and it's also taking care of token management. When a new node comes in, it's going to take care of assigning a token to it and let it join the cluster and on the right.

B

What you see is a standard, Jenkins installation with a bunch of Python scripts, underneath this is what we had during the reboot 2014.

B

So you probably are wondering: is there something missing here, so we don't have any system doing the monitoring, so you probably know about App Center at Netflix. We don't use App Center at the time when we start using Sandra App Center was non-existence, so we use priam and what we call what we our own monitoring system. We just call Atlas it's open source. Now you can look it up. This is a standard dashboard for our Kessinger clusters and every every every few quarter.

B

We reassess up Center to see if we're going to use it in the in the next version, but so far we've been using Atlas and one of the one of the reason for that is that, as you can see, on the top right corner is our Atlas clusters. We have priam that is responsible for sending the metrics to Atlas, so it sends heartbeat and sends a lot of metric about coordinator latency and all the other metrics are used to she dropped, drop messages, saundra exceptions and and all along and on the left.

B

You can see our Netflix apps Netflix apps are also sending metrics to Atlas the client-side metrics how's, the s tnx latency looking and the declined view of the system.

B

Here's an example, graph you'll see that we can put on the same graph, client, side and server side metrics, which give us a good idea of how the system reacts to the server side, weapon, see all the blue lines or client side latency. The athlean-x Latin see the 99th percentile for Reid and we have multiple lines because we check the latency on a column, family and key space level and the right red one is the server side latency.

B

So now that I've explained the monitoring system that we have there's another way we monitor our our our fleet of Cassandra is, however, health check script. The health check strip is run through Jenkins. It's more of a pull. Pull approach is going to contact all the clusters, all the notes and make a deep diagnostic to see if there's anything going wrong.

B

I'm not gonna, go through all the this slide, but focus on green and red green, good, Ram, bad, that's pretty simple! So good means nobody got page and the health check either didn't see any problem or saw problem and fix it automatically so and red is we got a problem? We couldn't fix it automatically, so we were gonna page an engineer.

B

The two cases that we had automated at the time was an instance that disappeared. For example, chaos monkey just terminated at one of our instances or any or it got rebooted or something like that and the other cases I won't go deep into it to see. If there's any higher issues on the disk, so there's nothing. We can do the drive just fail. The ephemeral drive fails. So let's just terminate the instance, but for all the other cases at the time we just page an engineer. So what about reboot? So during reboot?

B

It's it's kind of a mix of the chaos monkey exercise and just a single reboot instances because, like crystal said, some of our instance never came back, they got rebooted, but the the ec2 health check, Amazon health check, actually contacted the nodes in Chile and if the node doesn't respond, it turns and terminates them. So some of our instance were so old that they got their terminated because of that.

B

So what does it look like for our workflow so the day since disappeared, because it never started back again, so we detected, we launched a new instance. The process of launching a new instance is for those of you familiar with AWS. We have auto scaling groups with normally controls a fleet of server that can dynamically scale, but for Cassandra we keep it stable. We just tell them. We need six node in that zone. So again we're gonna have the HT configured with six.

B

If we detect that there's only five nodes, then we're gonna tell the ASG. Please enable scaling, it'll launch a new node and then disable scaling, and then we are all good because I know join the cluster and we detected its own no pages.

B

The other cases is that the instance gets reboot, but it actually comes up and on first time it the first check looks at it and there oh there's an instance missing, but it's not missing out of the HDD. It is full there's six instances, but when we do a ring command an O'toole ring, we see there's a node that is down. So what do we do? We start looking at two why's it down, so we go through all these steps and actually there is actually I know that is down it's.

B

Not their central process is not running because the timing of the health check actually caught it a while it was rebooting. So how do we make sure? We don't page an engineer for for that case, because actually we don't need to pay. You just need to wait a little. Second, that's actually what we already do, there's always transient failure in the cloud. Sometimes you just there's a node going down, but if you wait 10 seconds it'll, just gonna go back up for in it any network issue or disconnect.

B

So this process I actually have a stage which is: is it the first time it fail? And then, if it's the first time, this workflow is failing for that node. It's gonna sleep for X minute at the time it was 10 minutes, but for reboot and the only thing we change before going to the party was switch, a configuration to say, let's give it 20 minutes, because we know there's gonna be a lot of reboot and it might take longer to reboot them.

B

So we changed that setting and we went off to the party any work, so we got I was about to say we got lucky, but I mean we planned for that, but we got lucky that that only one zone at a time was unaffected, because I would have in a different story.

B

So that's our now our stack at the time we identified a couple of gaps when we went through reboot and even before that we were already planning on the next system. One of the gaps that we had was the fact that our automation system was a big monolith. It was a big bunch of Python scripts and Python library that we cooked up a hundred thousand lines of Python.

B

So it's very hard to to actually have a new wire comes in look at all this code and get get actually familiar with that that automation platform and with big mallet comm lack of chaining the fact that it was a big model it we prefer system to be more smaller pieces that are chained together and that trigger unsuccessfully, or this one more like the workflow that you saw earlier for the health check, a system that is based more like a workflow instead of a big script with conditions- and you know the other, the other big gap.

B

This is actually a picture of our Netflix data center, the power cable. So wherever we're hoping, nobody will trip on it, no just kidding, but what I want to illustrate by this is this is single point of failure. The Jenkins master was our single point of failure. If the master goes down ow our script, that was currently running will page the engineer. So we wanted to reduce these false positive and yes, there was a way to have a multi master Jenkins and there's always plugins that he can do to.

B

But we wanted to go back to the basis like we want more of a cluster solution. We're working extend res, so we know about high viability. We won our monitoring system and automation system to have high availability. So with all of these, these gaps I'm gonna hand it off to near who's, going to talk about how we actually fix those gaps and what the new system looks like. Thank you very much.

D

Can you hear me? Oh great thanks? Yes, so we decided to embark on a new journey to figure out how we're going to fill in those gaps, and that in mind we didn't want to lose all the principles and lesson learned. We had so far that health check scripts that you just saw you just saw a simplified version, a diagram of it. It actually handles lots of edge cases and throwing it all away. It would be a shame, definitely not what we wanted to do.

D

So we were looking for something that would both accommodate a filling in those gaps but keep the stuff that we already implemented in place.

D

So we started by going out there and talked to other companies see what they're doing so. We went to Facebook spoke with their engineers about fubar. Fubar is a system that they designed that detects machine that have experiencing all kinds of issues in production and automatically takes them out of the fleet in order to heal them. We also went to LinkedIn and met with their engineers and spoke with them about nurse nurse is a platform that they also built in-house.

D

To do all kinds of art remediation things and we were in touch with Dropbox engineers as well speaking with them about a normal Noro, is an open source. A project hosted in get up if I'm not mistaken, with kind of similar capabilities of self-healing and how to remediation.

D

In addition to that, we also started looking outside for all kinds of open source project that would fill in our requirements, checking different stuff, running proof of concept and then wage to come a with a decision. Do we build our own in-house solution or adopt something that already exists? There, of course contribute to it and enjoy it, and we decided to go with stack stone stack. Stone is an event-driven automation platform.

D

It tells lots of capabilities. We mainly use it for our to remediation, but it also has additional capabilities of providing audit trail integration to chat ups and many other things. As Chris was mentioned. They have a booth right here. Go them ask them questions they're great guys. They have a great community join in and contribute you'll have fun.

D

So let's see how stack storm integrated into our stack, so that's decided. We saw before with a Jenkins running remotely scripts on the right side on our cassandra fleet. Our cassandra fleet, a report metrics through premium premium, gathers metrics through gem x from Cassandra, send them up to Atlas our telemetry system and, in addition to all Netflix environment, all the services from metrics environment also reports metric to Atlas and it of a if at last, detect any kind of weird behavior or that something's wrong.

D

It hits paged you the API in order to page our own call. So now, instead of Jenkins, we have a cluster of stack storm. We actually didn't totally. To be honest, we didn't retire Jenkins a totally. We still have some a scheduled tasks like repairs and compactions that are still running, but now, instead of a heating page duty, whenever there's an issue detected, Atlas is going to send an event to stack strong. A stack storm has a really fancy rule engine that can figure out which rule it should detect.

D

For that event, and by choosing that rule that rule triggers a a single action or a workflow, which is the chaining of actions, depend on each action depends on the result of the previous action. So you have a full workflow of if-else depends on the previous result. That way, we can it's much easier to auto remediate things. I want you to provide us an example.

D

A disk usage alert that we used to get so if previously, it would just page the on-call right now stack stone received the disk usage alert event and it goes and gathers additional context by additional context. I mean things like a is there any offline process running our factions running when compactions are running. As most of us know, a data size may crease up, maybe even up to double. So if compactions are running, maybe we shouldn't page the on-call.

D

We can just send an email, let's say a clean win for a the operations, they're getting left pages. So that's already one good thing, but how does it feel all the gaps that we discussed before so remember the monolith part so now, instead of having monolith scripts, we have to break them down into a smaller chewable actions that can be reused in different workflows. So breaking down the monolith is already a win. Another Wayne is chaining of action.

D

It's not natural to change, Jenkins jobs, but it's much easier to change them using your work so and as for the single point of failure, we run stack stone in a cluster when Atlas triggers an event. One of the instances of stack some instances will take that event and handle it. So even if a machine goes down, another machine might picked it up. So as long as we have one machine that is functional in that cluster, the event will get handled.

D

So that's another win for the resiliency, but if we can't auto remediate, we can still use page of duty to page the on-call, but here is still another big win, and why is that because stack stone by collecting all that additional context and information it can check out stuff? Like the ring outputs, it can check stuff like offline maintenance running it can provide links to relevant dashboards, and things like that.

D

So by the time the on call, which is your hair computer they'll, already get an email from stack stone with all that additional information, so time to resolution will be reduced. So that's another huge win.

D

Now it's a part that is personally important to me in that presentation. I remember, I was talking about all the priests and poles and lesson learned. I want to go over a few of them with you. Hopefully, you'll be able to take something with you and make your processes in your system more resilient.

D

The first principle is making your processes idempotent. An idempotent process means that you can run it to multiple times, but it will only be applied as if it was running once now. The good thing about that is imagine you're, upgrading, a fleet of 200 nodes and after 150 nodes, the process crashes.

D

What do we do now start to manually run remotely check? Which versions are on which node, which ones we need to upgrade not-so-fun from open operational perspective? By making your process idempotent, you can simply run it again and a having it ID opponent hear the implementation of idempotency is actually something really simple before you start upgrading. You know just ask a simple question: what is the version? That's running on that node running node, no tool version? Is it the desired version?

D

If it is just keep on and move on to the next node, by doing that, we save a lot of time a lot of resources, we gain resiliency. So that's a huge win, one more thing to keep in mind: making your process idempotent could be even better if you break it into steps and make each and every step idempotent ajs put it put it nicely. When we spoke a couple of days ago, he said that idempotent means making a stateless. A system feel stateful. Think about it.

D

It's a nice sentence, second principle of simplicity that one might be the hardest one to implement because taking a complex, a system and making it simple requires some kinds of genius ii. So it's not always feasible, but it's a good principle to keep in mind whenever you design a new system. If you can avoid multiples or do anything simpler, it's always a win, mostly from operational point of view, but also from maintaining the code and etc. The combination of simplicity and idempotence is really powerful.

D

An example that I wanted to bring you here is what we call the Netflix resumable weepers same as the upgrades that we discussed before running a repair on a 300 node cluster could take a while to take say seven days if, after five days, the Junkins job crashed, no problem repair is idempotent right. We can run repair on a node multiple times, so we just run it again. This time after another couple of days did Jake his job crashed again.

D

So now, we're seven eight days into the process say that one of our column, family, has the GC grace period of ten days. Why does it mean? Does anyone here have an idea? I need your help. Gc quake, Great Spirit is about to expire. Some of the nodes world repair. There are tombstones, their data resurrection reminded urgent danger of data resurrection. Thank you very much. So um just came with a a simple, yet brilliant idea of before running a repair on the node just check, which Jenkins build number was last run on that node.

D

This way, if a process crashes, we can simply run it again with a resumable repair ID, and by checking that idea and comparing we don't need to rerun it on the same node. So all the nodes that were already repaired will quickly get skipped and will start exactly from where we stopped so won't take again, seven days to repair the whole cluster.

D

Another a lesson learned is that we moved from using remote SSH synchronous connections to async HTTP connections, again repairs. This is the long-running process. There are a lot of long-running process that we run and by having a synchronous connection, your process becomes more vulnerable to transient Network issues. For example, it can easily fail by using a think httprequest with either callbacks or pooling. You make your process more resilient, also moving from SSH to http. There are many libraries out there that support it. It's easier to test, so we increase developer velocity by doing that.

D

Tuning your timeouts and number of retries whenever we work with remote services, that's something that we always have to keep in mind. It depends on the service that we're working with one important thing to note here is also a slip in between the retries. This is something to take into account.

D

A good practice is using a exponential back-off, meaning that if the first try failed, you sleep for a second, then, if the next drive failed, you sleep for two seconds and then for four seconds and etc. The reason for that is to avoid a what is called a retry storm.

D

If all the clients keep hitting the servers with retries, without sleeping the server will be overwhelmed and might fall back and last principle I wanted to go over is using fallbacks am I some of you might have already used fallbacks without even realizing it by trying to get some kind of configuration or something from a remote service. The request failed and you default to a hard-coded volume. That's called serving a fallbacks in netflix context.

D

Most of you probably know that there is their personal personalized recommendation engine. So when you log into your Netflix account, you see personalized recommendation. If that service try to become a starts to become latent or even falls down, you wouldn't get 404, which would be a horrible customer experience. Instead of that, we default to some kind of general global recommendation service. This way, most of you would probably want even notice the difference you keep getting served with a nice, a nice page.

D

You can hit a movie at the play button and watch the movie and enjoy a Netflix I'd like to wrap up with our plans to the nearest future, and so we all the time we keep brainstorming and thinking what else we can do better. How can we improve our platform, our automation services and make them even more resilient?

D

One thing we came up with is to start gathering data, a collecting beta data about the clusters, all kinds of statistics. This way we'll be able to try to predict things when they happen before even before they happen. Sorry, by being proactive instead of reactive, we can make our platform more resilient. Take the disk usage example that we just discussed if our art of predicting a platform can detect that we're going to run out of disk usage, it can trigger an event in stock storm to scale up the cluster.

D

This way, we won't even get page we'll just avoid the failure to begin with. So that's a huge win for resiliency and it's also a huge win for the quality of sleep of our own call. Thank you very much guys now is the time for questions.

D

Jas and Christos will join me. Please raise your hand and a is there a someone with a mic that great.

C

So how do you handle one of the stacks from node failures when it is in the middle of a workflow execution? We.

B

Can't hear you we can't hear it louder, go straight to the mic. How.

C

Do you handle one of the stacks Tom node failures when it is in the middle of running a workflow doing remediation for yeah.

B

How do we end, though, if stacks don't fails actually right now we get page eventually we'll, probably have and try to have a way to make it auto, heal, but right now it's configured with auto scaling in mind, so if an instance get killed or anything it'll bootstrap in during the cluster. So so far we haven't had any issues, but we'll see what future oh yeah.

C

I was more interested when it is running a workflow. For example, you are remediating your cluster and.

D

If it goes down, oh I'll take that so, um as we said, the workflow in each action result might trigger a different action in the workflow and we can you can define in each workflow on success or on failure scenario and on failure scenario. It could even mean that your script crashed okay. So even if something went, deathbed the own fail will still get around and you can leverage it to either page you're on call or send an email. That's what we currently do. Thank.

C

D

A

Hold on Mike Mike is coming I.

E

Was just wondering how large is your staff that supports the Cassandra so.

A

Cd CD as a whole is about a 11 strong, but that's Cassandra, 9,000, nodes of Cassandra 281 clusters, that's about 4,000 elastic nodes and about a five hundred to a thousand dynamite nodes. So a staff of eleven handles all these distributed. Databases. Thank.

B

A

And and just one extra clarification I know the guys talked about operations, we don't have an Operations team, everyone in the team, whether the developers or whether they're working on this on this stack, take a turn on call. So when we say operations, that's actually everyone and that's how all teams are set out in Netflix.

F

D

Wait around dog food.

G

Do you use the open source version or the dustox version of Cassandra and which we can hear we can offer? Do you use the open source version or the data stacks version of Cassandra? Oh, we use data stacks enterprise and how frequent you upgrade your clusters. We.

A

Try, and do it once a year at 9,000, nodes.

A

There's a question over here.

A

Or yell it out.

G

F

Have you had any issues where the schema that you designed into Cassandra cluster had to be changed, because maybe some queries were slow and I? Don't know if any know that was discussed as far as operations go, but I'm curious. If that was became.

A

An issue so the questions: do we ever change our schema, yeah.

F

You know, do you use your schema, is sort of updated yeah.

A

F

Based on performance, that sort.

A

Yeah so I mean applications, so when an applique new application- and we spin up a cluster for them and and most applications have their own dedicated cluster, we come up with an initial schema. Applications evolve, new features, get added so yeah. We, we reevaluate the schema on a regular basis and there's members on our team, who are really great at at schema design, no sequel, schema design is specifically Cassandra. No schema, schema, design and part of what cloud database engineering does is offers consultation and best practices.

A

So the question is: when we upgrade our Cassandra fleet, do we go to the latest greatest? Are we running 3.0 or or do we take a step back well data stacks enterprise, which is what we run out of the box, is a version behind. So we run the latest data stacks enterprise, which it and of itself is about a version behind.

E

Out of the 200 sirs that I believe you had originally at the time I reboot you mentioned 22 didn't come up to me. That seems like a high number, like 10 percent of your servers. Did it come up after her boot, reboot and I didn't quite catch the reason? What was the reason? Why was it.

G

E

There one common issue: they didn't come up or just random stuff. This.

B

Was because, at the time sometime, we have Sandra instances are long-running instances and some of our instances were year and a whole two and a half years old, and at that time there was a very old hardware and sometimes when they reboot their boot sequence, was too long to actually come up properly. Ec2 has a timeout on this, and if they don't boot up in under 20 minutes, they just terminate them. So they got terminated by Amazon before they some.

A

Of those nodes were still on a linux, 2.6 kernel, yeah.

H

I sharted saw them and my question is generally I see the industry moving towards the cloud. You know you talk about the incident with AWS, where you had no control of when they were going to reboot. Well, if I have my own data center today, I have control over that. Where do you see the industry going and how do you think things will lay.

A

Out yeah, so even your data center, you don't have control when, when what state are you in there, you in California? Well, when you're a utility company I was gonna, say PG&E when your utility company has a problem or you know some vandal cuts, a fiber, cable and all Internet is is affected.

A

So it's a false sense of security to say: I have my own data center I control, I control, my destiny, so whether you're in the cloud, whether you have your own data center, you have to test, and you have to architect for for disaster, recovery and fallback. Ideally I.

H

Think most a lot of companies do that today, but the question I have is here: you are a company as well-known as Netflix, but you had no leverage against your provider with all your customers could have been impacted. I mean that I think is from a customer per service perspective control that some companies would not want to lose so actually.

A

So in hindsight, Amazon did the right thing. It was a very bad security security bug in the Zen virtualization software, where, if they didn't do this- and there was a zero-day flaw, everyone on AWS could have been hacked. So in a way. Yes, it was it was. It was a little annoying. You know it was it kind of was inconvenient, but you know hindsight it was the right thing to do.

B

Thank you very much. Thank.

A

You everyone, if you have more questions, we have a booth, come on over and we'd be glad to answer your questions.

A