Apache Cassandra Cassandra Summit Europe 2014, 27 Dec 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataStax: Diagnosing Performance Problems in Production

Description

Speaker: Jon Haddad, Apache Cassandra Evangelist at DataStax

This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.

A

My name is John Haddad I'm, a Technical Evangelist for data stacks I'm gonna, be talking to you today about diagnosing problems in production, so I don't know how many of you were here for training. Yesterday, everybody did some training. We got some data, modeling talks, I, don't know how many of you but Cassandra into production yet, but you know with any database with any project, there's always a time when something goes wrong in production.

A

What I want to do here is kind of outline, some things that if we get done ahead of time, if we know about ahead of time, should help minimize the number of problems that we encounter. So the first thing that we're gonna do is talk about preparation right. This is stuff that we just should do. Obviously beforehand you can't prepare afterwards. It doesn't work like that.

A

The first thing is ops center you've probably seen this already. It's gonna help you out with 90% of your problems. You can graph individual servers, JMX information memory, usage, compaction, history. It basically is a operational tool designed to work exactly the way you need with Cassandra. It's it's gonna, be the first place. You look when there's a problem, and the first thing to know is: community version is free and there's a version that comes with DSC. That has a couple more features, but the free version really useful.

A

It should be on every cluster aside from OP Center, there's some other stuff that you want to take a look at or you need to monitor the rest of your application. You're gonna want to have tools like Monnett setup or mewn in things like collect d, to collect system, metrics tools like Nagios or icinga, or you can use a hosted solution. Basically, you want to have a really good idea.

A

If there's a problem with your cluster and one of your machines dies, let's say a disk dies or you know an application is quitting or something you should know pretty quickly without having to SSH into your servers and take a look, especially if you have more than a handful of them, it's kind of a nightmare. You definitely want to know right away which server is freaking out and what resource on it is constrained.

A

So if it's a bad disk, you should know about it right away and that shouldn't necessarily mean that you have to get up in the middle of the night and with Cassandra you shouldn't ever, but you definitely want to know what's wrong. So basically, if you're gonna use roll, it yourself with open source, that's cool! If you're gonna use some proprietary thing, also cool or you're gonna use a hosted thing. Do whatever you want.

A

Just make sure you get something in place so that when something does go wrong, you are able to quickly diagnose which server has a problem. Now, aside from just looking at individual servers and processes and knowing when discs, are bad, it's really good idea to have application. Metrics built in so I, don't know how many of you have ever worked with tool called stats D, but it's really easy to integrate into your applications. So you can have timers and you can have counters and it also has gauges.

A

And basically, what you can do is profile very small sections of code in production, and you can kick this out to a tool like graphite and then you can effectively graph everything.

A

That's going on in your application and, what's really cool is Cassandra actually has a support for this as well through, what's called a metrics plugin and you can kick the same, you can kick all of your Cassandra data and your application data into the same place and you can get correlations, so you know that maybe you've got a dip in user signups and there's a spike in compactions and you can actually determine if things are related to each other and the root cause of some of your problems.

A

Aside from the metrics integration, if you don't want to use that there's another tool called JMX trans, that's really useful and it can kick the JMX metrics out to ever. You want. So it's pretty cool. You can end up with some really flexible graphs. You can see the two I have on here: there's a nice dashboard with nine graphs or at the bottom there's a tool called graph on ax, and basically it's it's just nice. It's come that with the server metrics are really useful to quickly diagnose problems that you might have.

A

The third thing that I'm going to talk about is log aggregation, and this is something that I've actually had people fight me on in the past and I. Don't really understand it. If you a problem and you can take a look at logs and they tell you exactly what's going wrong and they have meaningful data. Like user information, you know stack traces. Just please do it. You will save yourself a huge amount of headache.

A

Make sure that you don't have a ton of noise in your logs, like if you look in your logs and you had a million errors in the last minute and nothing's actually broken you're just totally wasting your time. Take the time have good logs. It will help you out in the long run, I've solved dozens of problems that would take hours and I've figured them out in seconds just from looking at the logs okay.

A

So there's a couple things along the way that if you're running a Cassandra cluster, a couple things that you can just trip up on and they're definitely outlined in the docs. But if you don't happen to read that one paragraph or if you just don't care and you roll things into production, if you don't read as much as you should I've done that pretty much every time. These are things that can hit you and they can impact you in a really negative way and they're really confusing.

A

The first thing that we're going to talk about is incorrect server times so I don't know if anyone here has run into this issue, but it's kind of a nightmare. If, if you don't realize it, and basically, if you have the wrong time on your servers, you can have a whole bunch of weird issues, especially if they're customer facing, but in the case of Cassandra you can actually have some really unpredictable data.

A

Either show up or disappear or just weird conditions that can occur so basically, whenever you do a write with Cassandra, everything comes with a timestamp and if you have two servers- and they have two wildly different times like in my example over here- you can end up doing a write on. Let's say this first server and it thinks it's ahead in the future.

A

And then you can do a delete on the second server and it thinks it's way in the past and it turns out the delete won't actually take effect because the timestamp is behind the original one. It's a little bit of it. I honestly, don't know how to diagram this, to make it really obvious, because it's such a weird problem, but I can tell you that you can do things and they won't actually be deleted. So you don't want to do that. It's weird and the solution to this is really simple.

A

It's just always make sure that you're running ntpd and your clocks are correct. So you want to do this on your application side and on your server side, a big thing, that's kind of confusing for a lot of people is tombstones and the effect that they can actually have on your application performance and a tombstone, as everyone probably already here knows, is it's a marker that data no longer exists and it has a timestamp just like normal data kind of I was because I was showing you with the delete.

A

You know, delete comes where the tombstone tombstone comes with a timestamp, and they basically say at this time. This data no longer exists. Now you can run into a problem which, on the mailing lists, frequently might get called tombstone help that if you have too many tombstones, you can actually result in some massive performance failures and the classic example of the thing that, because Sandra is terrible at is a queue, so everyone seems to want to put a queue on topic of Sandra and everybody runs into massive performance problems.

A

It just does not work very well. So this example that I have up here on my my slide is: what happens if you have a really big partition, and you know it's a hundred thousand rows and at the front of that is 99999 tombstones, and you only want to get one thing out of that. Well, cassandra is gonna, have to read a hundred thousand rows or a hundred thousand tombstones just to give you one row in response: you're gonna have a really crazy trace.

A

When you take a look at it and everything's gonna be really slow and you're, not gonna, understand. Why so don't do that? Don't use it as a cue. Keep track of your tombstones and you'll have a lot more fun.

A

Another problem that is kind of a bummer is when you roll into production really quickly, and you didn't take the time to research, what a snitch does and a snitch is really useful because well, it's got a few uses it primarily it's for high availability. It lets us distribute our data in a fault-tolerant way.

A

So if you've got a bunch of let's say, you're gonna, you know go in three wax or four racks and you use a proper snitch what's nice is it will make sure that your data is distributed to multiple racks around your cluster?

A

What you don't want is to have all of all of your replicas exist in one rack and have that rack fail and then you just can't get your data, so you're not you're, really not taking advantage of Cassandras high availability in those cases, and the other thing that's really nice is it will allow for dynamics niching, which will let your clients talk to the fastest replicas that they have so you're gonna get a performance increase from using the right, snitch and you're also going to get high availability.

A

So you definitely want to do that, and switching this after you've gone into production is really time consuming because you're gonna have to run a repair, and if you have like you know, terabytes of data sitting in your cluster, all that data is gonna, be streaming all over the place. It's gonna take a long time and it's a huge pain and you can save yourself fat just by doing this.

A

Up from this issue is becoming less of a problem with newer versions of Cassandra, but I still recommend that you don't do what I'm about to describe and the thing that I did. If you're running, let's say an older version of Cassandra, like version 1.1 and you decide to upgrade to 1.2 or 2.0. What you don't want to do is introduce try and bootstrap a new node of a different version, and the reason for this is the SS table. Format, changes between versions and it breaks streaming.

A

So I tried this and it basically just sat there for a few days and I was like I, don't know what to do with this cluster and you end up having to like kill the node in a brutal way and then repairing, and it just turns into a total disaster. So if you want to upgrade your cluster, don't try and do it the clever way of putting new nodes in and have them be different versions.

A

Just upgrade your nodes in place and you'll have a much better time than I did or if you actually have the right, snitch setup, which you should by now, because I just talked about it. You can't upgrade an entire rack at a time as long as you're, not using consistency level all which nobody in here should ever using unless you love downtime, so don't do that.

A

This actually comes up a lot and people run out of disk space and they shouldn't because they have alerts in place and they should know what to do. But you know we run out of disk space. Okay, what do we do? We're gonna, add new notes, and the problem here is: when you add a new note, the you know we're gonna stream data to the new node from the old ones, and the thing is is that the old ones won't actually delete the data you have to run.

A

What's called a clean up and clean up will just get rid of the old data and let you reclaim your disk space, so you could theoretically expand your cluster from one node to a thousand, but your original node is gonna, have the same amount of data on it. So you have to run the cleanup or you're still gonna have a full disk and it sucks you've probably heard this a bunch of times. I'm gonna say it again, do not use shared storage. This is a sand. This is an AZ. This is EBS.

A

It is not good single point of failure, even though everybody says. Oh, my sand has 300 drives and 26 power supplies and I. Don't know what else they have, but they have lots of things and it sounds really cool, except when your sand stops working, which it does and your entire Cassandra cluster stops working.

A

That's pretty terrible, so don't use the Sam or an ass or abs, and the other thing to remember is even if it worked perfectly you're still dealing with the fact that sands are ridiculously expensive compared to just throwing solid-state drives and your servers. So it's gonna be less expensive for you to use local storage, it's gonna be more reliable and it's gonna be a lot faster. So those are all good times all right, compaction, trips, a lot of people up it's basically the process of merging SS tables.

A

Yeah you've got a couple different options: the default one size tiered compaction has been the one that's come with Cassandra for a long time. It's really good for right, heavy workloads, and it's performs better on spinning disks than the other option. That's been around for a little while called leveled compaction. Leveled compaction will give you a performance increase in Reed heavy workloads and update heavy workloads.

A

So if you've got one of those and those are actually the workloads that I've worked with primarily leveled compaction can be a huge help, because it's going to minimize the number of SS tables that a particular partition falls in so Reed heavy workloads. Anyone here good times, not a single, read heavy workload, this guy with his camera, probably all right. Whatever you can take a look at how much compaction is going on. You can throttle it be a node tool and that's useful. You can stop it.

A

If you want, you shouldn't feel like I, said level, compaction really good for solid-state drives and Reed heavy workloads and you're gonna want to stick to sized here, you're on spinning disks, there's another one that just came out gate to your compaction, really good. If you've got time series data with TTLs, because it'll drop an entire SS table at a time, but it's pretty new, so kind of test it out a little bit. I haven't used it yet in production, so I don't know.

A

So, let's suppose you've got all this cool monitoring place. Everyone feels good. We got application metrics whatever, but we still hit a problem. Well now this is kind of when we, you know, take off the gloves. We hop on a server and we got to bust out our command line, kung-fu.

A

So the first tool, a lot of people are probably familiar with this. This is pretty straightforward. It's just a nicer version of top. Everybody should have a CH top. Everyone should be familiar with this. Pretty straightforward I won't get into it, but it's just nice for process management. It gives you a good overview of. What's going on the the big thing that the resource that really is the most limiting when you're dealing with Cassandra is usually your disk, so iOS dad has been a really useful tool.

A

There's a awesome combination of flags DMX that I like to use not because he's an incredible rapper, but because those flags are actually useful and you can basically get a quick overview as to what's going on in each your discs. There's a you can see how much read and write is happening in any given moment what your queue size is. So if you've got a it says, average request queue size in there. If that numbers really high, it means that there's a ton of requests that are queued up on your disk. That's a bad!

A

You don't want that. A wait! You're! Looking at how long it's gonna request has to stay in the queue before it gets serviced. So if those numbers are high, it's bad percent utilization is unreliable kind of a ignore. It I wouldn't bother with that another tool. That's really nice is vmstat. Basically it tells you about virtual memory answers your question.

A

Am I swapping generally, you want to I personally recommend disabling swap if you're, if you're, hitting swap your performance, is gonna, be so bad that your notebook might as well be down, might as well fail and just restart it and figure out like what's do I add new nodes, do I put more memory and the existing ones generally want to just bootstrap new ones and it's awesome, but if, for some reason you've got swap on there, you want to take a look and see if swaps happening.

A

G stat is kind of a combination of the tools that I've already talked about. So if you've, if you hit an issue, this is actually my first go-to, you know I see I'll check H top to see if anything that's shows up there, but D stat gives you a nice overview of your resources. So you can take a look at disk, CPU memory and there's a whole bunch of flags in here. You can literally get anything you can think of so I recommend everybody put D stat on their machines.

A

I know if it bun, it doesn't come on there by default, but uh install it just make it part of your regular thing and whenever you have an issue, just check D stat right away, because it answers a lot of questions.

A

If you want to get a little bit more intense- and this is really helpful, if you have an application server, that's freaking out I recommend learning a little bit about s. Trace s trace can show you all the system calls that are happening on a particular application at a moment time, so you can filter it out to network or disk or whatever, and what it will do is it'll just show you. The individual calls are happening.

A

So what I did here was I ran s trace on just the touch command, touching a random file, and you can see all the system calls that are happening along the way and I've used this to figure out some really gnarly issues. I'm, like what is this thing even doing and then I found out, I had a process that was trying to connect to some remote IP address and yeah.

A

It just was timing out and it was really easy to figure out what was going on with s trace kind of a nightmare to go through and put in debugging statements all the way through the code. So this is a good time kind of on the same, the same idea, if you want to get a really good idea of of exactly what's happening on a particular port on your on a machine, you can run TCP dub, and this just lets.

A

You watch network traffic and you can get a really great idea of all the queries that are being executed on a machine. I've used this to check out stats D and make sure that data was actually coming into it or you can sit and watch client requests that are coming in. You can just spy on everything and it's really convenient so I've solved the ton of problems with this tool.

A

So Cassandra itself comes with something called node tool which everybody I hope has heard about since I think I mentioned it already and you're all paying attention and no tool has something called TP stats and TP stats is pretty cool because it can give you an idea of the different things that are going on within Cassandra and what's been blocked and there's a couple. There's a ton of things in here, but a big cause of a lot of problems is you can just look right, mmm table flush writer.

A

If you've got blocked, mem table flush writers, then you've probably got really slow disks and the big. The other thing that can that can this can result in is a lot of garbage collection issues. So if you've got a ton of mem tables that are just sitting in memory, they're gonna be promoted and they shouldn't be. They should just be flushed or they're just sitting in memory. Taking up your old gen want that, so lots of issues there make sure that nothing is blocked.

A

Blocked is bad and the other thing you want to take at take a look at is drop communications. So if you've got drop mutations, then there's a good chance that you have inconsistency in your cluster and if you're doing queries it consistency. One then you're going to get back some old data, so you're gonna want to run a repair if you've got dropped. Mutations.

A

Another really useful tool is in no tools histograms, so this is actually from Cassandra 2.0. There's a new version of histograms. That's actually nicer in two point one and the first one is proxy histograms, and that basically gives you an idea for your cluster at any given moment in time. How long queries are taking that are being performed on that single node, and this is actually gonna take into account all the network time. It's it's the full time to execute a query.

A

So if other nodes are misbehaving, then proxy histograms will actually have some outliers where the numbers are really high. So it's not it's not useful. If you want to determine which node is having a problem, but it's useful to understand like are there problems at all?

A

Cf histograms is a much more useful tool once you've kind of looked at proxy histograms, you figure out like okay, like maybe this node is having a problem. You can use CF histograms to identify individual tables on individual nodes, and you can basically get a really good idea of which which tables are slow and that that's really the thing that you want to figure out and once you figure out what tables are slow.

A

You can ask yourself which queries am I performing against that table and hopefully, if you set up your metrics to begin with, you should actually already be able to look at. You should be able to look at your logs, which are being aggregated, and you should be able to look at your dashboards that you have set up and that should help you identify the queries that are broken. So once those queries that are broken are identified, you should be able to use query tracing and effectively. What we have here is I built.

A

A table called tombstone mayhem filled it up with tombstones, and this is basically the queue that I was talking about and you can see from the trace. There's just a bunch of steps along the way and down at the bottom. You've got red, zero, live and a hundred thousand tombstones, and you can see how that number just jumps way up. This was actually performed on a solid-state drive and everything was compacted and it's still super slow. You just don't this reiterates my tombstone point, but you can use to a query tracing to figure out.

A

You know, did someone put a cue in production and why? Why did they do it? So the last thing that I'm going to go over is the JVM and garbage collection. So this is kind of a pain for a lot of people. It's confusing even people that have been Java programmers for like ten years, like I, have no idea what's really going on in the JVM, because there's about a trillion flags and like whatever it's really not that crazy, so I'm gonna attempt to break it down in just a few minutes.

A

Everybody should walk out of here with a good at least fundamental idea of what's going on in the JVM, and we could start to reason about how it impacts the performance of our cluster, so JVM, nice automatic garbage collection basically means that we don't have to manage memory anymore. I know some people love doing, that they love just allocating memory and freeing it I personally, don't care, it's still I. Never do it right. Everything's broken the one that we that Cassandra defaults to is pärnu for new gen and CMS.

A

So you basically have these two, these two sections there's the new gen, which is composed of Eden and two survivor spaces, and then an old gem. Now the new gem basically is the place where any new object is allocated.

A

So if I, just if I created a new cell in Cassandra, which is basically the smallest container of data, that I can have it's gonna be created in the new gem and if that sticks around long enough and we hit a minor GC, which is when Eden fills up of the objects that are still active, are going to be promoted into survivor.

A

Okay, so you've got new stuff in Eden, promoted into survivor and if it stays around and survivor long enough, it gets promoted into the old gem and whenever you hit a minor GC, its I stopped the world operation. So literally everything stops and it's gonna go through look at every object and you're new and you're new gen and determine does this thing need to be promoted? If it doesn't, then it can be thrown away.

A

That's super fast, so you could have like a 10 gig new gen, and if none of those objects need to be promoted, you're stopped the world time is gonna be really fast, but if all of them need to be promoted, then you have to copy 10 gigs of data out. That's super slow.

A

You can start to kind of reason about your workload and you can start to think about how does this impact?

A

How does this impact garbage collection I'll have a couple examples in just a minute, so once we're in the old gen, you know we've got a bunch of stuff around here, and this is giant slab of memory and there's this concurrent operation. That's happening all the time in the background and it's taking a look and it's marking objects and the the garbage collection is happening continuously, and this too short stop the world pauses in there, but they're generally not a big deal and they shouldn't impact your performance.

A

The thing that does impact performance is when your old gen gets filled up and when your old gen gets filled up. It's basically gonna. Stop everything collect a bunch of stuff and reorganize things and that's super slow. When everything gets filled up, then you run into a full GC and that can take a really long time. I've heard of full GC is taking over a minute. You don't want to do that really bad time so yeah.

A

So we run into two problems and there they come about from different workloads. The traditional advice on that you read in either the cassandra mailing list, IRC or even the config file itself has always said: don't put your new gen above 800 Meg's and that's based on the idea that we have a write, heavy workload and with a write, heavy workload, we're not really creating a lot of short-lived objects and that's okay.

A

So you want to keep your new gen to be small, because there's a lot of stuff, that's going to stick around and it's going to be promoted and you don't want to do that. Ten gigs of copying, like I just talked about sucks, so you're gonna have a lot of minor G C's and it's like I said it's just slow, I'm.

A

Sorry, the the problem with early promotion is, if you have short, live objects that are being promoted prematurely into the old gem and that's when you have a lot of minor GCS and what I've seen this on is read heavy workloads on solid-state drives. So in this case, you actually want to increase the size of your new gem, and you can see here. We've got early promotion into there.

A

The other problem that we can have is long minor GCS and as I mentioned, this is the one where your new gen is too big and there's a lot of objects and they all need to be promoted.

A

So to figure out. If you're having promotion issues, you want to use a really convenient tool called J stat there's some other tools called J visual VM, and it's really useful, but I personally prefer the command line tools, because it's just faster to get up and running and J. Visual VM is kind of a pain to use on a remote server.

A

So J stat has an option called GC util and you just give it a pit and an interval and the number of times that you wanted to print out and it will show you the percentage, that's being used in your survivor, spaces, you're eating and you're old, and you can see if data is bouncing like if a ton of data is bouncing back and forth.

A

But you can survivor or a lot of date or if your old gen is filling up really quickly, then you definitely want to do some GC tuning and you can see there's a couple flags in here. There's young gen count on those yong-jin count time and then there's an old gen or there's full gen count time, and you can see how much time was actually spent at each interval.

A

Doing garbage collection so really useful tool, and it can help you kind of determine if the tweaks that you're making are having the effect that you're looking for.

A

So basically the thing that you want to look out for long multi. Second, pauses right. If you've got stop the world pauses happening all the time your cluster performance is going to be terrible and the thing that's going to be really can using about it, as you might have like if you're running in Amazon you're, probably running Cassandra on one of the nodes that have 16 16 cores and your CPU won't actually be very high. Your memory usage won't be very high and your disk won't be high.

A

So it's gonna be really confusing to figure out which one of your resources you're restricted on and yeah. You definitely want to check the GC side of things and see how much time you're spending in there and like I, said if you've got long. Multi. Second pauses that's happening because of full GCS or you could have long minor GCS and that's because you're, promoting too many objects and the question that kind of comes up here is: how much does this matter like? Is this having a true impact, or is this one of those things?

A

That's like you know we're optimizing for a 5% increase in performance, so this graph that I'm going to show you we took after doing GC tuning at my last company. We actually got about a tenfold increase in performance, so we had a 20 node cluster and it's performance or sorry. Well, I think this was a 12 node cluster and it's performance was pretty terrible and we decided to do some GC tuning.

A

We bumped up known as a read heavy workload and we bumped up our new gen size and we bumped up our old gen size and it performed a lot better. So this is what you can do. Everyone was really happy with this because it was way more awesome, so yeah I recommend playing around with this and learning.

A

So what do you do when you get the call stuff's broken? Please fix it. We basically we're gonna narrow down. The problem you want to ask yourself is the problem: even Cassandra. You should be able to with all the metrics that we have put in place with all the logging. We should be able to know which part of our system is actually broken, which ones throwing the errors. Where are they coming from?

A

If it is Cassandra our nodes going up and down up Center can actually answer a lot of these questions, but if it can't, you should have all this other logging in place and metrics and aggregations, do you have slow queries?

A

Is it the JVM use your query, racing I and user logging figure out if you're misusing Cassandra trying to use it as a cue if you have compactions get to the bottom of this figure out what resource your constrained on and make sure and make sure that doesn't happen again and also set up the logging so that you know if this is a issue in the future, so I think now we have a few minutes for Q&A. Yes, three minutes all right: QA, yes, yeah.

A

B

We have delete heavy workloads on partitions, can you delete a full partition and then not have issues with it.

A

Really it depends so if you do the delete and then you're trying to read it again, then you will have a problem. So I would not I would not recommend creating a partition with, like, let's say, a hundred thousand a hundred thousand rows and then deleting it and then trying to select from it. You will have some massive performance problems.

B

So a bit more about abs and why it's bad? Because provision dials are fast and pretty cheap.

A

Prove it so the big problem with EBS is has always for me been reliability. There has been every year that I've managed operations for a company I've had to deal with total EBS failure across the entire availability zone or multiple zones.

A

Shared storage is never going to be more reliable than putting storage on an individual machine and remember whenever you do a request over EBS, it's still happening over the network, so it's always gonna be faster to use local storage and it's always gonna be more reliable.

A

I'm done I'm getting kicked out. You can find me and meet the experts after this.