Apache Cassandra Cassandra Community Webinar Series, 19 Nov 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webinar | Diagnosing Apache Cassandra Problems in Production

Description

This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.

A

Hello, everybody and welcome to this week's webinar diagnosing Apache Cassandra problems in production. I am delighted to have with us this morning. John Haddad John is a Technical Evangelist here at data stacks and prior to coming to date, stacks he ran Cassandra in production at his previous company, has a lot of experience with operations and we're going to learn a lot from him this morning before I hand over to John just a couple of housekeeping items.

A

If you would like to ask John a question, please use the Q&A tab inside of WebEx type. Your question there and at the end of this morning's webinar, we'll save some time and get through as many questions as we can in the remaining time. So that's enough for me welcome John. Thank you very much for agreeing to present today lay down the education on us. Please all right.

B

Very excited to be here so, as Christian said today, we're going to be talking about diagnosing problems in production and I actually gave a similar talk. I got the summit, but unfortunately, due to time restrictions, I really wasn't able to go into details. I would have liked on certain topics. So one of the things that's interesting about kind of diagnosing pumps and production is there to be a problem throughout your entire stack and being able to diagnose just Cassandra isn't really enough.

B

You have to be able to narrow down the problem figure out if it's your application figure out. If it's the JVM Cassandra itself, you know if it's configuration that you need to change or the hardware, so we're kind of going to walk through all the things that you need to know if you're running Cassandra production, how to figure out what's broken and what steps to take to try and fix it.

B

So the first thing that we're going to do is is we're gonna talk about preparation and if you haven't put Casandra into production, yet these are kind of the things that you're going to want to know about. You're going to want to take care of these things, take some notes and make sure that you're ready from once you get into production. You want to understand, what's happening on your system, so the first thing that's really important to put into place is op center. Op Center comes from data stacks and it's basically a custom-built operations tool.

B

That's meant specifically to tell you what's going on with Cassandra it, basically it it will help you with about 90% of the problems that you encounter, because it's so tailored towards the Cassandra use case. Alright. So if you do end up with an issue in production, it should be the first place that you go to and there's two versions.

B

The community version, which is free and I, definitely recommend that anyone running open-source Cassandra download it and set it up because, as you can see from these pictures here, it tells you a lot of information about your cluster. The enterprise version has some extra features that are pretty useful. I won't really go into detail on now, but they're, both very, very useful and I, strongly recommend that you use them with any cluster. That you have the next thing I'm going to talk about- and this is this is just general stuff.

B

A lot of this comes with op Center, but throughout the rest of your application, you're going to want to use these tools you're going to want to have server, monitoring and alerts in place, and you can use you can do it yourself through open source software or you can use a third party. It really doesn't matter in the end which you use just matters that you pick. One mana is a really useful tool if you want to monitor processes and disk usage and get alerts, but things go wrong.

B

So if all of a sudden you've got a you know a disk at ninety percent full, you should get an email about that. This is a pretty big deal and it's really easy to set up these. These things do not it'll take a long time to learn. You know you put a few days work into it. You can have your whole system monitored and you have a really good understanding of what's going on in this particular case. This is nice because it can prevent a lot of problems before they even began.

A

B

It you can see in this example from novios. You can see that you've got a critical alert on this disk right, so instead of hitting a real performance problem when you're out on a disk space you, you definitely want to know about this kind of thing ahead of time- and this is this is stuff. That's like pretty basic. There should be on every system.

B

There's really no, there's no good reason to not run this stuff. Some other useful tools, munion collect the go, collect performance statistics, as you can see writing on the right.

B

We have Nagios excellent tool on the Sephora called icinga, which has been picking up a lot of the development and then, as I said before, there's a lot of third-party services, I'm familiar with data dog and server density, and they can provide a lot of the same tools and the thing that's nice about that, is you don't have to host it yourself, so whatever you want to do just pick something and roll with it. The nice thing is that this kind of thing is: you can swap out services without interrupting all your application.

B

You can run multiple services, so I actually ran several different services like I was going to hosted one and then I would also run an.

A

Open-Source one on my own.

B

And it was just nice to kind of have a little bit of redundancy a little bit more maintenance, but I. It ended up being really useful for us.

B

So the next thing that you're going to want to make sure you have put into place are your application metrics, and this is kind of where developers and keep one operations really need to work together in order to make sure that the system is functioning right, a lot of times it things are kind of put on operations, people and they don't really have the insight that they need in order to diagnose. What's going wrong, as it may not be an Operations problem.

B

um So what you want to be able to do is is track a few things you want to be able to track events in the system, and you want to be able to have micro, timers, around small blocks of code and the easiest way to do this is with stats. D and graphite. Stats D actually also works with a few other. A few other applications um that I know LeBron Oh is one that you can.

B

You can output stats, the information into and that's a hosted service, and then you can use either graphite or grep on ax and I. Have the two on the right over here. Graphite is on the top. Cortana is on the bottom. It looks a little bit nicer, it's gaining in popularity, but both tools will just allow you to graph things that are not system metrics, so user signups error rates and people from different countries.

B

You can get a lot of really really good information out of this, and once you start correlating it with your system metrics. If you see that all of a sudden user signups just drop well, you know there may be a serious problem. So it's it's nice because you may not have let's say super high load on your Cassander servers, but you can tell that you know the number of user, signups or log ins or whatever is just drop to zero. Maybe there's a DNS problem.

B

Maybe they have a networking issue, it's not always about fixing high load. It's about understanding your cluster in your application as a whole. So basically, the gist of this is you want to measure everything that you can? That has any sort of significance.

B

The other thing that's really cool about these tools is since Cassandra 2.0 there has been a the integration of metrics library and that came out of Yammer and the metrics library essentially allows you to pipe Cassandra metrics into a whole variety of different places. So you can you can spit your met, your metrics from Cassandra into graphite, so you can start to correlate metrics.

B

So you can see on the bottom graph here, there's a few graphs where there's multiple lines, so you can start to correlate things like JMX information from cassandra with user signups and all of a sudden, you understand that there's correlation between events- and it just helps you get a little bit more information out of your system and, if you're not running what they. You know, you're running other Java utilities, and you have some JMX stuff that you want to plot on this.

B

There's a nice there's a whole bunch of different libraries that you can use, but a really useful one is JMX trance and that will allow you to basically kick any JMX metric out to stats D and it will grab it'll, go into graphite, so really really useful stuff. Just make sure you have all of your application metrics and you can get to them quickly and you know, coordinate with developers and operations and just have some really good dashboards, because it will help you out a lot.

B

So log aggregation is the next thing. That's really important and again this requires developers and people and operations to work together and it's the same deal you can either go for a hosted service, so they've got as I have here: Splunk and log lea or you can go open-source and you can use log stash. The Cabana which I've used a lot gray. Log is also really popular and then there's a whole bunch more I'm not going to list them all. It doesn't need to be an exhaustive list.

B

The thing that's important is that you have really good logs and you have them aggregated somewhere, and this this shouldn't just be application logs. This can be your Sandra logs if you're running elasticsearch, if you're running engine excerpts or Python like you can you can always put your logs in this tool. Take the time to make sure that the logs are parsed correctly and binding errors when they happen, is actually going to be really easy and make sure their logs are meaningful like if you've got.

B

Let's say user data make sure you put your user ID in the lock message and then, if they call you up and there's a problem, you can just search on the user ID and all of a sudden. You have a really nice easy way to understand what happened for a particular user so and make sure that you solve the problem quickly and people will really appreciate you for it. Because you're, you know, you don't sound like an idiot trying to say, like oh I have no idea.

B

What's going on with you, you can say, like oh I, get it like when we help you out, and so it's useful from a customer perspective and from an operational and developer perspective.

B

So now that we've done it are our tools set up so that when there is a problem we can, we can just look at our dashboards and figure out. What's going on, there's a few gotchas and mm-hmm. These gotchas are basically really really small things that you can run into.

B

They can have that can result in really big problems and I've actually run into you most of these myself, which, as I, put them on the list, and they were pretty confusing, and it had a lot to do with me not reading some things, and you know it's easy to skip over a paragraph of really really important information and just kind of or forget about it and yeah.

B

It can be a pretty big problem, so the first one that we're going to take a look at is, if you're running, especially if you're running Cassandra in your server times, you're, not correct, and the reason why this is a problem is because Cassandra relies heavily on time stamps and having inaccurate time stamps can really screw up your data.

B

So whenever you do a write with Cassandra, each piece of data is actually written with a time stamp and the time stamp comes from the server that you're talking to and it's it's actually sent to all the other servers. So it's not.

B

Each server has its own time stamp that it's using its using the time stamp of the server that handled the query and whenever you have a conflict, whenever you say, hey I have these two pieces of data which one's the most up-to-date, the Cassandra will actually use a last write wins, so it just looks at the timestamp DePalma data.

B

So if you have servers that have different times and one ten seconds ahead, ones ten seconds behind things even get screwed up now, if you take a look at the example, I have on the ripe this time on the server on the left is ahead of the real time in real time. You know twelve I just picked a number that was relatively low and easy to talk about. Obviously we don't have time twelve right now, but uh for the sake of the discussion, I think it'll work.

B

If the server happens to be eight seconds ahead, it's at time twenty and the server on the right is behind right. So it's five seconds behind. So let's say at time: twelve we do an insert right. That insert is going to carry with it a timestamp and unfortunately the timestamp is going to say twenty right, it's it's! Basically the server thinks it's in the future. It goes ahead. It writes this data with twenty now a few seconds later, someone goes to read and then they go.

B

You know what I need to delete this data, it's no longer valid and the timestamp should be 1500 or three seconds in the future. But the problem here is that this server actually has a time in the past. So it's going to say you know what let's do or right and the time stamp on this is actually going to be ten. Well, when we look at the delete versus the insert, the insert looks like it was done further in the future, and so the delete actually doesn't occur.

B

So you end up with this problem where you're trying to delete data, and it just doesn't disappear and that's kind of confusing um you will you will you know because cassandra's is eventually consistent, people will go well, I did or I do right at consistency level, one and it didn't take and what happened and then they do another delete later and it works, and it's really weird because your servers behaviorally consistently well, it's because you have a really weird race condition.

B

So the the thing to take away from this is just make sure your server times are right and the easiest way to do this on Linux is to install ntpd, and this will just make sure that that your servers are constantly checking. You know, with the correct time and they're drifting back, basically a correct story, and if your servers are like way off, you actually want to run ntp date and ntp date will just basically Jam a server time back into the correct time.

B

So the next thing that we're going to talk about- and this actually has a little bit to do with the slides or tombstones so tombstones- are basically a marker. That says this data no longer exists all right, so you do a delete. Instead of just deleting things off the servers. It's a distributed system deletes. Don't really work that way you can have data come back.

B

You have to put a marker in place. That says: there's nothing here and you know I. We've got our tombstone on the right and this is actually a valid timestamp, that's stored at the Sandra, and it has the key and it has the timestamp, and so in that scenario that we just talked about where we had back here. We have these servers. We have this delete, we've got we.

B

Basically, we have to have that delete with the right time stamp and that's what exists here and it's it's a really really useful tool and it prevents a whole bunch of errors from coming up. However, then there is a problem that you can run into calls a lot of people call this tombstone hell and basically, what happens is if you've got a really really big partition.

B

Let's say this partition here: we've got a hundred thousand rows and if we're misusing Cassandra and we're trying to use it as a queue, well we're going to have a whole bunch of tombstones right in the front and a queue is just a really bad data structure for Cassandra, because we're trying to read things off the front all the time and we're also deleting things off the front, so it makes it it makes it so, the more and more work in order to do a read.

B

So you can see here, we've got 99999 tombstones and you actually have to read through all of them in order to get the data, that's at the front of the queue, and this is why people say do not use Cassandra's queue, every library that basically ever comes out that tries to uses a queue. It's just it will hit a limit and it will get a performance bottleneck and you will end up doing a lot of I/o and a lot of CPU just to do something. That's really simple, stuff use, aq use, something like Kafka I.

B

Definitely would not build Eric you on Cassandra. So the next thing that we're going to talk about is a snitch. So if you need to put your cluster into production and not really think about the snitch by default, you're just going to kind of let it roll out and do its thing.

B

But the problem is: if you don't use a snitch, then you're not really taking fully taking advantage of Sandra's high availability. So there's there's two purposes of a snitch on a read: a snitch will keep track of the fastest replicas or reads so it effectively lets you get much better. The best performance out of your system and the honor write what it sniffs does is. It will actually ensure that your data is spread out across different racks or availability zones. However, however, you want to arrange your data.

B

You make it so that the snitch will just help pick which servers everything goes to and that's pretty awesome. So what you want ideally, is if you're using multiple racks to not have multiple copies of a single piece of data. In one rack like, if you have three replicas, you don't want them all to be. In the same rack, the rack goes down, you lose all your data, and this just makes sure that that doesn't happen, and you can see there's a whole bunch of them list listed on the left.

B

The one all the way at the bottom and the gossiping property file. Snitch is the one that data stacks recommend. It's the easiest, one to configure rack inferring.

B

It relies on IP addresses ranges of IP addresses for certain racks if you're in the Amazon, definitely ec2 and the multi region are are good and they'll just make sure that you don't have more than one copy in the same availability zone and that way you can lose an entire availability zone or if you have the multi-regional stuff set up, you can lose an entire data center and it'll just make sure that your data is distributed properly, and it's awesome.

B

So this one's becoming this next one is actually becoming less of a problem and it's it's version, mismatch issues and basically I've actually run into this on. um What can happen is if you've got a cluster right. Let's say: you've got a 1.1 cluster and you're trying to upgrade to 1.2 or you're trying to create a 2.0.

B

What what I tried to do, which is totally incorrect, is adding in a new node of a different version into an existing cluster. So you just want to make sure that all your nodes in your cluster have the same version. um If you add in a new node, basically there's a process called streaming and streaming happens whenever you bootstrap a new node whenever you decommission a node or whenever you run a repair and when that happens, the the file formats are not going to be the same, and it won't be able to read the data.

B

So it'll. Basically just hang, and you end up with kind of this weird system where it says that it's joining and it'll say that it's joining for a while like days and it's it's really hard to figure out. What's going on and basically basically what you want to do. Is you just want to make sure that you avoid introducing new nodes and do existing clusters that use different versions? Even minor versions, just stick with the same version, it's much safer and it decreases kind of the number of things that are different.

B

That could be affecting next question is how do I do upgrades I? So if this problem doesn't exist, if you shut down a node and you upgrade it and then you bring it back up so if you've got a smaller cluster, I would say upgrade on those one. At a time.

B

If you've got a big cluster, let's say: you've got 10 racks or 20 racks or something you can upgrade as long as using the right snitch, you can upgrade one rack at a time you can shut down ten servers or whatever upgrade them all, bring them all back up, and that will actually be fine because hinted hand-off, which is the process that happens when a node turns back up. That works just fine between versions. So the gist of this is just be safe.

B

Don't try and get clever with your upgrades just stick to really dumb simple and it works great, so this next problem disk space not being reclaimed. This one is, is pretty confusing and really.

A

B

Basically, the gist of this is, if I add in a new node into my cluster. It gets data from the existing nodes right asked to get it from somewhere and the big nodes that it gets. The data from actually won't delete the data that they strained off, all right that so, if you've added a new node in because you're running low on disk space, you're almost to you, unless you actually run module cleanup, you won't have solved your problem. You'll still be getting these alerts and you're you're.

A

Not going to understand.

B

Why you're running out of disk space? If you change your replication factor? This is the same thing. It won't just delete data, you have to run cleanup and you know you can you can reclaim a ton of data. The the other thing that can happen is depend if you're running incremental backups. You can run into this problem as well, so you're going to want to make sure your backup strategy isn't resulting in you piling up a ton of SS tables that don't need.

A

B

So, just be be careful about that, and this next point is about using shared storage, and this is a it's crazy. This this problem is hit on time and time again and advice is given. Don't use, shared storage and people use shared storage and it ends up being a problem.

B

Shared storage has a whole slew of issues that are associated with it, one being it's you're using a distributed database right using a cassandra is meant to handle failure and you've effectively added a single point of failure into this plane, and it's just not a good idea. Your latency is way higher than it's going to be than if you were using local disks going out to the network is always going to be slower and using your SSDs if you have SSDs, which is a good idea.

B

In fact, if you take a look at running Cassandra on a stand versus just loading up, your your your five boxes with solid-state drives its way cheaper to use solid-state drives, so just go nuts on those. Instead, a fan is basically just a giant commitment.

A

B

It's really expensive and it doesn't work well with Cassandra and I, haven't heard of anybody running it Cassandra on a sand and getting either the performance or the value that they want and remember your Cassandra performance or your performance with any database is about latency. So if you've got fast disks, then your servers gonna be fast, and it's not about I ops, I ops does not measure latency um what you.

B

Basically, you can have ions through the roof and have 15 30 millisecond latency, and that just doesn't work, and you know it kind of it'd be like if we had great throughput to Mars. You've got huge, huge latency, and but who cares about your throughput? It doesn't really help, but you may be able to get a billion ions to Mars, but your latency is is just huge, so stay away from shared storage, and that includes things like EBS.

B

You guess is not good for, and that includes fans that includes your Nass, just use the local storage, if I select failure and it's great for performance and it's way cheaper.

B

So this last thing I'm going to talk about is compaction. Compaction is the process that Cassandra has to go through to merge SS tables. Remember when a Cassandra when kissing a doe writes its data out its writes it in an SS table and that's an immutable data file.

B

So what's written, it doesn't get appended to or anything it just sits there and eventually, if we did this without compaction, we'd end up with a ton of tables and what, if we did a read we'd have to read through, like you know, 10,000 tables or something which is totally impractical.

B

So basically compaction will take these tables and merge them together will delete the originals and I'll write out a new one that has all the information just combining into one table.

B

So it reduces ILO and reads, and it makes the system a whole lot more manageable because you could have like I said, like a hundred thousand s, those tables, if it's totally unmanageable, you can't even do a problem, then this is something that a lot of people hit where you actually have too much compaction or your discs are not fast enough for compaction and, if you're running Cassandra on like a single 7200, rpm Drive, it's really not you're, not setting yourself up for success, so the odds are you're going to run into a compaction shoot, and one thing that's really nice is up.

B

Center actually gives you some insight as to what's going on throughout your cluster. How much compaction is going on? You can look at it. A per server level really really useful, um like I said, if your before on the original slide, op Center, if it's a ninety percent of what you need, if you take a look at op center- and you see a ton of comp actions occurring and your disk usage is out of control. Well, there's a good chance that the two are correlated and using the neutral command.

B

You can actually adjust the throttle on compaction, so you can set the compaction throughput through no tools on a running machine. You can actually, you can say like. Oh no I want you to slow down or I, feel nuts like I I. Don't I don't want to throttle you and that's great. If you're, if you've got a solid-state drive, you can turn that throttle way up or I suppose down on.

B

If you're talking about is a throttle, you can just say, go ahead and do one hundred and fifty makes a second because you're, you know you guys can handle it. That's fine!

B

So there's there's a few different compaction options and these are pretty important. It depends on your workload and you just want to read up a little bit on the differences between levels in size to compaction. The gist of it is, though, that level is really really good on solid-state drives and Reed heavy workloads and also update heavy workloads, and it does a ton of money, though so the gist of it is.

B

It tries to keep a partition and as few SS cables as possible and so, like I, said a lot of I hope to figure out which, which data goes where it's pretty crazy. You probably don't want to be doing this on. You know a spin, a traditional spinning drive it's going to be really slow and you know you're going to introduce some performance problems.

B

So if you do have a spinning drive, you're, probably going to want to stick to size, tiered compaction and that's kind of like the old school default, and it's really really good for right, heavy time series, workloads and there's actually a new, a new compaction strategy called dated compaction and that's been introduced in the sandro 2.1 and back ported into Cassandra, zero Inc 11, and it's kind of like an experimental compaction option. That's even better for time series workloads, especially if they have TTLs.

B

So it's worth checking out if you're doing time, series data- and you know you want to run one of the newer versions of Cassandra.

B

So the next thing that we're going to talk about it diagnostic tools- and these are basically just the tools that, if you're sitting on a server and you've got a command line open. These are the things that you're going to want to know to understand what is going on with a machine at this exact moment in time. So this is like real time monitor and stuff, and it's you know, you're digging into one machine.

B

I found these tools to all the external useful and I'm they get. They start out with really basic and they go into more complex. So this one's a no-brainer I was actually using this right before our our webinar. Here, a chop really simple: it's just for process overview. You can do all the stuff that you can do with top. It just looks a lot better and it's a good first tool to fire up. If you're having problems with machine, you throw up H top it's going to put everything ranked by CPU.

B

So it's really useful to see. If there's, if there's a performance problem, you're going to know, okay I've got low memory, I'm, swapping or um CPUs, which is really high. It tells you really really quickly. The next thing that we're going to look at- and this is like I, said a little bit more a little more advanced, but you know shouldn't be too bad is IO staff and IO staff basically gives you disk stats, and the idea here is to understand what is going on on each device. What is what is my read rate?

B

What is my right rate? um Do I, have a queue, so there's a says: average queue size right, so you can understand, is: is my disk queued up on a wait? How much time am I waiting? And these things are really nice, because you can quickly like if you're using a raid, you can actually quickly identify if one disk in the rate is slow. So this is a super useful tool.

B

If you know, if disk looks like it's, it's not behaving right, and especially in the rape case, you can identify the problem really really quickly, and maybe you need to swap out that drive. Maybe it doesn't work right anymore. There's a percent utilization that is kind of not always accurate, so I would definitely ignore that vmstat really useful tool gives you an idea. What's going on virtual memory statistics, are you swapping there's a swap column on the right? This are in the middle here.

B

It says si that swap in s, oh, let's swap out, and you can see VI vo, that's blocks into blocks out and what it does is it gives you an idea if you're actually having if you hit the memory limit. So if you have swap still turned on, then this would be the place to figure out if you're actually hitting a swap issue normally on production servers.

B

I personally turn swap off I'd rather have a server just crash and with Cassandra it doesn't really matter, because if you have multiple replicas, then the other ones will continue to work in this particular case. I would hope that if you reach your memory limit, you would actually be getting alerts through either Nagios term on it or server density or whatever, whatever tool that you put in place. You should find out about this before this problem happens.

B

So hopefully you don't have to use this tool too much and you've been. You know, you're told about it ahead of time and that's kind of where prevention is much better than having a problem, because if you hit a problem with swap there's a good chance that all of your servers are close to hitting a problem. So you definitely don't want this to happen, because it's going to be harder to introduce new servers when other servers are crashing a.

B

Really useful tool that will kind of aggregate a lot of the data that that kind of logistical look at is a deist at these data is nice because it can give you network information, CPU memory and disk all wrapped into one I love this tool.

B

Actually, you will go to it before I go to IO staff, and basically you know it is what it looks like it's just a lot of information generally on a system if I'm trying to figure out what's wrong, I've got like four different tools: open and D stat just took all that away. I! Don't need to worry about anymore.

B

You know I used to run iost at an age top and vmstat. Like all these tools, and now you can just run addy static. It does like 95% of the same thing. So um if you need to dig into like disk statistics, you can run iOS dad to get a little bit more information, but often the most part. These guys going to do this one this one s. Tres, is really really useful if you've ever run into a system, and this this is. This is part of my you know. General tools are useful.

B

This doesn't apply just to Cassandra. In fact, I think this is more useful. If you're trying to debug what's happening at the application level, you can run as trace and what Estrella does is.

B

It will show you all the system calls that are happening for given process, so you can attach to a running process which is really nice or you can run a process with s trace and I've actually used this to find that my process was trying to act to a machine that wasn't there, so it was trying to open up a socket to some random box and it just it was causing some weird issues and running s trace it's easy because you can see every every process, that's happening along the way.

B

Every system call so I like this a lot and like I, said it's helped me debug, some really weird things and you can you can optionally filter. So if you do like e trace equals Network, you can just see the network trace. um So, like I, said super useful. You want to understand exactly what your application is doing.

B

S trace is a really good way to find it's going to print out a ton of data, so it might be good to output to a file but yeah, it's great, so kind of on that same general, useful utility, that's a little bit more intense, is TCP dump and what's nice about TCP dump is that you can get a really great idea of the traffic that's going across from and you can look at a particular report. So if you want to run this, let's say on your case and report like I did over here.

B

I can see that I've been doing queries, and this is actually from a project. I have called meet bot and meet. Bot is just a chat bot and you can see the queries that are actually being sent over here. So it's pretty cool it lets you trace really anything so I've used this to watch Redis, elasticsearch Cassandra, it's great! You can also see what's going on if you're on an application server, you can see what requests are coming in. There's a whole ton of flags. There's a lot of options: it's a really really flexible tool.

B

I strongly recommend getting familiar with this, so those are good system level utilities and they're they're great for getting a general idea of. What's going on, machine Kassandra actually includes a something called node tool which is really useful for understanding. Very, very specific things about Cassandra and the first thing that we have in node tool is: it's called TP stats and it gives you a really good high-level overview of things that are blocked on your system, and you can see this fifth column over here. It says full name.

B

Active, pending, completed blocked blocked is the one that you want to take a look at. For example, if you take a look, there's a mem table flush writer, if that's blocked, then you've got some disk problems generally. That's that's what's going on, and it can also like blocked will also lead to garbage collection issues, because it means that many tables are sitting around in memory and that memory can't get freed and we need to do more garbage collection.

B

The other thing that's really nice down here. If you take a look all the way at the bottom, it says message, type and drop: one of them is mutations, you've got drop mutations, then you need to run a repair if you've got data. Consistency problems as a good idea to take a look at TP stats, see if you've got dropped, mutations and do a repair. If that's the case, another tool, that's really nice is, is histograms histograms, there's two of them.

B

The first one is proxy histograms in case you high-level, read and write times under cluster, and it does it in microseconds, and you can get an idea of how fast queries are being serviced, both reads and writes by using this tool and once you've done that. If you determine that there's a problem, you can use CF histograms, which is on the right, and you can get statistics or a single table on a single node, and this is really nice because you can help you narrow down performance problems down to the table level.

B

Now, if you've identified problems at the table level, the thing that you don't want to ask yourself is which queries am I executing? Is that table once you have those queries? You can use query tracing to determine the query path on what's happening right. So this is an example here, I have a you know. This is my tombstone problem, I say: okay, now, I've got a hundred thousand, um you know, I've got a hundred thousand rows in this partition and I.

B

Do is select and well guess what they're all tombstones, so nothing comes back except the thing is I still need to do a ton of work to figure out what's going on here. This doesn't really look like that big of a problem where I am on on this drive. The actual time is not it's totally terrible. um Actually, no, it is it's pretty bad. The in this is still in solid state. So this kind of goes back to my tombstone issue. You don't want to have a lot of tombstones.

B

This is going to be slow to matter what you do so just avoid it. You can see the bulk of this time is all the way at the bottom, and it's just crazy, yeah five seconds. That's just horrible and you get no data back.

B

A

The next thing we're going to talk.

B

About there's the last this last topic: we're going to be talking about GBM and how garbage collection works. So what is garbage collection? Basically, it's an alternative to managing memory of yourself and what the JVM will do is keep track of which objects point to which other objects and when objects aren't being used anymore. It will get rid of them and reclaim that memory a bit again and the way that this works with Cassandra is we're using part new and CMS.

B

It's a generational garbage collection, so basically, objects are allocated in the new gen and over time, they're promoted to the old, Jim or removed.

B

So once garbage collection happens in the new gen, we have what's called a minor G C, and this is stop the world operation and it occurs when the new gen fills up. Dead objects are removed and then live objects are promoted into the survivor area, so you can see even down at the bottom. Objects are promoted out of evening just into Survivor and they're, promoted back and forth between the survivor generations as well and after a certain amount of of swapping back and forth, an object will actually, if you promote it into the old gem.

B

So the important thing to take away from this is removing objects is actually proving fast and promoting objects is really slow. There's some accounting and the background needs to happen. There's a men copy and it's pretty bad. It's a lot slower than removing like removing, is super fast. So there's a couple patterns that we're going to see as a result of this before we talk about that that we're going to talk a little bit about the old gem so after the object has been promoted from the new gen to the old gem. We've got this.

B

This huge lump of memory laying around and over time actually constantly we're going to have what's called a major juicy and a major GC is mostly concurrent. So most of the stuff is actually going to happen in the background and there's going to be two short pauses and what you don't want to happen is a ton of major GCS happening one after the other.

B

You don't want there, your your whole gen to get totally full and your new Jam to get totally full because then you will hit what's called a full GC and a full GC is what happens when the old gen fills up, and basically this is stopped the world. So if you've got a 20, gig heat you're gonna have to look to all 20 gigs of memory.

B

It's gonna have to draw its graph and like do a ton of work and I've heard people having full GC that have gone on for hours and that's what happens if you ever really do cheapest pie, you don't want to use really big heaps. It doesn't work that well and basically, your system will be completely unresponsive. Are these that node will be totally unresponsive during the full juicing?

B

So, as you can see in my notes- and hopefully you've inferred by now, these are really bad, so we've got two problems that we can have and the first one is really promotion. So if we've got a bunch of really short-lived objects and our new gen size is too small and we're creating these short-lived objects really really quickly. What happens? Is your new gen fills up, and then your new gen gets promoted to your old gen? Your old gen is supposed to have data. That's that's.

B

Long-Lived objects like they're supposed to be things that are going to stick around, for at least you know a little while. But if you set these short-lived objects by short-lived I mean you know, 100 milliseconds. Well, it's a pretty short lifetime. If these things are in the old gen, that means your old gen is going to be filled with these objects that don't need to be there.

B

So you're going to get a lot of full GCS, because you're going to be just constantly filling up filling up your memory and the particular case that I've seen this happen is read heavy workloads on SSD, a big problem and it results in there's a lot of minor GCS to copy new gen over and you're copying a ton of data and then you've got full, as you sees happen all the time, so your performance is pretty bad.

B

The other problem that you can run into is a long minor, G C and the the problem here is: if you've got a new gen, that's way too big right. So, if you've got this right, heavy workload and your burger we've got all these mem tables sitting in memory and then we're going to copy them over from the new james gilligan. Well, if we've got a copy over like three gigs of data, it's going to take a long time.

B

So, as a result, you've got a ton of data being promoted and it's really slow and forced. So a really useful tool to understand. What's going on with garbage collection and just in profiling on it, it's Jake stat, so Jay stat has a flag called GCE till you can pass it a process ID and an interval account, and it will actually show you what's going on all the way. This is also available via Jay visual VM, but I personally prefer the command line.

B

You can take a look at than the survivor on the left, Eden old, gen, perm gen, and then they have accounts and times for what's going on with garbage collection, really really useful. Other things that are really useful, op Center will actually show you government reflection stats and you can see correlations between GC spikes and readwrite latency. You can turn on garbage collection, logging on Cassandra.

B

It's going to spit out a ton of information, and the problem here is that if you've already hit a problem in production, you don't really want to make a whole bunch of changes. I personally prefer to try and observe the problem.

B

Rather than change, something that like will maybe make it go away for a little while it's going to be really hard to understand, what's broken, so that's kind of why I recommend chase that, if you're hitting a problem in the center box and you're, seeing no like not too much I owe and your CPU usage isn't that bad and your memory is not full. It can be really confusing. You're not going to understand where your ball Mike is. You may want to check.

B

J stat you'll, see data being copied from the different survivor genes or the survivors or eaten, and it's really helpful.

B

So basically, what you want to look out for is long multi, second, pauses, okay and that's caused by full GC, and basically that means your whole Jam is filling up really fast and it means that that's being promoted under the new gen twosome and the other option is long minor GCS you, you know if that's like you've got a lot of objects being promoted indulgent and it's generally, your new gen is too big and it matters. Okay, it's a big big deal.

B

This was our cluster at my last company when we were trying to understand what was wrong. We did a bunch of jvm tuning and this is what we got out of it. So is absolutely necessary. So what do you do? Oops? What do you doing? Something's broken right, you get that call well this. This is where all this work that you've done along the way getting familiar with these tools. This is where it pays off right. So the first thing you need to understand is: is your problem evening to Sandra all right?

B

You need to check your metrics. You should have all these in place already I've. Given you the tools, there's no reason not to have them. You could have notes going up and down for that. Op Center is going to be really useful. Look at your system, metrics and if you've got slope, queries you're, going to find the bottom length using histograms figure out.

B

Is it the JVM once you've won t rule about garbage collection, you want to look at individual tables figure out which queries are running slow and maybe you've got a problem with data modelling.

B

If you've got disk issues, it might be a compaction right, so you just want to use all these tools that I've talked about in order to figure out what exactly is wrong with the system and if you put all the things in place ahead of time- and this should be really easy and basically you look awesome and you don't have to you know you really don't have to deal with that many problems and they won't take a long time to figure out. So that's actually ok and it's right on time.

B

So that's the end of my slides I'm going to pass it back to Kristin right.

A

Thank you so much John, so we do have a few questions in the Q&A.

B

A

Alright great webinar this morning, chock-full of tons of very useful information. We have an upcoming webinar, we'll take a break over the holidays here and January 13th.

A

We will be talking about spark and sparks dreaming with Apache Kafka and Apache Cassandra and then especially if you are in Europe, make sure that you register for the sumit 2014 on December, 3rd and 4th in London. There are a few free tickets left for the main conference day and then we are also having training the day before John I think you're heading over the pond. Is that right? Yes, I am.

B

A

Great, so your feed John there and a great opportunity to further your cassandra learning. Ok buckle up John, let's get through as many of these as we can. Let's do it.

A

Peter asks that the failure to reclaim disk space apply to data removed via tombstoning.

B

So you, you won't ok, so if you've got a ton of tombstones you're like if you were to create 100 gigs of data and then delete all that data, you would actually have the same amount of information. Just as tombstones tombstones will be removed. There's a visit, GC great seconds and effectively what it does is it says, tombstones will exist for a certain amount of time you can play with that to determine when tombstones are removed, but they only get removed through compaction.

B

So if you could actually have an SS table with a ton of tombstones and if it's never compacted, it will never that space will never get refined. So that's another reason why some stones are kind of brutal. You want them to be the exception. Okay,.

A

Great Vijay related to tombstones asks will choose, don't be created on any other action than update than delete example, holidays, yeah.

B

If you do insert with the TTL ptl's will result in symptoms. Ok,.

A

Great Pankaj asks how is latency more with a fan: I've seen a sand with SSDs having less than 2 milliseconds latency and have been using Cassandra on it. So I.

B

Would say that that's pretty slow, so 2 milliseconds latency for a sound with SSDs is not very good you're going to get sub-millisecond latency with as if these are talking 10 or 100 times faster than that. So you know worst-case scenario: 10 times faster ass and is pretty bad. Compared to that and remember the sand is so expensive, like you're, not just paying for the drives in it. It's it's.

B

It's just crazy, expensive, yeah.

A

It's kind times this is a topic we feel very passionately on. We see, you know, basically it's an anti-pattern for Cassandra. So if you need more information on that, please reach out to community at data stacks comm, and we will reach out to help you this Cassandra Creek, double zero values as null John double.

B

Zero, yes, not that I know of I, don't I, don't think so! No, okay,.

A

Thank you, let's keep going there's, so all the tools that you showed today are those put server rather than per cluster.

B

A

B

Some of them, um like App Center, is per cluster, something like Nagios cluster. The tools that I was looking at the command-line tools, those are per server, so IO stat p.m. stat key stat, TCP dump. Those are those are per server great.

A

Hey here's a question: I can answer: we post the recordings and the slides to Planet Cassandra org, usually within 24 hours of the webinar we've had several requests for the slides in the recording, so they'll be on there and apologies for the mix-up for those who did not receive the correct password. You missed a few minutes at the beginning. Sorry about that. Okay, John does proxy histograms report, client requests times or local disk time. It's.

B

The full time for the request, so it's not necessarily the client time because there's like 100 milliseconds latency between the client and the the coordinator. It wouldn't report that, but it reports the total time from the start of the request until the end. So if you've requested something at consistency, level quorum or all, it will actually cable pick that time to do account. If you want client level times, um I didn't I didn't have time to include this in a webinar. That I would include.

A

B

Would basically wrap for your session and a session on the client side with something.

A

That does I had.

B

Like a query, timer and if you exceed a certain amount of time, I would send that into log stash or into whatever long aggregation tool that you want. So that's kind of one of the uses for that those logging tools- I talked about you- can log individual slow query to figure out. What's going on with them.

A

Great sounds like you've, headed to the bar early John thought. You've got thousands of friends with you, I, don't.

B

Think that's coming from me because there's no one here entrusting.

A

Okay, maybe it's page page whiskey. Can you we're picking up a lot of background noise.

A

Well, you can meet her hold on one second.

A

It's really loud sorry about that everybody.

A

So uh my office is asking which of these diagnostic tools can be used to diagnose after the fact.

B

Well, diagnosing after the fact is.

A

B

Basically means that you have to have had you have to be recording everything along the way, so I like op Center. That's one of the reasons why I like audios, because if you're recording all these metrics, ideally you should be able to do that and that's you know like those command-line tools. Those are definitely all looking at a system at a given moment trying to figure out what's going on, there's no. The two clubs are which I didn't cover today and that's really good for system metrics. But, ideally you know you're.

B

You should be recording all of your jmx metrics using that JM x-trans a tool and you should be recording all of your system metrics and you know, if you're doing log aggregation, then you should be able to piece together all this stuff to figure out what happened after the fact. So it's like your if you had a problem with the machine at 3:00 in the morning, and you know you're going to take a look at it the next day. Ideally you want that information there. So it's a combination of all the tools together.

B

A

Great going back to the timings that ratah is asking how roughly how many milliseconds typically should we start worrying about my net major GCS.

B

It depends on your application honestly. So you know I personally have seen like 20 milliseconds in there and I kind of got nervous about that. The problem is, is that during those minor GCS, if you are a query that started before it, the and you have a 20 millisecond pause, but you can end up doing like let's say 2 milliseconds worth of work, one before the GC and one after now. Your query just took 22 milliseconds is your application? Is that a problem in your application?

B

If it is then that's obviously too long, and you want to get that tuned way down. I think I think that graph I showed I think that we were seeing 20 milliseconds and then once we tuned it, you know we were down to like three so that, like I said it totally depends on what your tolerance is.

A

Okay, great a good friend Robby asking is date: kid compaction ready for primetime.

B

I, don't know anyone using it in production, yet so I would personally probably not push it out. I would wait. A few um I would wait. A few more bug-fix releases before I rolled it out. I would just try it no local death cluster to see if it's even applicable for you and the nice thing needs a DVD. You can change a table too. You can change the compaction strategy and it's totally fine.

B

You can. You can change tables from size teared to leveled, and it's really not that big of a deal so yeah like I, said I would I would let it take a little bit more. Maybe early next year would be a good time to start trying it in production.

A

So we have a lot more questions left and we're not going to get through them, because we've got one minute here, but on Planet, Cassandra or there's a way to book office hours, which are quite hip meetings with the Evangelist team, John and others. So if you have a burning question that we haven't got to this morning, please feel free to reach out.

A

Let's take just one more so can you talk a little bit about IO profile of an app and compactions? We put settings which are the best based on io profile, io.

B

Profile I'm not. Is that a specific tool or not really don't.

A

Know, let's move on we've got many more yeah. I have an.

B

Avenues: iopa val de.

A

To do how, how do you see the client connection info on Cassandra to make out the last question.

B

The easiest way like if you are looking for the number of connections I, would use net stat just because it's a really simple way to understand, like how many connections are open and how many, like recently been closed.

A

Okay, fantastic, so John. Thank you so much for taking the time today we had a great crowd on hand. Thank you, everyone for tuning in and don't forget. We have another webinar coming up in January. So thank you very much. Everybody about awesome. Folks,.