Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Pythian: Monitor Everything!

Description

Speaker: Chris Lohfink, Engineer at Pythian

This session will cover a walk-through to provide an understanding of key metrics critical to operating a Cassandra cluster effectively. Without context to the metrics, we just have pretty graphs. With context, we have a powerful tool to determine problems before they happen and to debug production issues more quickly.

A

A

A

Thanks for coming hope, you've enjoyed the first half the talks today, I'm crystal think. I'm gonna be talking about metrics, why they're important how to access them and what they mean in Cassandra.

A

I'm a senior engineer at Pythian, where I will lead the cassandra practice like a lot of people at pity, and I work remotely, but I'm based out of minnesota I like doing software development like a lot of people here in particular, I do Java closure and Python, but the language isn't so much is important, is just going out there and playing with it. I like big data.

A

I sided I'm, one of those guys I enjoy having large data sets and then the the algorithms and data structures involved in in doing analytics and statistics over them and I like to set my house and fire in lech cube myself. You know hobbyist, electronics, so pipiens a data outsourcing and consulting firm and oops sensitive. So here, jockbox andhra from an operations perspective. One of the features that is most loved about.

A

This is the fault tolerance, and this is when you get that phone call at 3am, particularly if you don't have appropriate escalations or multiple people to handle it. It's really nice so that if my phone dies and no one's there to take care of it, the system is going to keep running, even though at even though it failed at 3am until I wake up in the morning and see all the red red alarms.

A

So that's really great, but it's really easy, then, to forget about Cassandra, because you'll, you will necessarily notice when things start going wrong because little hiccups and stuff can be easily glazed over. Maybe something will go down for a minute or two one of the nose, but it'll come back up as things get queued up and it'll just keep we're running, which is which is great, and it gives you a nice buffer.

A

But the problem is it here: if you're not watching for it and paying attention to it, that can eventually get worse and worse until eventually, you can no longer ignore it and, and that ends up happening at a long time. So.

A

What we really want to do is utilize, this buffer, that this time, where things can start going wrong and actually breaks, we do this in two different ways. We would be both proactive and reactive, proactive being your daily and weekly checkups, and this is something that people should really be doing. Is you just go when you look at the metrics see how things are going? You know just get a standard health check, and this helps with predicting capacity issues.

A

If you see that EV utilization is going up over time- and you know the capacity of your cluster you'll be able to predict when you're going to have a scheduled maintenance in order to to handle that, and it's also something where you can predict possibly predict issues before they're, even issues.

A

So if there's, for example, it in your data model a cql collection or something that's slowly growing over time, if you're not capping this or if this, if you're still doing reads on it, you could eventually end up having really bad memory issues and garbage collection. So if we find any data modeling issues, we can we can address them before they become a problem as opposed to waiting for the actual crash, but no matter how hard you try, things are going to go down. You're gonna have hardware failures, sometimes catastrophic ones.

A

Some person drives their car into a transformer at the Amazon region and takes the whole thing down. It happens bugs and Cassandra it happens, and sometimes you have users who may use you in ways that your data model doesn't actually support either on purpose or not so much, and this is ultimately where you have your alarms. Your metrics and page duty I saw a couple people from page or duty walk in, so you guys you guys are awesome. Thank you.

A

Having appropriate escalations is really important, but for both of these really, what you need is data. You need metrics. You need to be able to to form form the alerts to be able to see trending to be able to to debug problems after they've happened, and this is a quote from Kota Hale. Is you need to bridge the gap between how you think the applications running to how it's actually running these are really the window to the application.

A

This is how you see at things, work and there's probably at least one person in the came to this talk expecting that picture so I threw it in it somewhere. So there there's multiple ways to there's a lot of metrics, I'm kind of breaking it down here between the Cassandra metrics and the environmental metrics.

A

So the Cassandra metrics we're talking about jmx metrics, we're talking about no tool and ops center and stuff, like that, that's what I'm primarily going to be focusing on this talk, but I am going to give some honor I mentioned to some of the environmental metrics in particular logs and a lot of the kernel based metrics, so disk CPU memory network things like that.

A

So, just having the data isn't important enough. You really need to have context to it. Otherwise you just have a pretty graph. That's very helpful! It's nice to have the dashboard and your NOC for your for you to show your manager and show them all the pretty dots.

A

So that way everyone's happy, but you really need to understand what their meaning and so as I'm going through this talk, I'm going to try to provide some context, I'm going to try to explain it, but I can't go into too much detail just because of the time limits so I'm going to give a really high level overview of a lot of the subsystems. So jmx a lot of people here, probably already pretty familiar with them.

A

They're pretty standard with Java they're, pretty complex, there's a lot to them, but we're at this stage is going to consider them objects with attributes and operations. A lot of Java applications, Cassandra included, use this pretty extensively for monitoring and user input. It's pretty annoying, it's very slow! It requires java to access, even if you're using you can use things like jolokia and stuff to access. It through other languages, but you still need that java wrapper.

A

It's had memory leaks and some versions of the JVM and it's been pretty frustrating from your opera for an Operations team in general, primarily because of this. This mechanism that jmx uses where, when you make a connection to the jmx port, it's actually going to reply back with a different hostname and port uppal. That you're going to then reconnect to again and of course, initially that that port was random, which makes it virtually impossible to set up a firewall and still have this work in the more recent versions of the JVM.

A

There is an option that you can use to set the port that the week, the second connection uses, and it can actually reuse the port that the initial j, MX or use- and this is actually configured by default in Cassandra after 0.8. That's a newer version so that some people haven't gotten to that. Yet so you can set that attribute yourself if you're, using a newer java, 7 JVM, there's a lot of ways to access, jmx visually you're going to have j console and visual vm.

A

These are things that come with the jdk, so a lot of people will actually already have this on your computer and providing the firewall issues aren't going to be causing a problem. You can actually just connect to it right from your system, but ultimately, I think it's really important to be come familiar with the jmx with a command-line versions. So jmx term is a great one, and that way you could just SSH 21 your systems and quickly poke at something that isn't exposed elsewhere, there's also MX for Jay and jolokia, which are pretty great.

A

If you don't want to use Java at all, because and they'll provide like soap and rest wrappers for your jmx interface, so jmx looks kind of like this there's beans here, objects which have a domain and then a series of key value attributes. So here we have an example: we have comp atheon and then just a set of attributes to narrow down what that beam is, for so it looks hierarchically hierarchical, but this first level the domain and cassandra originally had four different domains: the DB internal net and requests. So these are there not deprecated.

A

Are still there, they still work, but as of I think it was 11 they switched to having this new domain metrics, which actually contains all of the metrics that you need everything is there you don't need to access the old ones anymore.

A

The first attribute in this is a tight and there's a lot there and I'm not going to walk through them or anything, but that type is first attribute. After type majority of all, these beans are going to have a scope and a name. The scope may or may not be there. Three special cases are the thread pools with a path column, family with the key space and the key space with the key space, but no scope.

A

All these metrics come from metrics, which is a tool kit made by Kota, hail at Yammer, and it's pretty great, actually I'm a fan of it. It's real easy to use. There was a project I worked on where we actually had about a total of fifty metrics total in the entire application, but then that we installed this and we started adding things some dynamic and such and within three months we had.

A

Oh, we had thousands and thousands of metrics that we were able to collect, and then it just becomes interesting trying to storing and keep all those. If you are familiar with Java it's it's in github, it's pretty easy to look at, so you can just open it up and it gives you a good understanding of how it works. Just open up and look at the source code. I'd highly recommend it, and it's really popular and using a lot of projects.

A

So in metrics there's a bunch of different types like and Cassandra uses pretty much all these. The first need simplest is a gauge which is just a value that can be a string and array of strings. It can be an integer or whatever it can, whatever a counter being something that's incremented or decremented, pretty self-explanatory. A meter is just the rate of things, so you have the number of requests per second for the number of requests per minute.

A

It's it's all depending on your units and it keeps 15 and 15 minute moving averages, there's a histogram which gets a little bit more interesting, because this is where we have the statistical distribution of your data. It keeps a bunch of percentiles and then the mean average standard deviation and all those. So what this is is more if you have something like a payload of a request and you want to keep track of how big those payloads are. If you just keep the max min an average, you can end up with something.

A

Where will the max is to Meg's, which can throw off the average a lot? Because in ality you could say that the 99th percentile is 100 bytes and it's just you have one: that's thrown one huge outlier so having the statistical distribution is really helpful to understand, outliers and actually how the data looks and then there's a timer, which is one of the really common ones, and that's just a combination of a meter of the events of whatever is happening and a histogram of the duration that it took.

A

So here's an example of how one would look inside of jconsole. We have a histogram of the right requests from the coordinator perspective and one of the nice things is. This includes them the units inside of it. So I'm able to read this and say that well, the 75th percentile is 683 and the latency is microseconds, so I know that 75% took 683 micro seconds or less, and then the same goes with the meter side, where I can say that there's been thirteen thousand calls per.

A

Second, I'm able to just read that right off there, including all the metrics, including all the units, so this can be a little overwhelming. There's a lot of attributes, there's a lot of operations and there's a lot there and there isn't really any documentation to explain them. So you ultimately have to end up going to the source code, to figure it out and even then between versions. They move, they change, they get renamed.

A

So it's really hard to follow, and this is where there's this really great tool, no tool which I'm sure everyone in this room has used and that all that is is a command-line wrapper around jmx and so similar to jmx. It has a lot of options and there's a lot of things. I can't go through them all here, but I'm going to go through some of the ones that I think are the most important from a monitoring standpoint: TP stats. Now this is the thread pool statistics.

A

So what the thread pools are in cassandra is cassandra is based off of a staged event, driven architecture. So this means that it takes a bunch of common tasks and breaks them into thread pools, and then it just throws a queue in front of each one. So each one can just take a task and pass it on to the next here's a this is kind of it's kind of a simplification of the process, but it's it's a it's I think a decent one. So let's say we have a read request come in on node1.

A

This is going to act as the coordinator and it's going to look at it and see that the data exists on node two. So it's going to create a task and put it on the queue for node.

A

2 is read stitch, so it's going to one of the threads in that stage is going to take off the task and process it and then build up another task which is going to then go back to node 1 and its request response stage, and this is going to then have a reference to a callback that was created when the request first came in so then it can return to the clients saying that it was completed.

A

So that's just it's a simplification, but it's how things work and then optionally that request response stage may suggest me randomly it's ten percent by default kick off a read repair when it does that it'll create another task and push it on its stage. Now, that's interesting because you're outside of the feedback loop from the actual request being made, so you can potentially if those read read, repairs are taking longer than the number. Then, how long? The requests themselves are taking.

A

You can actually end those having those queue up eventually, which is what you want to watch for. You want to watch for these things, not being these individual stages not being able to keep up with a load given to them. So there's a bunch of stages. I include the jmx path at the top.

A

It's a little harder to read here, but the first is the name is actually coming from the scope attribute of the jmx beam they're listed on the left side, but each one of these is going to have a thread pool the active is the number of threads in that thread, pool that are actively working on a task now, each one of these droid pools are going to have a different amount of threads in them, but this will these tell you how many of them are work?

A

The pending is how many are in that queue before the the thread pool and the completed is how many tasks see it completed. So blocked is where it gets interesting. You shouldn't see many of these get blocked, in particular the flush rider and the replicate on right might get on in 1.2 and 2 point 0 below, but most the others. You shouldn't see them block.

A

That's when there's a limit to that q, how deep it can go and when that limit gets reached, it will actually block request from putting even things on the queue so it'll block the caller. So that's a really bad thing and you don't want that to happen. So, even if you're missing the missing it so you're you're pulling it, but you still missed when that blocked earth happened, you'd be able to see the all-time blocked increment, so you wouldn't be missing any there's another section underneath that the drop messages.

A

So this is when a task is originally started, so it's going to take a timestamp for it and then there's going to be a timeout associated. So when a read happens, if, by the time this task gets to one of the stages the read timeout has passed since it was created, it's not going to process it. It's just going to throw it away because we would have already returned to the client saying that we had a timeout exception, so we don't bother doing any processing your CPU on it.

A

So there's a lot more to this and I don't have time to actually go through all the different thread pools their limits and what they could mean, but there's a blog post there where it actually walks through them all, so these slides should become available. So if you are interested, you can read them there.

A

Okay Oh, your time. Sorry.

A

Okay, so no tools, CF, histograms or column family histograms, is beneath the in a column, family there's going to be a lot of statistics. Some of those are histograms. This is going to print those out that are all relative and the column. Family is just the old name for a table. They kind of rename them recently. So that's what it's it's interested in, you can actually specify which keeps, but you need to specify which key space and table you're looking for now.

A

Hopefully, you guys have had some exposure to the read and write path, but just in case you haven't I'm going to give a very high overview of it. Here is when a right comes in it's going to write to an in-memory table when reads happen: they're going to check that in memory table and ne ne SS tables on disk. So when the rights stack up enough in mm table gets large enough, it flushes to disk and creates another SS table.

A

So then reads have to access all those SS tables continually, which can become a problem because that can become a lot of disk seeks. So you can avoid reading the ones that don't have the data you're interested in by adding this bloom filter in front of them, which would basically say when you're doing a read this data that I'm looking for. Isn't there so don't bother checking, but it still gets bad. So there's a periodic task that comes through called a compaction that will merge the SS tables together.

A

So when you're looking at this CF histograms, what you're, looking at at the top of s tables / read, is how many of those esas tables are touched in a read. That's really important, because, especially when you're on spindles is that's going to be pretty expensive. So usually your read latency is going to be pretty look pretty similar to your SS tables, / read and then the write latency is how long it takes to basically right to that mem table which should take long at all.

A

There are cases where it can take longer, particularly during a flush. So the interesting thing with this. What how this looks is like here, I'm saying that ninety eight thousand reads went to one ss table or that four thousand reads went USS tables now this is a little bit different than how they used to look for people who are familiar with the old style of CF histograms, which provides a lot of information. But it's not easy to read so in this case.

A

We're saying that from the SS tables perspective that three thousand reads looked at one ss table and or that even hard look reading it now so or ten rights took 60 microseconds. So it's it's it's pretty hard to read, which is why a lot of times it's a good idea to then do something like take a use, Python or something and Matt plot lib and just read that ring and write them out as a as a bar chart.

A

So that way you can read them a little bit easier and then here you can see it makes a little bit more sense that well at you know, 24 microseconds. There were you know, seventy thousand requests. So it's a little bit easier to read that in a graph format, so I would recommend doing that. I included a little link to a gist in the bottom there. Once again, you could get these slides later and hopefully get that, but it's just it's a modified version that someone else wrote but I kind of made it work.

A

So that's how they used to look now in tutto, two dot, one which was just released today.

A

It looks a lot different, and this is because, instead of being based off of this, these old histograms that were kept and are now deprecated is this is using actually the histograms that are from metrics the ones that we've been talking about and showing that I showed previously, and here it actually it's a little bit easier to read because you could just say: oh well, the 99 percentile write latency is 19 microseconds, it's a little bit easier to conceptualize than saying, there's 3,000, that you know nine microseconds or something and I'm trying to do it in your head.

A

So this this is al. It's a lot more convenient to look at something of interest. Here is the min and Max are not there of all time, but the rest of the percentiles are actually of the last five minutes. In particular, it's a forward decaying priority reservoir, which doesn't necessarily mean it's the last five minutes, but it means that it's exponentially weighted to data that has entered in the last five minutes.

A

So when you're looking at this you're, essentially looking at the last five minutes, this is drastically different than the old style and the current in 20 and 12 and previous, where every time you called CF histograms it reset, which is great for benchmarks. But it's really inconvenient for operations, teams and people debugging, because one person logs in and looks at it and then they reset it and everything's back to zero. So then, the next person goes and looks at it and it looks like Oh everything's awesome.

A

So so it's for this style just from the being able to debug and diagnose it's a little bit better. So but there's a lot more statistics to each table. That's where CF stats comes in, which is the column family statistics where you can give an option for specifying a key space or a key space and a table, or you can do the dash I to exclude, because excuse starts with I and in particularly it's nice to do that.

A

So you do a dash I system and it will remove the system because you never really care about the system statistics usually well. Usually you don't so at the first indention there's going to be a couple key space, scoped net metrics. These are really just the sum and the average of the individual key space metrics. So it's really not that useful to look at averages or sums so usually just skip by those each table.

A

It's going to be it's going to be named at the first one and then it's going to go into the different metrics for them. So in this case we have 3 s's tables total. Usually when you see this, this is going to pretty much correlate to your reads: Perez's table you're, going to see that your SS tables, /, read I'm. Sorry we're you're going to see! Usually at three at the highest, so usually that's going to be like your upper bound, particularly with sized here compaction, but with leveled compaction.

A

It's going to look a little different, you're gonna get an extra line there, where it's going to tell you how many s's tables exist at / at each level. Your reads should only touch one and that's this table at each level. If everything is working correctly, but that's not always the case, this is kind of how it would look like.

A

If that's the scenario is l0, the first one is not is actually using size to your compaction and the / 4 is basically saying that it's the threshold of four before it would do a compaction, so it should be four at the most. But since it's 14, that means the compaction is I'm keeping up, which is bad, so moving on you'll see, then a batch of information about the size.

A

This usually isn't that particularly interesting, but you know it's it's good to know and good to see the net whoops the mem tables is when those rights are going to the in-memory table is how many columns how many cells exist in it. The data size is an estimate, but it's an estimate of how much size in memory that an end table is taking up, including the jvm overhead, and the switch count is how often they get. How often this table has been flushed to it as an SS table.

A

The local read and write Layton sees and counts are the amount of time and how many times reads and writes are being taken on a local case, and this is not including going to the other, replicas and stuff. This is just how long does it take to get the data off disk and look at it? So these are usually should be a lot smaller than how long your application sees pending tasks is what I like to consider the most useless metric here.

A

It's actually a the number of mutations that are backed up on the switch lock during a flush, basically which, basically, you don't need to know what that really means. It's. It doesn't mean much because into one it doesn't even exist, so I would just never pay attention to that. One. There's a bunch of statistics on the bloom filters, in particular the one that I think becomes the most useful to look at is the amount of space the bloom filter takes out, because sometimes you can sacrifice iving the read hit.

A

It all depends on your which you can accept, but ultimately, if that gets too large, even though the bloom filters are now capped off heap, they still need to be read. So when you're doing a read, these bloom filters need to be in memory. So, if they're being, if Els is a paging them off on a disk, then during a read they can be pretty expensive. So you want that to be a reasonable number. Based on how much memory your system has.

A

The partitions is just some information about how wide the rows can get, which is why, in particular, the maximum is the one that you care the most about it's. Okay, for that to get large I've seen people with 26 gig. You know participation partitions, but it's not a good thing.

A

It's going to cause a lot of, especially during compaction, they're going to be very expensive, they're, going to take a long time, they're going to cause a lot of garbage, and it's it's not good to add that large I usually recommend people try to keep that below 15 megabytes, but it depends on your your data model and what you can accept and what you're reading right are, like average cells per slice, that's how many columns are being read during a read.

A

You should keep that below the thousands if possible, but there are scenarios where it's acceptable to be larger. It's the wrong direction. The tombstones is how many tombstones during a read this in particularly, will go high. If you are, for example, using Cassandra's AQ you're, going to end up where you're deleting things on a partition as you're adding things. This will get very, very large over the period of the GC grade seconds, so you do not want you don't want this in a thousand, so this is in the thousands you're doing something wrong.

A

You should examine your data model and try to resolve it.

A

We've I've been mentioning a lot about the column. Family read right, Layton sees, but that is useful, but really what you care about most the time is how long it takes for your application to have these read and write stake and that's where the proxy histograms comes in, because, instead of just being the local time, it takes two to insert the the mutation into the mem table or read the data off the disk.

A

This is how long it takes for the coordinator, who gets the request from your application to distribute it to the replicas, collect the results and get it back. This is this, but you will not get it back, not including the network latency.

A

So this is what ends up being a little bit more important from your applications perspective to monitor, because this is what's going to really affect you, because even if all your nodes are seemingly operating really well on the local level, if you're having any clustering issues or networking problems there, it might cause a problem.

A

So there's a lot more to node tool, but I'm not really gonna have time to go through them also, but there are some honorable mentions things like compaction, stats and everything I would recommend going to the datastax documentation and reading up on them and going through them, because you're probably going to use almost everything in it.

A

If you want to get to these metrics, but you don't want to just be pulling no tool or polling jmx. The metrics library has a nice interface provided for you, where it could actually push the metrics out to whatever you're using so by default. Metrics comes with jmx console csv and lara j, but it has in the the metrics library there is also a gang glow and graphite, but they're not compiled with it.

A

So if you want those, you have to include the drawing your classpath, but they are maintained along with the rest of the metrics library. The there is a lot of community reporters as well. In fact, there's probably more than I could list so and they're also really easy to create. So if you want, you can just build your own and have a push to something, but if you have a new relic or something- and you want your Cassandra metrics to go to it, it's pretty easy just to to pump to set that up.

A

It's easier with a console, see SVG angular graph. I dreamin, because you can there's just a llamo file that you can configure the metrics to use. Unfortunately, if you're not using one of those reporting interfaces, you actually have to create a java agent to run as a pre main and set up the reporter. So this is how things were done previous to 1.2, and you still have to do it.

A

If you want to use one of those specialty, config reporters, something that I think everyone in any java developer or anyone who's worked with java has experienced his garbage collections, I in particular, love them, I, think they're, really fun I like tuning them, I love the metrics. I love how complex it is. I think it's interesting. So fortunately it's a whole nother talk in itself. But if you're interested come talk to me, sometime and I'll love to rant, so something you should do with Cassandra.

A

Every one should do with cassandra is enable the GC logging there's virtually no overhead in it, and it provides a lot of information. There's no reason not to do it to do it. There's a in the Cassandra env file as Cassandra dashi and via SSH. You just have to uncomment out a bunch of lines at the end of it that has all these melodies garbage collection, metrics. There's one exception to that. I would say, though the fls statistics that it has I would leave commented.

A

This is a undocumented flag in the JVM, and it's going to provide a lot more output in your GC logs and it's going to provide a lot more information, but it's something that any GC viewer or anything you use to analyze. The GC logs won't know how to handle.

A

So, if you take the GC launch from Cassandra and try to load it in, it's just going to crash and die so it's useful for, if you're, looking at it or you're, building your custom parser to look at the logs, but from a perspective of having a nice tooling to look at them. It's not worth it so I would actually recommend not including them, as I mentioned. There's a lot to garbage collection and I.

A

Definitely don't have the time left here to talk about it, but I would recommend grabbing one of those things like the GC viewer and being able to just open up the logs periodically and looking at it. Ultimately, though, I think if you want to get really serious about it, you should use Python or our whatever your. Whatever statistics package you like to use and analyze the logs yourself so so there's a couple logs in Cassandra and I think it's really important to do lago eurasian.

A

So you should have something splunk or something set up to do this. But there's to the system, log is the first, which is a rolling log. So you'll see a lot of system logs inside of your inside of your blog directory, and this is just the output from log4j.

A

The log4j output is also going to standard out, but you can disable that in your log4j properties, if you don't want your output log to get huge, but your output log nowadays is okay to get huge, because every time you restart it actually truncates, which is in itself a good and a bad thing.

A

It's also kind of bad because there are some exceptions that will not be logged in the system log in particular, if a uncaught exception gets propagated to the top of a thread pool it's just going to dump the air the stack trace out to standard air and it's not going to be included in the system log. So in those scenarios it's still good to have the output log captured- or at least maybe before you restart, create a backup so that way, you're still able to look at it.

A

There's a lot of system logs that you should also be monitoring just from a standard Linux kernel perspective. This is slog device messaging and such there's a lot to monitoring the kernel, which itself is another talk and in fact I. I'm recommending here. Brandon greg has a great set of talks and explanations on what you should use to monitor. What aspects of the kernel.

A

The JVM has a lot to monitor as well, in particular, the two that you should be monitoring as the heat and the threads. The heap you're already capturing the garbage collection logs, hopefully from the previous slide, but there's also the possibility to do things like take the dumps and stuff like that in case you're having high memory pressure and you want to analyze them on a more deeper level and jmx provides a way to just trigger a dump for that or you can use J map to do it as well.

A

There's a there could be a lot of buildup of threads. It's kind of useful to look at the thread, count and jmx just to make sure that you're not exceeding any limits on your clinic on your kernel or anything. But it's also nice to look at the what the threads are doing. So you can do that with calling a kill, dash.

A

Three will dump the digit the stack out to standard error or standard out so it'll show up in the output log, but you can also use J stack to print them out and just having the stack trace.

A

Usually isn't enough, so if you have a high CPU loads, your cpu's burning and you have kind of gone through some normal diagnostics, and you can figure out what's going on, something you can do is, for example, use H top it'll tell you which threads are burning, CPU and then it'll give you a paired which you can associate with there'll, be a hex number in front of the stack that you can kind of tough pair them up and see what that particular threat is doing a nice utility that will do it all, for you is JVM top it'll.

A

Actually look at your CPU burn. What it's doing and actually do some profiling for you kind of not very invasive Lee, just kind of from the outside, so that's a really useful tool as well.

A

Ultimately, you should just record everything you find.