Apache Cassandra Cassandra Day Atlanta 2015, 3 Apr 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Day Atlanta 2015: Diagnosing Problems in Production

Description

Speaker: Jon Haddad, Technical Evangelist
Company: DataStax

This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.

A

I'm back again it's pretty exciting.

A

uh We could, uh I think, we're a little bit a few minutes behind when I got behind, as I was for the python clock, I'll be a little bit slower with this material, you guys having fun learning a lot, all right, good scream at the top of your lungs, if you're having a horrible time, perfect excellent.

A

All right, so I'm going to be talking about diagnosing problems of production. I we've talked a lot today about new data modeling, how cassandra's different, how amazing it is and how everything is perfect when you're using it you're going to put this in production, and you are going to need to understand your system. That's just the confidence work.

B

There's a few things.

A

In here that I'm going to cover that are pretty important that if you don't know ahead of time, you will end up regretting things, so this is good stuff.

A

So the first thing we're going to talk about is preparation. What do we need to know about before we put our app in production?

A

The first thing is to monitor your system right for that you're going to want to go to data stacks off center. This is going to do about 90 of what you need as far as monitoring. What's going on inside the sandro, what's happening with compaction things like that, we can get visual histograms into everything. That's going on inside of our cluster you're, probably going to want to integrate with some other tools. That's cool! That's available! You can access any of this information over jnx or using metrics, which I'm going to talk about today.

A

There is a community version of data stacks off center and it's free and there's also the enterprise, one that comes with dse, and that gives you some additional features you can use. You can use this to launch uh servers on amazon, there's a whole bunch of other functionalities, provided here I will go into two deeper details.

A

You're gonna have general purpose monarch. This is cassandra running on this box alerts. This is uh from those of us that have been in australia. A lot of this is very, very basic, but it's also easy to kind of forget about or skip I'm absolutely guilty of putting stuff in progression without monitoring. I have regretted it, which is why it is now. In my slide, I consider myself pretty thorough and yeah. I definitely just caught with it so you're going to.

B

A

To know about things like mine, uh collect, d and union, you can get nice graphs of cpu usage, just usage. If.

C

You fill up your disk you're, going.

A

To want to know about it, cassandra a disk based database does not perform well on full disks unless rights don't work anymore. Unless you don't do that, there's like various tools that are along the lines of an audience messing up, I think there's several others that are forces might give us that they're pretty popular. um You can use third-party services.

A

If you don't want to run your own infrastructure, and I definitely recommend that if you're, a smaller company and you're looking to not have to worry about the headache of all this monitoring, just make sure you get something in place. Let you know it works test it out right, stuff, it's okay, you're gonna, want to collect application methods who here.

B

A

Statsd already in their applications, he has pans. Okay,.

B

A

Okay, how long not a lot of people are using this right now in the room? I strongly recommend that you integrate this. If you build out an application, you can do things like monitor, very, very small blocks of code. You can put micro timers around it. You can graph those micro timers over time. You can understand how many times a section of code has been called and you can put counters around them. So you can understand how often how many user logins are happening per hour right now. Did that just drop off?

A

Do we want to put alerts around, so we can tell based on what's happening in these graphs if my system is behaving abnormally, so this is a pretty useful thing. I strongly recommend you get this. This also works very well with the sander itself. There's uh this metrics library from a guiding code of ale and cassandra metrics integration effectively lets you spit out the internal metrics of descender into these graphs.

A

So you can take your application level graphs like those user logins and answers, let's say a timer from a user table like how long the query is taken. So you can actually correlate things like. Is there a problem on user logins? Maybe it's correlated to a performance problem I have login out is one of those other things that I try to not use, and I have realized that it's insane not to there's a whole bunch of other solutions here, uh splunk boggling, you can do it yourself, you can use. I love logsdash.

A

I've also used greylog before there's a logging tool that I just found out about that. I managed to make it into these slides, but it's a dse based logsdash, so it uses solar instead of elasticsearch. So if you've already got vsd, then you may want to consider using that, but basically this this just lets you take all your logs put them in one place. Have it be searchable, so you can easily digest the information and you can go back and you can like look at error rates over time.

A

The important thing here is to make sure that you're not spitting out a lot of noise. uh My last company, someone decided to throw an error for no reason, which is the future in here in the log. Just so they could tell if a block of code was being hit. They never removed it. So we were seeing millions of errors per minute and we realized the information. Wasn't pointless. You couldn't couldn't get anything out, so log errors, don't block hot errors and you'll be okay, so there are some uh gotchas.

A

These are the little things that are easy to miss because maybe they're like one sentence now and if you don't do them, then you're going to regret it.

A

We talked a little bit before about what happens. If you have incorrect server times, why do you need to run atpd? This is a an attempt at visualizing. What can happen with your server time as well? So, let's, let's say that we have our first server and at real, let's say, clock time: 20, our server is eight seconds ahead, so it's just wrong and a mutation comes in and as an insert. Well, it's going to get a time stamp of 20.

A

right, so we're inserting the data timestamp and point goodbye that gets replicated the other server and it carries with it that first time, stamping 20. now.

B

The other server.

A

Is actually behind it's five seconds behind and now clock caught time 15? So three seconds later in the future, we issue a delete to delete that data that delete, because it's been issued to the second server carries a timestamp of 10 because it's behind so that data won't get deleted. Now this example may seem a little ridiculous, but I have personally run into this in production and I've seen other people run into this.

A

So the last paragraphs is very important to understand that it's coordinated to your timestamps on your servers, your timestamps, your server is wrong. Then your timestamps here are going to be a total mess and your data integrity will be awful.

B

A

The solution here is always installing tpd, with the caveat that you understand that there's always going to be a variance of about 20 milliseconds on your servers. So even if you have ftpg the server times are never going to be perfect.

A

If you look at the times that are all registered by tpp servers, the variants on those out of control, so you'll get 50 milliseconds, 100 milliseconds difference just between the tpd servers, so.

C

Server times are not perfect.

A

Try not to issue inserts or updates to the same data at the same time, if you do as an antihire, but this is one of those things that invite you through. So we talked a little bit about tombstones. We get this when data is deleted,.

A

It does have primary keys, it's got the time stamp. It just says that there's nothing here, and it's just our way of knowing that the data has been deleted. We can run into a problem called tombstone hell.

A

This happens when we have a ton of tombstones in a partition, and the anti-pattern here is trying to use cassandra. As the cube I'll tell you up front cassandra is the worst team on the planet. Do not use it as a cube. I would rather use anything else, including a human and pieces of paper, then use cassandra as a cube, because that would be better business.

A

So what happens here? This example is if we have 100 000 rows on a partition and we've decided to have 999 tombstones in the front. Well, when we go to read this data, all we want is one thing: it's going to have to read all these tombstones out of the front this takes forever, even if we're on solid state drives. This is a ridiculous problem, so you don't ever want to do this.

A

I have a slide in there that actually shows you uh how bad the framework is. um There's something called snitch snitch is pretty convenient. It lets us distribute our data and fault on our way. I talked a little bit about how we can use multiple racks before one of the things that the stitch allows us to do. Is we say I have three racks, let's put only one copy of my data in each rack.

A

If I do rf38, I guarantee that I can lose two racks and I'll be okay, but if you don't lose a snitch, then or if you don't use a snitch, then you're gonna end up with two copies in one rack or three copies in one rack and your data won't be distributed and you're gonna lose all that fault tolerance, all that high availability, that we're so excited the whole reason why we use cassandra is high availability.

A

You won't get it on it. If you don't use the right sketch, so stitches exist for amazon, you can do multi-region. uh You can do. Google compute, there's like a whole list of ones that come out of the box.

A

This one's easy to do I did this one. I tried to declutter. So if you decide to upgrade your standard cluster, don't try and do it by adding new nodes of a different verb.

A

The streaming protocol, which is what gets used to introduce a new nerve, will change between major versions and when you try and string a new note in it will just sit there and it will fail. Repairs will also break decommissioning, will also open. The proper way to upgrade is to shut a note down, upgrade it to place, bring it back up if you've got multiple racks and you should then the proper way to do it is to shut down. One rack upgrade the entire back at a time, bring the entire back back up.

A

That assumes that you have the ability to sustain the loss of our rack completely, which is generally a good thing. If you're using a database build for falls online. Yes,.

C

So the one node at a time what, if you have a large cluster.

A

Let's say 300 dollars: it's still.

C

A

Time, no, that that's the whole one racking time, because then, if you did that you have your data spread across racks, so just go one rack at a time and you'll be fine.

C

So this is assuming you know that you're actually in separate.

A

Racks right which no no, it should be. If you use a snitch, then you should have your data in separate racks, and this should never be wrong right.

C

So if I'm using like the dust and property files mentioned still yep,.

A

B

A

Of those things, that's a pain to change after you've already put this into production. So all my gotchas are things like. Oh you know. I didn't think this before I watched my cluster and now it's it's like a several day process to fix um or hey. I have data loss or hey. I can't delete stuff like what's going on and you'll spend days like, I did trying to figure out what was going on. It's like, oh for some reason, the server doesn't have a tpd. So why don't do that?

A

So that's kind of like this right. If you do this- and you upgrade one rack at a time as long as you're using the right, snitch, you're, totally fine and then the bigger your cluster, the more reliable it is that you'll be fine with one half at a time because you have ten racks like you could lose 100, clustered repairs.

B

If you only had three racks.

A

Losing you know, a third of your cluster might be a little bit more of a standard handling, but it's a question: how much headroom do you have.

A

This one is actually really easy to get.

B

A

So when you add a new node into your cluster, it is uh streamed automatically. You don't have to really worry about. Maybe just saying hey, I turn you on and it goes hey give me some data and it gets together for everybody else and it joins the party and starts certain queries and it tells the application about it and you don't need to restart your app.

A

It just finds out automatically that your new node is there, and so all this stuff happens automatically except one thing: the data on the old nodes which no longer let's say, need to be a replica for a piece of data.

A

The data doesn't get cleaned up, so you can, if you add nodes because you're running on a disk space, you need to make sure you run a node tool cleanup which will go ahead and reclaim that disk space, otherwise you're going to be sitting there wondering why you triple the size of your cluster and you still have nodes which are completely full.

A

I personally didn't run into this, but I have seen this come up on a mailing list.

C

Is there a way to clean up old data.

A

So one thing I didn't talk about before with tombstones is that there's something called gc grace seconds, gc gray. Second, is how long should the tombstone stick around?

A

The tombstones will not permanently be there and after the default of gc 30 seconds, which I think is uh 10 days out of blocks um when a compaction occurs, the tombstone will be deleted.

A

So if, if also if two s tables are compacted- and it no longer owns the data- that's in those ss tables it will, it will be removed as well, so in general, through compaction, you will see a reduction in the size of your data if you've added you've noticed recently, but if you've got an emergency and you really need to clean, like you really need to replace your disk based, cleanup is going to be the thing that you need.

A

All right, shared storage- I don't know if this has come up at all. uh You do not want to run this handler on a nas on a sand on floppy drives over at s on, like you, you name it like you, you want, you want to use local storage. The best thing you can do is just get solid. State drives, run locally, they're cheap, well they're cheaper than they used to be.

B

But they are a much cheaper.

A

Option than putting everything on snap putting up using the sanders is ridiculous. You would you could have 100 cassandra nodes and have a single point of failure on your sound, like I, people will have argued with me that their samples go down and, like the next day, they're saying we're going to have and there's like a 300 servers.

A

It's it's not a great. It is a silly solution to put a stand, a single point of failure in front of a fully distributed database. You mentioned evs.

A

I see that you say avoid eds volumes, but if you're talking an ect or you're saying than to just use large instances store.

B

A

That's a good question, so, uh yes, I absolutely recommend avoiding ebs over the let's say now, 10 years or so, but I think I think ebs has been around for about 10 years now. For the first seven, I saw a complete meltdown of ebs at least once a year. um You you generally do not want to use ebs.

A

The other thing that you have to take into account is that you're still accessing a drive over the network, so that latency is you just have more latency than you're ever going to have whatever you talk to when it's available, you always want to do local and thermal storage. If you're going to be using the center in arizona and even on gce there's there's a local businesses with local, solid state drive spending is a basic kill.

A

So that's the way.

A

So I talked about compaction, a whole bunch um process of merging as your tables, it's a good thing, but you can't have too much compaction and you can't follow it and if you let's say you, don't it on spinning distance, then what's going to happen, is those threads are basically going to take over and be just constantly unpacking your stuff and you're not actually going to be able to serve any queries because you're compacting too much so there's some good statistics here that compact should get a convection history and you can get the throughput for the individual compaction processes.

A

So this is a really useful thing to have if you're using solids and drops, which I can't recommend enough- um you can kind of uncap this and just let it go nuts, especially if you're using level of compaction you use a lot higher. But as a result, your reads are just crazy. Fast, you've got a whole bunch of compaction options levels I just mentioned great on solid state. Sized here is the one that's been around forever uh dave tier is the best compaction option if you've got ttl time series data.

A

Information or something like that, you really only need it around a week or 30 days or whatever day, two compassion of the latest version of the open source exam. I believe it's in the latest version of pse and right now we're just seeing incredible results now um yeah.

A

So let's talk a little bit about some diagnostic tools. This will help you. If something does go wrong. These are tools that everybody, even non-host people should be aware of, because they're just amazing h-top is kind of like a general purpose tool. uh If you know top, the shop is a little bit better. um I know staff I was thinking about actually taking this slide out of here, because I have staff can be replaced by my two slides for now.

A

That's okay, because this is actually still a useful tool uh because it's still available it's on the mac. It's on linux. It's just it's a good general purpose tool to understand. What's going on with your disks, and what you want to know is how many reason rights. Second, am I doing?

A

um What's my average queue size, so the average queue size on a spinning disk will tell you like how many requests are backed up, and if that number is anything like significant, you basically have to wait for a ton of seats to occur and you will see advanced problems.

A

Vmstead is really nice for understanding virtual memory on your system. You need to answer the question: am I swapping? This is the place to go? I'm not talking.

A

Dstat is the tool I just mentioned. That is the successor to iostat. I find it's better in every way imaginable. You can get a ton of information about everything going on in your system. I recommend, if you are running linux, then you should have this installed on every machine because you will be able to diagnose so many problems very quickly. It includes network memory, cpu uh disk, it's just everything. It's fantastic.

A

S traits uh who here is familiar with esters, nice all right, everyone is going to walk out of here. With all your questions, this is exciting.

A

Have you ever looked at a machine and been wondering what is this process doing it's one of those questions where you feel like you're. Looking into this black box, I've solved so many problems just by attaching s trace to an existing process and looking at the system calls that are happening.

A

For instance, I had a application server and my application server was hanging every few minutes and I was trying to figure out what I was doing so I just attached to it and waited for it to hang and it turns out it was trying to connect to an ip that didn't exist anymore and, as a result, the socket was just sitting there and there's some weird dns issues, and I was like oh that's, pretty cool and I traced it down to a configuration problem where someone could put in an incorrect ip address, so it made it really really easy to understand like.

A

Why is this pausing for five seconds? Like you know, it's a problem. That's hard to solve, if you don't have any insight, unless you annotate every single line, you have a print statement which is impactful, so you can filter with the dash e flag, which is really useful. If you just want to take a look at I o or network, um and, like I said you can attach to a running process or you can start the process with estrogen so in this example, this is.

A

This is simply the output of what happens when I straights touch. So I'm just touching the button and you can see all the m apps that are happening. That's happening along the way, really really useful.

A

Jstack is extremely useful if you are looking at a jvm process and you want to know what all the threads are inside of it, where they are and the state uh the state that they're in you can use jstac. If you use this on cassandra, you'll get an output of every single thread that exists and what it's doing so, if something hangs, you can run jstack a couple times, take a look at the traces and understand where the source code. This is.

A

This is one of those tools that, if you want to be on the open source side of things- and you want to act under sandra, you can take these lines and match them up to the source that you're looking at and figure out. If there's a problem, if you want to take a look at your network traffic, let's say I'm, you know I want to know what packets pretending, tcp jump is a really useful tool you can monitor and import. You can do tcp udp in this mode.

A

I have in ascii mode and it will just dump out anything that comes in. So I'm actually seeing right here. The results of the raw cql protocol, so you can see, there's a bunch of select statements in here and I have to have a key space called beatbot and uh it's pretty very possible I'll hit chatbot.

A

So I was just watching the queries that are coming across. I don't like this for not just for december, but for any application server or just web servers in general or any database server. This is this is really nice whenever you're having connection problems and you're wondering like am, I am I even able to hit this machine like you'll, be able to figure out?

A

Is there? Was there a firewall? This will tell you everything.

A

Samuel there's a really nice utility called no tool and no tool can tell you a lot about what's happening at the internals in cassandra.

A

So if you run nodetool tv stats, tp stands for correctly and it will show you all the things that are going on with different threads and it will show you how many of them are blocked. That's the thing that you really want to take a look at, for instance, if you have a med table flush rider blocked. That means that you either have not enough clusters configured or your disks are too slow.

A

If that happens, there's a good chance that you're going to hit garbage collection problems, because this is memory that we need to free by flushing data onto this. So we have all these flush drivers piled up. We've got all these bent tables sitting in memory and garbage collection is going to cause them to be promoted, and that ends up resulting in more interesting problems.

A

If you see drop mutations, that means you've got some networking problems along the way and then you're going to need to prepare histogram is really useful. So there's two on the left. We see proxy histograms. This is uh the cumulative time that it takes to serve a request, including on any network round trips, that you need. So, if you're doing a consistency, level form this takes into account the amount of time that it takes to get a response from another note as well, so you can just get your high level reading times.

A

Cf histograms is a little bit more useful if you've narrowed panel using proxy histograms a performance problem with a particular server, it's useful. Sometimes you see histograms and cf. Histograms works on single key space on a single node and it will do is give you a histogram of your read write times so you can say, like ah you, can just look at the scene as little bump and you go well. This is where my time spent and you can narrow down 400 problems to an individual key space table.

A

So that's pretty nice uh query tracing is kind of like looking at a weird plan, except it's more instead of a queer plan. It's not what the optimizer is going to do and the query tracer is: what did it?

A

Do you don't want to run this on every single query that you have because it's actually um non-trivial overhead, it stores the entire gear trace itself, uh and I believe it's ttl, so you don't want you know if you're doing a hundred thousand or five hundred thousand clears the second history cluster, you probably don't want to be creating a trace for each one of those that'd be crazy, but it will tell you along the way uh what happened I mentioned before about you can see uh tombstones.

A

uh This table is called tombstone mayhem and what I did was I created a ton of tombstones and then I did the select against it and you take a look at the source elapsed time on the right. You can see. It actually goes.

A

It's fairly reasonable until you hit the zero live and a hundred thousand tombstones, and that's where the time just jumps up like crazy. So this is on. This was on my laptop and this was under no real load. It was just that's how long it takes to read. 100 000, so you can guess how bad this would be in production. If I let you do that, so don't use cassandra's cube, just show of hands who here still wants to use as a kid success awesome all right. This is the most intense part of the talk.

A

This is jvm garbage collection, so, unlike a language like c java will perform automatic management forms and it's very convenient, but it does cost, and that cost is that every so often frequently the jvm needs to scan all the memory. That's currently in use and build an object map and determine if that memory is still reachable what can be freed and what can be promoted.

A

So what we have here is two spaces. We have our new gen, our old town, and we have our eating space within our new gym and what happens here is memory? Require objects will be created in our region and over time, they'll make their way into the ultra.

A

So the two meat, the meat in space once the eating space is full. What happens is we have what's called a binary pc and the mineral gc causes all the objects? uh First of all, let's stop the world, so all the threads. Even if you had a million cores and 100 million threads, they all had to stop and all the objects from youtube are looked at and some of them are going to be promoted out of the survivor generations and then a bunch of them are going to be prohibited from eating into survival.

A

If there's not enough space in a survivor, then all the extra spillover gets moved into the vulture. One thing: that's really important to keep in mind here is that copying objects in the jvm is very slow. If you have, let's say you create a ton of temporary objects and none of them need to be promoted. It's actually very fast, you'll see like an l-sided pause, but if you have a huge new jam, then- and all that needs to be promoted, that's not going to be fast, it'll be pretty slow.

A

uh Historically, because android has had an 800 unit. This has been messed with a little bit depending on the workload you can change that I'll talk about that so find a gc. I already talked about this. I got programming.

A

So then we have our old gem um objects that have been promoted from new gen sit, the old gen and there's a process called uh major gc and major gc is a mostly concurrent process and it's just constantly scanning through and it's doing, two short pauses called mark and remark. uh What they are is one takes a look at all the objects that even determines, if they uh point and maybe not just the old gen- that's that's it. It unpauses uh draws the object.

A

Graph pauses determines if those objects are still reachable and then cleans up anything that we don't need, and so this isn't so bad. These are constantly happening and they shouldn't really impact performance on your system too much now, the big problem comes when you fill up your whole chat and you get a full gc, so full of gc is pretty much the worst thing to have in your application. It might as well not be on anymore, because it's going to pause for like 30 seconds that we can effectively call the system again at this point.

A

If the new gen is completely filled up or objects can't be promoted. What we'll do is it will do a couple of things. One will collect all generations, and that means like, if you had an akt, has to look through everything in the stop the world, and it's actually also going to be it's going to defragment the memory as well, and that is really ridiculous. So it's going to just copy your tongue here, you don't want to hit these.

A

It can be avoided, though, if you have, if you tune your jpm or your workload.

A

So if you have a right, heavy workload, objects that are going to be promoted in the most event tables and the problem that you can run into is if your new gen is too big you're going to have these tables that are sitting there and you're going to have a pause, and it's going to promote all this stuff and remember, copying is slow in fact it's ridiculously slow. So what happens?

A

Is you promote all this stuff and you have a really long stoppage like I've seen 400 milliseconds, that's 400, milliseconds that if you started a query and you expected it to be two milliseconds and you just happen to catch it right in the middle you're, going to take four through two milliseconds to cancel this query. It's very very frustrating to get these long pauses.

A

So if you have a heavy duty chain and a right heavy workload, you are absolutely going to get a bunch of performance problems and you're going to not understand why cassandra is performing just terribly.

A

The second workload that I want to talk about is our weekend and the read heavy workload is the complete positive. So when I read a bunch of data out of this handler, I'm expecting that. Maybe I have a few milliseconds and I read a bunch of stuff off disk. I send it out over the network and then I don't even think about it right. These are these objects to be collected.

A

So if we have a really really small new gen, what happens is we've got this really heavy workload. That's creating tons of temporary objects in memory, but this is getting filled up so quickly. It goes up nope. We have to stop. Let's promote everything, so you've got these objects that are constantly being promoted into your old gym that don't belong to them and then they're also being promoted so frequently out of new generator old jam fills up and you end up with full gc's.

A

So this this re-heavy workload will result in lots of full gc's, which has become are terrible so to understand what is happening from a garbage collection point of view. I recommend a tool called jstab jstaff, combined with a flag called gc tilt, pass in the process id the number of milliseconds to interval and then account we'll show you at that interval what the state is of garbage electronic mission. So you can look here and you'll see the different survivor generations, eaton old, gym.

A

The young gc count the amount of time spent in umbc and full gc. So you get a lot of really useful information, there's also a gc cause, which is another really good lab, um but effectively what it does is it gives you really good insight as to what's happening in your cluster.

A

So this is what happened when we tuned garbage collection, or we moved to the jvm. In my last company, we saw an almost tenfold increase in performance on our cassandra cluster, so this was a screenshot that we took. um I mean the number of top like at the top of the graph was ridiculous, like we're talking almost 30 milliseconds for operation, because we were seeing so much pause and then they already know. I mean we're on solid state drives whereby why is it taking 20 milliseconds?

A

Nothing takes 20 milliseconds to do that, and then we tuned it that one right up like four, that's, not so bad.

A

So that's kind of a smallest order of magnitude, performance improvement. Just in tweaking two parameters.

A

So what do you do when stuff is broken right? What's the process that.

B

You have to go down and figure out how.

A

Do we solve our performance problem? First of all, people will blame the database a hundred percent of the time if they don't have metrics in place. This is one of the reasons why I like to have a mattress in place.

A

Without a doubt, it's a oh, like really because um you're doing some api calling here's some like a random website that happens to be out, like that's pretty sure, it's not standard, uh so you want to go and monitor this stuff right. You need to put those metrics in place, so so you can graph everything and look at the rate of failures that are happening in your system.

A

uh If you do think it's top centering or if you do think it's cassandra, you should check awesome off center is going to give you information immediately. How many nodes are down? What is the jvm profile? Look like you'll have very, very good information as to whether or not something has changed in the last day or whatever, and that should lead you down towards the path of.

A

Maybe maybe I have to fix this. Do we have slow queries, it's possible that your data model is incorrect. It's possible you're using the wrong compaction strategy. uh You need to look at your system stats, uh it's possible that you're just seeing lots of jvm pauses and that there's nothing wrong with all. It just happens that you only bigger bigger new chain size.

A

So you take a look at the histograms query tracing and you should be able to pinpoint if it's an individual query, it's an individual table or if it's a garbage collection or just a misconfigured server or a dead disk. All this stuff should be. You should be able to diagnose this stuff very quickly. If you put the right metrics in place.

A

That's my slides uh questions.