Apache Cassandra Cassandra Summit 2015, 14 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Last Pickle: Repeatable Scalable Reliable Observable - Cassandra

Description

Speaker: Aaron Morton, Co-Founder

Apache Cassandra makes it possible to write code on a laptop and deploy to multi-region clusters with a few configuration changes. But what does it take to create repeatable, scalable, reliable, and observable clusters?

In this talk Aaron Morton, Co Founder at The Last Pickle and Apache Cassandra Committer, will discuss the tools and techniques they use. From environment planning to implementation for tools such as Chef, Sensu, Graphite, Riemann and LogStash this will be a discussion of the full stack ecosystem for successful projects.

A

My name is Aaron Morton and today I want to talk to you about how to take Cassandra off your laptop and into the data center and the types of mistakes that people make. That mean they come and give me money which I like, but sometimes I'd like to do some more enjoyable work. A little bit about myself before I start I run a small consultancy called the last pickle I've been using Cassandra for about five years, I'm a committer on the project, and we have staff around the world and we just help people use Cassandra.

A

So what we're going to do is we're going to look at the decisions we make at the design phase and the development phase and the deployment phase that mean that when we get to our cluster it works. It always works on. Your laptop I saw someone today with a sticker. That said, it worked on my laptop. We want to get it into the data center and then we want it to work in the data center after three months and after six months, even after 12 months.

A

Let's start with the design choices we get to make we're going to look at the data model, we're looking at it from the perspective of how to make data models that work at scale after 12 months.

A

There'd be a lot of other courses that we've done today and yesterday in the day before around how to actually achieve functionality. This is about how their achieve performance and scale so the first one we want to look at is we want to avoid doing unnecessary reads: we're going to do what we call no look right so say: we've got a table that looks like this: we're just going to track the days that a user visit our website.

A

If you were doing this in an old-fashioned database, you might have a model like this read from it. If it doesn't exist, do an insert. So you add one so there's a reasonable thing, because you've got a primary key constraint in there, but in Cassandra our primary keys don't have constraints on them. You can insert again and again and again with the same value, so don't bother doing the read you're just wasting time and energy. Just insert it and then let compaction take care of dealing with those over rights.

A

We just cut out a bunch of disk I/o moving on we, we talked about this a lot when we're doing data modeling, but we want to really limit the size of that partition in some way.

A

Okay, I use, I always use the phrase in space or time. So if we look at it when we're limiting something by time, imagine we've got that table of now and we're tracking every time the user visits. We record some piece of data, it's a hundred k's in size, for example, and we keep that forever and so this table will grow. The partitions in this table will grow with no bound and as long as your site keeps running, this partition keeps getting bigger, which means that we have more of a partition.

A

We have a bigger index or inside that partition that you don't even see and know about. It means that when it happen when it goes through compaction, it goes slower, big ones. Things are above 32 Meg's they go, they have to go through a slower compaction process and it means that when you do a repair, you are potentially copying around data that you don't need to, because we repair token rangers not individual rows.

A

If there's nothing wrong with this partition, except it's just next to a partition that does have an inconsistency in it will copy this guy around. So, if you've got the partition, that's a 500 Meg's in size. We will just copy that to all the other nodes and then wait for a compaction to do its thing again. So a better approach is to bucket.

A

So we add another column on and this time we're going to buck it up by day and we've put that into the partition key and now every partition is only going to get written two for one day, and that means that the it's not going to get too big. We have a fair idea of how big a Terran to get. If you did the developer training, they went through a whole set of algorithms about how to work out. How many bites it's going to be.

A

The simple thing you want to do to begin with is just: is there a bound on this like how many times are people going to visit our website in the day? Something like that.

A

Another issue we want to look at around tables is avoiding a mixed workload. So, if you think about that log structured merge, tree storage engine, we can end up with fragmentation of your partition because we flushed the disc and the rows get fragmented and the job of compaction is to bring those back together, but compacting. This can only do so much. If you have a bad data model, you will have things spread out.

A

You can see how many SS tables your reads hit by looking at the SS table, histogram and we'll have a look at that later on, but the better idea, instead of detecting that you've done something wrong, is to not do it wrong to begin with. So imagine we had a table that looked like this.

A

We've got the users password and the last time they visit password gets updated. Frequently read frequently last time they visit gets updated frequently and let's just go with red. Infrequently two different workloads: it's going to create a table where it's very fragmented and when we go to do the reed, the reed path will have to look at all the fragments, even if you're just getting the password, then that's only in one fragment.

A

A better data model breaks up these two workloads into things that are updated frequently and things that are updated less frequently and we'll see later on. We could use that to use different compaction strategies in different areas as well.

A

Level compaction strategy can be your friend if you have a lot of over rights, as we saw back here. We are updating the last visit time. All the time. That's Nova right. We want to compact away those fragments of that partition. If you've got a lot of tombstones, it can be your friend as well.

A

It can be a friend if you want to do a lot of low latency reads and it will be your enemy if you're, trying to put too many rights through because it uses about twice the disk I/o and when it gets behind it gets really behind and things get out of control, so LCS is good, but if you've got a high write, throughput consider a date tiered compaction strategy and we add it just by adding a property onto the table. It's pretty simple. There are other properties, of course, of compaction, the level compaction strategy.

A

You can say the size of the files you can put settings in there about tombstone compaction and things like that, but the basic settings nowadays, the defaults are pretty good, this one's a little more complicated.

A

So we want a data model that is going to scale the throughput when you scale your cluster. If you launch your site and you have three nodes and you're very, very successful- and you get some money and you scale up to 30 nodes you're going to have a lot more capacity in your cluster. But you also want to get more throughput out of that. So if we had a data model, let's say for a hotel pricing and we have the check-in day as the partition key and then the hotel name and the pricing blog. Now.

A

All of the travel in all of the check and information to check in on the fourth of July is in one partition that one partition is on three nodes. When you have a three node cluster is on three nodes and when you have a 30 node cluster, it is on three nodes. You will not scale the throughput to read that if you have a lot of people who want to check into a hotel on july, 4th you're not going to be able to successfully get all those requests out.

A

So, instead we're going to put some kind of bucket in here. So let's break up the hotels by the city that they're in have some other process that goes and works out which cities are near other cities.

A

So we know where what CDs we want to look at and put that into the partition key so that now the all the hotels in santa clara are on one set of nodes and all the hotels in san jose are on the different set of nodes on a small scale cluster on the three node cluster they're, all the same, when we move up to a 30 node cluster they're spread out and we can increase our throughput as we scale the cluster and to do that. We're going to want to use asynchronous requests.

A

So these came in this is deep down in the native binary protocol. The native poignant protocol is asynchronous by default when you make a synchronous call through the neck through the drivers, it's just blocking for you.

A

What this allows us to do is think in terms of paragraphs instead of sentences if I want to go and get the hotel pricing information for six hotels, don't make sex sequential requests, make sex concurrent requests using the asynchronous features of the drivers and then wait on those six futures. It will be a little bit more than the network round trip for one request, but it will be significantly less than the network. Then six network round trips, so back to our hotel model. This is our table.

A

We go through our process to work out what cities we want to search and then we send these asynchronous requests and on the client side, I'm now joining it up so I'm doing a scatter gather approach.

A

Client is now kind of an outer ring of the cluster. The driver understands where the data is and it sends those requests out and you scatter out those requests they run in parallel. You bring it all back. You piece it together and you get on with your life.

A

Make sure, when you're designing your data model, you document what sort of consistency levels you expect people to be using, because often the people making the data model and the developers are the same people to begin with, and then there are some new developers that come along and they don't really understand why or they just have to copy. What's in the code, if you write it down in your data model, you can explain why you think you can use eventual consistency in this area.

A

Of the application only has to be a couple of sentences in the schema. That's all and then smoke test your data model.

A

The idea of the smoke test is to find problems as early in the method in the process, as we can so we may insert for our hotel data, have a little script that inserts and hotel prices and the cities and their relationships to each other, and then all I do is I write a couple of select statements and I put some comments in the schema in the script and I say: run these in parallel. Make this before that and I can test my logic.

A

I can make sure I understand that yeah I can read from that table, get a value and write it to that table and it's a great communication tool.

A

It's once we've gone through the design phase, we're going to move on to development, so we're not going to get into too much dev detail again we're just looking for the best practices to avoid problems 12 months down the line.

A

First, one is we want to understand how much we're reading from Cassandra, so it may be that you're reading cells, columns that have 20 megabytes in them and the thing to do would be to not put 20 megabytes in one column, or it may be that you're reading tens of thousands of rows. The important part is to know what you're reading, how much your a reading back so again in the native binary protocol at the base level, there is support for pagination, whether you see it or not.

A

It's there and is always enabled the driver sets this by default, to be five thousand rows. So if you do a select against a partition- and it has five thousand rows in it or more- you will get multiple pages come back. This is called the fetch size. Five thousand might be a bit much if you've got a bunch of data in there.

A

If you've got some wide, if you've got some large rose, that could be a quite a significant amount of data, so you might want to turn that down to be 100 200 whatever also this. So in the background, the driver is going to fetch again and it will either by default. It does it synchronously, so you exhaust the other Raider. It goes back and get to the next page or you can make it do it eagerly.

A

So once you start reading from the iterator in the background, it goes out and gets the next page and brings it back again we're going to use the appropriate consistency level because reducing to the lowest consistency level improves performance, improves throughput and we wanted, wherever we can use token aware, asynchronous requests at consistency level one. So we touched on this before in the design phase.

A

What this means is your client is paying attention to the tokens that are assigned in the cluster and when it's going to write a will read a value it works out which nodes are actually replicas for that directs. The request to one of those nodes and if you're, using consistency level of one, it can answer that query using local disk only it will still do if it's a right, it will still send it to all of the other nodes in the cluster.

A

There are replicas and if it's a read it may do some reading in the background to check for consistency and repair it, but you've taken network latency at out of it. You're taking one network hop out of the picture and it can really reduce your latency down. So you set this up using what are now like lots and lots of policies and properties on the driver. You create your cluster and in this example here we create a DC round-robin load balancer.

A

So it's going to the the internal policy is going to round robin across all of the nodes that are replicas across all no sorry in DC one and then the token of where policy wraps that and filters them down. So it's only the nodes that are replicas for that partition.

A

If we want to do asynchronous requests, we go to our session object and we call execute I think and it gives us back a result set future that we can listen on using guava or we could call get interrupted uninterruptedly if we want to pretty simple stuff, it's just going to give us the row back now. If you're doing this, you really should avoid doing a denial of service attack on your cluster okay. So your client is probably running at the speed of memory.

A

It's just got stuff coming in and you just want to send it to the cluster. Your cluster is running at the speed of desk. If you just send a lot of asynchronous requests and don't wait for them, there is a pretty good chance.

A

You may overload your cluster, so if you want to send like a thousand rights, maybe only have 100 of those in flight at any time again the numbers not so important as the fact that you think about it and make sure that you don't have it set up that if you're a site gets really busy, you didn't have any protections in place and what the amount of rights that were in flight on the cluster went sky high. I.

A

Think that monitoring in the learning is part of development that if the developers have access to monitoring early on, they can see how the system is running or not running and that most developers should be writing metrics. That kind of that you want to see on the graph with the metrics that you see from Cassandra, so you can understand latency going through your staff. That kind of thing- and you should just use whatever works. Our ops center is getting better and better. Remon is very good but very complicated you're.

A

A fan that looks nice logstash does lots of funky things since who has lots of functions, but it's a bit complicated whatever it is that works, get it in and work out how to get as many metrics as you can out of Cassandra and into that system as as possible when we're pulling these metrics in where you're the one to do.

A

Some aggregations remember they're going to be coming in off one node, but if you've got a 15 0 plus, do you have to think about what's happening at the cluster level, what's happening at the node level? So we generally want to have some aggregations at cluster wide. We want to know throughput across the whole cluster. If you've got a smallish cluster, you may be over the show, the value for every node on the same chart, it's often useful to have a filter that pulls out the top three and the bottom. Three.

A

If you've got to know, if you've got some nodes that are processing more rights than others, you may have a hot spot in your data model and that will become evident if you filter on top three and bottom three and then you're in the one to be able to drill down on the individual nodes to see what they're doing the type of metrics that we get out of the Cassandra are a one-minute rate, which is an exponentially decaying average of the throughput every second.

A

So the throughputs for the more recent seconds have a greater waiting than the than the older seconds in that minute, and sometimes we have just values that are counts and you're going to want to run a derivative on that to get the Delta over time. And then we have lots of latency metrics and they come out with a min and the max and a medium and the standard deviation, and things like that. I tend to just grab the 75th 95th and 99th percentile. For those and these things report in microseconds.

A

Whichever process you go to get your metrics out, they have a naming scheme that is reasonably consistent. You can get your metrics via jmx. You could use joke allah or MX ml to get that by HTTP. You could have collect d pulling things off or you could use. I think the preferred method is to use the configuration file called my metrics gamal and configure cassandra to push metrics out to the reporting systems, which is much better approach. You can configure it to push the multiple reporting systems.

A

You can configure it to push at different intervals to different reporting systems. You can have regular expressions that are going to filter the metrics that are showing it's a really useful thing, so they have a naming scheme that starts with org Apache Cassandra metrics, Oh ACM, and then we get into the different areas. So we have one here for the cluster throughput. This is an aggregate of the throughput on this node and we have the right latency one minute. The latency bit is a little confusing. This is the throughput. This is not.

A

The latency will see the latency in a few minutes, but that's where the throughput hides in the client requests, and then we have the throughput for the node. Now, if we're writing, if we've got the schema that has RF three, the three put to the clustering will say is a hundred. The local throughput should be around three hundred because we're going to do three rights for every request that comes in.

A

So you can see the throughput at the individual column, family level and it it's org, Apache, Cassandra, metrics, column, family, then you're a key space name. Then your table name and then write latency again, one minute rate and we can see the request latency. So the timer for this starts when the request hits the coordinator from your client and it ends when we send the response back. So it includes all of the wait time all of the internal network traffic, all of the disk access all of the queue times everything's in there.

A

It is the the most important number to understand. I think everything else make a bad data model. If you've got metrics telling you that your data model is bad, but your SLA is being met, then you probably shouldn't spend too much time trying to fix your data model, there's probably more important things to do so.

A

We have that request latency, but this one is for all the requests coming through the class through this particular node, and then it can be broken down by individual tables, so you go Oh ACM column, family, you're, a key space name, you're a table name and down here. It's called coordinator, write, latency and coordinate a read latency. This is really important because you might look at that. One number to begin with and see that the 95th percentile is one second, but you don't know what table it is.

A

So you can drill down here and see that oh it's table foo is the one. That's 90 that, where the 95th that's one second, the next level down on latency is the local latency. So this is what's happening when that read thread gets the message off the queue on each node and starts pulling information off desk, and this is telling us how fast this particular part is running. So, for example, write latency might be about 50 to 150 micro seconds and read.

A

Latency might be around a hundred to sort of 800 milliseconds on microseconds, and once you start to look at these numbers, you get a feel for what's happened. What that network latency is all about, because that number is typically a lot smaller than the overall request. Latency now most problems that people have our to do with the repast, and there are three key metrics that you can look at to understand the red path. The first one is the live. Scanned histogram I've just pulled out the 95th percentile here.

A

This is how many cells, which is an internal representation of a non primary key column. How many cells we read, /, read and breaks it down for the percentiles, so you could look at this and understand if you're doing huge reads on this table. Maybe that's why the reeds on that are kind of slow and we've got the tombstone scanned histogram, and this is how many tombstones we've read off desk and remember those tombstones, get rid off desk, allocated memory and then throw in a way. So we don't want too many of those around again.

A

This is a histogram of / reads and then we've got the number of SS tables that we touch / read, and this goes back to the fragmentation that we talked about earlier on the more SS tables that we touch, the slower the read because there's more random disk I/o in there.

A

Now things go wrong and then we get into eventual consistency, and everyone wants to know how eventual is eventual- and nobody knows it's totally non deterministic, but we can get a bit of an idea about how many times things are going wrong.

A

So we have some metrics that let us know how many hints we stored, storage, total hints. It's a count. You have to do a derivative on it, but you can know how many hints you store and remember. The hints are when the node was down before we started the request or when the no time doubt didn't get back to us for a right and then it's broken down pure IP address so you're going to know that oh look, the nodes on the other side of the. Where are the ones that we're storing all the hints?

A

For but don't forget, that hints are an optimization. They can be turned off and by default, if a node is down for more than three min are three hours we stop storing hints for it. So there's another measure here, which is time outs, and this is the rate of time out that you're getting talking to other nodes and it's a good measure of kind of your network health and let you understand if you've got problems going across the land and things like that.

A

On the other side of this, we have to repair it. So we've got read repair. There is read repair that happens in the background and doesn't slow anyone down as read repair Brown. We could know how many times that detected a problem and we've also got read repair. That's blocking this is where you do a read as a consistency level to tie them.

A

One and the coordinator check the two results that got and decided that they didn't match and it has to go and fix that, while the clients waiting and so that's the blocking, read, repair metric.

A

Occasionally, there are errors. We know there are two types that happen in Cassandra and the first one is the good one unavailable exception says you asked me to do this read or this right as a certain consistency, level and I can't, because there is not enough nodes available, so I haven't even tried to do it. The second one is the bad one. Timeouts are essentially, we just shrug our shoulders and we don't know what happened both of these should be tracked and they loaded on and measured and if you're managing the system.

A

It feels good if you know about the problems before the developers come in the nor you because they're saying there are errors in their logs.

A

There is also errors that cassandra has not many nowadays, but there can be errors and there is an unhandled exception handler that will catch those and that's trapped here. You'll also see this. If you do know tool info, it says how many unhandled exceptions it's normally 0 nowadays, but it's handy to know. If there's something going on, you might want to understand how much disk you're using so there is old, Apache, Cassandra, metrics storage load count, and then you can get the this number of bytes and then you can get that per table as well.

A

So you can understand which tables are taking up the most space.

A

Speaking of taking up space, there is compassion which whose job is to squish down all of our over rights and tombstones and make things take up less space. There are metrics that tell you at the ground level how many pending Compassion's there are. There are metrics that tell you how many pending Compassion's there are per table and there are metrics that tell you how many you've completed. So you might want to alert on this if you're in count of metrics gets above 50 or 100, or something like that.

A

You're not looking at this stage for poor performance, you're, more looking for abnormal performance like have, we suddenly got a problem that compaction isn't working and now we've just gone off the charts, and if you can catch that at the compaction level, you may catch that before it impacts your read latency now inside Cassandra there are a number of thread pulls and your your request journey through these red balls.

A

We won't go into the mall, but they generally do the type of thing you'd expect from what the name is so there's a mutation stage that does mutations, that does rights and the reed stage and replicate on right is for counters and things like that.

A

It's a queue with a number of threads at the end of it, and the pending says how big that Q is often it might be 0 you might get a large number of requests come in and that might take, as you know, a split second for it to get above to get back to 0. So you can monitor this and that's a good indicator of how up-to-date the cluster is.

A

Now, if those messages sit in the queue for too long, they get dropped. It's part of the load shedding process that cassandra has how long they have to sit there for is controlled by the timeouts that you have in the yam or file, and so by default, now for reads it's five seconds and for rights it's two seconds. If we are dropping messages than we are shedding load, it means the cluster is overloaded and we want to track that and know that is happening could be an early indicator or something going wrong.

A

Important thing to remember is that we can drop messages and shed load, and the system can still be functioning correctly. So long as we go back to the client and say we successfully process your request at the consistency level that you access to we're good if we drop messages on one load, that's fine all right! So now we're up to provisioning. Getting a system out and running I mentioned smoke tests earlier on around smoke testing your data model to kind of check your logic.

A

You can go and smoke desk you're, a disk which is a good thing to do in the AL. Toby has a good write-up about how to do this and the types of numbers that you'd expect and the techniques that you can use just make sure that the disks are not broken before you go inside to use them. You can use the Cassandra stress tool and just run the smoke test.

A

Just stand up your cluster and run this beforehand, and if it has any issues you have, you can fix them before you go to all the trouble of getting your application provisions and everything else and there's a Cassandra there's a blog post on datastax about that as well, and then we start to build run books and the idea of the run book is the plan. How you're going to handle some bad situation in the future, and my goal with a run book is always to communicate like a ten-year-old like I.

A

Did this because of this on the weekend, I went to the park type of stuff and explain what we're doing and how we're doing it. And why and then the way you test your run box is you run fire drills? Then the fire drill is to see that the runbook works and that people understand it and if you're, in a fortunate position that you're deploying a new system. Hopefully you can do this early on.

A

So here are the scenarios that you want to run through. First, one I call a short short term node failure. The node is down for less than the hinted handoff window, which is three hours, so the other nodes collect hints for it. We continue to be available for quorum requests and, at the end of it, there's no action necessary, because this is what cassandra is designed to handle all the time.

A

But you want to test it, I'm going to make sure your applications happy. The next thing we want to do is break the cluster like take down multiple nodes until the cluster breaks and hopefully understand how the application reacts in that situation and, if you're, if you care about this, for in a rumble situation, you should run a repair. When you come back, then we can maybe lose an entire availability zone or a rack, or something like that.

A

So we could put into my P tables or shut the nodes down with everyone to do for that, and we should still be available for quorum if we've designed it correctly, and that would be a good test of a test, your application as well, and when we come back, we may want to put repair in to make sure that all that data gets back. We could rely on hints, but if you've got a high-throughput, it might be faster to run repairs.

A

The next type of failure I call a medium-term failure. So that's where we're down for longer than the hinted handoff window. We can't rely on hints anymore, but we're down for less than the GC gray seconds, which is important, will see that on the next one. So this might be the case where you've disabled hints, because you don't need them or you had a failure overnight and you woke up or you didn't even bother- waking up and you're going to fix it eight hours later. So it's a fairly common scenario.

A

You should bring it back and run repair because it will have missed a significant number of rights and hinted: handoff will no longer be storing those rights and replaying them and we've got long-term failure. This is where our nodes down for more than the GC gray seconds, which means that any tombstones and T tl's that were created may now have been purged off disc and the node that was down may miss. The deletions and data will come back to life and you should never bring this node back into the cluster.

A

You have to replace it, delete all the data and start the node with a dash, replace node. And if you look at the data stack stocks, it explains how to bring back a failed node in that state.

A

Now, with all that practice in place, you can understand now what happens in the rolling upgrade. It's just repeated, short-term failures, so you might want to in the fire drill to an upgrade. Sometimes it's good if you, if you're provisioning, provision the revision before the revision you want to run. So if you want to run to 19 provision to 18 put the application in run some load test through it and then do an upgrade to 219.

A

Well, it's running, and if you want to scale up where you just want to lift up all the data and config from a node and put it onto another one again, that's this short term failure and what you want to do is make that failure mode while it's turned off as sure that's possible.

A

So and if you go in the scale out, you're not going to have any impact, so you're going to scale out and you'll be available for all and again, if you really want to get into this, you can provision your initial system with not all of your nodes. If you're going to go to the six nodes, initially provision, five of them put the application in run it and then, while it's running, add another one and print and get some confidence that that's going to happen.

A

That's all I have for today and I've got about five minutes. If there any questions.

A

Yeah question down the front.

A

B

Were you going to be getting into anything discussing chef recipes, I am.

A

I going to get into anything about chef recipes, I thought I might be able to, but I didn't quite have time. I really like having your config be consistent, is a great thing and making sure that your automation doesn't go. Crazy is another great thing, so I just ran the other time in today's yeah.

A

Some questions set.

B

Thanks to talk, you have any operational experience with different types of rating, either raid zero rate NJ bud and what your feeling on those types yeah.

A

So question is about using different disc layouts or raid and things I've worked with enterprises where they say the box is a raid 10 and it has 20 disks in it and that's the only box we have and that's fine. The biggest concern you have is: if a disk fails, how long does it take for the disk to get replaced? So if the gardening happens once a month or once a week, and how long will it take to get replaced in terms of performance?

A

It's j board is pretty good and raid 0 is pretty good. The problem with raid 0 is, if you lose a disk, you lose the whole all of the data. So generally, nowadays people go for j board. It can be a bit of a pain with fragmentation on there. So I would go with like the best hardware accelerated raid and if you don't have any, I would just use j bald.

C

Do you have any recommendations on.

A

How to identify hotspots in your notes.

B

A

To identify hotspots in your data model, you want to look at the local write throughput and you want to find the top three in the bottom three per node like so for your table across all nodes, get a line for each throughput. If you see one that's getting more rights, then that's a that's a hot spot! There, that's getting more right or and same for reads, understanding that you've got fat rose.

A

You can use node tulsi of histograms and no tools here, stats, and it will tell you if there's a very what the biggest fragment is and if that's a lot bigger than your average, then you've got a hot spot somewhere, then you're, finding that in the data model can be hard. Okay, thank you. Cuz, there's a question. At the back. There.

C

Do you have an opinion on or experience with, join ring, false and repairing in the case that the node has been down longer than the storage window, so.

A

The question is about bringing a node back, there's been long down longer than GC grace and join ring. False yes, sir, join ring false is what we used to do, but now there's an option called replace node which either takes the IP or the host I. You are uuid of the node you're replacing the node joins the ring in a joining mode. It accepts rights from the other nodes, doesn't accept any reads from the other nodes: they don't send any to it.

A

It initiates a streaming process, and so it's accepting right and streaming all the old data and once that's complete, it then joins the ring. So it's a better model, there's a much smaller gap for eventual consistency. This creep in cool one last question: thanks.

D

So my question is about repairing nodes. What I have sometimes observed is that, if a note goes down the coordinator, who is storing his hints, sometimes also gets down or very busy or so in those situations. What's the recommendation should why just bring down both the nodes and repair both of them? So what's your take on it? Sorry.

A

I didn't catch the bit in the middle. It's the fans are a bit noisy there. Okay,.

D

What I'm saying is that, when a node is down are generally the coordinator would storing is hence sometimes also gets very busy and very right. Heavy traffic yeah and also goes down or becomes very slow. So in such situations, just to bring back the cluster and a healthy state should I kill those both nodes and bring back both or what should be the yeah.

A

So the question is about when hinted: handoff goes crazy, so in the situation where the node goes down and the other the nodes that it's storing hints go down as well, you bring them all back at any time. It doesn't matter which order you bring them back and if you're, in the situation where you once you lose a node, you can't handle the throughput. It sounds like your cluster needs to grow, so remember, you're, doing n, plus 1. You should be able to handle all of your throughput and latency on N and the +1.

A

It is the extra. So you may need to grow that cluster cool. Okay thanks. Everyone.

A