Apache Cassandra Cassandra Summit 2013, 28 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: A Deep Dive Into How Cassandra Resolves Inconsistent Data

Description

Speaker: Jason Brown, Senior Software Engineer at Netflix and Apache Cassandra Committer
Slides: http://www.slideshare.net/planetcassandra/6-jason-brown
This talk focuses Cassandra's anti-entrpoy mechanisms. Jason will discuss the details of read repair, hinted handoff, node repair, and more as they aide in reolving data that has become inconsistent across nodes. In addition, he'll provide insight into how those techniques are used to ensure data consistency at Netflix.

A

Hi everybody thank you for coming out, so these days, I spend a lot of time thinking about data consistency, so back when I started my career and even some. What about my time doing lately is um dealing a lot with that things like Oracle and older school, big iron databases, of course, with those things you really didn't have to worry much about data consistency, because it was pretty much just one big massive node. If you're unfortunate enough to have to deal with things like multi-master replication, then you did a cage.

A

They have to think about data consistency, especially when one of those poor unfortunate nodes crashed and took out the other one, and then you have a three day: shipping out digit Netflix.

A

So in addition to that, there's also things like master slave replication in the classic kind of set up with on my sequel, we've thought a little bit about dumb data consistency in those cases, but not so much. We just always assume that log shipping was usually pretty successful, but now that we're in the new world of this no sequel kind of land, where there's a lot about or essentially everybody's appear and everybody's, a right master data. Consistency becomes a lot more interesting and a lot more tricky.

A

So inconsistencies can really creep in at in really many many different ways and I'm going to be talking about anti entropy protocols and cassandra and the way that we deal with alleviating a lot of these inconsistencies. So a little bit about me, my name is Jason Brown I, currently work on netflix on the senior software engineer, I'm on the cloud database engineering team, which essentially means I work with Cassandra, I'm also a committer on the Apache Cassandra project. In my previous life, I worked as an e-commerce architect at major league baseball and I.

A

Did some wireless development way back in God seems like the dinosaur ages now, so maintaining consistent state in a distributed system is hard. Essentially, the cap theorem really comes into play and really gets in your way. So, just a very quick brief summary of the cap, theorem, sefa, consistency, a for availability and P for partitioning a consistency is data consistency as I'm going to talk a lot about here.

A

Availability is just the ability for a client application to get some kind of response as a positive or negative from a distributed system and p is partitioned typically in reference to application partitioning. So in this sense, in the case I'm going to be talking about its kassandra. So if a consentir node goes down, do you really want the entire application going away? Or can you deal the partial failure?

A

So inconsistencies can creep into distributed databases in a lot of different ways, I'm going to pick Cassandra said so. The the system I've been thinking about mostly and the one that my employer pays me to think about. So the most common thing is a node goes down. We happen to run inside of ec2, so nodes can frequently just disappear without warning and often do late at night network partitioning.

A

So if a rack goes down, so if a robber goes down and notes, can't talk to each other, mutates can't get from one node to another, and so they get dropped on the floor or saved up for later dropped.

A

Mutations are kind of interesting, so that can happens when a node gets really overburdened and it's got so much a traffic coming in that it really just can't swallow and absorb all these incoming requests so that no just chooses to drop requests on the floor because it needs to I keep on going and it could either try to swallow everything they try to swallow. Everything that's coming its way or I can just try to pace itself and just drop some mutations.

A

Of course, processes can crash before they actually flush any of that data out to a disk. So in Cassandra, typically, we use a 10-second delay before flushing the commit log. So if you have to make a right and five seconds later, the process crashes it didn't make it out to disk.

A

Then of course one of my personal favorites is file. Corruption happens in old-school, Oracle happens in my sequel and ya can happen in Cassandra as well.

A

So the way that we deal with a lot of these problems is by introducing anti entropy protocols and essentially anti entropy, tries to resolve all these inconsistencies that just happen to on occur because you're running a distributed system and things will crash. So the major things that we use in Cassandra are at right time. There's tunable consistency, there's atomic batches and hinted handoff at read time.

A

We can do things like the consistent reads and read repair and then a maintenance time read a node repair and I'm going to go into and I'm going I'm going to do a deep dive into each one of these topics, and so we can see where Alexander will will recover from problems that will come and creep in. So let's start off with the right time, because you know that's how all data gets into the system.

A

So, let's walk through the right basics of how Cassandra execute mutates.

A

So a client application will send a request. Do what's called the coordinator node that coordinator node will figure out all the replica nodes that that requests need to go to in every DC. That's in that Cassandra ring. Then it's going to send the mutate to all the replicas in the local DC to all to any remote data.

A

Centers, it's just going to send a request to one of the replica nodes, then that whoops, then that replica node is responsible for forwarding that request on to its peers within that data center, then that coordinator known waits for everyone to respond all the nodes in all dcs. So in my example here got the client application sends a request over to replica one in the first data center itch, and this is assuming a replication factor of three and it's going to ship off.

A

The the Mutai to its peers are replicas form replica two in the same data center, then it'll ship, then for the second data center ships. The request off to a replica before and then that guy's required is, is, is post forwarded on to its peers within the same data.

B

A

For the response path, everybody responds back to the the original coordinator.

C

A

Tunable consistency, essentially, the a coordinator will block for a specific, a specified number of replicas to respond to it. So there's a lot of different levels: you're going to have here for the consistency level any which is basically I'm going to wait for anybody to respond to my right. It could be one of the replicas. um It could be anybody else. Typically. This is really only useful when you're doing something where you wear a degree of lot of data loss is acceptable.

A

So if you're doing some kind of logging that that, if you missed a few rights, you may be okay, however, for most other use cases and in fact, or everything in netflix, we never use any. Then one, two and three I'm pretty straightforward, you're, essentially waiting for one two or three replicas to respond back to that coordinator. So he can so it can consider that right, successful local quorum is essentially within a within that coordinators. Data center I'm going to wait for a quorum of replica nodes to respond and and uh I'm sure.

A

You probably all know this by now, but replica is essentially the replication factor divided by t2, plus one. So to go back to my previous slide here. Actually, the first one here, replica node one is going to ship off the request to four and two as well as itself. I just made it simple in this picture, and it will wait essentially for two of those guys to respond successfully before it considers the local quorum right.

A

Successful each forum is another, a consistency level which actually this kind of almost would be, but essentially each quorum just waits for a quorum from each data center, pretty straightforward and then all basically says I'm going to wait for every replica everywhere to respond successfully to me all pretty straightforward. Now, the real trick is that this is tunable consistency, it's inside of Cassandra, while you can choose things like all and each quorum, the thing is: you're really going to run into real problems.

A

You may think you're actually going to get a stronger consistency level with your rights, but you're more likely to just cause other sorts of problems in your application that you'll have to deal with the likelihood that you'll have all of your nodes up all the time and never have to worry about anybody being down is really just a fallacy. Somebody's going to go down, something's, going to fail or you'll, be doing maintenance and and your rights will fail at that time, so each quorum is almost as bad.

A

In my opinion, because, essentially what that's really saying is that you're going to trust the B the network between all of your data centers? And if you happen to be running in something like ec2, where the where the ISO the guaranteed isolate between regions, is about zero and it's not a knock on Amazon. That's just the reality of how those public Internet's work between those data centers. You can really get into problems of expecting that that network connection to be up all the time.

A

Then your mutate is essentially it's got a hard-coded dependency amongst all of your regions or data centers, and this is actually a problem. We deal with that netflix there's one or two use cases of each form and it does get into problems. It's not. In my personal opinion, it's that's highly advisable, but it is a possibility, so hinted handoff. So what happens when one of those when one of the replica nodes is down that the mutate is supposed to go to?

A

Basically, what we're going to do is we're going to save a copy of that mutate to the the coordinator is going to save a copy of that mutation to its it's a local disk and then replay it later, and a hint is essentially just the address of that node whose down and the and the full mutation that right that you want to execute and how it actually works is.

A

That is that, on the coordinator, as we're figuring out all the replica notes that this mutate needs to go to we'll figure out a first who are all those nodes and then we'll check to see if any of them are down. If some of them are down and you can still fulfill the consistency level, there was a as part of the request.

A

We can still send off that mutate to the nodes that are alive, however, for the note or nodes that are down will story hint also, if, after we send off that mutate- and we don't get a response back from one of the nodes say, it went down or there's some kind of network congestion or whatever. If we don't get a response within right request, timeout in milliseconds will also get a will. Also start hint I believe the the default value for the timeout is ten seconds some value in the Cassandra animal.

A

One of the other, interesting things about hinted handoff, is that there's another parameter in the Cassandra amo called max hint window in milliseconds and that's basically. This is the amount of time for which a coordinator will store up hints for a know. That's down the default is three hours, so if a node is down for one hour, we'll store our pins, because the defaults at three for that value, then, when they notice comes a cup, will replay all the hints.

A

However, if they note happens to be down for four hours after the third hour, the coordinator will stop storing up hints and then we'll just drop anything for that that node and then essentially what you're going to have to do at that point is when that node finally does come up. It gets all the hints replayed to it, it's out of sync, and it's not going to get the data until you do something. A little bit more involved, which I'll talk about later.

A

So for the hinted and off replay, it's pretty straightforward- it's just place the hints back to the nodes who came back up. It runs every ten minutes as of Cassandra 12, it's now a multi-threaded operation, so that, if you have one note, if you have multiple notes that you're trying to replay hints to- and one of them is slow you're not going to block hints trying to replay to the other nodes that are also back up and one of the best things for us is that this is throttle.

A

It's a throttle, Obol in kilobytes per second, and why this is interesting is that um if you have an O that comes up and it's got a bunch of hints that are backed up for it? This will help the coordinator not just blow through all of its IO, trying to read all the hints off the disk and then shove those bites out through the the network card.

A

So it's really to help the coordinator, no not get bogged down just in hints, but in trying to to serve up the actual reads and writes that it's getting through a normal production traffic. Of course, that note who is coming back up if the hints are delayed going to him? Obviously it's going to be out of sync for a little bit longer time, but you already have that problem anyway, since it was down- and now it's getting hints so the extra time spent in shipping it.

A

The hints is actually a reasonable trade-off, and this is something we take advantage of that Netflix to make sure that dumb, we really don't bog down the il on our ec2 instances, since I 0 is very precious in ec2.

A

So here's a little example.

A

Node r2 is down. When the request comes in the coordinator, node replica one will store a hint as soon as our two comes back up. It's replays the hint pretty same straightforward. However, you get into an interesting problem and I didn't explain the setup for this intentionally, because it makes it a little bit more complicated.

A

The way the Cassandra will do, mutate is, if it's got a batch, mutate I should actually say if it's got to mutate, that's going to be applied to three keys. Three row keys: it's going to ship those out it's going to handle each one of those mutates one at a time, so serially soldered to the first mutate second and then third, so the question becomes: what happens if the first one succeeds, the semblance?

A

The second one succeeds and then the coordinator fails its then it's then up to the client to replay that done of that mutation, which is a little unfortunate. So that's why I'm Cassandra 12 we introduced atomic badges and what that is is a coordinator upon getting that mutate. The first thing that's going to do: is it's going to send a copy of that mutate to two peers in its local data center? That way, if it goes down its peers can replay that. So you can play that, didn't what Peter replay at that point.

A

So the coordinator sends those others mutates off to its peers. It performs the actual real job of sending out the mutates to all the real replica nodes. Then it's going to call its peers and ask it to delete that match the the appears if they don't get that delete request within 60 seconds I mean it's going to it's going to play those I mutates.

A

If the batch is older than 60 seconds internally and as of Cassandra 12 up all mutates are using atomic badges, so it does add to add on a little bit of latency to your right path, but not a whole lot, especially when you consider that Cassandra's right story is actually pretty excellent. This is a very small tax. On top of that, so, let's jump over to our what happens at read time.

A

So very similar to how mutates work. The coordinator is going to figure out what replicas to call and there's an interesting sort of here between the consistency level and the read repair.

A

Levels but we'll get to that in a second, but first of all, we first of all determined will all be the replica notes. We could be contacting our in all the data centers, so the coordinator node will send a request to the first, the closest data node, and it will ask for the full data set to all the other replicas. It's going to ask for digest and the reason we do that is to optimize on network traffic and also garbage collection on the coordinator.

A

But it's been recently talked about that, if that's even necessary anymore at this point, but essentially the coordinator is going to wait for a consistency level, a number of nodes to respond and then.

A

Joe go to my first to go to an example: I thought I've indicated the via data nodes in pink here. My four-year-old daughter is quite a fan of the color pink, so it seemed apropos for her so for local quorum. It's the coordinator, nodal ship to request off to just two nodes in its local data center.

A

Those are really just determined by who's closest to me, as measured in león, see so for consistent reads, then, after that coordinated nodes gets back that full data set as well as the digests it'll compute the digest of that full data set.

A

Then it's going to compare all those digests and if there's any mismatches, what the coordinator do will what the coordinator will do is resend the request out to to those same replica nodes again, except in this in this case, there's going to be no request for digests, it's asking for the full data set and then when it gets, then, when it gets those back, it's going to compare all these data sets then send any out-of-date nodes and update, meaning so that it dumb.

A

So those notes get the full most current set of that of that data, and the coordinator will block until those guys respond and then once those guys respond, everyone is up to date. The coordinator will do the merge of all those columns, hopefully of all this day, of all those rows and columns and then return that to the caller, so I'm consistent, read. If there is a mismatch, obviously you are going to pay a little bit in terms of latency, but this is a way to make sure that your data is more consistent.

A

So Reed repair this attempts to synchronize that client requested data across all the replicas and how it does. This is a piggy backs off of the normal read path, but it's going to wait for all the replicas to respond back to the coordinator and it does at asynchronously and similar to the consistent read it's going to compare all those all those digests of all those replicas. And if anybody is out of date, it's going to ask for the full data set. Do the compare, send the up-to-date information and then it's done.

A

So an example of read repair. The green lines in this are the other local quorum of the nodes. We're going to use for the local quorum read the lines and blue are the ones we're using for read repair. So our coordinator here is asking for a replica four and three for the data set that will essentially use forward returning back to the client app. However, it's also sending back it's going to send the request over to the other four nodes in pink and.

A

It'll get all those responses back fix up anybody who's out of date, also on for a reader pair. It's only going to wait the for the read timeout in milliseconds to put the actual name of the llamo parameter here, but essentially it's not going to wait forever. It's not going to keep some token around for all these nodes to come back to figure out the read repair, it's going to wait for the standard time out and then try to do its thing.

A

So the way that this is configured is that it's a setting per column family, so not every column family needs to have this on and how it actually works is that it's actually a percentage of all reads that come into that column, family, so either using CQ lsh or the old-school command line a CLI. You set it up as a value between 0 and 1, and that essentially just becomes a a percentage chance of a particular read.

A

Getting the the read repair chance and you can set this on a local DC level or global and where that actually becomes interesting, is local. Dc will just make it faster and less involved, whereas got many many data centers in your standard cluster. It's going to take them much more time that much for more computation to fix up all that data.

A

So reader pair is great in that it repairs data. That's actually requested, there's of course the problem of what about data that isn't requested. So you get a bunch of mutates. Let's say that some percentage of those are actually requested by clients or callers, but then you have a sub. You've got some set of data that isn't actually requested, so it never gets go through the read repair kind of state. It's just sitting there, potentially out of sync.

A

So that's where node repair comes in.

A

What node repair is going to do is essentially repair inconsistencies across all the replicas for a given range. So how does how it actually execute is on a given node you connect to a you connect to the standard process with with notes, you'll execute a repair and how it works. Is that standard process, we'll figure out the ranges around the token ring that it owns and then it the parameters you can pass in on the command line to unload tool are the column families they have to be within a given key space.

A

I was within the same key space, and then you can choose if you want to compare if you want to do repair on just the local DC or globally.

A

So a couple of cautions before we dive into how Notre Pere actually works. It should really be part of the standard operations for aches and or cluster, especially if you're doing deletes, because what you can run into is a problem of resurrected data. So if you have, if you've got a bunch of Rights going on, you do some deletes, but those deletes those tombstones, never percolate over to other replicas nodes.

A

There's an interesting thing that goes on called garbage collection and essentially that's when we collect all the tombstones from all the tombstones and all the TTL data from a given column, family and typically that happens during compaction, either major or minor.

A

So if, if a node that got those deletes, then goes past, the GC grace period, which I think of the default is 10 days, and if you didn't sink that data over to the auto sink node the date, the node with the tombstones could compact those could compact those tombstones plus the original data out of that node. But then you still have the live data on the guy who never got the tombstone. So therefore, essentially, what you've just done is resurrected.

A

That data you tried to defeat and then, of course, if you run something like repair or no drip it or read, repair on the guy who was out of sync he'll, then put that data back into everyone else. Because of the data will be missing on the other nodes and now you've just resurrected data that you didn't want there in the first place, so it's pretty important to run node repair.

A

However, there's a pretty big drawback, and that's it it's it's it's disk, I/o and CPU intensive working at Netflix running in DC to this is a big problem. It's really important to schedule these. These are no two pairs off off off peak depending on the size of the cluster and the amount of data that it needs to to compare. It can take upwards of n, depending on how frequently you actually run the the notre père, it could take upwards of six to ten hours.

A

We've had things that have gotten uglier, but that's due to our own learning curve and our own internal problems, hopefully most of which are resolved by now.

A

So Notre Pere, similar to how reads and writes at work, we determine all the anode p.m. all the pier nodes that have the same matching ranges. This works the same where they're using V nodes or just old-school token assignment two nodes but dumb.

A

It actually makes this discussion a little bit easier to not deal v nodes, but the same basic principles apply, so the initiator node, the one you trigger a repair on it's going to contact all of its peers that we discovered in this first step and it's going to trigger a major compaction on those guys. So major compaction in this sense is not is going to read all the existing SS tables on those remote nodes and essentially it's going to build up something called a merkel tree and essentially merkel trees.

A

This is essentially a binary tree with hashes with ranges and hashes. Hashes tend to be at the hashes are at the leaf nodes, whereas ranges or just the intermediary nodes in the tree. So essentially, each node is going to build up this tree by reading every row in its data set so means it needs to churn, through a bunch of I/o reading every SS table off disk merging that it doesn't write the it doesn't write anything back out to disk, but it does have to read everything so it can build up that Merkel tree.

A

Then it's going to return that tree back to the initiator, then that initiator essentially waits for all the trees to come back Africa, including a calculating its own merkel tree. Then it's going to compare every tree that it received to every other tree and this way, essentially what is doing its comparing the merc. It's it's, comparing the hashes that are in that tree, not the actual beta.

A

That's that's an important point to remember here, because then, if, if in that, if in that comparison, if any differences are detected, it's going to ask one of the two nodes that are differing. It's going to ask him to ship all the ranges that are differing with the other node to that other guy, then the other guys requested to ship all its data back to the other node so, and all that is written out is new SS tables on those disks and I'll show that, with an example here sorry.

A

So let's say we got a six node cluster here and I want to start done, and each note has a its replication factor of three. So each note has three ranges. This is what they're going back to a cluster that is not using vinos. It's a little bit easier to see. So the note the top is responsible for a token ranges, a B and C of the next guy down. The line is responsible for b c and d, this guy C, D and E, and so on.

A

So you can see that these three nodes all have range a on them.

A

The three guys at the top I'll share range. Be. Can you see what this is going? Next, all the guys on the right there I'll share range, see so at the end of day, what you're going to get is all five nodes here, participating in this repair, so I, meaning they're, all going to do. I major compaction we're going to build up this merkel tree and ship it on back to our initiator node at the top.

A

Then let's say the the amount of the tops I see is that these two nodes down here at the bottom of the chair range e. Let's say they have some differing range or something that's inconsistent about them.

A

The note the top is going to tell this guy hey. Why don't you send all of your data for that range over to this guy, then this guy is going to ship all that same data back over here. So at that point we're actually technically not merged, but what actually happens is because in exchanging that data it gets written out as a local SS table so that the next reads that come through on those nodes will essentially act like any other. Read and it'll.

A

Do the comparison to the emerg of that data and it'll find that hey the data that I got shipped over from that node over here? It was more recent than mine, because it's now part of my data anyways, it's no it's up to date, and it will return that to any coordinator that that's calling it.

A

Similarly, if the data is out of date, it won't even bother to really use it, so those are most of the Nelson. Those are the ante entropy of protocols and operations that we have instead of Cassandra and at the end of the day, the cap theorem lives and there's, and we just need to understand the trade-offs of what's involved and what we and and and make reasonable decisions based upon what we're trying to do, and Cassandra we've decided to trade a little bit of a consistency for a greater availability.

A

So, of course, to deal with the a consistency you need to get back to to deal with the inconsistency you need some processes to help. Get you back to consistency so Alexander has some process now fix this up. As I mentioned, we've got, tunable controls exist, a right read and the the on-demand time inside of a cassandra clustering, and that's all for now. So please, if you have any questions, I'm hooking it up. Thank you.

A

A

The question was: how does vinos make this a little bit more complicated, so I may be inaccurate on this, so Brandon or anybody else in the room who is a also in the project can also help out so I believe.

A

What's going to happen, is that we'll we'll figure out all the up here, nodes that share that same range and it just won't be as evenly on the diagram as I put it so it'll be a scattered nodes around the cluster, but because of the nature of V nodes, every node could be sharing a range with every other node. So it could happen that you ask every node in the cluster to participate in that node repair and what's going to happen, is that every node is doing a full, the the the the full major compaction.

A

So that's really be a trick, and I may be slightly incorrect about that.

A

If you've got, if you've got a data set from which you never delete data, it's not as critical than to make sure you want I'll make all the young to run read repair, however, since its run node repair, but them it's still a reasonable operational kind of thing to do, especially if you can run it it off times if at at off hours, and especially, you can just apply a drip method to it, where it doesn't take too much I owe in the background, and doesn't it's pretty too much with your regular process, should be a reasonable, odd thing to do.

A

Hi, okay, oh oh, oh I'm, sorry I, know, there's my cup here.

D

Cool so when decoding their Luck's because because it can handle and it has died- and he has to get everything- the performance instead goes down, that's one of the downside of it. How would you know that? That's the cause of the performance, huh a.

A

Great question: so how do you know if your performance is going down due to all the consistent reads?

A

As far as I know, right now, there's not any metrics that we capture instead of Cassandra, to expose that in putting together this talk, however, I bought ivory started to think about these kinds of things, and given my experience at Netflix, where we can, where we essentially tried to capture metrics on everything, and it's really insane the amount of we capture so I've started to think about applying that to some of the things in Cassandra, yet being a little bit more judicious than we would be necessarily in an int and in an in-house app.

A

But essentially, at this point, I, don't think we have any metrics that expose that next question hi.

E

uh When, when you were talking about node repair, so so, basically the way that it looked is that you're running from action as three separate times over each over each data piece it is it not terribly redundant and inefficient.

A

Well, how it's actually going to execute is that back up to the slide yeah. So all five notes are going to participate when you execute a node repair on that top node. All these five notes are going to generate their local tree ship them back to the original guy and then so and then our life goes on.

A

Then, if you want to trigger Notre Pere on the next note or any other nodes in the ring which we tip, we do as a matter of process and Netflix, which is there's still a set of time in between which you actually execute it.

A

On the first node and the second node during which inconsistencies could creep in um I, don't have a great answer, perhaps a Brandon or any of the other committers her in the room have a better answer, but yeah read but Notre Pere is definitely a very heavy handed operation, but it definitely does work in getting data not consistent. So when.

E

um Just just a quick follow-up sure when, when the Merkel tree is generated, so that's trainer generated across all of the data on that node I. Believe.

A

That's true, yes, okay,.

E

A

You can you can as far as I remember- and this could be just due to the version of Cassandra we're running in house at Netflix. There was a bug with the PR option, but yeah that definitely is an option. I, don't remember all the details.

A

Do you remember the dish? We always run a PR but I think.

B

A

A

Not really the way that it actually the way that the code actually execute doesn't take that into account and then well then, the other trick, that's going to play into a counter, is V nodes with v notes. As far as I remember, there is no real primary range, since primary range really referred to back when you're assigning a single token to a given node. Now that it's oka, given node can have up to n number of tokens. Primary range really is a little bit more fluid. So.

A

But don't forget, there's going to be a time lapse in between when you actually executed on that first node and any other node, so inconsistencies could have been introduced during that time or any event or garden, and, like I said at the beginning, it's not just a node going down, there's n number of ways that that inconsistencies can creep in so.

F

With this diagram, can you get away with just running repair on the top note and the bottom note, since that would cover all the notes, or do you run it on every nose so.

A

I, don't feel a qualified enough to go ahead to that one. We've had a lot of debates at Netflix about that. We've had some conversations outside folks and I've, never felt like I've had a satisfactory answer. Christo's, do you have any better run I do well. They need me to.

C

F

Compute the minimum number of nodes that would cover all the ring so I had another question. You said that running it more often could be better well running.

A

It more often will certainly keep it more consistent, more consistent, but.

F

You're using have the full impact in terms of CPU in I ops, you don't get benefit from running it hourly versus weekly. It.

A

Depends what depends on what you're doing with that actual cluster some, and we can take that. That's specific Netflix use case offline, but um there is, there could be some benefit, but it's yeah all right.

G

Since node repair seems to be the hot topic I'm going to keep going with that thread, what are we beating the dead horse? Yet so you mentioned, you know, six to eight hours potential repair time all right, because.

A

Of our specific constraints, right.

G

So how much of that is IO bound, bruises disk, so does? How does that change? If you go to SSDs.

G

D

Faster I know, the answer is it's faster but like okay, I need.

G

I need like a decimal point somewhere, so I.

A

Can't give you a real answer because to degree I mean when we run on SSD we're running on amazon's h1 for extra-large instance, which really you shouldn't think about as being SSDs, because there's n layers of virtualization- and you know in between your application and and the actual metal that really to think about it as SSDs is not really doing it. Justice you can make it as higher I ops, but it's not like real rock and SSD.

G

Okay, so with higher I ops, how much does it given her if you, what.

H

G

So with the higher I ops, how much does it improve? So how much is it you know really like shipping, the data between the nodes versus actual local disk or pseudo local disk access.

A

So my use case that I know about the time my head is actually not fully representative, because that specific use case we actually switched from love with compaction to size, tiered compaction in the middle of in the middle after we switch to SSDs so I, don't know if he it'd be the more current times our plus. At that time we were still running was there was a bug in the version of Cassandra. We are using for doing repairs on leveled compacted clusters, so it's a little squirrely and I. Don't you.

G

Don't know, okay, all right.

A

Yeah, that's a long about it's a long way of saying: dude I, don't know hi.

I

There so in the consistent read case, you said that it waits for all replicas to be repaired. But if you're doing a quorum, read say you've got an RF of 10. Then.

G

I

Got 66 correct ones and for incorrect ones, even though it can answer your question with the 6 correct ones, it's going to wait for all four to repair. No.

A

So how it's going to work is that it's going to ask so the coordinator will first ask in this case six nodes for the data.

A

If those six notes happen to have the updated data set and are all consistent with each other and if you're not executing, read repair or even if you are those six will respond to the coordinator, the coordinator, will you know check all those digests it'll check out and it will return back to the the calling application saying: hey, everything's cool it's only until those other nodes are involved in that quorum, read will. Will you then a trigger the consistent, read algorithm to fix up that data that.

I

Make sense cool.

A

Again, hmm it's just you know the luck of the draw which nodes you happen to hit hi.

J

You mentioned that you had some adversity towards in each quorum. Configuration is that is it simply that you avoid the use cases that would require data center survivability or you have an alternative solution to that use. Kids. So.

A

Each quorum use case so be my specific beef with that is. It essentially is now tying your right to the availability of every data center and the connectivity to every data center which for Netflix it's now us to us East, one Europe, if we launched in Brazil, if we launched in Singapore you've now just hide that right to five regions and a quorum in all five and your availability is definitely gonna.

A

Go down on that and latency is certainly going to be high, because you know the speed of light is always going to be working against you're, trying to communicate that far across the planet. So a lot of it depends on your use case.

A

Exactly and it's not even just the data center going down, it's the connectivity between the data centers going down, which is more likely to be your enemy or latency between those that's going to be a bigger problem than anything else.

A

We internally we're really trying to push an ocean push notion of really just accepting eventual consistency, especially in these use cases. um There's actually one use case at Netflix where we actually use each quorum and there's an argument whether could have actually it would whether it's even really necessary where things get interesting, is with the with the camera, with the compare-and-swap thing that's being introduced in Cassandra, 20 I still need to do a lot of our reading up on that, but that could potentially have helped this use case, but I'm not sure yet so.

K

To beat this node repair horse a little bit moral.

A

It's everyone's favorite, so.

K

It's I'm looking for some of the art behind how often you should run that on your nodes. Obviously it sounds like you need a real run, at least at.

A

Least, every change days right at least every GC grace period is the recommended within.

K

L

K

The defaults 10 so a bare minimum. You need to run it once every 10 days on.

A

K

That's the advice: is there any other factors that come into play when figuring that out? That's the minimum, did you find value and running it more often,.

C

Yes, listen! Oh.

A

God, oh please, don't alright! So war story, it'sa less! You know, coming to last session of the day, I'm sure you want to hear a and moan war story, so god this is on video for the entire internet. Isn't it yeah.

K

A

Let's say you have a cluster and there's a rogue application, that's essentially a batch application and what it wants to do is and for some kind of consistency policy, or is it a referential integrity? But if it fails, it's not going to move necessarily got message to a dead-letter queue or put it off to the side so that you can do some special processing on it.

A

It's just going to keep processing the thing over and over and over and over again, and this thing happened to do a bunch of deletes and so essentially what we happened. Oh, what happened is that it would do a bunch of deletes and try to read the same row. So within the span of about two hours, we would accumulate about 200,000 tombstones in a given row and then, if you're, trying to read it at the same time. So essentially you have all these nodes. Trying to read. Read the data as fast as possible.

A

Oh and read: repair chances also set the one by the way, globally yeah for first time. First time, I was on second time. It wasn't either way it's still screwed us. So essentially, you've got this. This conflict of trying to read a bunch of tombstones out of us as tables merge that, with the one live column, that's alive, and it was just choking on GC, as I had to have all those uh those tombstones and TTL columns in memory.

A

K

Was a big problem, so the other so.

A

The problem was to actually just run. A major compaction was to run repair so that the the tomb Segel.

K

What tombstones would go away? What.

A

Was the order compaction then repair, yeah yeah? So basically we we lowered the GC ran the repair. Then we did a major compaction to make sure all the data was gone. So at that point we've just gotten. We threw away all the tombstones and once all those compaction is finished, the JVM recovered like within a minute. It no longer had to worry about all that garbage sitting around and basically the parties were just killing it. So.

K

Now there was the more deletes you do, the more often you gotten, you really want to be thinking about running remote access in a minute. Okay, see ya,.

A

Oh, it was a good time at 4am twice in the same week,.

B

A

Excellent yep so.

K

In other words, figured out until you're, not screwed anymore,.

M

So I'm question about atomic badge: I use the autonomic by use the batch mode in Cassandra 1.19 till I hablo, like five selling records at each time and.

D

M

Out it very fast over one point, one point: nine and then upgrade you 1.2, because there's autonomic bash features introduced in one point you and turns out. The overhead of CPU is like nine hundred percent and it's just killer. Cassandra class mics and clusters and I realized that if I reduce the batch number to like 100, it tastes much better. It turns out the cpu Acosta's life, as you rise before so I mean it hints for to me, like seems like atomic badge in knot.

M

Chip in a sense, if you have a batch number, is so large and other McBan she's. Not you.

A

You know I mean like nvm. I, like everything else. A lot of it depends on your use case, and you know just testing it. I mean if you just if, like some people, you said, is just put duck standard, 12 dado out in production on day two, that it was released like some people and that we know.

C

A

Know you can run into problems, but you know I'm, it's gonna be the same advice. You know that that that I would tell anyone. You know you know you really need to test your major versions before pushing it out unfortunate. I don't have a better answer than that: I'm, not aware of any bugs off the top, my head with that, but there certainly could have been yeah but um yeah.

M

Because I checked the usual case, like the same the number typical batch number, we are using like around 100 or something so I choose five sullen before and turns out that no much buddy you'll stand much. So it seemed that probably like not many people realize that this is a problem for like one point you essentially when I have a batch number, which is very large mmhmm.

A

Yeah I mean you just gotta load test before putting it out there, but of course you know it's like anything else, you'll discover a lot of things are practical or, and then yeah I just got thanks. Sure.

H

I was wondering what kind of deadly consistency issues may arise due to system clocks getting out of sync or taking wild swings backwards and forwards.

A

The initial gut answer is, of course, you need to you know: just have ntp sink running on you know all the servers I'm not only just a Cassandra servers but the clients we.

H

Ever seen, ntpd make one swings to the system clock.

A

We haven't run to necessarily into any specific problems. I'm sure there are I mean at the.

D

End of a cosigner is using a.

A

Time stamp I mean we're not using vector clocks, her very specific reason about the time stamps.

H

So typically, that doesn't is not in your experience, although I've seen in other similar types of systems all right. The other thing was, you mentioned file corruption, but you didn't say much about it after that, huh is there much Inca sondra to detect such corruptions and repair that automatically or so pretty much so.

A

Nervous I believe, if I remember correctly, when we try to read a file and we detect off out corruption, will essentially throat and not counted as one of the f files will read, will consult when we're doing a read.

A

But I'm trying to remember that block of code we'll definitely avoid that SS table, but there's nothing that automatically triggers any kind of our read repair, necessarily because we can't know as a read repair or an odor pair of any kind, because we don't know what ranges were actually in that file, necessarily because the files corrupt now so.

H

If you had had to ever deal with the situation like that, where you basically had to trash a note and we create all its stated from scratch and or get in there and manually repair, some damage, not.

A

Really from what I understand you.

H

H

Well, my last question was: does it ever happen that such corruption migrates to all of the other replicas and cause more system like we.

A

Haven't had an apostrophe and we've been running Cassandra for three years, we've got 50 odd clusters over a thousand nodes in production. We really haven't run into that. Well,.

H

I think you should consider yourself very lucky then, and.

A

We have lots of crazy use cases Oh.

C

Lima, you mentioned something about compaction strategies brief briefly, at another question, I was just wondering how compaction strategies play into all of this, or do.

D

C

How is one better than the other one better, with different.

A

Dogs but yeah yeah, no, there yeah. The reason why I brought up compaction strategies is that in Cassandra 10 there was a bug doing repair on level compacted clusters. I, don't remember all the details, but there was a bug that prevented us from running repair. So what we did is we just swapped that guy from using movil compacted to sized here, then we're able to do repair and everything was happy after that. So that's the only place where that angle comes in and decide doc. Yeah.

C

A

And I'll punt on it on answering other ones better thanks, I think, there's beer time later, you can ask me yeah.

L

Follow case when, whenever we ran.

J

L

E

L

See like fifty percent day to storage jump and it happens every time it takes like forever to contact them back back up. So it's a particular reason for that and my coochie calculation, or give any reason for that. So are.

A

You said that cpu jumps up or I onam.

L

The disk space disk.

A

Space, um this space shouldn't jump because we're actually not writing out new files, we're just reading all the existing climbing.

L

The after the moco tree calculating the streaming yeah.

A

Girl data; okay, so that just means that you've got a lot of inconsistencies and the nodes are just streaming the data to the other guys and trying to get the data back in sync. But.

L

That's should not like every time we're in every five days. Every time it brings in a sixty percent, more data after contacting back to the normal, so yeah.

C

L

See a few bucks to talking about the precision of the moku tree. Do you guys have any insight about that? What.

A

Version are you running 12 1.

L

A

Text: okay, um I.

L

A

L

Alone, I feel a lot of people have the same props, the same same problem clean up, no.

L

They not only have the if the key is not in the same, it's not in the correct notes or working in the map right, but we never. We never move.

B

Any node and yeah.

B

A

All right at it are there any other questions. Great then, thanks a lot, everybody else coming out, I appreciate it.