Apache Cassandra Apple Inc. + Apache Cassandra, 3 Mar 2015

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Apple Inc.: Cassandra Internals — Understanding Gossip

Description

In this talk we'll dive into how Cassandra nodes discover and communicate with each other, and share global state information via gossip. As the gossip subsystem seems shrouded in mystery to many folks, we'll peel back the layers and learn how it powers the underbelly of Cassandra.

A

A

A

Hi everybody thanks for coming out. My name is Jason I'm, a committer on the Apache Cassandra project and I'm a employee over at Apple. Just a quick note before we start this presentation is not a contribution to the Apache foundation. So, let's dive in so let's say that: there's something big new and exciting happening in your life, so you're going to get married or you're going to have a new baby.

A

How would you go about letting everyone in your social circle know that this is going to happen? These days, which you probably do is probably put something up on Facebook or it's in on an email to all of your friends. But what you're, probably not going to do is sit around at home, just just waiting for everybody to just send you that that congratulations message instead, what's probably going to happen, is your cousin will see your Facebook post and say?

A

Congratulations then the cousin will the aunt the aunt will tell the uncle the uncle will not do anything. The aunt will then proceed to tell everyone else in the family and so on. So, essentially, everyone then knows that you're getting married, and so this piece of data has on see this piece of knowledge has essentially been broadcast out through your social network. So there's a similar thing that happens inside of Cassandra. This is called the gossip subsystem and that's what I'm here today to talk about so first thing, I'm gonna, do is talk about well.

A

What the hell is this thing, then? How is it you actually used inside of Cassandra, then we'll dive into the actual anatomy of the message, passing the message passing and a hobbit resolve state and then we'll see how gossip affects the rest of the running Cassandra cluster.

A

So, first of all, what is gossip? It's a broadcast protocol for disseminating data, it's decentralized and peer-to-peer. So what that basically means is that there's no centralized server that holds all the state that a cluster you know needs to talk to you to learn all the state that it should know about.

A

It's basically a peer-to-peer where everyone is sharing information and replicating what it's heard about other nodes from those nodes to new peer nodes, and in that style it's called an epidemic broadcast and what that means similar to my my my a potentially trivial example at the beginning was I told one person that person told another person that person told another and so on, and so this is in much the same way that biology works. Where you have some infected organism which infects another which infects another and so on.

A

However, in Computer, Sciences is actually a good thing that that everything gets infected unless it's a security worm. So gossip is a fault-tolerant kind of a protocol, because once that knowledge has been disseminated across a couple nodes, if one or two or even most of those nodes go down still a few are infected and they can continue broadcasting out that information to the rest of the cluster and gossip is an efficient and reliable broadcast protocol.

A

It's got a pretty minimal overhead in terms of the way that the mechanisms with which we broadcast these messages and it's a pretty lightweight so back in the 80s. There was an excellent paper called epidemic algorithms for replicated database maintenance by a DeMars at all. Basically, they built the a system called clearinghouse and they were the ones to really first talk about this whole concept of epidemic broadcast and give us the at, via via the language, to talk about this style of replication.

A

Excellent, read by the way. So let's look at a quick example of how this actually works. So you see you've got a note here. Let's say: you've got a 16 node cluster or 16 anything's, and this first node has some piece of information. What to share. It then invokes another peer, and now that peer has the data both of those peers invoke two more and now four out of the sixteen are infected. Can you guess what's going to happen next? Those four are going to call for more.

A

They have now have the data and, lastly, those ain't now call eight more and that has and and now the entire cluster has that data. So this is a no just a one node talking to one another to one other note on each round, and so it's a base to log a log base two kind of operation you can have a wider fan-out. So, instead of talking to just one peer, you can talk to five six, ten, whatever you want, but done, but there's extra costs in the network a involve and sharing.

A

So why do we use this in a cassandra? Probably the more interesting questions, so we use it to our lively, disseminate, no data. Amongst years note, meta data across peers and those types of metadata are the cluster membership. Basically, what nodes are in this cluster heartbeat and the the node status and other meta data points about a given node and each node maintains a view of all of its peers in the cluster, including itself and the Sun.

A

This information is, of course much as I just described, disseminated out in an epidemic sort of fashion, each node talking to another node and then spread out now that I've talked about what what we actually use the form Cassandra. Let's take a quick second to talk about what it's not used for in Cassandra. It's not used for streaming, repairs, reads or writes compaction since cql or any other fancy things, and it's also not responsible.

A

It's not responsible for my morning, caffeine, so just kind of it just very important to delineate what can what what gasps affection is not doing helps to a box in the conversation.

A

So, let's dive into the details, so there's three main data structures inside of a gossip subsystem not going to talk so much about code or dig into that, but the but but talking about the data structures, helps to us structure the discussion so there's a heartbeat state, an application state and an endpoint state inside of each node. Oh, the endpoint state is more less a wrapper around and a heartbeat and a collection of application states and inside of each node in cassandra for every impure node in the cluster.

A

There's going to there's a map of endpoint States for each peer, so the heartbeat state there's two pieces of information in this there's a generation and there's a heartbeat. The generation is essentially the timestamp of when the process was launched. It's largely immutable throughout the cassandra process during the lifetime of the process, but there's some special tricks that done that. I cannot talk about at the very end about why we would actually violate this. So the heartbeat, then, is just a periodically updated, monotonically incrementing value. Now the layman's definition of that is basically just it.

A

It's an integer that we update every second and the reason why that's important is because it's a heartbeat, you want it to be keeping updated, updated and updated, because if the heartbeat isn't updated, it might be dead, so application state, it's essentially a tuple of an e new name, some value and a version, and these are contained as a map inside that endpoint state.

A

Four endpoints data collection for each peer in the cluster, so I will talk about Union in names in a second, but then the version is essentially a version of value for that meta data point. So every time we increment the value will also increment that version number that way it will help with out comparisons later on to help assure convergence of all this data.

A

So the enums of this application state, there's probably a dozen or so instead of Cassandra, but I've only put up the most common ones up here, or at least the ones I find to be the most interesting. So data center rack, pretty straightforward. You want other peers to know you know just where you are physically or where you are logically schema.

A

Just a schema ID so that we can compare if two nodes are out of date and they can exchange and if it finds it out a date, the load is basically just raw disk space used by a node and it's updated every 60 seconds, instead of every peer severity being a rough heuristic of I/o pressure on a node like I, await stats, compaction and dynamics, niche measurements and where that becomes interesting, is looking at the scoring and ranking of peers.

A

With the dynamic stitch which we'll talk about at the end of this talk and the status you know so, status is actually pretty important and probably one of the most important states. All these. All of all of these so bootstrap when a node I want to notice is a is launched. It's going to set it to a state two on to bootstrap and then it'll gossip that out, it doesn't last very long assuming that the the node has been launched before and it has data.

A

But if it's a new node it'll stream get information to itself and then it'll move on to bootstrap hibernate is a very short lived thing. As well, it's only if you're actually replacing a previous node, not to impor normal. Of course, the state that every unknown wants to get to.

A

It means that it's a normal part of the cluster and is behaving as it should leaving and left have to do without decommissioning a node out of a cluster and removing and removed have to do with when you actually remove a node that you no longer have physical access to the machine to run decommission on that. So you have somebody else, essentially remove it from the cluster.

A

The thing that's really important about the status is that this is the status that a node has declared about itself. It's not an evaluation by any other node. So this is what a certain node says: I'm up, I'm bootstrapping, something it says, I'm, bootstrapping or I'm normal up and down- is a rather different concept that we'll talk about soon.

A

So, let's take a quick little example of just the gossip or metadata points out of out of a small, tiny, three node cluster I, launched on my laptop as you can see, I've got notes a B and C. The generation is slightly different. The generation is more or less a timestamp since of the milliseconds, since the epoch and since I launched it on my laptop of the script, they all pretty much launch. At the same time, the only difference being the last millisecond, the heartbeats are slightly different, but that's okay.

A

They don't have to be exact if they're launched. At the same time, if you'll notice, node a has a value, that's two less than the other two. It could be that node a went through some stop. The world. Garbage collection pause for two seconds and wasn't they able to update that heartbeat because it was in the stop the world pause, but ultimately the values lining up really doesn't matter as long as it them an incrementing value.

A

That's the most important thing about the heartbeat and status other all in normal and the the second value in there is just essentially the first token that that node owns not too much interesting. Then the rest of values you can see are the same.

A

So let's dive into the actual messaging of gossip.

A

So it all it all revolves around a gossiping around and every node. Every second starts a new round of gossip. Now, there's nothing that coordinates every node saying: hey, go out and gossip every node, because this is a decentralized kind of system. Every node just starts this new round by itself.

A

It's got some timer tasks that kicks off and we start a new round and what we do is we select between 1, 2 3 peers to out gossip with, will always pick a live peer if there are any in the cluster, we'll probabilistically pick a seed node to talk with and we'll probably ballistically uh try to call an unavailable, node, so I know that we've previously labeled this down.

A

Well, then try to reconnect with- and we just probabilistically do that, rather than always doing it, because if the, if any for the seeds, we don't just want to pummel that, with uh with gossip traffic and for the unavailable nodes, you don't want to flood the uh network with packets that are just going to be dropped on the floor.

A

So it's dive into the messaging, so Cassandra's gossip messaging is very similar to the TCP three-way handshake in the three-way handshaking, a sin and ACK and a syn ACK. Instead of uh inside the gossip, we have AK herbs like syn, ACK and ACTU. Now why it's called back to rather than syn ACK, it's an excellent question that I don't know the answer to so.

A

If you'll notice, it's got three messages in it. With a with the broadcast protocol, we could just ship out one message and be done and let that and then let all that information, just you know eventually percolate out through the cluster, but having three messages for each run of gossip. It allows us to add in a degree of anti entropy into uh into the gossiping protocol.

A

That way we can establish through that way we can get to convergence much faster so that the two nodes have been on the same data shared between them, which is of course well is the point of this gossip. So sin.

A

The note that's starting the round of gossip I'm going to call it the initiator.

A

Basically, it calculates a digest of everything it knows about the cluster and then sends that off to the peer that it selected one out of and it's basically the idea digest is a tuple of the IP address of a node in the cluster, the generation that that node thinks it is and the heartbeat version once the peer receives that syn, it's going to do a diff against what it thinks that the cluster is, and it's going to create a new message called the ack and send that back to the initiator.

A

The ack has two components: the first one is a set of updated metadata that the peer has that the initiator doesn't and then the second part is a digest of the nodes that the peer needs. The updated information for that the initiator has that the peer needs so essentially they're exchanging data.

A

So if you've got updated, information I want it but I, but for some things but but for other things, I know I have updated information, so I'll send it over to you, then the initiator once it gets that ACK message, it'll it'll apply all the local dips and for the digest that the that was in the act message it'll produce another set of dips I'll, create an action message and send that back over to the peer to peer of course, honor on receipt of that gets, it applies it and the run of gossip is done in all.

A

It's actually a pretty lightweight operation and not very our resource intensive.

A

So if you remember inside of those app states and there's this generation and heartbeats, these are all integers and I mentioned that these are all incrementing values. So, let's talk about how, inside of a a run of gossip, how we actually reconcile, who is out of date and who has a more updated value? Senders gossip, anti entropy is based upon the van or NSF paper efficient reconciliation and flow control for anti-itch protocols in a big long title. But it's essentially, how do we make data converge fast in a gossiping kind of system?

A

It's also nicknamed the scuttlebutt paper, as you can see in my poor handwriting that I didn't erase so the appstate reconciliation has essentially three levels of precedence: the generation and the and the and comparing the app states. If you remember those individual metadata points based upon that version, that I mentioned. So, let's take a look at an example of how we can reconcile the data, so in this example, I've got a four node cluster.

A

You know it's a b c and d, and this is a round of gossip between nodes, a and b hope this lisa semi legible over there, but them so first they're, going to compare the a metadata about a the generations are the same one two three four. The heartbeats, however, are different a since it is the owner of that data thinks think said: it's heartbeat version is nine. Ninety four B thinks that the heartbeat is nine nine zero. Obviously, nine nine four is more current.

A

So, therefore, at the end of this run of gossip B will update its its notion of A's heartbeat to the value of nine ninety four for node B. You see that thought the generations are again the same. The heartbeats are are different in this case, so a thinks it's ten B sets at seventeen. However, more interestingly is the status is different between the two.

A

A currently things said that B is bootstrapping and and you'll see that I put the number one in braces just to indicate the version that it's at B, however, has now entered the normal state and is updated that status to normal, and it's also incremented the version two two. So how we actually compare these application states is just strictly based upon that version number. We don't actually try to compare the values to C, which is bigger one. It's always based upon that version number.

A

So in this case, two is bigger than one at the end of this round of gossip, a will think that B's status is now a normal version of two. Now, if the version on B was say seven and a still knew about version, one of that value, we actually don't care about any of the intermediary state. So two, three four five and six: a won't care about. We just care about the latest value.

A

It doesn't done I'll care about any of the intermediary values, so looking at nodes see be totally, doesn't even know that that note exists. So in this case, B will just take anything beat B will take anything that a has to say about C and at this and after this round of gossip B will now know that C exists and is part of the cluster.

A

A

Know D: the generation is different, a thinks that the the generation is 2 2 to 2 and B says that it's 3 3, 3 3 in this case I know D has has a bounced. It was a restarted and now has an incremented, a generation value, which means a will. Take anything that that B says about D, because D was bouncing. It has a new set of a potentially new set of metadata points, so at the end of this run of gossip able to take everything that B has about node D.

A

So, let's not, let's just briefly summarize um the the the the message passing here. So each node independently begins a new run of gossip every every second and it selects one two three peer nodes per round and it's always going to pick a live.

A

Node probabilistically a seed node and if there are, if there aren't any unavailable nodes, will probabilistically pick a unavailable node as well, and it's Trina it's three messages per round, and the other thing to note is that it's always going to be a constant amount of network traffic so because we're always going to broadcast the same number of nodes in that initial digest and passing the net.

A

The metadata forth back back and forth in the AK and AK 2 is going to be pretty much constant across all the rounds of gossip across all the nodes in the cluster, so because of gossip gossip itself won't necessarily cause any network spikes. It's always going to be a pretty flat, constant rate of network traffic and the reason I bring this up is that is that sometimes I've heard complaints about things like a gossip storm instead of Cassandra.

A

Where we're you know certain Network events trigger, and it just happens to get blamed upon gossip- that's usually almost never the case. Gossip is a very constant trickle in the background of your cluster. Of course, if you get into clusters of you know, 10,000 nodes, the packets pass back and forth are inherently going to be larger, but then again, you've got a cluster of 10,000 nodes.

A

The reason why you have a cluster that large, because you have that much data and you're, probably already maxing out the network bandwidth anyways, so gossip really isn't going to be the cause of that. However, when gossip does choose to mark a node up that was previously down. Well, it could be happening. Is that if, if if there was one node down in the cluster, and then it comes back up and all the node seeds said it's up- there- probably gonna start streaming hints and things like that over to that node.

A

That was down, and that's probably more likely the cause of what it is frequently called. The gossip storm gossip didn't necessarily causes, but but they have a practical effect inside of Cassandra. Seeing that a node is now up means that we're going to start streaming data over to that peer node, so I want to spend some time talking about the practical, implement the practical implications of gossip continuing on the previous thought, so some questions that dumb will be really good to answer and those are who's in the cluster.

A

What other nodes are there in this in this cluster? How are these peers judged up or down and more importantly, what does up and down really mean I'll? Tell you it's actually not what you think it is.

A

When does a node stop sending another node traffic good question: when is one period preferred over another a closely aligned with the last question then the last thing we'll talk about is: when does a node actually leave the cluster, and when does it doesn't and when and what happens when it doesn't leave the cluster and it's supposed to so cluster membership? We kind of talked briefly about this in my discussion of how on when we talked about the reconciliation between a gossip messaging.

A

So when and when a node and node starts up, it basically needs someone to start gossiping with right. It's the whole point of gossip is to have someone to talk with well, if you're starting up- and you need to know the address of someone to talk to. But how do you know what that is until you actually gossip with someone to find out what the other addresses are that you should guess be gossiping with well, if you've ever dealt with the Cassandra yamo file, there's a seed provider.

A

The most common example is a simple seed provider which you just give it a hard-coded list of IP addresses. There's a couple of other seed provider, implementations that essentially I'll provide a list of host names or IP addresses, so that new node that comes up gets this list of a seed addresses and starts gossiping with with with one of them randomly and through that, it's going to learn about all the other peers that are in the cluster and then because it knows about all the other peers in the cluster.

A

For that, because of that first round of gossip, it can now gossip with pretty much everyone else in the cluster, then of course, or every other new node that comes up in the cluster, a lather rinse repeat, and so that's how we essentially grow a cluster organically by disseminating all this cluster membership information inside of the the gossip bond messages themselves,.

A

So up and down the measurement of up and down is specific to a note itself. We never never never never never broadcast to any other node. That I think that other pier is up or down it's always local to me and how we actually determine that is based upon the heartbeat that I mentioned earlier, and how I actually get updates how a node actually gets updates about the another piers. A current heartbeat value doesn't necessarily need to be communicated directly to me, but it can come in directly.

A

So if I have a three node cluster nodes, a B and C a can a can gossip with B and B. Can gossip with C so there's all normal communication between those three but wait. But let's say that a can to communicate with C.

A

For you know, whatever networking a partition problem that might exist, but B is still getting all those heartbeat updates from C, because a knows that B is up and legit and because a is also gossiping with B a will get the heartbeat updates about C, even though a can't talk to see so. Therefore, a will still think that C is up, even though a in the all the network traffic between a and C can be completely dropping on the floor, packets or timing out no acts, no sins.

A

No, nothing a is still going to think that C is up so how we actually determine if something is up or down what that heartbeat is another component inside of Cassandra called the failure detector.

A

Briefly, it's based upon the the fee accrual failure, detector paper and at the end of the day, is just a glorified heartbeat listener, and what we do is every time we get an update about a peers heartbeat, we record a timestamp, actually what we were.

A

Actually, what we record is the dip of the timestamp between now and the last time we heard an update and we keep a backlog of those updates and we generate the mean value, and then we periodically check all the peers in our endpoint state list that I mentioned earlier, and we checked that list and the backlogs of every node to see if we, if we've heard from that node recently and we generate a score and if it's been too long that we haven't heard from that node we'll actually mark it as down.

A

So for the general purposes of this discussion, the only thing inside of Cassandra that can market peer, node is down is the failure detector and that's based upon other receiving the those heartbeat updates. Almost nothing else will mark it down. Conversely, only Gosper can mark a note. The gossip are the primary gossip cost. Sorry, that's the only component that can mark a notice up, there's a couple minor details, but nothing that should interest anybody else, besides our coders inside this thing.

A

So what does this notion of up-and-down really affect? How does it affect your cluster? So it does not affect writes, because we always want to send writes to the target node that it needs to go to whoever owns a certain range for a key. We always try to shove it over there and if we fail to get a an acknowledgment or a receipt from that, node will actually store that up as a hint pretty common in cassandra.

A

If a note is down, we won't send any read requests over to those peers because we'll assume it's dead and there's nothing good that can come from it. Similarly, we'll treat the gossiping to that node differently.

A

Once we market is down, it'll, be in the dead state and and technically it's an unavailable node at that point, so we won't always try to gossip with it or, if you remember from a few slides ago, when I talked about which peers we choose to select from clearly it's down, we don't think it's live, so we won't pick it from that list, but we might probabilistically try to talk to it because it's an unavailable, node and lastly, there any open, repair and stream sessions are terminated.

A

Now, probably because you know probably because you haven't been getting those heartbeat updates, the node is down or there's some network partition and probably the Asaka connections are dead or all the are or nothing as being act or whatnot. So basically, all we do is terminate the sockets and close any of any memory sessions just to clear up space. So that's great about the heartbeat. No, but you know what really happens if that peer is running really slow and everything is timing out. Do we ever mark it as down?

A

No, however, we're going to try to avoid that node, and that brings in a another component called the dynamic snitch and basically, what it does is it records Layton sees of one of whenever we send a up here note: read, requests and basically, what we do is we keep a rolling window of Layton sees as we've talked to that that de Fora node, and essentially we rank them based upon scoring and we rank them about every 100, milliseconds and the scores are reset every 10 minutes.

A

The reason why we reset the scores every 10 minutes is that that allows us nodes that maybe we're having some long GC period for whatever unfortunate reason to actually get a rearrange correctly, where it should be in a list so that, if it, if it normally is a fast, faster, responding node yet had some heavy I/o or some heavy a garbage collection going on. It can then be re ranked.

A

So, let's, uh let's switch a little bit and talk about how nodes actually leave a cluster. So there's a couple different ion mechanisms. One is to use the enode tool decommission feature, and for that you actually need to log into the actual node that you want to exit the cluster and run node two of the commission. What happens is that that that node is going to change its status to leaving and then broadcast that out.

A

It's then going to find the peers, who should now own the ranges that it has and stream the data over and then send any hints over as well. That should be played to any peer nodes who might be unavailable at that time. Then, after all, that activity is done, it's going to change its status to left and it's going to set an expiration time.

A

The importance of the expiration time, which is currently hard-coded at three days, is that the hopefully, within the span of three days, you've broadcast to every peer in your cluster that yet this node is dead, forget about it after three days after that three-day period expires, then there's about an extra five, a minute window into which we quarantine that node just to make sure that everybody's moved it into the quarantine and then we'll just then every nodal, basically just toss it away after that five minutes, and it will never edit and it should be hopefully gone forever.

A

The next way of removing a node is a is removed, node I, once again it's another node tool command. However node remove node is when the Anoat that you want to to kick out of the cluster is no longer available for you to log in and run decommission on. So you had to go to some other random node in the cluster and execute this so that that initiator is going to set the status to removing for that peer of who's gone and then every node is responsible for rebalancing the cluster.

A

So they need to figure out what what nodes now over, which token ranges and then stream some data over to them. They'll each all then delete any local hints they had for that. Node that's now gone and then finally notify the the the coordinator or initiator that they're done with all their actions. At the end of all this activity, that coordinator is going to set the status removed and once again send another expiration time so that everyone can drop that nodes information after three days replace node.

A

This is something we commonly did in ec2 for just replacing a and no that's no longer available and reusing its token.

A

We perform what's called a shadow gossip run, which is basically kind of like a minion gossip round. Basically, what we want to do is probe the the cluster to see what are the nodes are out there. You know, because we're trying to replace somebody, let's see if it actually was in there and find out the tokens that it actually owned. So it's called the shadow gossip round.

A

It's really just a sin and an act without any act to because clearly be I know that trying to replace somebody doesn't it doesn't know anything about the rest of the cluster. It's just trying to grab information from from the alive nodes. So we take the tokens and the host ID. Then we check that the owner actually is dead and isn't still actively gossiping the cluster. Then we just stream the data from any live nodes to ourselves.

A

So that's what happens when everything goes right when you want to kick a node out of the cluster, what happens when things don't quite go so right and you move, and you remove a note, but yet it kind of hangs around in your cluster. For days weeks past that three-day period, a lot of you seen nodding your head kind of knowing what's coming, so we assassinate that damn note to get rid of it.

A

So there's a JMX command called unsafe, assassinate endpoint a little bit unfortunate with the naming of unsafe, but it should be used with caution, but it basically forces a change to appear so similar to how that removed token will be a remove node a command works. You invoke it on a node, because that original guy is gone and you bit and and give it the IP address of that previous node. And what we do is we force that generation to be updated and we force the status to me. I think dead.

A

Then that gets gossip dot is normal and hopefully, after three days it will effectively disappear off of every node in the cluster.

A

A

Epidemic broadcast protocols provide a resilient and efficient and efficient mechanism for data dissemination. Basically, I tell you you tell a friend you tell another friend, the end of day is just rumor mongering, just sharing information all around in a kind of a decentralized manner and it's out of Cassandra. We use it for Peter for peer discovery and metadata propagation. That way, everyone could eventually comes to kind of an agreement, even though that it's never quiescent, because that heartbeat is always updating.

A

We always have some kind of general notion about what the status of the cluster is and, at the end of the day, really this whole thing isn't that mysterious once we just peel back the layers of the onion to figure out what it's really doing on underneath and with that, thank you and I'll open it up for questions.