Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Cassandra Internals

Description

Speaker: Aaron Morton, Apache Cassandra Committer
Slides: http://www.slideshare.net/aaronmorton/apachecon-nafeb2013

A

My name is Aaron Morton today I'm going to talk about Cassandra codebase. Obviously it's a quite calm codebase. It solves complicated problems, so we're not going to dive into in great detail. What I'm going to do instead is pull out some of the more important classes and point to some of the places where exceptions happen, that you might get as a client or timeouts are raised a little bit about me before we get started.

A

I'm a freelance Cassandra consultant I'm based in Wellington down in New, Zealand and I'm a commuter on the apache project, we're going to start by talking about the architecture and then we're going to dive into the code base architecture wise. We can draw a pretty simple diagram: there's some clients they connect to api's. Those API is talk to some software components that we can call cluster unaware. Acts are a cluster aware. They in turn talk to software components that a cluster aware which read and write from disk.

A

When we put this stack together in a cluster as you'd expect, the cluster aware components talk to each other. The cluster underwear components continued just to do their thing, reading and writing from disk without worrying about the other nodes, but we can put some bed labels on this diagram. Cassandra is a dynamo based system, so let's call our cluster aware components, the Dynamo layer and, let's call our cluster unaware components. A database layer. Cassandra is just a fast log, structured storage engine with this dynamo layer on put on top of it.

A

The database components take things from memory. Put them on disk. Take things from disk put them in memory. The Dynamo layer collects that memory on the various machines brings it together on the coordinator. That's where we take care of eventual consistency and all those guarantees make it available to the API and it gets sent back to your client. So we can expand our overview today, we're going to talk about an API layer, a dynamo layer and then a database layer and break the API layer up into two transports and services.

A

We use thrift as a transport. If you do nothing, your API call starts in the custom. Tea thread pool server in these slides, where I have Oh I'm talking about the Java package in the codebase org Apache, Cassandra thrift, custom, T, thread, pool server. This guy uses a thread pool that has one thread per client connection. That thread stays with the client connection for the life of their connection. At the end of it doesn't clean up and then goes on the process and another client.

A

You might also find in the codebase a custom T non-blocking server, that's an older deprecated server. What you will find, though, is the custom ths a chase, whoever you choose, the server type using the RPC server type yeah more configuration the default. Sync is the custom T thread pool server, H sha chooses the H sha-1, the difference is HS, hae uses a thread pool and each thread processes one request from a client and then goes on to look for another request to process.

A

It's a bit more efficient in terms of the number of threads that you use. Now. That's the thrift and it's pretty simple, because a lot of that code is thrift code, and we don't want to talk about that. Today, native binary is new in in Cassandra 1.2. In fact, it's labeled beta. It uses nettie under the hood disabled. By default, you enable it using stark native transport in the y ammo. You can run both thrift and native binary. They use two different ports and they're happily coexist.

A

You can dig into this guy a little bit further start in our server class in the transport package in the run function build the execution handler. This is our thread, pull thread processes, one request from a client and goes on to find another request from a different client, perhaps the process so to do that it uses nao server sockets under the covers, and it sets up this thing in nettie that we call the server pipeline.

A

This has a series of stages that your request goes through, so this frame, encoding and decoding frame, compression and decompression message, encoding and decoding up to a stage called the dispatcher, which is where your request actually runs. Our dispatcher is in the class called transport message. Dispatcher, it's got a pretty simple message receives function on it. This gets called with a message: it gets our server connection and validates that we expect to receive this message for our current state. If we're already authenticated, we don't expect to be given credentials.

A

We take the request and we execute it. That gives us a response that we want to send back to the client before we do that, though, we get that server connection again and check to see if we want to make a state transition. So do we now move this server connection into an authenticated mode, because we've actually got the credentials and everything is good.

A

The messages in that native binary transport are described in a human, readable document. That's in the source tree. You can go and have a look and get a really good idea of what this protocols talking about I'm to the services layer. Now JMX, you should all be familiar with JMX. You need to know how to use this to admin. Your servers, the management beans to the xposed, are spread around the code base that they all implement.

A

Interfaces that end within bean so they're pretty easy to find the interfaces themselves all have pretty good documentation on them. So you can go and look that up find out what they're doing they're implemented in a really sensible way as well. There's an interface called storage, proxy mbean, it's implemented on the storage proxy class and it's exposed in JMX as org Apache, Cassandra storage proxy, pretty easy to track those down onto thrift at the service layer. This is our RPC interface and it's implemented on the org Apache Cassandra thrift, Cassandra server class.

A

It implements the entire thrift interface, those the sorts of things you would expect it to do, has access control, input, validation and that maps from thrift types to our internal types and back again, because we don't use thrift internally that thrift interface is in an ideal and interface definition. Language file- it's not a very human, readable document, but you can go and have a look and there's some few comments in there and you can get a bit of an idea of all the functions that are available on the thrifty interface.

A

Let's take a look at probably one of the more popular calls on that get slice. This is where we want to get some named columns or a slice range of columns, one particular row. This code example is from 2.0, so the first thing it does is it sets up request tracing. If that's enabled on this request, it gets the client state. We have client state for both thrift and native binary connections and on that client state.

A

We have the key space that you're working in, and we have your username if you've authenticated use that to do our access control and because this is a a pretty common function. Now we've got to go through a few abstractions here to complete our read: we go over to multi get slice internal. This is where we validate your thrift request. Did you ask us to return? One rose something like that, and then we create an internal representation of your request in an object called a read command.

A

This is something we'll execute later on to get your results, we hand that over to get slice- and here we start to see a blend from the API layer into the internal structures. This takes the internal read command and it returns. Thrift types has a function called thrift to fight.

A

That's where it does that to get the data that is going to return back it hands the read command over to this read column, families function, which takes the read command and returns our internal representation of its result just call the column family and does that by handing the read command down to a class called the storage proxy see this guide later, but the storage proxy lives in the Dynamo layer. This is the end of your API call.

A

If you came in through thrift over the C ql3 now, the equivalent to the Cassandra server is pretty much the query processor. This guy prepares and executes cql three statements. It's used by both thrift and native binary. If you call execute C ql3, you come into the Cassandra server and jump over here, for the execution has the access, control and input validation that you'd expect and it returns transport messages instead of thrift messages.

A

So when this is called from thrift, there's a little bit of extra work to thrift to fire that result, sequel, 3 itself is described in an antelope grammar again. This is not a human reading human readable document, it's designed for the antler compilers, but you can have a look in there and get a pretty good idea of the nature and the structure of CTL 3. The statements in that language are processed in a two phase approach. First week we create these things called path statements these are created by the antler, compiler and they're.

A

Essentially, a representation of what you sent to the server has some extra things on there for tracking bound terms. If you're using prepared statements has a prepare function on it, which then creates something we can execute. It's called a cql statement. These guys have access control, input, validation because they we need to check that you're allowed to delete or insert or or read, and they have an execute function when we want to run it.

A

If you have a look at the parameters there, you can get a feel for this, the consistency level, that's a parameter, query, state and variables that we want to run. These are all parameters. If this is a prepared statement, pass those guys in and the way we go, it's obviously the most popular and most complicated cql statement is the Select statement.

A

It has an inner class which implements the path statement, has a lot of logic in there to understand what you're actually saying in the Select Clause, and it creates a select statement for us which, in its execute implementation, creates a recommand. These are the same objects, the same classes that the Cassandra server uses and it passes these down to the storage proxy, and this is the same storage proxy that the Cassandra server uses. So your API call comes into Cassandra and either goes into the message.

A

Dispatcher or the custom T thread, pool server comes to the query, processor or the Cassandra server, and then on to the query processor.

A

At that point there for a read we create read, commands and we hand them down to the storage proxy and we start to transition into the area of the code that I called the Dynamo layer where we're having to deal with the fact that we're running in a cluster there's a number of packages that I could conclude in the Dynamo layer here that we won't get to today for time constraints the HT package, it's the distributed hash to tree that contains the petitioner and the tokens the GMs package contains the gossip implementation.

A

The locator package contains the partitioner, and so it contains the replication strategy and the snitch and the stream package contains the logic that we have for streaming: files between nodes during repair and bootstrap. Let's dig into the service package, though storage proxy. We saw this guy before this guy's, the workhorse of the Dynamo layer. It's the entry point for all of our cluster wide storage operations. Reading and writing all goes through here. We it does the sorts of things you'd expect it selects endpoints it checks.

A

If there's consistency, level nodes available, it sends messages to stages. These are the stages in the stage event-driven architecture that Cassandra uses we'll see more about those shortly. It sits around and waits for a response in a very efficient way handles hints. If, if we didn't, if we got the timeout or something like that and then returns the result back to the API layer, so we can go back to your client there's another important class in here.

A

That's the storage service, this guy's a wrapper for a lot of functionality, that's implemented elsewhere, everything to do with tracking the ring state, starting and stopping ring membership, mapping between nodes and tokens and tokens and nodes so that we know how to find your replicas for reads and writes so. The storage property comes along and it uses the storage service and it works out.

A

We're going to do a read and we're going to send the request off to three other nodes say to do this read hopefully we'll get back a bunch of results, two or three of them, and we need a way to resolve any differences and that's handled by objects that implement the I response. Resolver interface, pretty simple interface. Here we call pre process when we get a message from another node once we've gotten enough of those we call resolve and that will either give us the response.

A

We want to send back to the client or throw an exception. The road digest resolver is use the first time we run the read. This is where we ask one no to give us the data that we want to turn to the user and the other nodes to give us a digest of the data that they would send in the resolve implementation. It creates a digest of the data response, compares that to the other digests. If they don't match it throws a digest, mismatch exception.

A

If they do match it returns the data, and we send that back to you. If they don't match the storage proxy will run the read for a second time and this time an arc saw the nodes that are involved to return their data. So it uses the row data resolver. This guy is a little bit more complicated. It works out. The final result that we would send back to the client using those timestamps to get the correct value, and then it works out the deltas between all of the replicas that participated in this request.

A

It creates mutations to fix those deltas. It sends them one way to the remote replicas doesn't wait for a response and it returns back that correct result that it calculated and we hand that back to the client without waiting for the remote replicas to fix their inconsistency. If you're doing a range scan, we have to use the range slice response, resolver a bit more complicated again and the range scan your replicas can return different numbers of rows. So this guy has to handle the fact that one, though, could return five and another note could return.

A

Ten rows, that's great on the read path. Well, obviously, we don't use that on the right path. We do, however, have a commonality between the two. It's called the response handler or a callback. This is an object that implements the ia sing. Call back in the face, comes from the networking package and it's a way for the net for the messaging package. Sorry to deliver a message to an object that understands your request. This object is in charge of making sure that your request gets completed, simple interface. We call response we hand over the message.

A

It's implemented in a class called wreath callback and those are the ones for rights as you'd expect, but the get function on here is handy. We call this when we want to get the response from your request. It waits on the condition if that condition is not before RPC timeout occurs. We finish waiting and we raise an exception that says: read timeout, this read timeout gets turned into a time date exception that gets thrown back to you, the client.

A

The condition is set once we've received enough responses, so that happens in our implementation of response on the ia sync callback interface. So what we're waiting for is consistency level knows to come back so, if you're doing a quorum at our three we're waiting for two nodes and the data response, so the one node that we asked to give us data. When we get those the condition is set we passed through, then we use the resolver to get the result. We want to send back to the client.

A

This all comes together in the storage proxy in a function called fetch rows. This guy gets a list of recommends it iterates through them a couple of times in the first pass. It goes through and creates a list and for each one it creates a list of thus ordered end points that we want to send. This read to this takes into consideration, proximity and the consistency level.

A

It constructs a road digest resolver, because this is the first time we're doing this read it uses that and some information about the consistency, level and the end points to construct the recall back. It uses the read command to create messages, so we can send those recommends to other replicas and passes those messages and the recall back to the messaging service.

A

Send our our function send request response, so the messaging service now is going to send those messages to the remote replicas when it gets a response is going to deliver them into our recall back object, and we saw that with the ia sync callback interface and the implementation of get after we've done this. For all of these read commands we iterate through all the read call backs that we created, and we call get remember that get is blocking that could throw an exception.

A

The digest mismatch means we run it again and the read timeout means we have to throw time that exception back to you guys in the dynamo layer. The network package has both a protocol and the in sport and we're just going to talk about that. The protocol, it talks about verbs, request and response verbs. Those the messaging service doesn't know how to process the verbs though it delivers them for objects that implement a verb handler. Just having an object.

A

Isn't enough, we need a thread, so we map those message: verbs on two stages onto the thread pool stages, so this comes together in the messaging service receive function. We take our message and rewrap it in a message: delivery task. This is just a runnable. We take the message type and we use that to get the thread pool. We want this runnable to go on and we drop the runnable into the thread pool and we go back and look for the next set of bytes to come off the socket. This is running on the thread.

A

That's just reading from a socket for one particular node in the message: delivery task. We look at the time that this object was constructed and if it was longer than RPC time in the timeout, we dropped the message and that sorry- and it's a droppable type message- we drop it. So when you see dropped messages in the logs they come from here. If everything works, we get the verb handler object.

A

We call that, and we pass over the message and we're now processing your read in the read stage on the remote node quickly now onto the database layer and the thing that you think of sorry, the concurrent package, just Maps stage names to thread pool executors. These are the things you see in no pool no tall, tepee stats in the database package. The thing that you think of as your key space is implemented in an object called table. We have one instance of these for each table. We call open to get that instance.

A

All the time has a function called get column family store, because the thing that you think of as your column, family or table is implemented in a class called column family store. This has the top-level functions that we use for doing, reads and write, get row and apply over to our column family store. We have a top level function here to do reads: get column family. We have an apply function, I'll skip down here, because I'm running a little late we have I saw the column. I saw the columns.

A

This is the guy that contains all your column objects. We have three implementations. The array back store saw, the columns is thread unsafe and used on the read path. The atomic saw that columns is ISIL, has isolated, updates, it's used by the mem table and the tree map that saw that columns is a thread safe structure. We use in multiple places, I have to skip forward a little bit.

A

So in summary, here we come in through the API into a couple of different ways, but we've always come down to creating read, commands or mutation commands and dropping those into the storage proxy in the dynamo layer that uses response, resolvers and async callback objects and the messaging service to transport. This read command into the correct thread, pool on the current node or on a remote node that goes through the table and the column, families store and the disc ads and filters which we didn't look at, which is where all the reads happen.

A

It's just a series of either raisers that pull stuff off disk and we roll that back up and we get the result back out to you and the client. So that's all I have time for today, so I think I'm even too too late for questions. Is that correct?

A

So sorry that went a bit fast, but we started a bit late.

A