Apache Cassandra Conference Presentations, 24 Jun 2015

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Cassandra and Spark Optimizing for Data Locality - Russell Spitzer (DataStax)

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So hi everybody I'm Russell, Spitzer I'm, a self retirin here at Davis tax, and we are the company that loves Apache, Cassandra and one of the things I work on.

A

There is the spark Cassandra connector, which is an open source product available for everyone, download and I'd like to take my slot here, to talk a little bit about data locality and how we in the spark Sandra connector, achieve data locality with Cassandra and I'm, going to start out with a little story, and it's one you might have heard before, and it involves this man here, Lex Luthor and he happened to own a large amount of property in an undesirable location. And that's the lesson we need to learn from this.

A

Is that sometimes, where things are is very important to what they're worth and his solution, though, is not the one that we should take and his solution was, of course, to improve the value of his land to move the entire ocean, and this is really similar to the ETL of of the world right now, where we're trying to take all of our data from one source and move it on mass to another location before we're ever able to do any analytics work on it.

A

Luckily, for us, we do have a different kind of paradigm that we can do with spark. So spark is really our hero that lets us get away from this old pattern of trying to move everything all at once and then do work on it and it's a lot more like having superheroes and we have superheroes in different cities, and each of these superheroes can do their work locally. So the thing is when we have Superman doing work in Metropolis and we have Batman doing work in Gotham City.

A

We save all that time of the two going back and forth and doing all this wasted effort and the same thing happens with spark because with spark we can end up saying the only data that you're going to work on is that local data on the machine that the executor is running on. So it ends up. Looking a little like this, where our spark executors are our superheroes and our nodes are our cities.

A

So this ends up saving us a lot of time, because when we're talking to a database like Cassandra walking to the local machine, is going to cut down a lot of network traffic on that initial load, that's so important to get data into spark for processing. So now that I've gone over this little story, let me just say that this is all automatically implemented inside of the spark Cassandra connector, which now is compatible with spark 1.3 and actually from the time that we submitted these slides due today.

A

We've just had our milestone release for spark 1.4 as well. So if you, if you'd like to try that out, it's it's ready right now, we support all of Cassandra data types, including user-defined types we can save to Cassandra. We have intelligent right, batching, we support collections and all kinds of other great features and, of course, open source free to use for anyone. Now, let's go into a little bit about how the spark Cassandra connector actually reads: data node local, and to talk about that.

A

I first have to go over a quick lesson on how Cassandra actually stores data so Cassandra, basically stores all data. Using this thing called the token range and the token range can be thought of as this long series of numbers and these numbers are then divvied up amongst the nodes in the Cassandra cluster. So if I had three nodes and to keep with my theme here, we have metropolis Gotham in Co city.

A

Each of them is only going to have a selection of this token range that it will be responsible for or the primarily responsible for so in this case, we see that metropolis has one section: Gotham has a different section: ko City has a third section.

A

What this ends up being means that metropolis ends up being responsible for all of the tokens between. In this particular example, 750 99 Gotham is responsible for 350 to 750, and ko City is responsible for the remaining parts. Of course. In reality, our partitioner is going to go from a very, very small number to a very, very large number, but this just makes it a little bit easier to illustrate the point. Once we have this assignment of the primary responsibility for each of these tokens, we can start saying where data actually belongs in the cluster.

A

So, for example, let's say that we get a row that looks like this and I've placed in yosik who's, another committer on the spark ascender connector as my data here, but the cql schema, which defines this row will designate part of this to be the partition key and the partition key becomes hashed by the internal cassandra hashing function and transformed into a token value.

A

So when we apply that, for example, to this row, if we say that the name part here is actually the partition key, we get out a single value and that single value will tell us, based on the ownership of that range where that piece of data belongs. So in this case we have yosik hashing to 830, which means that this particular row belongs on the Trop list. Note so based on the content of the row, we know exactly on what note it should live on now. This changes slightly.

A

If you have V nodes enabled in Cassandra and there, the picture looks a little bit like this, where, instead of the ranges being a large contiguous piece, we actually have them broken up into smaller chunks. And what I like to illustrate here is that, just because we change two owns, which ranges doesn't mean, we change what the hashed value of this particular thing is, but it does change which node it might belong on.

A

So, given that we have all of our data basically assigned a particular home based on its token, we want some way of reading all of these values going over the entire token range and getting all of this data out of a particular table.

A

So with that, let's talk a little bit about loading, huge amounts of data and that's basically into the Cassandra connector, a full table scan and a full table scan means we need that entire token range turned into a spark RTD. So the way the spark you see under connector does, that is, it starts when you call one of these two functions: either through the data frames API or through spark Cassandra connector Cassandra table. It will make a listing of all of the token ranges in the cluster and know where those particular ranges are hosted.

A

So it knows where all of the replicas are for each of these particular sections of the entire token range. Then it takes these token ranges and starts building them into spark partitions where in the metadata for the partition, it says this particular partition is for this particular piece of this particular token range. So, for example, we might break up these purple ranges into to spark partitions now. The way we break up these ranges into spark partitions is based on the spark Cassandra input split size. So this basically says how many live tokens.

A

Would we like there to be within a single spark, partition and I'd like to know? If this is an estimate, we aren't actually able to do the the true count without actually reading all the data, but this is basically as close as we can get so we'll continue doing this for each node and since we know each token ranges, replicas were able to consistently make spark partitions that only map to a single Cassandra node. At this point, we can also assign a piece of metadata, since we know which node we built each spark partition, for.

A

We can assign a preferred location for that particular spark partition, which is a Cassandra node. So when we go about to actually executing queries, if we take a look at one of our executors- oh sorry, if we take a look at the spark driver will see that the spark driver is going to see the spark partition know that all the token ranges that are in the spark partition belong on a certain machine, and it can use that to assign the task to that particular node. Note that this is controlled by the spark locality weight parameter.

A

So if you do go past sparklet II wait seconds, it will end up getting sent to a different node. So if you really want to ensure data locality when coming when getting data from a Cassandra node make sure to have this either be very large or set to infinite. So once the task has been assigned- and we know that since this is a metropolis partition, it needs to be done by Superman.

A

We'll take a quick look in here and what happens is that the executor is going to take those token ranges and turn them into cql queries, and this is key, because the cql queries are the exact same kind of queries that you would execute on the java driver, and the reason for that is is that underneath here, the actual connection to Cassandra is a Cassandra Java driver connection.

A

So what will happen? Is our token range gets split into this query and the query gets executed and as its executed, the can Sandra input page row. Size decides how many rows will be paged out of Cassandra at a time. So this is basically how you can how quickly data is pulled from Cassandra, while you're running these queries.

A

So again, another cool thing about having cql here is that any push down that you can do to a cql query. Any queries on secondary indexes or clustering keys can be done here as well. So we can actually use the interface that the connector provides to push down clauses to Cassandra to minimize the amount of data we're actually pulling out of our Cassandra tables.

A

So this is one of the most efficient ways to make use of SPARC and Cassandra together, because we want to minimize the amount of work that Cassandra has to do, because it just wouldn't make sense to pull all of the data out then to filter it in SPARC, when we could have done it on the Cassandra side in the first place.

A

So that's basically how we would go about loading, a large amount of data, but let's say instead we have just a selection of partition keys. So in the case that you have an RDD full of partition, keys that you want to extract out of a cassandra table, we also have into a different method. So this is about loading sizable, but well-defined amounts of data.

A

So, for this we have a function called join with Cassandra table, and this basically says if you have a generic RTD filled with partition, keys and some other amount of data, we'll provide you a method to automatically parallel why's that and request data back out of Cassandra.

A

The one thing to notice in this is that that RDD, that you might give us to to pull out of Cassandra, may not have all the data in such a way that the partition key is all mapped to the same node. So that gets us to a little problem because we're back in that unfortunate side of things where we're no longer having data locality in our actual execution, but we've also provided a way to get around this with a function called repartition by Cassandra replica. So this lets us take that RTD.

A

That is partitioned in some way. That is not congruent with the underlying partitioning of the Cassandra table that we want to use, and basically it will perform a shuffle that places the data so that we get these node local partitions again. So when we have no local SPARC partitions with the assigned idea of which node that they belong to, we can then do our join with Cassandra table and have each spark partition mapped to a single node that is also running Cassandra.

A

So the way that this actually executes sunder, the hood is pretty similar. We end up building the cql statements that look like select something from your table where the primary key is the primary key in the in the data that's being pulled out of the RTD, the same Cassandra input row input page row, size controls how much data is pulled at a time and the same push downs and all of that apply as well.

A

So I want to take a quick, a quick second to talk about how this works in data stacks Enterprise. So that's our commercial offering and one of the cool things about that is that it has Apache, Solr and Apache spark together, and the thing is, if you run both at the same time, one of the cool things you can do is combine your SPARC queries with a solar lookup.

A

So what we can end up doing is pushing down a solar query into that same syntax and what we'll end up doing is hitting a Lucene index locally on that same machine before hitting the same machine for the Cassandra data. So we can end up doing this cool combination of leucine indexing and Cassandra lookup and spark all at the same time while maintaining data locality.

A

So I know that was a lot of information to go over and I. Don't have a very long slot. So I just wanted to quickly point out that you can learn a lot more about this online Academy days. Xcom. We have tons of videos about how the connector works, how our integration works and, of course the Cassandra summit is coming up in September and if you enjoyed being here I'm sure you will enjoy being there as well and I. Think I have time for one or two questions.

A

A

Yeah, so the question was: do a lot of people run Cassandra and SPARC on the same machines, and the answer is yes, a lot of people are doing that because, while we earlier heard about how the inter shuffling of SPARC is not really critical to performance, one thing that is is the initial load of all of this data and for Cassandra in particular, avoiding network hops pulling the data from the local machine is extremely efficient for Cassandra as a database.

A

So it's really useful to do that in most people most users, our one question- do plan to support three nodes in Cassandra connector, so V nodes are supported now, one of the issues that with it is currently the way that V nodes are replicated it's harder to get unique, it's harder to it's harder to get efficient partitioning, so it is supported now and it does work, but you'll get the best performance with a very large data size and smaller data sizes won't be as effective.

A

Any other questions.

A

Yeah good question: yes, you mentioned a partition keys, a lot how new cluster keys into the picture, so cluster keys are basically within this framework, you'll be pulling entire partitions at all requests, so you can use them to filter down your requests as they're coming out of Cassandra, but you can't really use them to pull data out, because we can't locate data with a clustering key alone, but if we have the partition key as well.

A

So, for example, if you have a whole lot of partitions based on name- and you also have events that are clustered based on time, if you give us a whole lot of names and a specific time that you want greater, then we can efficiently grab all of those slices, but it's not as efficient for taking something that isn't on a clustering key. So last question yeah yeah, so you can set consistency level and basically we apply it to the entire RDD collection, the entire RTD action. So you can choose whatever you want.

A

The default is local one and you can set to anything you like, okay, so thank you. A lot.