Apache Cassandra Meet Up Presentations, 21 Oct 2010

Previous Meeting

⏯

youtube image

►

From YouTube: Cassandra and Lucene

Description

presented by Jake Luciani of Riptano. Slides: http://bit.ly/9Rbuyp

The talk covers:

use cases for search and type of search applications
problems scaling and maintaining Lucene/Solr
Cassandra
Lucandra (Lucene + Cassandra)

A

I'm tj on twitter. You guys can use twitter. um I work for rectano, which is the apache cassandra company um and uh luxander, which is this project on this talk is.

A

Right, so what I want to cover today is you know to talk a bit about the different use cases for search today.

A

I want to talk about how how currently things are scaled and seen and sold, and I want to give an introduction to cassandra and then bring that into the loosened stuff. If you guys.

B

A

Questions probably said this already, but if you have any questions so what kind of search apps do we need to support today? um Lucian is really great. It's been around for 10 years. It's really highly optimized for lots of use cases, but they basically fall into these two different sorts of problems. You have index size, meaning you know. You have really small index like a couple hundred documents or a thousand documents up to.

B

A

Wikipedia or you know anything really really large, with millions or billions of documents um and then at the same time you have this other dimension, which is you know your your your index. Freshness like how real time does it need to be wikipedia.

A

You know they can update once a day or you know once a week or whatever, but you know uh what application like twitter, which is needs to stay as close to real time as possible. You need to update really quickly um actually at the lucene revolution. I heard I didn't go to it, but I heard the the so twitter is now using solar or they're using lucine a custom build to do their searching, so this is a little inappropriate, but also, but there's slides on that around on twitter yeah yeah.

A

I I think I mean twitter's search problem is actually really straightforward, so they don't use any faceting. Everything is ordered by time. You know they. You know they do put relevant tweets at the top. That doesn't necessarily need to come from solar. So I think doing you know: 12 12, 000 searches a day, uh given their use case. 140 characters is actually relatively.

A

You know it's doable with what we've seen in how they're boxing problem. So um I mean not not out of the box but with their modifications, but when you need to have a real-time system that needs to do, you know sorting on multiple dimensions, you do different kinds of scoring, um it gets tricky and- uh and it doesn't really handle lots and lots of indexes. I mean, if you so in lucine and solar an index is a physical directory.

A

So if you've got a million users, you've got to manage a million directories which is not very fun so um and that's what I'll go into here so part of the problems with uh leucine and solar today, and these are all things that they're working on as well. So you know this isn't a super critique of it uh because I know they're all being addressed, um but this is where I thought there could be some benefits from combining with cassandra.

A

So rights are expensive on a live system, because you have to reopen your your index every time to get the new rights.

A

So and as you just generate more and more uh documents, as you index more documents, you need to, you know, merge your your index uh files together and then you need to reopen them.

A

If you get too many index files- and you run out of you- know- file handles or errors everyone- that's usually seen at least in the past couple years. I've always had this problem so to fix that you have to re-optimize, which takes a really long time on a really large index.

A

You have to have a lot of memory available because you need to do sorting because in order to do the sorting it has to have all of the things that you want to sort on in memory.

A

So the too many open files is a good example in replication.

A

uh So solar supports replication, but it all has to be maintained manually through the files they're working on a way to address that using zookeeper.

A

But that's still sort of in the early stages and scaling rights is hard in the scene because you can only write to it to an index through through a single writer and.

B

A

And then, in terms of the operational side of things you have to make, you have to manage.

C

Your backups, you have to do.

A

Your monitoring, your failures, you need a big ops team and, in terms of this whole thing just to maintain your your search engine uh and I feel a little jab with my suit, uh because you have a lot of the same operational problems with with my sql.

A

um So cassandra. What is.

B

A

Cassandra is um it's a combination of of of two. uh You know no sequel ideas. There there's there's the big table paper from google, which is what uh you know, their their big table system and each face is built on and then there's uh the dynamo paper from amazon and they talk about two separate systems and cassandra is sort of a superset where it combines the best of those features and gives you this sort of new kind of uh distributed database. So the the key about it is, is it's peer-to-peer, so it works. Sort of.

A

If you remember you know, uh kazaa or uh you know, I guess you know.

A

What's it called.

A

Yeah nutella: uh they they uh those systems where it's basically a distributed, hash table um and you can just join the cluster. uh You you just connect to a seed and you join the cluster. Cassandra works exactly the same way. It's configurable.

A

You know in terms of the eric viewers, uh brewer's cap theorem, you can you can sort of pick and choose what which, which dimensions you want to adhere to for every read or write, which just makes it really interesting, and it's not just a key value store. It's it's a um for any given key. You can have this kind of like a tree map in jobless.

A

You have this sortable tree of data that that you can keep under under uh under given key, which is what makes it interesting in terms of how to model your data. So there's a lot of data modeling pieces that you can. You can think about when you're building something on cassandra.

A

The plugable replication sorting is really cool too, because you can um you, you can write your own, um your own type of source. So if you're indexing data- and you know it's a time stamp like those things come built in, but let's say you want to index, you know an object and you want to do it a certain way for different kinds of uh for. If you write into different column families, you can use different ways to sort um and the same thing goes for replication.

A

It comes with different uh types of partitioning and replication strategies and you can write your own and just drop it in as a class.

C

B

D

A

Reads is very low. Latency it integrates really tightly with with hadoop. You don't actually need. And yes what do.

C

You mean by rights are very fast, yes, um is it synchronous asynchronous? Do you have to write on all nodes and what.

B

Do I get when I.

C

When I say right yeah, it can mean several things: well, the right path.

A

Is very low. Friction there's not a lot going on right, so it's basically just appending to to a log file, so you're, you're you're only you're only uh lag is really the the the time it takes to to. um You know to write the disc uh and- and you don't have sync on your right- you can you can configure that too, uh but I'll just walk through the specifics of right and what I mean by that um it integrates really well with hadoop.

A

You don't actually need to run hdfs a separate cluster, all your data into it. You can actually run your mapreduce jobs in cassandra itself and write the results back to it.

B

C

A

Of adoption and there's lots of development, um like I said uh I mean I know, there's a lot of big companies that use it from facebook to twitter and all these guys and uh which is the commercial company is uh offers. um You know commercial support and training.

A

uh Actually let me skip this, so let me go into the rights and all this stuff. So um so, when you do a write in cassandra, so let's say I'm writing two keys. So let's just not talk about the internal data models. Just talk about you're, storing keys around a ring, so it's a distributed hash table in this case, I'm just using letters right so from a to z. So you can see. The first note up here is is a to c.

A

The next node goes from d to f. The next node goes from g to I, and so on. So um the the cool thing about it is when you write uh things so I'm showing two different partitioners, the ordered one.

A

So, if, if, if you want to do an ordered row scan versus if you want to just um write randomly into the ring, so your data is distributed evenly, um if you do, um you can write to any node and it'll proxy it to the appropriate node with with and then depending on how, depending on what you set your right to. If you wanted to to write up to all replicas or or or a form of replicas, you could specify that. So you can.

A

So you can make those rights as fast as you want, and what a right is in cassandra is um so it keeps a what's called a binary mem table, which is the which is a internal, sorted set of of rights that come in so for every key that so for every write that comes in every insert. uh If it it it sorts it in its current uh binary event table it, writes it to disk and then once that binary table gets to a certain size, it writes it out to disk as an essence table.

D

C

A

Side, you know, there's there's, there's three: I can go through the green side if you want, uh but but but on the read side there there's a bloom filter that there's a vision there for every ss table, there's a bloom filter, there's an index file and then there's the the actual data.

A

So when you're doing a read first, it checks if it's in its recache, if it's not, it will check the bloom filter to find which ss table which define which esses table it's in and then it'll use the index file to find where in the access table it is what offset and then it'll it'll read the actual date.

A

That's how reason why rights work if I'm going too fast, the first time I'm just focused on time. So uh so these are my my kids. So uh if I write you know, live as a key it uh and I write to this random note: it'll get it'll get written to to um the appropriate spot uh in the random partitioner.

A

You know it knows what key should belong and what part of the ring no that'll write to that.

A

And one of the cool things is obviously, if you, as you add nodes and remove nodes from the system, um cassandra will manage that for you, so it it will redistribute the data appropriately to to keep up with your replication factor that you set as well as um as well as you know, keep track of what data should be on one node.

A

So it uses a approach called gossip which is defined in the diamond mode document which, which basically says you know uh for any given node uh it's so you can. When you join. When you want to join the cluster, you specify any seed node, any node can be a seed, so you specify it. uh You know, you pointed out the seeds and then it will. um That scene will tell a couple of its uh a couple guys in the ring. Hey this new guy's joined and that's a couple of disguises with this guys.

A

Eventually, you you end up with the ring gaming sync and as well as if you want to remove a node from the um from the environment, you can do that as well. You can also move tokens so for any given node. Let's say you have a hotspot on certain data and you want to increase the replication factor and you want to move some nodes around your tokens aren't evenly distributed.

A

You can do that as well, so it so. It manages all the cluster problems for you uh and I think one of the really cool things is being able to scale the systems down. We always talk about scaling up, but especially in a world where you know where you can buy on-demand hardware um on ec2, it's really nice to be able to take uh to take notes out of the cluster.

D

So replication.

C

A

This is another configurable thing you can. You can say um so one of the powerful things about cassandra is you can write across data centers, um so there's something called the uh the endpoint snitch, which tells um you know when one rack ends and when the next begins you can write your own snitch, there's ones for ec2 put it in different regions.

A

You know, or if you have your own racks, you can define it in a yama file, but what it means is um is really for replication.

A

uh Cassandra will make sure if, if you're doing a read it within a certain data center, and it will make sure that there's replicas uh you know in in your data center and it and it will do its read from that node uh as well as for writes in the in the rack unaware it just writes it to the node. Next to it.

A

um You know, if you replication, of three it'll write to the previous three nodes and for the rack aware it'll make sure that it puts you know at least one or two copies in the other data center.

A

uh Okay, so that's sort of my cassandra uh spiel. I guess. Oh sorry, I didn't go through the. um Let me talk about the data model, all right, so the data model is entered, so you have a key space. Key space is sort of like a name space. You know, if you think, if you have an application, and you want to share the same cluster for your dev environment and your testing environment, let's say you can create two different key spaces.

A

If you want to run multiple applications in the same cluster because separated by key space, it's.

C

A

In a database, it's basically like a separate schema now with within a key space. You write keys, but each key lives within a column right. So so this is a very hierarchical um design. So so you can find it confidently. There's two types of column: families. You can have a regular con family, which is just you have a key and then a and then you basically have a a list of of uh of underneath key value pairs.

A

And then you have this thing called a super column uh which is um which gives you a another level of keys and values. So you can have a second dimension.

A

um So if you think about like a a map of map with maps, you know or if you're typing a dictionary in your dictionary or um so you can, you can create these multiple levels of data and you can query within that tree.

A

um So as an example of this, I will talk about the lucian stuff. So the way that we've seen works now out of the box is you: have uh you have a searcher? You have a reader, you have a writer and you have your discipline.

A

B

A

Sorry so where's the white box that wraps it in a http layer, um so so the way that this lusandra stuff works is is instead of storing it on disk, um I'm actually storing the the inverted index in cassandra using the cassandra data model right so um for any write or any read, it's it's going to be standard to get to get the data and the way that this works is uh so there's two parts of what we've seen index.

A

There's a there's, your there's, your actual document data, you know, so a document in the scene is there's a document and there's a field and a field is just a key value pair right. So you could say you know. Field date. Is this field title? Is this you'll? You know url? Is this and you specify how you want each field to be indexed so how it's parsed? So if it's just regular english or if it's a utf-8 language.

A

Internationalized language there's specific analyzers for that uh and then so, when it parses that data um it creates. What's called you know, like uh your your your term factors right, so you have it's broken up into a number of pieces, there's a term frequency so for any given field. If you have, uh let's say uh you know, a bunch of text say.

C

A

Wikipedia article, you know.

B

A

That the word the occurs will be uh represented in the term we can see, so the key will be. um The the key for cassandra is is is a is a combination of a few fields, the index name, the field name and then the term itself. So let's say I have my index is called field, one, my field, sorry, my index name is index one. My field name is called text and the term is the okay and then, within that, uh the value of that key is going to be the document id right.

A

So every document has a unique id and then with that then, underneath that id um you have a number of different information. So the term frequency the number of the times that the occurred, the term positions like where in that field did the occur. um The term offsets like how.

C

It's escaping what the offsets used for it's.

A

uh Oh sorry, yeah in terms of how what what characters are before or after including you know, words that get thrown away because they're, because they're uh they're, they're uh they're, stop words and then the normalization factor. So in so in terms of your scoring of how important is this word to this document?.

A

Okay, so um and then what you.

B

Get out of this.

A

uh Is uh you know so, once you deploy your index um in cassandra, you can do some cool things. You you no longer have a a a single writer, a single reader. You can go through and you can you can.

A

You can write to any node. You can read from any node. You can have you know n number of indexes. You don't have to worry about replication or optimization or any of the operational stuff that comes with with solar and lucy, and you can let cassandra do the work for you. So if you guys want I can. I don't know how much time I have. When does this happen minutes? I think you have 15 more minutes, oh good cool, so I was going to do a demo if you guys wanted to see it.

A

um But before that, do you is there any questions on cassandra? We've seen that you guys want to talk about first. I know this is probably a lot for the amount of time I had.

C

You guys just want to see it.

C

You guys want to see that.

A

Yes, yes, all right, so um all right so kind of fun.

A

So first thing I'm going to do is start cassandra.

A

Just using the newest beta, I think I've already copied in the config, so the lusandra stuff comes with a sample config that has so.

B

Basically, the config is actually really simple in cassandra, so you say you know where am.

C

I going to write my data to.

A

What's my seed or list of seeds, how many concurrent readers and writers do I want what port do I want to run on and down here is where I defined the actual column. So if you look here, compound is really simple. Right, you've got documents, you've got term input right, so I've got I'm just comparing each column by bytes and the super column is for uh determined. So, as I described.

D

So you know you don't create a schema for the scene or solar when you create the significant casino.

D

A

So if you remember in the other slide, the keys are, uh the keys are are composite where they, the index name so you're, putting your your your index um as just a part of the key. So if I want everything under under a particular index, I just do a row scan for everything containing that part of the key, and so it's an ordered scan.

A

All right, so I'm just going to start with center.

A

A

The default jvm comes with some very version default java comes out with a strange thing: okay,.

B

A

You know thrift service running on localhost, um I'm actually a.

B

Thrift committer.

A

Which part of the reason why I got into cassandra's first place so? Okay, so now cassandra's running in the background?

A

Oh and actually one thing I gotta do so one of the new things with o7 um the new version of cassandra. I don't know when you guys haven't most of you guys haven't used it, but it doesn't have um the older version of cassandra.

A

You have to find your key space information up front, but now, in the new version cassandra you can create as many column, families or you can drop and create recreate you can modify, but it also supports uh secondary indexing. So, instead of just indexing on column information, you can also index the values.

A

uh So, what's gonna do all right, so I'm going to say, there's.

B

This thing called schema.

A

Tool which is just something temporary and what it does is you specify localhost the port is running on and then you say import and it it will import the schema, that's defined in the config.

A

B

Now I've added my schema.

A

Now what I can do is show you what the what the ring looks like uh schematically.

D

Imported what to what again so.

A

If, if remember in the yaml file at the bottom, there's the config, it takes that config from the yaml file. But if.

C

A

Could write using, I could write in whatever language thrift supports to schema so part of your it's you know just like in in any database. You need to put the ddo or dml.

A

So if I do ring here, I can see what the ring is since there's only one node, um but obviously you could get a whole list of notes. It just picks a random token by default.

A

Okay, so I was going to do workspace.

A

And now what I'm going to do is uh there's a demo modification that comes with it. I guess.

A

All right and then what I can do is I'll do one demo what this does is it uh it? It's a basically like a a delicious search engine. So I took a bunch of like a text. Limited file, uh tabbed limited file with bookmark information delicious right. So it's just uh the url, the title of the url and then list of tags.

A

So what I can do is.

A

All right, so there index my data and now I can do.

A

Right, so I can I just queried for everything that had l, I n- u star right. So this is, um and so this uses the exact same machine api. So there's no magical food that you have to do. Then you just have to use the lusandra index reader, which just takes the your cassandra connection.

A

It's very simple! If you want to see the code.

D

Stuff on top right.

A

So that so this is just regular old.

B

A

But I have a solar example in here too,.

A

So if I go and start up my.

A

So uh solar comes with an example, so I just took their example and.

A

This starts up jetty and all that stuff. Okay, so.

D

Now it's running on.

A

A

So they give you this tool to post post a bunch of random data to it. So there I just posted a bunch of data into it and then what I can do is go to.

A

So here's like a solar query.

A

So it just searched for the word stolen, uh but yes, it does work with fasting and everything else. um I will talk about, though. Let me just go back to the presentation right. So there's the thing working, but what I wanted to show real, quick and is you know some examples of this in use? There are companies that use this. um I wrote this initially just for this.

A

I built a toy kind of show how it's used. It's called sparsely um so sparsely is uh a twitter search engine kind of thing, but uh this is like over a year ago, but you go to sparsely and see what you can do. Is you log in and it creates a index of just your twitter stream.

A

So if you have 100, if you follow 500 people, you can search just across those 500 people, and you don't have to you know, search all the other jokes because that's I didn't. I didn't like the twitter search, because the fact that you always have to search across everyone and there's always lots of spam and garbage in there.

A

Yeah, I probably did yeah. This was back when it was. It was hip to keep that lying.

A

A

This is running on a server in my basement, so.

B

A

Wife may have uh kicked the cord if you you typed in www.

A

A

See if I jump in here.

D

Could you explain a little more about when uh solar and lucy actually do get.

D

A

Sure so um so in lucena, you need to reopen your index every time you do a write. If you want that right to be found by the readers so seeing what it does is it keeps track. It keeps track of all of its terms and information from all the the subreaders so like per file. There's it's a subreader and when a reader just has a list of subreaders and each of those subreaders have a list of terms they have available and the last time it was updated.

A

So it knows for this term it's the most recent version is in this file. So if you want your new rights to show up, you have to reopen that reread that data it's like broke. Okay, let's get those sparsely uh that never happened.

A

So what so? What you can do is um so so so now that all that goes away with part of this, the fact that the data is stored up under under the same key there's no longer you know multiple copies. There's you know the eventually consistent copy right. So, depending on how consistent you want your data to be you, can you can change that factor on your rights by default? It writes at a level.

A

You know one has to write to one replica for the right to succeed, but the minute that right is written on that one replica it will it'll replicate to the others, so your data will eventually show up there, but you don't have to worry about it. So every read is run against the cluster, so you don't need to reopen any data and just require the same data from from the poster, and that brings me to my last point. The good thing.

A

The really strong thing about this, with with uh with sparsely that use case, works really well, because you know that I mean at one point: it's popular so there's you know a few there's, probably five or six thousand users, so there's five or six thousand indexes running on a single box. So there's not five or six thousand directories, but each individual index has you know under a hundred thousand documents that that's a good use case for the current version of luxembourg.

A

um The newer version which.

A

Oh here it is so.

B

The new version that's being worked on.

A

Is to um is to really take advantage of solar's sharding uh and make it sort of auto shard for you using cassandra underneath. So since the ring in cassandra knows where what data lives, it will create a maximum of a maximum index of.

C

A

Let me describe the problem with the current version: better, the current version. You have this problem where, if you let's say you have a million documents, you search for the word, the which is contained in 90 of those documents. That means you got to pull over the wire. You know uh you know 900 000 document ids, that's that's a performance problem um with the newer version.

A

What you can do is uh with the newer version, what it does is it embeds cassandra as part of solar, so solar becomes part of the ring right and each index has a maximum size size of let's say, 100 000 documents. So when I search for you know the word, love it it, the the cassandra part of solar says all right. You know where are the other shards in the ring and it uses the solar apis to talk to the other parts of of solar.

A

So you end up getting much much better scalability in the large case. So this is you know it's it's all working. I haven't actually updated my public github yet, but.

B

This is something that's going to make it really, you know, take it.

A

To the next level, which will address the other use cases which you know what, if I have a billion documents, and I need to search at all at once- okay, so I think that's it any any questions or things you guys want to chat about I'll, be around all day. So thanks very much.