Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Searching for a Needle in a Big Data Haystack

Description

Speaker: Jason Rutherglen, Senior Big Data Engineer at DataStax
Slides: http://www.slideshare.net/planetcassandra/dse-solr-realtimeanalytics
The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.

A

Alright hi welcome. This is datastax enterprise, real-time analytics the title official title was searching for data in the Big Data haystack, or something like that. So my name is jason ragland. I worked at datastax.

A

Next, one will basically work on data sex enterprise I've worked on the solar integration for the most part, so I do a lot of support and that sort of thing in development so and I've done. A dud did did one book programming, hive, 40, Riley and I did I'm doing introduction the solar for O'reilly so datastax, we kind of know what datastax is, but we we do Cassandra and we do did a sex enterprise. We've got data sax enterprise 3x, so that's kind of our main main version that we've got going on.

A

It's a single stack, so it's everything integrated in one process, Cassandra solar Hadoop and we do some consulting and support around the product.

A

So and I know what big data I just want to how many people have used solar, okay, nice and Cassandra, more okay, a little bit more good so and how many people have used to dupe?

A

Okay, nice, so that people, one of the things one of the useful things for so to you, one of the reasons to use solar is to do real-time analytics in this case. It's not really it's near real time, not so much real time.

A

Cassandra is, I would say, real time, but in solar, it's not efficient to do that. So the when we're talking about near real time we're talking about a second latency. So basically that's you. Do you submit a document and then you'll search on it you'll be able to search on it within about a second or so, and so why not a relational database? Basically, just solar provides horizontal scalability. Just like a sandra. Just like a dupe I mean a lot of our customers are going from relational to Big.

A

Data Solar is really useful for converting SQL applications over to the big data space. Typically, in those applications are not the batch based kind, but it's the kind where they want, like it's a user interface or something and people want to do interactive, queries ad-hoc queries, solar gives you the the query latency is, should be around 100 milliseconds or so so it's pretty it's pretty hot, just like Google, and then you get that you know it's. The costs are a lot less stuff like that.

A

So you might ask you probably familiar with solar cloud, and you know elasticsearch and stuff like that, so why? Why Cassandra? Why we integrating solar with Cassandra, will basically we leverage everything, that's good about sikandar everything that's been built into their to do, distributed stuff, multiple data, centers and all that and then we we thought we solar. We we let solar do the queries and the indexing and that's it. So we don't. We don't use any solar cloud or anything like that, because we don't need to. We get all that with Cassandra in Cassandra.

A

You know you're probably familiar, but it's a simple simple dynamo model works really well in distributed environments and it's just automate. The dynamo model takes care of placing you know where the documents go and all that stuff. So it's very it's very it's in it's. It's a fairly simple architecture in that regard, so I mean probably probably familiar with the whole Cassandra versus HBase thing, but Cassandra's just a lot easier to use.

A

Let's see so batch analytics people people, you know, there's there's basically guy the way. I look at things. There's two use cases there's batch analytics, which is you kind of wait a little little bit and you're probably going to do a join. Otherwise you, if you otherwise you can pretty much just use solar or you can use Cassandra and then you're going to get the the new real time.

A

Typically, if you're doing batch analytics and you're going to be used, hive and you're going to have to do you DX and stuff like that, and none of that really poor it's over to solar. So it's not really any any relationship there.

A

So real-time analytics I think you can use solar. You can use, do complex event. Processing you can do cassandra and then there's newer stuff, which I kind of hesitate to put in here, which is like Impala and stinger and stuff like that for hive and that's kind of that's latency. There is typically 30 seconds like five seconds, so I wasn't sure where to put that complex event, processing is like doing all the calculations in real time. So it's a little different, because the data is never really add.

A

The data is at rest, but the every all the queries are essentially computed it as the data is streamed through. So it's a different architecture, whereas soul or actually iterates on the data at rest or typically in RAM. Basically so loose scene is a it's a Java library and it's kind of like leucine and solar one Apache project. It's basically at its basic form. It's an inverted index, but it's grown to be a lot more, which originally for text analytics.

A

It's very high speed. So it's very it's highly optimized. For today's.

A

Computers basically, so what is an inverted index? It's important to know what an inverted index is because that's the basic four base basis for the whole solar and leucine ecosystem. It's very simple! Really, it's just a terms dictionary. So it's like a sorted list of terms or words and then each word post points to a posting list and a posting list is simply some metadata and it's, but it is basic form. It's a set of document IDs which are integers and so it bait and then inverted index ball.

A

So well, leucine will tokenize text a lot of our customers tend to not actually do too much text analytics and I'm more focused on just raw. What I would call like relational database types of queries. So that's it's a it's a little different than your typical text text, query type of thing, but we do support that, of course.

A

So solar is built around leucine. It's basically like it's kind of like the rapper that we've seen needs and people always kind of reinvent the wheel, if they're just using raw leucine, I, dad's fascinating, distributed search and we use both. We implement both of those in datastax enterprise.

A

What's what's been missing for a number of years, is the whole distributed cloud type of capability so that they can in the in the Apache Solr community. They start working on solar cloud. It's my opinion. It's got a little ways to go to be totally useful, but it's that uses zookeeper and provides kind of the missing cloud piece in data sex enterprise, of course using Cassandra. Then we get the cloud piece really easily there. So solar cloud uses zookeeper I. Think that's kind of a fatal flaw.

A

Zookeeper is like on yet another system you have to manage. Cassandra is peer-to-peer, so it's a lot easier to to manage and you get the multiple data center replication I. Think it's based I think solar clouds playing catch-up things like elasticsearch elasticsearch has been out there a little bit longer and they focused on the things that solar didn't provide for a number of years, and that was near real-time search and distributed stuff. So, like the cloud types of capabilities, but it's the feet, this should be.

A

The features are not as robust as Cassandra, but I think it's a little bit better than solar Claude, so basic Cassandra concepts. Probably this is somewhat redundant but columns, column, families, key spaces. It's a peer-to-peer, it's eventually consistent.

A

So that's a little different than solar cloud, which is it elects a leader and things like that and does leader type of replication and Cassandra. We use the the big table model for the basic data modeling.

A

So one of the things about lucena Cassandra, that's really interesting. Is they both kind of implement the same type of log structured, merge tree? And that means that if you're ala, if you're buying hardware and stuff like that for a data, sex, Enterprise installation, then you pretty much can whatever is going to work for your Cassandra nodes. It's going to also work for your solar nodes. Just the same so like SSDs are really good.

A

You want to be able to give the tune the heap, and so there's a lot of similarities in in using the two in using a having solar nodes in your in your system so I. Basically, all we do with date. Sex enterprise in solar is we store the data in Cassandra and we index the data in solar and that's it and we let Cassandra place the data on given nodes and things like that.

A

Do all the replication we let Cassandra take care of which notes are online etc, and things like that, so there's a very clean delineation between solar and Cassandra. Basically, solar is only a secondary index and that's that's benefited us greatly in terms of not having to build a distributed. Solar cloud features in the solar, so I call it a separation of church and state, basically now so indexing. So typically, people are you're doing a lot of indexing when you're putting data into solar. It's a CPU intensive task.

A

It's not an I/o bound TAS, typically, and we've done a lot of optimizations to make that fast, with datastax enterprise. Queries, on the other hand, are typically IO bound. So if the index is not Ram, basically, everything is going to slow down by an order of magnitude the queries both in terms of how many queries per second you get in the overall query latency, which is the raw query times.

A

Solar leucine does multi-threaded queries its solar, just not something that we're looking at putting in the data sex enterprise.

A

So each each unlike solar cloud, an elastic search. We index we always index on each node, so we're not replicating leucine segments and indexes around that keeps things very clean and it also its kind of we have to do that because of the eventual consistency model.

A

So and then another thing that we do and when any time you have a distributed search is we round Rob automatically round-robin queries to different nodes. So if you have a replication factor greater than one like two or three or something like that, then you just hit a node and it's just going to automatically balance the distributed query across nodes for you, so you don't have to really worry about that which is a nice feature. So three point 0, 1 and 3 point 0 2 is the current release of data sex Enterprise.

A

We add a lot of cool features like indexing re-indexing. Basically so because we have all the data store in Cassandra, you can just you can just do a command and it's in reindex all your data. So if you change your schema, which is something people do and solar a lot, let's say you want to change like a string to an int or something like that. Then you have to actually in that rhian that you have to recreate the entire index and in the in typical usual solar and solar cloud and last search this.

A

It's it's not easy, but in data center price we make it very, very easy and you just do a command, a rest command and it's very easily.

A

And yeah so there's no custom code required and that sort of thing, so we added some time also some interesting features like you can view the heap space, the heap usage people typically run out of heap space when they're using solar and, I would say doing, support that's a very common problem. So we allow you to view the memory usage there and that allows you to do capacity planning, which is very it's just something that people miss usually and then it's an app to fight. Then we have to come in and try to fix it.

A

So we also do multi-threaded index rien, dexing and for repair. That's. That was a new feature we added. So that means, when you add in more nodes, you do repair and it's pretty much going to max the CPU out, get it to get the note up it's so the node will come up as fast as possible.

A

Just some other features we added. We have full security in solar, so it's kerberos, SSL, password authentication between nodes and when your when your client application talks to a given node. So those are some cool features that we added three point: data sex, Enterprise 3.1.

A

We added something that solar doesn't have, which is per segment filters and facets.

A

We also have multi value facets, we're including solar 4.3, which has something cool called dot values; I'll, probably do a blog post for that soon and I'll probably do blog posts on the first segment filters, and things like that.

A

Probably this is I'm not sure how much this this makes sense. But it's I consider this a major pain point for solar and we we kind of needed to do it because we do range queries in solar to correspond to the ring. So if that makes any sense if you're familiar with Cassandra, you know it's a ring model and we need to narrow down the the query to the part of the ring that the query should apply to and then we we need to cash those those queries.

A

So we use the per segment filters for that with the near real-time search. We also support V nodes in composite keys. 3.1 kind of the this 3.1 is really going to be a fairly good achievement. I would say, because a lot of the problems that we've we've had, that they've been there I, would say nagging solar and maybe day sex on a price for a while, totally gonna, be fixed.

A

Everything will be pretty smooth, I think, but when one feature that will be really good in the future, that will will be adding really soon is multiple data center index re-indexing. So that way, I call it I call live re-indexing. Basically, so you can have if you've got a production app and you want to change the schema, you don't have to take any downtime.

A

If you have multiple data centers, you just take one data center out Yuri index that take another data center out reindex that one and you should be able to stay online all the time. So that is something that I would say. That's fairly advanced in. Nothing else is going to offer that last text search lower cloud. It's not going to offer that I, don't think, and then we also we're looking at making way. So you can actually write c, ql and I'll.

A

Just translate it in the solar for you and just run it I think that'll be pretty cool.

A

A

So one of the things I like to go over I that I find people. Don't don't talk about enough is how would you know if you, if you've got an existing application? That's in SQL, you know Oracle or something like that. How do you convert that into solar and there's? Not one of the things I found is there's not really good guides for that, so we're just going to go over some basic SQL queries and the cow, though, is look for solar.

A

Basically, you way start with the solar config. It's just an XML file with a bunch of crazy options.

A

It's it's a little bit. It's a little. It could be a little bit easier to use right now, but it's you know maybe we'll fix that in the future there's soft commit times soft committees basically committing an index into ram first and then later on. It goes to disk. We in in data sex Enterprise, the transaction log is, is held in Cassandra. So if a node blinks out, you don't lose any data not only because of the whole replication and the quorums and stuff like that, but because of the Cassandra commit log.

A

So if it no good blinks at it, just when it pops back up again, then we just go through the commit log and it just ran indexes. So that's pretty good and a hard commit. It's just kind of a concept, that's good to know, which is that's a hard f sync of the in-memory lucene index to disk or SSD.

A

This is this is what the oddest off coming. It looks like I'm just kind of like going super fast through stuff, because there's not a lot of time, but the slides the slides will be available later so filled. Cash is really a really important, important, todd's concept to know about anytime, you do a sort or a fasted query. Typically, it's going to load this. These heap structures, these heat based data structures into ram in solar, 4.3, there's an option for keeping it on disk or in on the SSD whatever.

A

But this is an important concept to kind of know about, because customers will typically try to run a sort or a facet query and then all of a sudden, their whole system goes out of memory and that's that's it's bad in production and bad in general. So it's a good concept to know about so solar j HTTP. Basically, everything with solar is HTTP base. Of course we also support inserting data using cql.

A

There's no way to do cql queries to solar, or I should say there is, but you shouldn't use it so, every if you're doing queries it's always http-based and you can insert data via cql or you can use the native solar API. So basically, if you've already got a solar application, you can drop in these data sets enterprise and everything basically will just work.

A

Let's see so yeah inserting data with cql weedsport copy fields, vs cql. We support dynamic fields, it's kind of odd, but we do support that fully. Now.

A

Yeah, so we have this, we have a you, can do solar queries, BSC cool, I don't recommend it. I've only used it for debugging because it only hits one note. It doesn't distribute the query out and it's very limited, so we may address that in the future, but it's not probably not in the near future. So this is what this is. What a typical you know, sequel insert looks like you're familiar with cql.

A

It doesn't doesn't look that special, probably.

A

So we're just going to kind of string to some SQL stuff, really quick. So what what would it typical? This is just like the more your most basic SQL query right: it's select star from a table where something you know type equals PDF whatever. So what does that queer? You look like in solar. Basically, you solar uses HTTP, so these are got the HTTP parameter. Q equals type. That type is the field as defined in the schema. It will also be a column in Cassandra and then we're looking for a PDF.

A

There's no need to create specific, like in SQL. You would create an index on something like a b-tree index. Leucine provides the indexing kind of very out-of-the-box, so instead of having I think you can probably I. Think Lulu seen probably sports more indexes more fields. Having an index then probably in SQL database, haven't tested that, but hey flipping turning, you know just indexing. Anything even over-indexing is fine, I think with leucine, so and then what so? If you want to select columns.

A

This is, this is what it would look like if we went title and text only and we want to get all the data in solar would be the queue queue asteroid colon asterisk and that returns everything and then you just put in FL parameter which stands for like fields and then you just do title comma text, so some pretty basic stuff. If you want to do a count star, then you would just run the query and its solar returns.

A

The total number Lee so in what is an order by look like sort of order by is just sorting, so you can see there. We have soared, equals price ascending. You can have multiple sort queries and all this is going to go. It's going to execute the query view. I would say pretty fast.

A

So if you want to do an average price, you can actually do that as well. So that would be like an act that would be considered an aggregate function, so you would do is define stats, equals. True stats, feel this price and then just computes the average for you.

A

If you want to add a group by, then you add the stats dot facet, so that's going to actually do this. Simulate do the same thing as a group by Fung, SQL type of function and the most basic thing of all I'd say if you're converting like a tech space app, you may use the like operation and the way the way to map that is to instead of using a %. You just use an asteroid.

A

All right, Thanks, okay, so I guess we can take just jam to that really fast, but hopefully you're not totally tired and bored. What's any do with any questions.

A

No okay, yeah! Oh sorry,.

B

So it's small art in a different note was in the same or you put in a different server yeah.

A

So solar is fully integrated, it's in the same process as Sandra in Hadoop and all that stuff. So we we fully merged. Everything is.

B

In a white too, there's no like.

A

Sudden right so the question is: is there a different process or a different note for solar there there are different nodes, I mean you can have Cassandra Cassandra nodes or Cassandra plus solar nodes, but it's there's always Cassandra there and then the question is: is there a different process and there's no it's everything's in one process we made a conscious effort to do that because it's we're just we're. The two were just totally tied together. Basically.

B

Question so so, if you're trying to do some kind of time-based roll up like exceptions or counts over time, how.

A

B

Using something like a roll up compared to just indexing that in this solar and then doing time, facets I'm.

A

Sorry wait it's so you want to do a time. Roll up, yeah.

B

A

You want to do you, don't what you don't want to use facets or you do I'm.

B

Saying how does it compare to just keeping account.

A

In terms of what performing performance.

B

Or scalability any.

A

Facets in solar are pretty fast, so I mean you have to test it out. Yeah, you could keep the count or you could use the facets. I think it depends on what works best like if you want to have. If you want to enable more ad hoc queries over different ranges, for example, then so it might, you know, probably better just to keep the raw data and so or and then execute the accounts that way. Hi.

C

When you're actually repair is running in Cassandra, how are the indexes are? The entry index is also rebuilt, or it's say, for example, during the repair data is fixed on Cassandra. How is the solar index reconsolidate in yeah.

A

So we're so where we are we're keeping the independent in index per node and indexing is typically actually fairly fast. These days, especially given that it's multi-threaded so in it were a pair happens if full reindex occurs, re not afore index, but with the data that's moved is is rayon. Danu index is created for that data yeah.

A

If I understood your question right, yeah right.

A

A

It's it's! It's actually reindex during the repair, yeah. Well, the repair is happening again.

D

Early on, you said that datastax supports faceting, but then, when you were showing the future slide, you said in the future. Faceting will be supported so well.

A

It's it's really, it's more efficient, it's faster and more efficient fasting. So we, where we made a conscious effort because most of our customers do near real-time search. We won fascinating to also be near real-time, so we made sure all that is every type of fascinating thing you want to do and everything else is all tuned for near real-time me: it's optimized for a near real-time search, which is typically what people do when they're or white, when they're, using Cassandra or no SQL systems.

E

So you mentioned that Cassandra and solar are going to be running on the same process. Yeah.

A

E

Is there any way to monitor like the effect of the solar searches versus Cassandra work? Is there different threads running? It's all integrated I.

A

Mean typically I, don't think, there's a lot of issues around that I mean typically there's I, wouldn't say: there's a lot of CPU competition.

A

I would say typically there's the issues are around memory, so the memory footprint in the memory usage by solar is completely different than Cassandra, so yeah, they're, probably probably conduce a little I think we need to do more work around sort of monitoring, allowing people to monitor the memory situation of Cassandra the memory situation of solar- or maybe you know, alerting people when hey look at the heaps totally fried increase it or you know, add more node to do something.

A

The whole systems failing due to memory but cpu I, don't think it's I typically, don't see a lot of CPU problems, I mean you would think. Logically, there would be, but there we just haven't seen it so.

E

There's not any solar queries. You could do that we're in efficient enough to hog the CPU. Well,.

A

Even if it did, it would only be a single thread, and it's it's I would say. Typically it's very difficult to do that they don't usually that they may only happen if you're doing something crazy, like with with fastening with the older solar, stuff, you're doing fastening, and it's loading these trying to look these big data structures in the RAM and its just. Oh. Well, it's just out of memory constantly!

A

That's just going to fry it you that will fry the system, but if you've got like a query that just taking a long time that would that would mean that the rest of your system is hosed anyway, because the it's hit it's probably gonna, be hitting disk. That would be the only way to really write.

E

Well, that's my point: if that was occurring to be nice, to know that that was solar, not Cassandra, some way to determine it. That's that's kind of you're getting.

A

Yeah yeah well yeah. This is nothing is hosed.

E

Up, how do you know what it is yeah.

A

That's a good point: I mean it. We do. We do need to idea. We I think we need to provide better monitoring of low-level details. The question would be how I mean the simple answer. Is it would it ends up? Looking like it might like a log, a real time, log analysis tool which would be like splunk or you know, elasticsearch is actually investing in someone's.

E

Face right now: what's that, so there's nothing in place right now! Well,.

A

I think there is it's it's basically, you look at the log look at the log data.

B

Okay, thank you very much. Okay,.

A