Apache Cassandra Meet Up Presentations, 24 Mar 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: QCon London 2014: Going Native with Apache Cassandra with Johnny Miller

Description

This topic will introduce the Cassandra native protocol, native drivers and Cassandra Query Language (CQL). It is important for developers to be aware of this new way of integrating with and querying Cassandra -- without using Thrift or RPC. There are various ways of tuning that integration and modeling your data - all intended to make it easier and more productive to build against Cassandra with some additional performance benefits. This is a technical session with code abstracts using the Java driver.

A

Hello, everybody thanks for coming so case, you're thinking you're in the wrong room. This is the what we're talking about today, so it's going native with Apache Cassandra, so about me I'm. My name is Johnny I'ma solutions, architect at company called datastax. If you want to stalk me, there's all my my information.

A

Datastax is a company that was set up by Jonathan Ellis, I'm at file few years ago to really provide a kind of a commercial support, enhance the product around cassandra. And it's been. It's been a pretty interesting couple of years. We have quite a few customers. Now we contribute roughly eighty percent of the code into the Apache project and we're headquartered out of San Francisco and Europe ones in stocki Park, which is definitely not as nice as San Francisco, yeah, I wish, but and oblivious. We are hiring.

A

So if anybody is very very keen, please have a look look at our site, so what do I mean by going native with Cassandra? So traditionally, when Cassandra was, you know first starting the really the main way of integrating with it using it was through thrift, and there was a you know, a bunch of thrift, clients around there, Hector astanax, etc, and what happened with Cassandra, 1.2, really cql and Nate. The native protocol got to a point where I think it was.

A

It was pretty good and it's becoming pretty much the de facto way of developing against Cassandra. But why did we? Why do we do this? Why did we move from one to the other? Not that there's anything wrong with thrift, if you're still using thrift and you're comfortable with thrift, you know do use it, but really what we found from the community and the developer is the first thing.

A

It was a lot harder to get up and running with thrift, get familiar with how it works, and it really wasn't the way the thrift thrift protocol works, wasn't really the best practice for scaling up in a very large distributed database distributed system, and it also enables us to give a slightly more integrated experience into Cassandra in terms of the drivers.

A

The tooling around that and enable and from a kind of a product point of view, from adding new stuff to Cassandra, because you don't eat, you know the different thrift, clients keeping them compatible with what was happening in Cassandra and maintaining that going forward. It was getting hard so yeah, so we introduced it cql and the native protocol I should point out as well. Astanax Netflix are introducing the native protocol into there.

A

That's next driver as well so because they won't, they want to make sure that people know this, and so just to start and what I'll do is I'll. Give you a quick run-through through cql. If how familiar is everybody here with Cassandra? Have you used it before great? So hopefully you know what Cassandra is it's distributed database, so cql is a standard query language. It was really intended to make it much simpler for developers and people to use to query and model, and you probably already know it. If you've come from an SQL background.

A

You know you know, there's a simple statements: texts are from users. This is the same whether, whichever way you're doing it only it has the usual statements in there. You know creating dropping altering user permissions, all that kind of stuff. So it's a it should give you a very comfortable way of getting getting working with Cassandra, so example here. So this is creating a key space, so creating a key space and c ql.

A

What you're doing there is you're saying creating my key space, Johnny I'm, giving it a network, topology strategy and I'm saying, replicate my data between different data, centers simple, as that I'm going to rush through this, because there is a lot lot to cover from the basics point of view, and what you have here is statement there to create table should be quite quite easy to understand. What's going on there, there's a select query, they're very, very similar and an insert statement.

A

Only thing to really point out here is it's worth understand what that primary key definition means there is the concept of obviously a partition key and the clustering columns in there. So the the blue, but is the partition key? The orange e bits are your clustering columns. Now the partition key is, if you know Cassandra, you will know this.

A

It really is the the value that's used to give affinity to certain nodes and replicas in your in your cluster and then the clustering caller coms are really about ordering your data on that on that on that node. So it's a very straightforward stuff and it's also worth pointing out that you can have more than one clustering column in there and you can have composite partition keys as well, so you could have T name and play in it player name as your partition key.

A

So you can get quite quite quite quite quite clever with it and also we're pointing out. There are some data types that you might not have be aware of: one of the new ones that came in recently it was collections as a data type. So what you get with collections is you get set lists and maps and if you wanted to create a table using those in those carbs, it's very simple to setup. You've got a set example and a list example and a map example in there.

A

There are some performance considerations around using collections, so just be conscious of what you're doing sometimes it is. It is still more efficient to denormalize your data then use a collection column also do favor sets over lists. They are more performant and I'd also tell you guys to look out for sign of 2.1 where they are still in beta, so not necessarily, but they are putting a lot of indexing in around collections as well. So you, this is a quite a nice way of modeling stuff in Cassandra.

A

Other thing you can do is you can do query tracing so you can turn you turn this on and off if you're using the command line come. So you turn this off in your session and when you execute a query and you get back a bunch of diagnostic information on what it did, so you know which nodes it went to the time it took, etc. And this is your friend yeah query tracing is it is going to you a lot this when you're?

A

If you're a query is running slow, this will tell you why it's running slow and I'd always advise, and even we're just wanting to learn it's a very, very helpful tool. So you can actually go through all the steps that query took, what nodes it went out to how long it took to come back, etc. So, when you're looking at tuning your queries, optimizing it debugging. This will help you a lot. There's also a lot more that you can do with cql.

A

We have lightweight transactions. Lightweight transactions are on a per operation basis, a way of essentially doing a compare and set on a statement. So you're saying insert this data. If this condition condition maps matches it, you've got counters. So counter is just probably pretty obvious. What characters do you gives you the ability to do? Incrementing a new committee values and columns? We also have time to live so on a on a per insert. You can set a TTL for that data once that details expired, that data would no longer be returned.

A

We also have batch statements, so batch statements are a way of batching up multitude of statements. They are atomic in what they do and I'd also suggest looking quite closely at what's coming out in two point: 06 and 2.1 also around using lightweight transactions within batch statements as well.

A

So this is a very quick overview of cql, but really the point to get across is it's giving you a very familiar lying, a very major tool to connect up query model, your data and there's a lot of information out there, which I suggest test you looking at them as a bunch of talks out there. That would help, but I want I could talk about this for for an hour rather than everything else. So the probably was quite interesting is how we've built the we've got this.

A

You got the cql which gives you the the query language we're connecting up, and then we've built a whole different way of connecting into cassandra yeah, so part of the when you're using seat. Well is you can use the native protocol as opposed to the traditional thrift RPC protocol for connecting to Cassandra? So what this is doing, it is essentially, we are without we support, request pipelining. So your client connects up to your Cassandra cluster.

A

It opens up a certain number of persistent connections in there and what we're doing there is we're sending a multitude of requests concurrently down the same connection and getting the back on sanity. The same connection as well- and this is a two-way conversation, so we are getting not just request reply. We're also getting push push events back from Cassandra now the reason so just to be quite clear on the type of notifications you get back from Cassandra with previously.

A

If you wanted to be a awareness of the topology or, what's going to your client, would be pulling your Cassano cluster going out, get a request reply. What we do now is when you're using the native protocol any changes, any kind of technical events that happen on your cluster. Your clients are now aware of them, and this type of data informs the client driver as to what it needs to be doing around load, balancing, etc. Now the type of check vents you get back, our topology changes, a node has gone down.

A

An old has come up changes to your schema. These are the type of things that get notified to the client driver, not data mutations, so you don't have this wrote changes. Something tells me about it. It's just the technical stuff for the driver, but it's it's quite it's quite a handy thing to have, and then really the driver itself is built from the ground up to be it's a completely asynchronous architecture. What we've done is in case of the java stuff.

A

We've used a neti extensively to give you a non-blocking I/o around there, and it's it's really really handy, because the way and I'll get into it. We start looking at some code examples giving that have this now very asynchronous way of grits or a multiplexing requests over the sockets we're not blocking any threads. It really does mean your ability to scale whether it's internally within your clients and also the amount of resources it takes to work with cassandra is quite quite different. It's quite changed.

A

So that's very much in a nutshell. There's a lot more to this, and what I'm going to do now is just really talk about the more from a development point of you. Talk about the drivers how they work. These are that so the native drivers, we at datastax we've produced a bunch of the ones there in bold. We have these datastax native drivers and we have written them in Java, C, sharp Python. You got C++ and beta, not quite sure.

A

When that's going to go, we've got the odbc as well, and there's also a bunch of community drivers built in addition to this enclosure or lying mode, Ruby, etc. There's lots and lots of different there. You should you should. It depends on what depends on the language using, but there is the tools are there for you to use for the for you to connect to connect up with so what I'll do now is I'll, focusing on the the Java driver. I'll.

A

Take you through some ways of using it and cool things there and then, at the end, we can have some have some questions. So first thing: when you from a client point of view, when you decide to connect up to Cassandra, the first thing you do: is you build a cluster and what's quite important there first thing you'll observe? Is it's using the Builder pattern? It's it's all very fluid interface for building this stuff. Up.

A

First thing you have, there is you've, got these contact points, so the contact points are the really telling the client discover the cluster yeah. So it says, connect up. Tell me everything about this. This Cassandra cluster I need to use it very much in the same way, the your listeners work when you're configuring your sound right. It's not a it's. Not it's, not a we're, not saying connect to these nodes and you're querying. It's just I need to go and discover.

A

My cluster I need to know the apology of my cluster, the partitioning everything you want to know about your cluster, and then you build that. You then create a session off that cluster and you've optionally got the ability to create a session on a key space or just up to the cluster in itself. Obviously, if you don't specify the key space, you can then qualify that exactly where you would do it SQL with your secret, well statements and then, which you've got this session back.

A

You can then execute queries against your data and here's a example they're just doing a fairly straightforward insert into two and others, not my password for for anything so yeah there you go so it's, and it's also worth noting that you know your cluster objects and your session object. These are long-lived objects in your application. These aren't things you would be instantiating frequently so reuse them keep them alive for a long time, but in the same way there are shut down methods like you would on.

A

You know traditional JDBC stuff, so tidy up when you are done, don't just leave it hanging around, but really they are a lot. They are intended to be long-lived and to be reused within your application.

A

Okay, so next example here is reading from a table, so we call session, don't execute with a bit of CQ layer, select star from users and you get back a result set, and you know I, don't think, there's anything, because we need to explain about this one. You get back some rows and you iterate through them, and you do stuff with them very simple, very, very straightforward. However, there are cooler things you can do with this, and this comes down to the way we've adopted this asynchronous architecture, so you can do.

A

Is you can basically see that executes async method up there? This gives you the ability to asynchronously make, as the name implies. They simply see execute queries against there. So this is really really handy for a lot and I'll show you some examples. Did it because what this? What this does is the you get back a bunch of you get back a bunch of futures and these features are implementing guavas listenable interface, which also, as a consequence, means all of the very cool things you can do with guava future methods.

A

You can now do with with your queries into there, and this is this is very, very nice stuff and the reason it's quite nice is because it lends itself to doing to large scale and you're not blocking your client. Here. You can essentially execute your method. If this is a very horrible big loan query or a big bunch of inserts, you say, come back. Tell me when you're done and do that and you can.

A

What you do that, what you can do is you can register callbacks, so example, example, here what we've done is imagine this select. Well, the select star from users probably would be a very horrible query and we give you back a large amount of data and what I can do. There is from a client point of view. I can say when you come back, execute my listener there runnable and call that run method and do something there and it's cailli anything you want to do. Example.

A

Here is really just when you get back the row iterate through it and do it your stuff, but this is a very, very, very handy thing to be able to do, because it also enables you to paralyze your calls, so very, very simple example. But what this is trying to show to you here is I've. Similar query: I've created a bunch of a collar set the session executed. I.

A

Think that gives me back an object that adds to my future's, my future's arraylist and then I can go and just fire chuck in as many queries as I want in there go away and come back later on when they're returned and do something with those results and the reason this is quite handy is it lets you.

A

Really optimize how you're doing, and if you, if you, if you are having to do lots and lots of big queries, so you you're getting back. You want to insert lots of data. I want to select lots and lots of legs rather than getting back. One big select break it down into smaller, smaller queries, because this gives you a quality, gives you a lot of advantages. Well, first of all, it's very easy to do it with the execute a sink and with the futures on there.

A

But there is the problem with doing large queries, and this is that it will. It can result in, for instance, what your coordinator know doing a lot of work and that introduces a certain amount of hot spot on that node, which they don't have an effect on general through buttons up there. So breaking things down is the smaller queries is often better than having just one big query: it also it will, if you have them.

A

You know if you're looking, if you're, using, which I'll talk about that when you, if you are starting to use latency awareness for your load balancing this, can often skew this because you're giving a very expensive operation. So your 99th percentile is now thinking. Oh this, this node is slower than actually it's just doing a lot of work, so it can. It can skew that and the other big advantage of breaking things down to smaller queries is. If that query, fails, it's fine I. Can we try it? It's not an expensive operation.

A

If this is a massive star from users that, oh, my god, something's failed halfway through and I need to do this again, so it's it's a obviously there's going to be some use cases in your apps, where this might not be appropriate, but do it's your friend it is and because you can use this in conjunction with registering listeners and callbacks you can you can bring it all together as you need it.

A

So other thing you've got with the native protocol is you've got prepared statements, and this is just like prepared statements in your traditional jdbc, a world the. So these statements gets compiled what's their intended for multiple execution and its really really straightforward to use it you, if anybody's ever written any Java code connecting to a database, you will understand exactly what's happening here. There are some considerations around prepared statements. I mean don't prepare a statement. Unless you are going to be we're using it, you know I think that's going to really obvious, but yeah.

A

These are your friends. This is something you will predominantly users prepared statements and there's there's more ways of binding your variables into there and there you can do it by index. You can do it by name, I think there's also you can define variables now as well and substitute stuff in so it's fairly fairly easy to use and, and though the other and then other thing we have is.

A

We've introduced a query builder, and this lets you for, if you're, not you don't few, don't want to write, seek you up if you or you have maybe it's a York, your dynamically building up queries in some way you can use the query builder to essentially programmatically build yourself up, and this example here it's exactly like the other one, I'm selecting all from a key space table user where username equals Johnny, and that's if that, if that's what you like, if this is how you will help you in and how you want to develop and if you're comfortable, that this is a really nice way of doing it, it's worth also pointing out consistency levels so I'm assuming everybody here, knows about consistency and Cassandra and search consistency.

A

So whether it's with the previous examples of that you just if you want to set your consistency level, you just call set consistency level on that query, and then you pass in whatever whatever you want worth pointing out. The default is consistency level of one, so that's somewhat redundant, but just be aware of that of you. If you don't set it, it will be using consistency level of one and then same way. You build up that query. You call session, not execute. You get your results back. So really.

A

If you're going to take away anything from these previous slides it, it probably should be the asynchronous nature of you've how you can use the use, use the culture I mean it is really really cool, I'm, very, very efficient for doing so. It really doesn't lend itself to a lot a lot, a lot of use cases now, but it I mean. If you are, it's all non-blocking I/o, it's it's very very nice and you will see a lot of performance differences. Then tikki, we start breaking things up. Bulk bulk reads: bulk rights.

A

It's it's a very, very handy way of off split splitting the work load more evenly around your cluster, so use it and have fun alright. So what you've also got with the the drivers and the native protocol are a bunch of policies and the policies are for a load balancing for reconnection for retries and there are a bunch of pre-written choices about policies for years. So first one here we've got is a multi data center load, balancing policy and what this is saying to your client.

A

It's saying you're now going to be aware of different all the different data centers in my cluster and obviously you. It is a round-robin policy within that data center and all you're doing is you're calling when you instantiate that cluster any bad. Your contact points you say, use this load.

A

Balancing policy Stan shape that DC aware around a round robin policy and pass in the name of your local data center and what will happen there is if, for whatever reason, your your when you try to get when you try to execute something against that local data center, if it's not available or it's potentially could be available for maybe it's slow, maybe it's time out, it will switch over to a remote data center and execute a query against there.

A

Instead- and it's that's- that's that's quite nice, so it gives you a lot of tolerance to failure to not things flapping to stuff like that or if you just you basically you're, hammering that local data center you just it's having problems. Fine, let's start switching. Some of my requests over to my remote to my other data center.

A

Another one you've got is you've, got a token aware, load balancing policy now with a token, to wear low balancing policy. What it's doing is so when, when your client connect into Cassandra, it chooses a coordinator node, as you all know, Cassandra its entirety, masterlist doesn't matter which node it chooses to it's still the case. What the token and where policy means is, I have I will trying. I will connect to one of the nodes that owned this partition of data and what that means is I don't have to go. Do it like a france?

A

Is a network hulp to from the coordinator to the other nodes, because my data is already residing on the node I'm connecting to, and that's that's that's quite nice and the reason it's you know it. What makes it even nice areas you, because the client driver is getting these events pushed to it it when you change your, you add nodes, you change. The partitioning petition ranges across your nodes. Your clients told about this.

A

It knows it's always aware of which nodes on what what what what tocan ranges really the way to use this is to you instantiate it with a child policy as well. You have to instantiate it with the child policy and example here this. This is actually a very, very common way of setting up your your your your cluster is so you have a nice annotate, your token, where policy, and then you use once I'm token aware, also make me DC aware so I'll connect to my local data center.

A

I will choose a coordinator, but as an affinity to that that part that partition of data and then, when I connect to a remote data center. I'll do the same thing and you you it's it's it's it's very nice and then we have actually a most probably should say that there is a default policy as well for a lower bound, which is just around robbing.

A

So if you don't specify with low buzzing policy, it will just be around robbing across your whatever you tell it to connect you so the next set of policy we policies, we have our retry policies and retry policy is telling you what what should I do when some, whatever my request, returns a timeout or the node is unavailable or whatever, and just like with the low bassy policy, when you instantiate your you're, a cluster, you say call it my build my cluster with this retry policy and you specify which one you you want in there and there are a bunch of retry policies to choose from so you've got.

A

Obviously the default retry policy, which it's quite conservative and what it does it will just retry yeah it just it. Just just doesn't do much the next one there which we're using in that example, is the downgrading consistency, retry policy, and what this is intended to do, and you should you should use it with caution. Don't use this unless you generally know what you're doing, because what will happen is the idea here is I've.

A

Can it I've, for example, I've called consistency, level of all and I'm executing my query against against Cassandra for some reason imagine I'm in a multi data center deployment, the connectivity to mow the data center is unavailable. I can't honor that consistency level, because I can't get the ACT back from all my multitude of data centers. So what do you do? You can fail? The request say: well, I can't you I can I can't honor all or I can say you know what okay I'll drop down to a weaker level of consistency.

A

So princess I go from all, maybe to say local quorum and then know when that when the connectivity comes back up it will we try and do this, but it's it's a very handy tool in your box, but you know, I probably argue if you've gone for all, you have a really probably a good reason to go for all, so don't use it just thinking by default. I'll, just use this downgrading retry policy understand what that means for your application understand what that means for your data. Don't just use it out of the box.

A

You've got the fall through retry policy, which basically doesn't do anything. It just lets your bit, your that just bubbles that up to your your business logic and you decide what to do in that situation and then you've got the logging retry policy, the login we try policy. All it's doing is you would, as with the the token posse you nest, the child policy within there, and all this is doing, is basically logging out.

A

When I read try and I tell us, I can't see why you wouldn't ever have a retry policy and not want to know with your retrying, because something's really going wrong, so use the login retry policy yea. It will be fun and there's a link down there to our to the javadocs on this which you can have but retry policies are. You know it's important to really really I'll say it is. It is quite important to understand what you're, what you're doing there you're. You know you're retrying on failure, so something has gone wrong.

A

Hence your you're doing this. This isn't like on load balancing this, isn't saying: I'm going to be routing requests where it's efficient, something that something's gone wrong and you're now retrying. So no, what's the right thing to do when that happens, then it don't just a particularly downgrading consistency policy. That's you know that that that's a choice you've made for a functional reason for a requirement to choose that level of consistency. So downgrading is not not might not necessarily be the right thing to do.

A

We then have a bunch of reconnection policies, and this is, as I said, of thought gosh. This is how often will attempt to reconnect to a dead node. How familiar this and it's it's fairly brutal. There's this really two choices. There you've got a constant reconnection policy, which is saying if my node is marked as down every X number of milliseconds try again try again try again up to a maximum period of time, and he then got an exponential reconnection policy.

A

Where you know constant delays in a big distributed system is not a nice thing to start hammering your system, so you can, you start small and you exponentially grow the delay between them, because if it's not working after one second, is it really going to work after you know four seconds, so it's over doesn't add up, but you know I mean so it will just gradually exponentially increase that period up to do that and I'm always for me.

A

The constant attempts were would worry me in a in a system because that type of workload is you know the node goes down. It's probably likely that lots of your clients have noticed this or experiencing an issue and are retrying and they're all going to start hitting like this. So having a bit of variability in those retry, attempts is nice, so yeah I personally usually go for the exponential one and I'll also point out all of these policies. You can write your own, it's quite extensible it's.

A

This is all up on on our github get up calm, / datastax, very extensible people do write their own policy, for instance, know maybe a time-based one. You know midnight, don't use this data center use this one stuff like that. There's there's lots and lots of ways that you can extend this, for whatever your use. Cases and needs are um another thing, and the reason this is in there is because this is probably one of the most painful points with earlier with with early versions of Cassandra.

A

So um what you've got- and this do you think well show you this is this? Isn't you know anything fancy, but in comparison to what we had before paging was in nurse it was. It was quite quite. This is very, very nice, so I mean so historically getting large data sets out of Cassandra. Ensure client is a problem because Cassandra would load that result, set it to memory, and then you drop that result set over to your client. So you'd get these. You know memory exceptions a lot, you know you wouldn't you'd have problems.

A

So what we've we have now got is the ability to stream that data across in there, so you can basically be iterating through very large result, sets in a very, very efficient memory efficient way, and you no longer have to batch your selects and get them back like that, which is horrible code and horrible to do you just do your select star or whatever, and you just iterate through your data and example.

A

Here it's just showing you how that works, but the one thing that's quite cool about this is: is the concept of state as your paging, your data? So if you think about it, if that in that example, there are that clients queer that node there it gets back the first first page of the data. Well, what happens if that node goes down? Yeah? Well, you know, do I have to start again from the other nodes. You don't what it. What you actually have is essentially essentially a cookie.

A

If you will that tells you what what point you're through there. So if that node goes down, the next page, requests could can simply go to another note in your cluster without having to start again. I know that so it's it is tolerant to failure and I from from a personal point of view. This is for me one of the nicest things we're not just loads of nice things. This is a very nice thing, because I hated, it was very ugly to get large large result sets back. So this is your friend.

A

You won't even notice it um other thing. We've discussing I showed you the example before was tracing. So yes, in the exactly same way from the cql come online, you can tracing get results back. You can also do this within the the driver as well, and all you simply do is where regard to using your query builder or just your own statements, you just searched. Call it enable tracing on there and were you execute that query?

A

You get back an execution info object and from there you can get a query trace object, and this gives you a lot of gives you. Basically, this gives you all the diagnostic information that you would be getting. Has you gone through the command line to do this as well? So you can get unit. You know what nodes you've attempted. You know the latency if you've coordinator, has gone out to the other nodes to find it.

A

You get a lot of diagnostic information in there, but rather crucially, this is something you are doing: programmatically yeah, so there is you're not doing this on a global level. You're not going in and saying. Okay now enable tracing on all my queries, you're doing it on a per statement basis. So that can be a challenge of you because well had a way actually something's gone wrong. I need to enable tracing, so think think.

A

Think carefully about that, and the other thing to also think carefully with tracing is there is a performance cost to enable tracing I mean it is persisting and they don't Cassandra. It is consuming resources, so use it when you need to diagnose stuff, don't run it in production. You know, obviously don't these code with enable tracing on, because you're going to just be hammering everything in this I've seen people do some randomness turning it on so just to give them a certain amount of you know every five percent of kohls enable tracing.

A

You can do smart things like that. If you just want to build up a kind of a an audit history of what's going on doing that, but don't we got don't leave it on all the time.

A

Now the tools you have around the CQ Lemaitre protocol you've obviously got there's a command-line utility, CQ r sh. That will let you connect up and do what you like. Well, we've also built a dev center. This is completely free, go download, it have a play with it and you get all the kind of things you'd want from a to that. You get all the syntax, highlighting also completion, pretty colors, connecting up two clusters managing all that stuff. It's using the native drivers as well- and this is this is this- is pretty new.

A

We launched this back of one's, it depends September, I think, last last year, so it's in version, one at the moment and yeah we're we're very keen for people to try. It out tell us what they think anything you don't like anything you'd like to see changed or any extra things you want to see see in there. So, finally, there is a bunch of links here that will help you a lot so the first one there is obviously datastax com, that's going to where I'd say go to for all your Cassandra stuff.

A

A second link up. There is a bunch of stuff there to help you if you've not used Cassandra for Cassandra before downloading VMS, get yourself up and running on there. We also have quite a comprehensive amount of training. Both online training completely free, just go off and start using it learn about it. It is pretty geared up to the native driver and java and c ql, so it's it's quite quite good to do this from our downloads, page you'll get be able to download.

A

First of all, you can download datastax enterprise completely for free, it's free for development and testing purposes, and there's a lot of extra stuff on there that you'd, probably you would probably like to use our documentation. Freddy comprehensive.

A

The developer blog lots, not some interesting things there that you would I suggest reading them, loads, loads of cool stuff and our community community site for Cassandra's, planet, Cassandra, org and again, there's my favorite thing on there is that we have a bunch of use cases and interviews with people why they've used Cassandra how they've used Cassandra stuff like that and just tons of webinars over there. So that's me. Thank you.

A

um There's a lot to cover it's quite quick I could have stood here for a day talked about this, so it's hard to get it all fit again. So if ya, oh, yes! So, yes, and if you I'm here all week, were yep great, we got. We got questions coming in already we're on the fifth floor. Come over speak to me, speak to any my colleagues, any questions you have anything you want to know I'm here I'm here to help.

A

Cool, so we have, we have some questions, so are you planning to introduce data, update notifications? No, and arguably that's not really the drivers kind of responsibility. If, if you're asking me, how would I do data update notifications, the only thing you've got in your toolbox at the moment is triggers so triggers are, and I will definitely say. This triggers are experimental, they're they're, really as a we put them out with cassandra. Is there to two people? How are people going to use this? How are they want to use it?

A

Don't use triggers and, as you really know, what you're doing you understand Cassandra, because you are you basically, you write a Java piece of Java code, it whatever you like, yeah, so you're going to be consuming resources in cosecha, so that, but triggers would, if you did want to have some form of notification on data mutation triggers would be the way to do it now, but experimental and we've also the sounder we've also said very clearly when it becomes not experimental, there will be no backwards compatibility.

A

So if you start building a bunch of very complex logic and triggers just be conscious of what your upgrade path is going to be so they're there for you to use, if you know what's doing- and we have some very large- very, very large deployments using them, so it's not that it doesn't work, just be conscious, it's experimental and is likely to change as Cassandra Musil another one I've got there using cql is more easy, but in this way the flexibility decreases.

A

Can thrift and c ql be mixed and I believe they can I've not tried it myself and I've I have seen a couple of community drivers where they are using cql over over the RPC stuff, but I I wouldn't advocate it. I mean the really they're made to work in the native protocol and c ql. It sits. There really are made to work together. I wouldn't advocate personally I wouldn't advocate it if it works for you and it's going to do what you want to do, go ahead, but really seek your native protocol.

A

That's the way to go. You want to go down that route. No. The question is there a scholar, client supporting cql with a sync API, that you would recommend and I recommend looking at our client drivers download sites on our site, and there is some skyla ones in there. Also planet Cassandra. This links to some Scala libraries in there is there a specific datastax supported. We have built skylar driver. No, but you know I'm.

A

I know it and well probably yeah I mean that I think I think I think the demand is there and- and it wouldn't actually be a difficult thing to do, because our Java driver- because it's all asynchronous, be call it sold non-blocking I/o. It would fit very very simply into this. Really all you're doing is I.

A

Think if you built a scarlet drivers just making a little bit more friendly, how you'd use it I think you would ultimately I think you'd be just delegating out to the make the Java driver anyway, but you just make it nicer, so yeah, it's a scholar. I mean how we hope we do have I do know lots of people using Scala and the native drivers before. But- and you know what this is all open source is all up on github. You know we be quite happy.

A

If you know you guys want to build a scalar driver, build a scholar driver. You know another one there, usually when kasam, usually when using Cassandra, you were starting from initial requests and build the key spaces, etc, but adding other info usually implies creating other key spaces. This can result in a performance problem. Query multiple keys basis, sequentially. What is the power of Cassandra with cql instead of using relational DBS? That's not like a big big big question.

A

Let's start with the first one, usually when you're using a sounder, you were starting an initial request and you build a key spaces. So if you're not quite sure, I understand your question, there I mean if you're talking about data modeling is that who asked that question yeah who asked that question. Yeah I mean I I, don't quite sure what you're asking. But if you are asking about the approach to modeling data input, seek you understand? I mean it is a denormalized query, based approach to modeling your data and which the implication of that is.

A

Is you need to know? Well what am I actually going to? How am I going to be querying to do this and you're really optimizing for that? You know there is no secret special sauce. That's going to make this any any less challenging than it would be. You know you've got you're going to have issues with either if your requirements change or you get new requirements. How do I adopt my model to meet these new requirements and there's you know, there's we really the read the question.

A

People always ask me if the ensuing this is the question you are asking is: how do we know? I've got this massive table with a bunch of data in there that I've optimized around I'm going to I'm gonna be curving it. You know what I've now got a new requirement that I need to query a different way. What do I do?

A

I create this table, but I've got all this data sitting over here had a way back, fill it to get my my table of the date so that my data is now in there it's queried. It's probably do it and you know not going to lie I, don't think. That's a particularly easy thing to do. There's no click a button and do it. But there are a bunch of approaches that you've got. You've got a very low level.

A

You can start writing writing SS tables yourself, based on the you read in the SS tables from the other table. You read them in parse them, transform them write them out to the new SS table format, and then you just stream them back into the table. That's one approach. You would also probably want to start writing in parallel before you do that so you're you you're just catching up based on when you took the SS table and you've also got you know you can brick for us it you could.

A

You know you can basically write something that says: reading stuff in and writing stuff back out, not not depends on the type of data, and there are. There are some pretty. This is the nice thing with the async stuff, certain from a client point of view, it's pretty quick, you can get, you can actually pump, I mean I. Might a laptop I can pump? You know millions of Rights into Cassandra, just just by purely perlman boom, but you know that might not not not suit.

A

You know there are a bunch of you know the whole that whole ecosystem about yelling and tools around that that's evolving a lot we've partnered with Jasper assaf. You know pentaho building these nice kind of ETL load and transform tools which you know there are they are certainly kettle is cql compliant in there and it will let you do this kind of stuff and other approach you have to do.

A

This p population is certainly with with datastax enterprise, which is our kind of uber product of cassandra, and it gives you we have an analytics part there and what that is is essentially HDFS compliant file system sitting. On top of your cassandra column, families- and this means you can start writing hive queries. Pig queries, MapReduce jobs. So if you did, you could write something on there to read, transform and write tomorrow. The table- and this would be a batch thing. I- would just go off and do it and you get through it.

A

So there's a bunch if that was the question, there's a bunch of purchase to it. So I think I think where I've actually got I think I think I've finished a bit early actually, which is which is I, was worried about over running so yeah. So thank you very much. If there's no other questions, we'll we'll call it a day and I said down at the booth if come and come and speak to me, ask anything you want to know and you've got my Twitter and email.

A

Please don't stalk me, but do ask me questions if you want.