Apache Cassandra Cassandra Community Webinar Series, 12 Sep 2011

Previous Meeting

⏯

youtube image

►

From YouTube: Cassandra Community Webinar | What and Why NoSQL?

Description

Speakers | Aaron Morton (Apache Cassandra Committer), Robin Schumacher (VP of Products, DataStax)
Date | Wednesday, September 12 @ 11AM PST

In the first of our bi-weekly C*ollege Credit series Aaron Morton, DataStax MVP for Apache Cassandra and Apache Cassandra committer and Robin Schumacher, VP of product management at DataStax, will take a look back at the history of NoSQL databases and provide a foundation of knowledge for people looking to get started with NoSQL, or just wanting to learn more about this growing trend. You will learn how to know that NoSQL is right for your application, and how to pick a NoSQL database. This webinar is 101 level.

A

Referential integrity through the use of foreign teams and primary keys to make sure that that, if you've got an order, item record it and it has an orderly ID of an order. You can't delete the audio records.

A

You've got join semantics and and join semantics for inner and outer joins and less joins and right joins and all the other types that we have and you have transactions acid transactions with these different levels of isolation.

A

So in the second half of the 19th, these ideas are advanced in the sort of commodity databases that people were juice, not, but not. The maintain things that IBM when creating, but in the commodity, databases and probably the most famous one of those in the open source world has been my sequel. We look at the history. There had the first public release in 96, so by then be tree was already an established product.

A

So we had this time when, if you were doing small to medium enterprise development, you are using something you approved, maybe using company that didn't support SQL at all. In 99 they had the I fam engine and we still don't have my items and we still don't have transaction support and 2001 nodb becomes the default and we get acid transactions from from aunty people, and we get foreign key constraints for a referential integrity jump over to the Microsoft SQL Server world in Nice lies.

A

They had their first big release, which was version sticks and they were working with slide basin. They came with priorities and foreign keys and in many version 6.5 they enhance the joint ematic. So we got both inner and outer joins in 98. They added some replication and support for Unicode and then in two thousand we got the full set of retro referential integrity controls.

A

So these allow you to do things like say when you delete a record and there's foreign key somewhere delete the record that has befallen ki cascading delete, set null or set default on when referential integrity is going to be broken and finally postgres. Another well-known open source relational database had an initial and released late in the 80s, went through some issues and came out again in the public open source project.

A

In the mid to late, 90s you'll see that in 1999 they added support for transactions and they use multi version concurrency control, rather than the mocking systems that why people and ms SQL use and then into doubt when they added foreign keys and joins.

A

We knew that those three platforms concede that around 2000 2001 they started to look like we think relational databases should look. They had support for the trend for transactions, referential integrity and the enhanced join semantics, and there was a time there where they didn't end exactly that. Those features were added to to get anti-people components.

A

So we move into the needs into the new millennium, people that were creating applications and creating websites started to run into issues of scale for capacity and throughput and availability and the tools that they used to solve these problems, often lived outside of the database, were added on to the database platform later on. So we look at the first one teaching many key steam came out in learning 2001 and the idea is you put the date that you use frequently outside of the database and pull it from memory?

A

If you lift the case, then you go back to the database. It says application complexities. Now you have to look on occasion. Then go look at the database. We've got operational complexity. We now you're managing two servers.

A

This problem is under in herds where, if your case goes down, you've got a lot more request: code access database that we had previously or if your patients warming up it's going to get a lot of cash message and as I say there are there only two hard problems in computer science station naming and off by one errors and problem. The patient is that you has to invalidate the case at some point, and so you have a consistency problem here between the case and the database.

A

Early in the nerve, the 3000 and then some clever person came up with the idea of Chardon and the idea was: let's take half a million users and put them on one commodity database server: ok, next, half a million users and put them on a different commodity database server with all of their data. So it's a horizontal partition through your schema and we'll keep on going and we'll keep adding servers.

A

As we add users- and this meant that if you lost one server, you only lost a availability for that half a million users again, this added application complexity. Beep now you had to understand where the user was in the cluster and operational complexity definitely has increased.

A

If these are an independent service, you have to manage schema across multiple servers, which adds your operational complexity. It makes life harder. You've got still got a single point. Failure for that shot.

A

So if you're, one of those happy amusing to get stopped because your service down he's keeping their grand scheme of things that might not matter too much, but if you're one of them, you probably have been upset and it's hard to grow and keep the system balance as you go from C for eight machines, have you make sure that the capacity and the throughput evenly distributed your new customers might be doing more work or less worth than your old ones?

A

So you can get some availability, we have replication, we saw the microsoft SQL habit and mine still has it and typical master-slave replication occurs outside of the transaction, and the market does don't think any gets federated out to the slaves to failure. The scenario there can add application complexity, it might be managed by infrastructure, arrives, managing the application. You've got an unknown asynchronous delay between the master and the slave. When the game you've got two assistants e problems.

A

There, like we saw with patients in front of the database and potentially you're, wasting resources on those slave servers. If they're just there as a passive failover, you've got CPU and memory and disk that you could be getting more valuable and finally, the reliability of that slave server is unknown. I might go from here. Well, we failed.

B

A

Failed another thing that someone had is you've got infrastructure you rely on and the use and stuff you can't rely on.

A

So you might decide to do some things with that flame, so you might have a bright master in the reeds lake and again adding to the application complexity. It has to say the application hashim. Oh well. This is the right and right to go to this machine and read: go to those machines still managing this asynchronous delaying replication, so you've got consistency. Problems between the two and you've still got a single point of failure for the right. Maybe you've got to failover to Mario, where you can pick a new master.

A

So if we keep going down this path, we have multiple machines. We have caching, multiple machines with shards and master slave replication. We have to deal with our database schema in the relational database. The schema is a really useful thing and it helps the query. Engine understand how to run your complicated SQL query and starts as possible.

A

The ante start an admission you and you want to make changes. You have to run altered, evil and alter table locks the table and blocks out readers and writers and let the problem if you're owning a 24-7 operation, you've got lots of machines. You have to apply that to individual servants and also, as you go and include a fast-changing problem domain, you often tend towards the situation where you have a lot of columns that just my sins default null. So a tight, well-defined schema starts to become a problem.

A

The second half of that decade. We had some really interesting papers published. We start with the Google BigTable paper. It came out in 2006 and talked about a data distributed fault, tolerant database platform that they build one of the interesting things in it was the data model, so they've, broken away from the idea of a row orientated database model that you have in relational database and use these things.

A

They called column families where the columns were a row as a broken up and as a storage patent as a storage container and there's a query in container and that that works well for the sorts problems that don't trying to solve.

A

Next year, amazon publish their dynamo paper again about an internal across the databases they created and the interesting thing the dynamo talked about was this idea of eventual consistency is a way for a cluster of machines to return consistent data to quite request. Even if some of the machines have been down and missed some of the right activity, they might have different physical data, a different data on their bit, but you can still have the cost to give you a consistent result and you can deal with nodes failing after a while.

A

You can pretty much assume leave all always got some sort of most raelia going on once you get enough machines and then the next year Facebook publish their cassandra paper where they talked about how they brought together the ideas of big table and mo and several other ideas and created a cluster database to solve their problem of search for their email inbox and again, they were dealing with the situation where they had data that didn't fit well into their into a relational database model.

A

They had problems with scale at the time only they had half a billion users, and this became a very popular platform.

A

They released it shortly after onto google code and it became an Apache Incubator project, and then we ended up having web web chats like they are today because of these three papers that came out and pointed the direction that we could going in the second half of that decade. We had a lot of new ideas in beta based platforms.

A

We sort of fall into four broad categories. We have key value stores. There's a platform called Tokyo cabinet read. It is a very popular key value. Store Voldemort from LinkedIn is a key values, platform and riot member shows in the key value systems that, just as they say on the PM is key and a value. Maybe those values have some structure to them like in readers, or maybe they are just blobs of paper, and we have document orientated stores, Apache, CouchDB and mongodb, and these ideas that client sends a document of key value pairs.

A

The service stores that maybe an index of some it goes very flexible, subjective, schema grass databases allow you to lay out things and we look at the internet movie grass everywhere, so they become quite popular as well and Khan. Family stores, those sorts of things that Cassandra the category that Cassandra falls into.

A

There's a database called Apache hbase, which is now part of the Hado platform who requires all the infrastructure of had to run, and the first reference I can find about is from 2007. It was part of the apache Lucene project.

A

Google then took their big table infrastructure and made that available to the public via google app engine and put two days there, because, typically in typical Google fashions, it was released in two thousand eight and taken out of preview or better. In 2011, we had a patchy Cassandra, which entered the incubator and apache in 2009, became a popular project in 2010, and recently amazon has made their dynamo database available to the public through the Amazon Web Services system.

A

So there's a lot of activity in the column, family stores and they used to hold large amounts of data and provide high transaction on throughput in general.

A

The systems that we've seen developed in the last five or six years try to solve some common problems that patch wind so evident, or issues in the second half of the nineties, when my sequel, ms SQL and postgres started, and the common ideas that they revolve around often are that they're cluster of machines, the replications built into the core of the system and that they have either no scheme or a very flexible schema.

A

So they can support fasta durations on platforms and support, anemia or mon structured data and the built around the ideas that nodes fail. And if you want to create a highly available, always up application, you have to have a platform with us that accepts that and and can continue to operate in the face of individual service failures.

A

So with that in a way of getting the ground where I think how we've got to having Cassandra I'd like to hand over the amount of Robin, and he ki talking bit more detail about how to center works.

C

Okay, great thanks very much Aaron, so let me go ahead and I'm going to be covering for you today. Errands done a great job in terms of telling us what no SQL is and has identified many of the common players. So one of the things that I want to do now is take you through why you should be using a no SQL database. Okay, when is the relational database?

C

Okay, when should you be looking at a no sequel database and so I'm going to provide a number of different reasons why you want to go to know SQL and then I'm going to show you how catchy Cassandra meets those particular use cases so before I? Do that, though, I want to draw your attention to the fact that they're simply big claims right now being made about no SQL. For example, here's a smoke quote from an article and infoworld but says no SQL is the stuff of the Internet age.

C

It's pretty big claim and really what did that even mean, and so, if we start to dive down into that a little bit, what really characterizes a modern database in this internet age? Well, there's a number of different factors. You could look at, but I've listed three for you here on the screen number one big data, and this is not just hype. This is real. I'll cover this more in a few moments here, but here you're dealing with being able to scale how fast data is coming in the variety of data to volume.

C

So that's one of the things that really characterizes I think today's data age, second, would be the clouds. So a lot of people looking to cloud trying to move their databases to the cloud for a variety of reasons may be operational. Efficiency, maybe cause something like that cloud makes a lot of promises. It promises that you're going to get transparent, elasticity scalability all these types of things is it legit is it true.

C

Well depends really on the database that you're using in the cloud just taking an Oracle database and run in a single instance of Oracle's not going to give you these things.

C

You really want to be able to understand what really constitutes a cloud database and then finally, data just everywhere is what I call it, and this is really needing to support data across multiple different locations, physical locations, those might be different data, centers, geographies, different cloud, geographies owned that type of thing, and so you'll find these characteristics coming up over in over again when the talk is about really what's what's really a modern database engine?

C

What's it look like and one of the things that it deals with keeping those in mind, let's go ahead and turn our attention to a number of different reasons why you want to use no SQL, specifically, first really just mirrors the very first reason: I gave you under Internet age databases. First, you.

B

C

Big data use cases right and what characterizes big data is the 3ds the volume brought in and volume. So you have should just a little air on the slide should be velocity there. So data velocity it's coming in very, very quickly and perhaps from different locations. You've got a variety of data, may be structured, semi-structured, unstructured data and then the volume of data oftentimes people here big dating they think well. That just naturally means terabytes and petabytes not.

A

C

We have a number of customers here at datastax who have very big data use cases, but they don't have petabytes that they're managing, so volume is not the only characteristic. That really constitutes a big data use case. Something else is the complexity of data distribution, because you've got all this data. That's coming in may be high speed, different types, lots of it and you've got a distributed around different locations.

C

That's something else that comes into play here, and so what people are really looking to do is to try to what I call future-proof their apps, and so I can't tell you how many different people I've talked to you here at datastax customers of ours that have tried a particular database relational data model and hit the wall with it, and they found that they perhaps couldn't add as many patients as they needed per hour for their new online medical portal.

C

Maybe they couldn't add new online subscribers as fast as they needed to, and they never want that to happen again and so they're looking for something to ensure their success in the future- and this is where again, Big Data technologies come into play, and one thing that analysts tend to agree on is that you need something other than a relational database. So I have a quote here on the screen from ITC. That says really big data technologies. It's not relational, it's really something else, and this is where no SQL comes into play and specifically cassandra.

C

Cassandra is a massively scalable, no SQL database that what it was architected from the ground up to do to handle big data workloads. This means that it's going to give you very strong right performance for data velocity. It's going to support the various data types that you need, unstructured semi-structured. All of that, and it's going to offer you linear, scalability, for your data volumes and or handling your concurrent users.

C

So, for example, if you are currently seeing around two million transactions per second with two nodes, you're going to get four million with four nodes and continue to work up from there, it is a true linear scale.

C

Database and cassandra is good for both reads and writes very fast for both cassandra used to be known as a very strong right database, in other words, would accept data there very quickly, but it didn't offer very strong, read performance and with version 10 of cassandra, that's no longer the case, basically reads and writes, or just about on par here and I've got to call on the screen from one of our customers here: source ninja, so they were using a typical relational database needed to scale and since moving to datastax enterprise, which is our production, ready version of cassandra along with hadoop and apache solr for enterprise search.

C

You can see the quote. They said you know: we've seen basically a seven hundred percent performance improvement while at the same time our database grew over five hundred percent and we've cut costs. Forty percent, so not bad. So the first reason you want to go to know SQL, you have a big data use case when big data is talked about. Quite a bit in performance comes up. You know how really fast will the database run?

C

How will it perform under these types of workloads- and you know the benchmarks now for Noel databases are starting to come out and at least the ones that we've seen there's there are some clear indicators that Cassandra is not going to disappoint you where performance is concerned.

C

So at the top of the screen here, I've got a quote from a recent academic benchmark that was done presented at a very large database conference here this year and they tested a number of different, no SQL contenders and in the end they said, look in terms of scalability there's a clear winner: Cassandra achieves the highest throughput, the maximum nodes and all experience experiments with linear increasing throughput. So it doesn't matter whether it's in the cloud and you can see a benchmark they're done by Netflix, whether it's in web apps.

C

You can see a an external benchmark, they're done against one of Cassandras, no SQL rivals. Cassandra really doesn't disappoint where performance is concerned and what it's big data were closed or not, okay. Secondly, why no ask you: why do you want to go to a no SQL database? Number two? You need continuous availability, and this is different than high availability. What we're talking about here or applications that simply can't go down, so that means whether you're doing maintenance. That means whether there is a particular disaster.

C

Hardware failures, those type of things know your application cannot go down and it may involve one or more locations. So maybe you have multiple data centers that are serving up different clients all over the globe. So you need that continuous availability and you need it everywhere. Well, here again, Cassandra really shines. It is a continuously available, no sequel database. So it was again architected to overcome the fact that hardware failures can and do occur so built into Cassandra.

C

What you're going to find is no single point of failure in, however, it manages data and function, so you get out of the box. Redundancy a function and data with its built-in replication, with its architecture, where Evernote is the same you're not going to have an issue in terms of availability when it comes to Cassandra.

C

Here we have a quote on the screen from one of our customers right scale and the primary reason they say that they chose Cassandra was because they needed that continuous availability, their app can't go down and they have multi data center support so that when they write data there is no worry that it's going to be written. So that's not point number two. Why no SQL? You need a continuously available database number three. You need true location independence.

C

This may sound like a funny term, but what it basically means is you need to be able to read and write your data anywhere now. Theron was talking about earlier, there's number of different architectures that can perhaps get you reads if you can get you rights as well, and that is a problem that the no SQL databases like Cassandra overcomes. So if you need to read and write data anywhere in multiple locations, you have one logical database. That, perhaps, is made up of many different physical locations.

C

This we're a no SQL database like the standard comes in and how it handled the various operations. The data itself is going to be eventually synchronized in all locations, and you want to keep data local, perhaps for very fast access. So if I'm here in the United, States and I need doing some queries or looking up some data doing, search or whatever I don't want to have to wait for a query to be satisfied in a server. That's tens of thousands of miles away or I keep my data very local for fast access and again.

C

This is where Cassandra can help out, because with Cassandra you get out of the box. Multi data center support, really that is the standard in the no SQL database industry, really the standard for multi data center support, multi-directional capable, and it allows you to create clusters that are hybrid in nature. Perhaps you need some data on premise: some data in the cloud, those type of things you can do that with Cassandra very easily.

C

It also offers something called tunable data, consistency and I'm going to go into this a little bit further in just a moment, but it plays into this whole idea of Cassandra being a location, independent database and this particular feature helps customers like Netflix to be able to create their systems very, very quickly and service, their customers all over the globe.

C

Another reason point number four: why you want a no SQL database. You need real-time transactional capabilities. Now. Some of you may think now wait a minute. I thought no SQL didn't do transactions. Let me clarify this a little bit so with transactions. You have something an acronym called acid that is typically applied to relational databases, and if you need true acid level, compliance you can get by with a relational database. Typically now there are some no SQL databases that will try to give you acid and what have you?

C

But by and large no skill databases don't look to really support acid in the sense that it's defined in the sensitives defined in the relational world. Okay and what I mean by that is that the see an acid does not apply to really the no steel database on those field. Data is like a sandra. It refers to referential integrity, form, key constraints and with databases like the central, you don't have those types of mechanisms you don't have joins and what-have-you, where you're going to need to support that see in the relational database.

C

Acid definition, in fact, some people don't think you really need acid style transactions for many of today's modern applications. I have a quote here from Dan McCreary on the screen, where he asserts ninety percent of the apps that he sees right now they don't need acid transactions. Now that doesn't mean that Noah ql can't support transactions for you indeed, Cassandra. Can it excels in real time in those sequel transactions? It supports the eid portion of the transactional definition, so you're going to get an atomic, isolated and durable transaction.

C

What it does a little differently is again dealing that see part of the acid definition so again, you're not going to have foreign key constraints and referential integrity to deal with instead you're dealing with consistency and how data is made consistent across many different machines. Many different database clusters that perhaps are in multiple locations and again as I mentioned earlier. This is where cassandra offers something that's kind of unique tunable data consistency.

C

What this means is you have the flexibility to choose on a per operation basis, how consistent a particular operation that you're performing is going to be in the database. So, for example, if you want a right to be propagated across all nodes across all different location and all must respond back before that transaction is complete. You can do that. You can specify that on a single insert, a single update, perhaps other rights, though other inserts and updates. You don't need that type of assurance that it made to all notes.

C

Maybe you could maybe you just want a majority of the nodes to respond, or maybe you just want one node to respond. The whole thing, though, is you get to choose you're in charge on a per operation basis? You have a lot of flexibility from a development perspective to make this happen. Okay, so yeah. Why.

A

C

You choose no SQL, something like this and right. You can handle real-time of those sequel transactions and I got a quote you're from Wikibon. It's a really Cassandra stands at the front of the no sequel pack when it comes to supporting things like this.

C

Another reason why you want a no sequel database, you heard Aaron talk about this a little earlier. You need a more flexible data model. All right, relational data model is the cob date. Data model very good serves its purpose that it is rigid all right, perhaps particular applications you're developing need to be a little bit more flexibility, a little bit more agility. You don't what I have to worry. If you have what are called wide rows of data, maybe data that's made up of hundreds or thousands or even tens of thousands of columns.

C

Well again, a database like Cassandra handles this very very well now. The good news if you're coming from the relational world like I was, is that you're going to see some things that we are familiar in Cassandra, so Cassandra uses the data model big table to row, oriented column structure very similar to relational table, but it's going to give you more flexibility and agility. So, for example, I could insert a row into a Cassandra, what's called a column family and maybe contain metadata about myself, and it only contains a couple of columns.

C

Maybe ten columns, then I need to insert a second row into that same column, family about you, and maybe you have a thousand different attributes that I have to track well, the good news is I can do that. I can insert that that particular data about yourself keep the one about me all on the same data with no storage impact. No, no storage, overhead issues, no query issues that you need to deal with. It's all handled by Cassandra, very, very well, there's other things that are very familiar to you.

C

So, for example, you have primary keys. You have secondary indexes that you can create for faster access. So you have some very nice things that that don't get in the relational world but at the same time the learning curve is not very high. All in terms of understanding the data model and understanding how things work, one of our customers here NASA they went to a no SQL solution from relational database. They had and we're very pleased to see that they were able to do things much more.

C

Naturally, they said, then the relational database was forcing them, and the data model also delivered much faster performance than they were getting from the relational database. So the very last reason why you might want to choose a no sequel a base. Did you just need a better architecture? Now we've talked about some of these things already, but I really wanted to bring it out explicitly so again earlier, you heard Aaron talk about the different types of architectures that you might be using with relational we're non-relational and such so.

C

You have master slave and, if you're like me with a background in databases, I'm sure you've done all of these so master slave. It has. There is issues that you have to deal with, most notably a right bottleneck, there's latency, replication issues between the slaves and the master. Then, if the master fails, the failover has to occur, then what else you don't hear talking about its failing back to the master webkinz brought back online?

C

That's usually very tough to do, but you have those things to take care of manual, sharding, very difficult, oftentimes done in an application and requires quite a bit of elbow grease on a part of the developer. To make it happen, then you have shared storage model architectures that have availability concern, since that storage area could be a single point of failure for you well again, a database like Cassandra overcome these limitations overcome these issues, because it's a masterless architecture to peer-to-peer design. Where every note is the same.

C

What that means to you is you're not going to have those right bottlenecks you're not going to have to do manual sharting, because it's automatically taken care of for you you're not going to have the shared storage issues, because you're using local storage that's replicated and you have redundancy automatically built into the system and dis- equates to less operational overhead, so you're not going to have to have as many people taking care of but Cassandra clusters you might, from sort of a manually put together sharded the relational system, an difficult here for backup 152 said you know when we were looking at different systems solve it.

C

Cassandra had all this built and we thought yeah, that's how you do it so again. Just to summarize, why might you come to a no SQL database like Cassandra? Well, you need to handle big data, use cases, ET continuous availability. You need a real location, independent database. You want a real time, modern, transactional database. You need more flexibility and agility in your data model and east just need a plane, better architecture to take care of things.

C

So what are some of the types of use? Cases grill, world practical use cases that a database like Cassandra can tackle list a number of them here? It's certainly not exhaust. Is that real time, Big Data workload Cassandra excels at time series data management. So if you have financial data, you have web clip string data, you have data, that's coming off various devices, oftentimes called data exhaust. It really does a wonderful job of these types of systems, social media, real-time data, analytics online portals and right intent. Systems are on and on.

C

These are the types of use cases that Cassandra really excels at, so that I, a quick screen pic here of the COS guide to no sequel, which I thought was pretty good study. I would definitely recommend you. You take a look at that. Mccreary says you when you really get down to it. What is sort of benefit you're going to derive you outside of the technical benefits that I outline and some of the technical reasons why you want to choose?

C

No sequel, you search that you can really build systems much faster because you really don't need a logical data. Modeling or any entity relationship diagram I would caution there. You certainly need to put fought your data model, so I might disagree with him a little bit on that, but I certainly agree that you can definitely scale more automatically you're going to have much lower failure rates without continuous availabilities featured characteristics that are built into cassandra, and it's definitely more extensible.

C

So with that I will turn things back over to Christian and we'll go ahead and wrap up.

B

Thank you very much indeed, Robin and Aaron. We have a couple of questions so far and I will go ahead and ask those if you do have a question: please feel free to either use the Q&A with in WebEx and we will monitor that and read them out or use the hashtag Cassandra QA, and we will monitor that and read them out so Robin I'm going to direct the first question to you.

B

It is from and apologies if I pronounce this correct incorrectly Payton tada and he asks how can I use Hadoop and Cassandra together and where would I go to get it and get information about that sure.

C

There's a couple different routes you can take if you, if you just want to use the open source version of Cassandra, it does have manual integration with the tube that is offered with it, and it would require some manual development efforts to go ahead and construct an integration path between Cassandra and Hadoop, but you can certainly do it. What I would recommend, though, is if you go to davis XCOM. We have our day sex, Enterprise Edition, which integrates Apache, Cassandra, Hadoop and apache solr, all together in the same set of software.

C

That allows you to build a database cluster that automatically integrates Cassandra with the Duke with solar and it's automatically built in there's, nothing special. You have to do you basically just install the software startup, the different nodes in the mode that you wish them to be in so you might start. Let's say you wanted to attend. Node, cluster and half Mike, because Sandra nodes half might be Hadoop nodes. You could certainly do that.

C

You would just start that nodes up in their respective order and their respective quote-unquote personality, whether it's real time with Cassandra or analytics with Hadoop. Then, when you insert data into those various nodes, it's automatically going to be replicated to to the different subsystems. So it's when you insert data on the Hadoop side, it's going to be replicated Cassandra when you insert data Cassandra's automatically going to be replicated with Hadoop.

C

So if you just go to the downloads page on Davis XCOM, you can download data sex, Enterprise, Edition, completely free download there and and completely free to use for development purposes. So if you feel free to download and develop to your heart's content with enterprise, that's what I would really recommend in terms of getting started with a a bundled or integrated Cassandra in Hadoop distribution.

B

Great, thank you very much, Robin next question from jose mendez aaron. You take this one and just really short on the show, unless I think, as we covered it in the webcast, but so you think that the future of database is columnar, as in its time was relational. ?.

A

Yeah one of the things that I think we have to deal with now sorry is fast. Changing life, changing problem spaces and different sorts of data in the 90s database books talked mostly about slowly changing problems, accounting systems and banking systems and the modern sorts of data that we deal with from information from sensors and health either end things doesn't fit too well into a chiton structured row based relational model.

B

Great, thank you very much and Aaron you take this one as well. This is a follow-up question from heightened and he says if I'm in production with 10 column, families. How do I add more column, families without harming my infrastructure.

A

So the story of schema changes and cassandra is one of it been a little bit painful, and now we being essentially angels and essentially the same as you would, with a single relational database coming up in version 1.2. We have support for concurrent online schema changes. Nowadays, you can make online schema changes.

A

What we don't advise them to be concurrent, meaning that you have to take a little bit of care when you're doing them don't have to people hit the button at the same time, in 1.2, we'll manage all that for you in in Cassandra, and you can add column families to your heart's content.

A

Just the same, though, in a relational database where you add more tables and your database has to do a bit more work in cassander, if you add lots of column family, so it's all three in the hundreds, then your then the Cassandra server has to do a bit more work. So if you're a 10, you can just throw a query, throw a query on Cassandra and we'll get to work and create that content. As for you,.

B

Thanks Aaron and one more for you, Aaron will stick with you. I have a four node cluster. What should be my replication factor.

A

Good question replication fact areas depending on two factors: I guess number one is your aversion to risk and number two? Is your expectation I think about how about the data that you're going to get back? So emergence of risk means different when you're sending a query- and we talked about this consistency idea.

A

If you we have Cassandra generally works as a quorum based system, and if you have a replication factor of three, then the quorum is too because the quorum is half the modes plus one and and at that level we will always give you consistent results back. If you use Coram's for reads and writes, oh I would normally suggest that you start with a replication factor of three.

A

If you let and use corn consistency, and that will mean that your database, you have these consistent reads and writes, and it really looks the same as when you're in a single relational database server. When you get into things a bit more and you're scaling, you might add replication factor, because you need to scale because you want to spread your data. Further.

A

You've got a very, very high, read mode, and you can tell Cassandra hey I want I need the thing to be on six nodes instead of three and now I can scale a bit more. For my read for my reads: I would start with three and and use the consistency level of core.

B

Thanks a lot Aaron Robin, so you don't feel left out. Here's one for you, but but feel free to. You know flip it. If you need to this one from Mike, c-can Cassandra, efficiently support data with thousands of columns with secondary indexes on hundreds of columns for searching across a wide variety of data attributes. If not, what is the practical upper limit to secondary indexes, supported.

C

It's a good question I'm, not aware of any hard limits. I don't know Aaron are there any that you know of I.

A

Can't think of the hot image in the code base now yeah.

C

So I I think you're really are there's nothing for Mike to worry about again. Cassandra really is is architected to support again thousands or more columns and secondary indexes at least are there and outside of perhaps maybe a little storage or something like that. There's really no limit that I'm. Aware of that. He had to worry about.

B

Thank you and last question right now and then we'll wrap up. Unless anyone ask anything additional this one from alvin kind of guy, an again apologies for pronunciation there is there any way to fine-tune the hash algorithm that Cassandra employees or is it fixed in terms of how it distributes data.

C

Look I can start, maybe Eric and finish, maybe, if he's getting to the various modes that you can operate in in terms of the data distribution, random, is the default and recommended in terms of a randomized method of Cassandra distributing the data across the nodes in the cluster. There is an ordered petitioner that is also available. However, it's typically not recommended for a variety of different reasons and I'll. Let Aaron the expand on that. If you'd, like yeah.

A

So we have when it comes to distributing data. This there's two features in Cassandra that we use. There's the partitioner when the random petitioner and the ordered, partitioner and random gives you a good fan point: a random sampling of data across all the modes which allows you to partition capacity and throughput. But then, on top of that is the idea of the replication strategy, which determines how data is just through videos as well and nowadays the default replication strategy.

A

Cassandra is called the network topology strategy, that's the one that allows you to say, use a replication factor of three in my East Coast data center, a replication factor of three in my West Coast, a temper and in in the middle of the country. I've got my own premises. Cluster and I want a replication factor one in there, because I just want the data in there so that my developers can come along and touch that and hit. Therefore, it's just for backup purposes.

B

Thank you very much, and actually there is one more question which erin you take. This one's is Cassandra offer snapshots and clones.

B

C

Wrong Dennis.

B

Gardein, by the way.

A

So there's a utility Cassandra called no tool and it can do snapshots and when we do a snapshot, we flush everything from memory onto disk and then use disk level hardlex in Linux to do a snapshot of that they nordisk did. We use that to them run their backup systems, it might be when you're doing an upgrade, either a snapshot so that you've got your data from before you upgraded your upgrade check the box and everyone's happy and delete those snapshots.

A

You can also use those as part of your disaster recovery planning to move data off snowed on to something like Amazon, s3 and Netflix. Do this and using their open source platform called prior, which they used to manage that Cassandra clusters and they'd also involve, but it also lets them take their off those backups and spin up our own a clones cluster.

A

So they use this when they move into a new region. They move this when use this when they launch their European service and they needed to take. Essentially they contain the cluster and stamp it a new datacenter out over in England, and so they snapshot. It went into x3 and they got it over there, so you can do snapshots for various reasons. Wherever you want to start off your data, you can take those snapshots and use those to bootstrap new clusters or new baby comes in the cluster.

A

C

One thing that I'll add about that is: there is a web-based management tool from data second off center, which you can go to the downloads page and download off center, and it supports doing visual snapshots. So you can actually point and click your way through creating snapshot, scheduling a snapshot to run on a repetitive basis. Things like that, so that has available for you.

B

Thank you both very much and I love it. When I can answer a question this one from Mike see, will these presentation slides be available for download? Yes, they will from dates tanks, comm. We will put the archive of this webcast, so you can go through and hone in on areas you want to and then also make the slides available. Alongside that, we will aim to have those up by this time tomorrow and we will send out a link to all the registrants and attendees when they're available.

B

That is it for today's webcast. Please join us in two weeks time when Billy Bosworth, the CEO of dates tax, is going to talk about transitioning from relational databases, to no sequel and then two weeks after that, we welcome back Aaron Morton again, who will give a deeper dive into Apache Cassandra as an introduction and some of these questions that we've had today Aaron, it would be great actually if we can incorporate those into your introduction, presentation.

B

Sure, okay, thank you, everyone very much and look forward to seeing you and your friends back in a couple of weeks time.