Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Can't We All Just Get Along? MariaDB and Cassandra

Description

Speaker: Colin Charles, Chief Evangelist at Monty Program Ab
Slides: http://www.slideshare.net/planetcassandra/5-colin-charles
The Cassandra Storage Engine allows access to data in a Cassandra cluster from MariaDB. Learn what the Cassandra Storage Engine is and how to make use of it, how we implemented it using dynamic columns in MariaDB. Also, we'll look at CQL, data and command mapping, use cases and benchmarks.

A

Alright, so thank you for coming, I'm here to talk to maria DB and cassandra interoperability. So if you are using MySQL today, this is a good path for you to also migrate to Cassandra from your existing installations.

A

Maria DB is a branch of MySQL and we'll go into a little bit about it, and now I'm Colin Charles and a little bit about me is I work on Marie d, be today: I used to be from MySQL I used to hack on my SQL code and then joined son when their colitis I've been working with open source for a very long time from the Fedora projected redhead as well as openoffice the door.

A

It's probably important to note that Monte program, the company that you would see me representing today, is a major sponsor of Maria TV, but Maury DP is governed by a foundation, so it is not. It is an open source project with an open source foundation backing and the reason why I say that I now work at Sky SQL is because sky skill and monty program have agreed to merge towards the end of april. So we're going to be a much larger company offering services as well as engineering Monty program, was completely engineering oriented organization.

A

We focused on making a better mysql with lots of links to other databases, so today's agenda is pretty simple: I'm going to focus a little bit on what Maury DB is a little bit about the marie DB architecture and you'll understand why don't say MySQL after a while, because there's one particular feature that we've extended Marie DB that isn't in mysql. So for you to then fully migrated of mysql.

A

You probably have to migrate to really be first because of this one particular feature that allows us to connect to other things, talk a bit about mapping use cases. We've run some benchmarks as well on amazon and some conclusions. How many of you here use amazon for your deployments? Ec2? Okay?

A

So since only one person here, sort of Maury DB before what is it Marie DB is a community developed feature enhanced backward compatible version mysql? It is a one hundred percent drop in replacement to mysql. If you are running linux like fedora or ubuntu or Susa, lately many of them, when you just do a yum, install or a zipper, install and ask for mysql you're getting more a DB by default. So it's a new default and we have a whole bunch of enhanced features.

A

You don't have to deal with oracle in terms of getting an oracle enterprise. You can get the thread pool for free. A thread pool is very useful for many short running queries that that happen typically with your web apps. So you can just open up a few threads to run many queries and get results. Coming back in the same thread, we have things like table, elimination which is the basis of anchor modeling.

A

If you are my SQL user, you know that you cannot use sub queries, however, with Maria DB sub-queries now materialized and their DB g3 benchmarks, showing that we can now run the full dbg 3 data set, including sub queries.

A

We've done a huge amount of changes in replication, which still don't match the ease of use that you get out of Cassandra wait. So you still can't do things like multi-master replication yet, but we've made things like group commit in the binary log happened, which means that if you have more than three parallel running queries, instead of calling F sink every time, you call F sink as at one go at which point you actually get great performance improvements, because s link is expensive in Linux.

A

We've also been playing a lot with these new SQL and links handler socket is a no SQL interface to inner DB. The storage engine, which allows you to do simple, create, read, update, delete operations. It completely bypasses the sequel layer, so it just goes direct to the engine and it's very very fast. We also integrate with the string storage engine so that you can now do full text search using stinks because MySQL amore, DB, isn't really made for full text search will allow multisource replications.

A

So since you're, probably if you've come from a MySQL world, you have many groups of little masters and slaves, but you maybe want to aggregate all the data from all those masters because they're running separately, that that's what Multi social application is useful for and dynamic columns. We will talk about when I get to the slide. So this is the mysql / maria DB architecture, diagram.

A

You have your application sitting all the way at the top. You then connect to it via the myriad number of languages that are available so take your pick, Perl Python, Java, etc. It then goes to a connection pool the connection. Pool. Will then do authentication we've extended authentication. So now that you can, you can also do authentication against pan. You can also do education against ldap. You can also do authentication against Active Directory after you go through the connection pool you hit the sequel interface. Then it passes, it hits the optimizer.

A

It may already pick stuff up from a cash, but if it doesn't already have it in cash, it goes straight down to the pluggable storage engines and the pluggable storage engines sit right on top of the file system. Things like my eyes. Em do not offer transaction support, but it's very good for quick inserts. You know DB is fully transactionally aware, but we ship something called extra DB, which is in 0 DB. That generally runs at Google and Facebook.

A

So it's in 0 DB that runs at scale, so you're familiar with MySQL you're, probably familiar with the pakona tool set and extra DB is a Kona based tool, but we've always had engines that spoke not only to the local file system, but over the network. Ndb was an engine that it's always been a network database, it's commonly referred to as mysql cluster federated ex was commonly embedded inside cisco routers. They will allow you to have SNMP log data sent across the wire to your mysql server. We have engine that now integrates directly to leveldb.

A

Leveldb is a key value store how many we use the chrome web browser. Okay. Most of you, let the diversion of leveldb that sits inside chrome is what implements index DB. So technically you've been running, we be without generally, even knowing you have this database sitting there. It's part of the html5 spec each browser implemented differently as well: Firefox user sequel life, federer I, see that's the engine I'm here to talk to you about that.

A

Actually, instead of hitting the file system talks directly to a cluster of cassandra that you are very common with, I'm sure most of you here use cassandra who here does not use Cassandra? Okay, okay, a bunch of you! hmm Hopefully after this you will use Cassandra.

A

So with the storage engine, leia we've been extending it and we're looking at many different storage engines, including things like engines to MongoDB and so forth. So you can use the same sequel interface that allows you to now speak to other databases, but then we also accept the replication API. So now there's a replication API in mysql II of binary logs, which is we call bin logs.

A

So there is now an API and you can have a Hadoop reply that writes directly to HDFS, which, from what I understand, is also kind of useful to folk, and this is not something we ourselves developed. Oracle has also worked on this as well to make the supplier happen. So now you have two ways to connect to different engines. You have the replication API, as well as the pluggable storage engine API, both of which generally quite unique to the MySQL world.

A

The MongoDB world is getting the idea of having a pluggable storage engine as well, but the only other open source database that has something pluggable is a pluggable optimizer in Postgres and we're really focusing on becoming a data platform. Yes, yep.

A

Okay, so actually I tried to explain the whole mysql architecture. You don't go through any any layers. You you basically still rights equal to connect to you Cassandra cluster, and there are some use cases for why this may actually be useful, especially in terms of use, the tracking and so forth. So you don't actually go through all of this.

A

This this happens in less than a micron second, possibly but I kind of needed to explain what the architecture looked like so that you know how we're connecting to it and there's no black magic of Voodoo happening, because if I told you all, your applications would just write, sequel and connect directly to Cassandra. Then you may presume I'm lying to you, hmm yeah yeah. That's the answer. I didn't want to I wanted to show you.

A

How are we doing it as opposed to hey magic happens in the background yeah and we're really focusing on this whole idea of becoming a data platform? So we want to go away from the MySQL, but still remain one hundred percent compatible with it. So we have not only handle socket.

A

We have memcache memcache d access directly to innodb, so if you're using memcache d, which is really really common, if you have a web app nowadays, you might want to make it persistent inside and you can save your mom cached information inside NDB, so because I, how do you apply leveldb cassandra and this? This is planning to go on and on we plan to integrate with other storage engine, so you can continue writing sequel without so your current knowledge of sequel will not go away.

A

So this is the reason why I did not mention mysql, because we have extended maria DB to include something called dynamic columns. Dynamic columns allows you to store set of columns every each and every row in the table. It's an arbitrary star and it's like a blob. It stores it in a glob, but it comes with lots of handling functions, so you can do it dynamic column, get create, add, delete and, most recently, we can also give you. The Rose in JSON format and json seems to be a relatively good interchange format.

A

That's very commonly used for many new systems. Many people like to write JavaScript from from the get-go, so now you can get stuff in JSON as well. You can nest dynamic columns as well, and you can also name dynamic columns and you do the column name previously. You could not name dynamic columns. These were actually given to you by Maria DB itself. Now this particular dynamic column feature is not available inside of mysql, so the connection to Cassandra would not be available via mysql. You actually have to use Maria DB, but lucky for you.

A

If you're already using MySQL the upgrade to Maury DB is really easy. You can just do yum install Maria DB server and it will just replace it in situ. It reads the same data files. It has the same socket same port number, so your application doesn't actually change per se, so you upgrade is in situ. You just generally get in all the additional benefits and performance fixes that we have so for the few people that don't use Cassandra. This is kind of like how we we mapped it column. Families are exactly like tables.

A

We have the rope key to column mappings. We do not support super columns, but from what I gather in the Cassandra world as well, super columns are going away. So it's not really a big deal that we do not support. Super columns.

A

Here's a quick example: I know it's really really small, but this is a what cql is basically which you should be quite familiar with. I created the key space with marie DB test and then I also could selected from cql, but then you'd realize that you could you can't do everything with cql.

A

There are some select operations that do not work. Cql three is likely to improve this, but I have a quick question here for the audience how many using cassandra 1.2, how many still use Cassandra, 1.1 and and before?

A

Okay. So that's more more still on the Cassandra 1.1. So that's good, because this implementation is based against Cassandra 1.1 bugs 12, so cql looks like sequel at first glance. It, however, doesn't do joins. It doesn't do some queries, however, my skill doesn't dislike berries either. So you're probably already used to not having sub queries, but Murray DB does, which is why I bring that up? You don't have group by order by inside of sequel, cql cql 3, I'm going to attend the talk later to learn more about it.

A

It's relatively new because I think it only got released in February of this year, so there are dashed for some changes that we don't. We haven't made to make it more sequel, 3, ready where clauses need to be represented as index lookups. Our simple goal today for the Cassandra storage engine is to provide a view into Cassandra's data from Maria DB.

A

That means inserts, reads, etc, but we don't want to replace sequel cqo and we want this to be a good good pass for you to currently use an access cassandra without having to use cql yet, but maybe down the line. If you re architecting your application or your data model, you can then use this as a stepping stone to migrate as well. Note that migrating helps me getting started is really really easy. We released Maria DB 1003 yesterday, so this slide used to say 1002 right up until yesterday, so you just download it there.

A

We have binaries available for all forms of Linux I'm presuming you are going to be testing most of this on some form of UNIX. You need to load the Cassandra plug in all Sarge engines. Are plugins at the end of the day, this is even true for n, 0, DB and so forth. You can do install plugin, Cassandra, so name a che Cassandra RSO. This will install it. You can also start it in my doc CNF. So under the mysqld you make sure the plug-in load is.

A

Is there or, if you install it via linux, distribution, make sure that you install more adb, dash, Cassandra storage engine, because you don't install that particular package when you're unsure engines it will not be. There also make sure that the Cassandra storage engine is there. You can do that by doing something like show, plugins or show engines, both of which will work. Just fine.

A

If you do, if you don't see this year, it means that you have a problem and cassandra is you cannot access Cassandra?

A

Now you can create a sequel table which is basically a view into a column family. You need to set the global global thrift toast you can also set the thrift O's per table, so we did this on on amazon. For so you can also try this on amazon and I'll. Show you how later I'm not going to do a live demo on amazon, because the internets kind of flaky and need to create a table specify that the engine is cassandra, make sure the day is a key space.

A

You must have a thrift toast, because this uses the thrift api and you must have a column family name as well. The Cassandra default proof toast, which is right up. There, allows you to repoint the table to any different nodes, dynamically and not change table ddl as well, when Cassandra is IP changes, so this is. This is also similar to how you'd connect the Federated, X or strings as C or any any network related database from the MySQL world. You always end up specifying addresses or pools of addresses, so to speak.

A

There are potential issues that you may face, oh to be to be fair. I ran this against 1003 yesterday, as well, just to make doubly sure that you would not have a problem. Potentially, if you run on fedora RL, you have selinux issues and/or if you're on Ubuntu, you have audit d issues, so you may see a permission denied error. You can turn selinux off or stop audit d.

A

Okay, none of this happened, but if it does happen to you which is occasionally reported, you can turn selinux or already off with regards to Cassandra 1.2, which is what was released sometime in February column. Families without the compact storage attribute are generally not supported. You'll get an error, so this is pre cql 3. You need to use compact storage, and this is referred to as legislators, currently in the Indy documentation. So my suggestion is to continue making legacy tables with compact storage.

A

We will fix this in future releases, but the time frame of every June is still pretty short. Also. We notice that thrift based clients can no longer work. It also broke peak in 1.2 and we're looking for forward to the patch. The issue is Cassandra 5234 and that that should be fixed, probably in the next release as well. So Pig 0.11, I believe, is also broken against Cassandra now. So, for all intents and purposes, I have used cassandra.

A

1.1 data stakes, cassandra 1.14 for this demo and the best part is now you should be able to access data. You can get data from Cassandra just by doing a select and you will actually get the data pulled out of Cassandra. You can insert data into Cassandra and then you can double check with cql SH to see if the data has been inserted as well. So you now have a complete window, / view into Cassandra. All these commands will work in the examples.

A

In the example am I that I show you later, you can download it yourself. Ok, so let's talk a little bit about data mapping. Marie DB table will represent Cassandra's column family. You can use any table name or any column family equal, something to specify Colin family, and this is high up. There.

A

All tables must have a primary key, and the name or the type must also match Cassandra's Rocky also called this will map to Cassandra static columns, as highlighted up there. So don't forget, the name must be the same as it Cassandra. The data types must match, and it can also be a subset of column families. This is that that also works.

A

This is the data mapping.

A

We support pretty much everything, including timestamp, and we support micro seconds in time. Stamps MySQL doesn't support microseconds. We do so. That's probably one additional little feature there. So what is dynamic columns going back again, mainly because why do we are dynamic columns so that so that we can access Cassandra's, dynamic column, families and access adult columns? This is how you use dynamic problems as well. Inside of Maria TV. You don't have to use it with Cassandra.

A

You can use this with your regular app as well, but this is an example how I use it with Cassandra, so I've, apparently until live about five minutes left, so we're going to go a little quicker. All data mapping is safe, Cassandra se will will refuse incorrect.

A

Mappings it'll actually spit errors out at you, and these are common errors that you can get spat out at you, we've mapped most commands, cassandra has put get and delete, and then there's sequel commands like select is basically cool equivalent to a get or ask an insert is basically put an absolute is an update than an insert. An absolute is a valid term. Nowadays it seems so with regards to select command mapping. Marie DB has a sequel, sequel interpreter Cassandra SC will obviously support the lookups.

A

You can now join between cassandra tables as well as Maria DB tables, and we have something called batch. Key access joins available and batch key access will actually make sure the joint buffers are accumulated and interesting columns and rows are actually transmitted to the optimizer. Turning on batch key access gives you great performance when the query sign is go from one to three, so regular joins us as batch.

A

Key access joints are amazing, so I'd always turn this on, especially if using because Sandra, because it's accessing stuff of our network, so with regards to DML insolent, does over. I rose, update, read stan rights, so let's just make it clear that Cassandra SC doesn't make it sequel, sequel, cuz, its sequel like, but it's not sequel per se. So a few use cases with Cassandra I, see log collection and analysis is amazing.

A

Etely used inside a Cassandra in the old days, maybe in two thousand seven or so you'd say: hey, grab log data and keep this inside of my eyes I'm or are the archive storage engine? Cassandra is better at it form version and cassandra is better at it. You want to call that webpage sheets. You want to collect data from sensors, so you in the previous talk data from sensors. Cassandra is awesome for this. So collecting the data time series, data from Cassandra and then query outs using maria DB is fine.

A

So if you so, if you are getting time series data, you're, doing user activity tracking, so this web page was last viewed by foo last known position of this user on inside this web page was this. So if your ecommerce shop, knowing the last known position of the user and keeping data of the user, is very important for you, you want to keep all this data in Cassandra. It's not so good being kept inside of relational database like mysql amor, adb, you are user five out of 1,000,000.

A

Do the old adage was people who select count inside of mysql inside forum software and that makes forum software notoriously slow? So nobody does that. That's this kind of stuff you can do with cassandra and if you're coming from maury DB- and you want the table that is auto, replicated mysql morita, we do not do other application, you want fault tolerance and you want something. That's really really fast get Cassandra with a Cassandra SC table. The other thing that's pretty pretty unique. Is you can get a globally replicated table?

A

Cassandra allows this, and you can't do this even with something like a galera cluster, which is another product line that we have inside them ready be another possibility. Unique use case is that we have a connect storage engine, so you can now connect and join data between an oracle database. Why odbc Cassandra your Cassandra cluster as well, and use nodb as an intermediary and stored data inside of Maury DB?

A

You want to turn on something called engine condition, push down which basically sends non matching rows from the storage engine to the sequel layer it and it does avoid round trips over the network. Basically, and the filtering is done on the remote data node as well. This is kind of useful, especially if you're I heard people wanting to migrate out of sequel surveillance, sequel server supports odbc as well, and this is great great use for being middle middleware software that helps you migrate, non use cases.

A

Things like huge sift through data joins pick is better. You want to do a bulk data transfer. Scoop is better. We want a replacement for energy because Sandra se is not quite your replacement for 80 DB. That would be rethinking the data model entirely to make it happen for you, here's a quick, tiny benchmark. We did this on Amazon ec2 with m1 large nodes. The bottom is innodb in blue. You can see that the moment we even add to two nodes for Cassandra. You start seeing amazingly good throughput, which is the kind of black one.

A

There there's amazingly good throughput for data with next to no tuning when you start having a Cassandra cluster in the backend. Cassandra is really really fast same setup as before, with some tuning done with the new DB and again with one Cassandra node. The red shows that the red beats the blue in terms of transactions, even when hit client threads, so Sandra is fast. Cassandra I see the interface is really really fast.

A

Oh this, this is for single line insert both are for single line inserts using this bench. So you can. You can basically pick a data from maury DB into Cassandra. It's really really easy to set up and use. We want to see. If you want other features like table discovery, we can actually have assistant table discovery of a Cassandra cluster like we do for federated X and connect. So we're definitely looking into doing that.

A

If that's something that could be useful for you, if you want to want to automatically access a Cassandra cluster, we were happy to actually start looking at that secondary indexes, possibly as well huge chunk of resources. These slides will be online. Thank you for listening. We won't go through the internals, but if you want to actually try this on your machine, you can download this virtual box image using vagrant.

A

You can play with this on your localhost, including like a Mac or Windows machine, but if you really want to do this properly, do this on amazon and use you know use the aim. Am I that comes from datastax? It's really really easy to get started.

A

Alright, so thank you. If you have any questions, you can ask them. Oh, we don't have time for Q&A, so I will be right at the back. You can ask me questions or you can email me. Thank you for listening.

A