Apache Cassandra Meet Up Presentations, 11 Mar 2012

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Apache Cassandra and Python

Description

Jeremiah Jordan
Using Apache Cassandra from Python is easy to do. This talk will cover setting up and using a local development instance of Cassandra from Python. It will cover using the low level thrift interface, as well as using the higher level

A

Hi they said my name is Jeremiah Jordan I am a software engineer at Morningstar. You do have financial data I've been using Python for a year and a half and using Cassandra from Python for that same year and a half.

A

So today are you here, hopefully you're here, to learn about Cassandra and how to use it from Python.

A

So today, I'm going to talk about what Apache Cassandra is a little bit and I'm going to talk about setting up a local and/or development instance of Apache Cassandra, and then I'm gonna talk about how to use it from Python thing and go into a little bit of schema design and data modeling Kasane in Cassandra.

A

What am I not going to talk about I'm, not going to talk about setting up and maintaining a production instance of Apache Cassandra. That's a whole nother talk, so talk just talk about using it locally from Python. If you want, you can get this get a copy of the slides. This address or PyCon will have them. They'll be linked from the PyCon website. Later.

A

All right, so what is Apache Cassandra? So here's the description from the Cassandra wiki in their website, so Cassandra is a highly scaleable, eventually consistent, distributed, structured key value store. It brings together the distributed systems, technology from dynamo and the data data model from Google BigTable, and so it's eventually consistent like dynamo and has a column family-based column and with columns and a key value store like BigTable.

A

Alright, so many things its column based key value store basically looks like a multi-level dictionary and you know dynamo from Amazon BigTable from Google and then the other nice thing about its schema optional. So if you want to tell Cassandra how your data Stipe tanned, how your data stored in the tables you can and there's some extra stuff you can get out of that. But if you just want everything to be bytes and you can insert whatever you want, you can do that too.

A

So here's the basic structure of how data is laid out in Cassandra, so you have a key space at the higher level, which kind of like a schema from a normal database, and then we have inside the key space. You have column, families and inside column families. You have rows which have keys and column names and values going so here's an example set up with some more real names for things. So now you have your application data key space with a column, family user info. It's got keys in it.

A

The keys are sorted, are not sorted they're in a random order. For the most part, it's the recommended way of using Cassandra. Basically, the ordering of the keys is how Cassandra decides to store data across a cluster of machines. It's a distributed system and the easiest way to do it is you have a random ordering of keys so that stuff gets evenly distributed across your cluster.

A

If you use one of the ordered methods of storing keys, then you have to be careful to put putting data hotspots on your cluster because the the data is put in an order. So if everything is grouped together, you're gonna have one server, that's getting hit by every single one of those requests for the data.

A

Alright and then your column names, column names are sorted and column names. Art can be typed so like in the first column, the first column family. Here, the user info column, family, the column, names are strings and so they're sorted in string, sort order and the second column family, the column names are time-based unique identifiers, and so those are sorted in time order.

A

So you know time one time two times three are gonna, be sorted in time order and we'll get into how that's useful later, and then you also have every column has a value associated with it as well, and you can also tell Cassandra or not tell Cassandra the types for values and we'll talk a little about more. If you type values, then you can use some of the extra indexing features. Cassandra has and then you also have every column, value, column, name.

A

Value pair also has a timestamp associated with it and that timestamp is used to do conflict resolution. So if multiple people write to a given key with the same column, name, the one with the higher timestamp wins and the timestamp is client provided. So you need to make sure when you're using Cassandra, if you have multiple clients, writing to the system, make sure their clocks are in sync if they have any chance of writing to the same spot.

A

So, like I said, it looks like a multi-level dictionary write your highest level, you have your column family, you get in there and you key into with your key, and then your columns give you back their values.

A

Really it's the at the lowest level. It's an ordered dictionary. So your your columns, you can go through them in sorted order, all right. So now we know a little bit about it. Where do you get it? You get it from Cassandra, Apache, org or if you want there's a company out there called data stacks that provides Debian and Red Hat packaging for Cassandra, so once you've downloaded it extracted installed it whatever.

A

For a development instance you're going to want to change the Cassandra configuration files some to away from the default locations, and so you know in the yum file, you're gonna change, where the data storage may be, where it's changed where the logs are stored and then the only thing you're gonna want to change on. The development instance is, probably you don't want it using all your RAM by default, it's gonna take up half the RAM on their system. So you probably don't want that from a dev incidence, your running unit tests on all right.

A

So then, once you've set up your configuration files so that everything's working you run it. You just run the command line. You say: F, it's going to run it in the foreground by default. It's gonna demand run in the background, if you don't say that F just some setup tips for running it locally.

A

So usually what I'll do is make some templates of those configuration files and then have the startup script, run those templates through something to stick the current path in there, so that I can check it all in to get from my development instance and then, when someone pulls it down the first time they start it, it generates configuration files to run Cassandra in place and whatever folder, that the person checked out the code, it's nice for running unit tests, so then so now that you've got it up and running.

A

So the first thing you need to do is connect and set up some key spaces and column families for your code to stick stuff into there's a command line, interface tool that ships with the distribution. So you start up the command line tool, connect to it and create your key space when you create a key space, you also have something called a placement strategy which basically says how Cassandra is going to store that data across the cluster, so where it's going to put each key the default strategy.

A

Is it's going to take your ring of machines and chop them up into equal space chunks and stick the data across those different locations? There's other different things. You can do. There's Network aware strategies that can know about data centers and we'll make sure data goes into. You know. One copy goes into datacenter 1 and 2 and a datacenter 2. However, you want to set it up and that's where the strategy options come in there we're for the simple strategy.

A

Basically, you just taught how many machines you want a piece of data on for the more complex strategies you can tell it. One piece of data over here over there 3 in the third place. However, you want to set it up. So then, after you create your key space, you create your column. Families inside that key space.

A

I said when you create a column family, you have to tell Cassandra what kind of columns you're gonna stick in there there's a if you just want to narrate columns that you there's a bytes type, that you can just stick whatever you want as a column name or you can tell it. Columns are strings and then they'll be sorted in string order. You can tell them they're utf-8 strings you'll, take care of that or you can do things like make it a time base UUID so that you can sort things in time-based order.

A

You can make stuff make them. Integers floats whatever you want. Basically, a column name is really just another spot. You can stick data, unlike most databases. So then, once you've got your schema set up, you want to connect to the system, so you're gonna go. You want to client Union there's the Cassandra wiki has links to most of the up. To date, clients for connecting to it in a variety of programming languages, the ways to connect to it from Python, so Cassandra is built with thrift as its main interface.

A

Thrift is another Apache project that basically gives you a remote procedure, call interfaces in a variety of languages. There's a thrift compiler you give it some IDL. It spits out code to talk to a server that implements a thrift interface and the rift has generate code. Generators for probably 20 different languages, at least so you don't really want to you. Don't want to use the thrift directly, not a good idea, so you want to get a native client. So for a native client, their PI Casa is the one I'm going to talk about today.

A

There's also telophase for your doing. If you have a twisted app, you don't want to use telophase and then there's a new client called Cassandra db-api 2, which basically implements the python db-api 2o. On top of there's a new CQ l, query language, that's being developed as a new interface for talking to cassandra and that db-api compliant interface uses the the sequel, CQ l interface, the CQ l interface doesn't have all of the functionality of the thrift interface yet, but it's getting there, but that exists as well.

A

If you have something that's already using a db-api 2o interface. So this is why you don't want to use thrift. Basically, the the IDL generated code has a whole bunch of extra objects and stuff like that in it. So you get a lot of code generated. You want to use Picasa, that's doing the same thing connecting and inserting a value see it's. You know, half a third of the lines of code, so Picasa has very good documentation up on github and then there's also an example.

A

Application implemented called true Sandra, which is basically a Twitter clone using Django and Picasa. It's a good good to go. Look at so go through a little bit of using Picasa. So connecting it's very simple. You create a connection, pool object. You tell it what keyspace you're going to connect to, and you give it a server a list of servers that are in your cluster and Picasso. Will you know if there's any errors or anything like that?

A

It'll do retries across the different servers in the connection pool and if you have multiple threads or something like that using the same connection pool object, it'll multiplex, those threads across the different servers in the connection pool to spread out the load.

A

Cassandra is actually very good about scaling up pretty linearly when you add more nodes, if you add more processes to go across those different nodes, Netflix actually published a pretty good benchmark, very extensive benchmark where they had millions of clients talking to hundreds of Cassandra nodes across multiple AWS regions, and there was a very linear scale up in their benchmark is I was surprised when I saw it. It was pretty impressive.

A

If you would go check out Netflix's tech blog, they have a lot of articles about them using Cassandra they're using it from Java, but it's a good at least what you can do with Cassandra. It's pretty interesting. So now, once you've created your connection, pool you're going to create a column family object, you know tell it what connection pool to use what Collin family you want it to talk to.

A

So then, once you have that column, family object, you want to write data that just say insert tell it what key you want to insert for and then give it a dictionary of column names to value pairs and those will get inserted.

A

Read all so very simple. You say, get give it a key and you can optionally pass in a list of columns and there's actually some other options to get that all get into later for fancy things you can do delete the same thing, give it a key optionally, a list of columns all right and then so. The batch interface, which is probably what most people are going to want to use a lot.

A

You do a batch insert and basically you just give it a multi-level dictionary of a key to another dictionary of columns and values and it'll insert all of that in a batch, and then there's also a streaming interface for batching. So if you create a batch batch object from the column family, you can optionally give it a queue size if you specify a queue size, basically every whatever your.

A

So in this case, 10 function calls it'll batch those together and send them off to the server or you can also not specify one, and it only send when you call send, but then once once you've created this batch object. You just do your inserts and your deletes, just in your removes just like on the regular object and then either when the queue sizes hit, where you call send, it's gonna send all those off and you can actually can batch up inserts and removes in the same batch and it works fine. So this is good.

A

You know. You've got a stream of data coming in, it's always more efficient to do things and do things in batches, but you can treat it just like you're doing one at a time. Your code doesn't have to know that you know internally batches F up and the other thing you can do with batches is do batches across multiple column, families, and this is nice, because, basically, your your insert is going to succeed or fail atomically as an operation.

A

You know across multiple column families, so it's nice for doing things like inserting into a column, family and also into a second column family, maybe as an index or denormalized query for some other kind of some other way of doing it. To do that batch across multiple column, families, you create a mutator object and then for every operation you specify the column family object for the column family. You want that batch operation to happen on and it basically works.

A

The same way is the single column family batch, but we'll do the insert across multiple column, families, and then, if you want to do a batch, read you use multi get and you basically specify a list of keys, and you can also optionally specify a list of columns to go across for multi get the other thing you can do so.

A

Like I said, columns have types so and they're stored in a sorted order, so you can do what's called column slicing, and so, when you say get you can specify start and finish values for doing a slice, and it will return you all of the columns which have a column name. That's you know sorts between those two values and which is so.

A

You know which you can do for addresses, but which is also nice for, if you do things like, if you do the time you you IDs, you can say create a start time say ten minutes ago and then ask for all of the things that have been inserted in the last ten minutes. You know all of the activity in the last ten minutes for something.

A

So if you're doing like your your Twitter clone write your timeline, your Twitter timeline could just be a row and Cassandra and you're gonna say you know, give me the last ten minutes worth of data to show to somebody or, if you're, doing, a storing log files or anything like that.

A

You know other things you can do for store dates, sorting and storing data and I'll get into a little more of that with when I talk about indexes, because that's one of the main ways you can do indexes, so you can also type if you tell 'kiss, if you tell picasa the types of your data or if, when you create your scheme, you're actually storing to Cassandra the types or things then Picasso will do data conversions for you, so you don't have to insert to give everything to Picasa as strings or bytes.

A

You can actually just stick an integer into that dictionary and Picasso will go. Oh, he told me that age is an integer, so I'm gonna, you know, convert 32 into a byte stream and insert those bytes in to Cassandra, or he told me height, is a float so I'm gonna take that float and convert it to an I, Triple E, float representation and store that into Cassandra and then it'll do the same thing. When you do the get back out.

A

You know it knows that that thing is a float, so it's gonna convert it to a float. Object so you're, actually you're gonna get a float back and you get your answer.

A

And then, once you started doing types, there's a nice, basically object model classes you can use, we call it. The column family map doesn't give you a full o RM object. Relation model just gives you an object model, so you can't insert multiple things.

A

You'll get an exception, so if Picasso will give you an exception or if you've told Cassandra, what type is something to be Cassandra will give you the exception.

A

So you set up these up. Basically, you create these objects. You use these special types from the Picasso type library just say you know. The key is gonna, be a utf-8 string. In this case, any email address is ASCII ages, integer height is the float joined, is a date type, and then you create a column. Family map object just like a column. Family object, but you're also gonna pass in that object.

A

You created in the that says what all the the column names mapped to in terms of their types and then to write something with this interface. You instantiate, the user object, then just fill in all the attributes on it. The key is John, the email is John at Gmail, you know age, 32, height 6.1, they joined on date/time now and then, when you call insert Picasso is going to take all those things convert.

A

All the attribute names and the column names convert all the objects into the byte representation for their type and insert it into Cassandra.

A

And then, when you read just read back with the get of the key, and it's going to give you it's going to create one of those objects for you again, so that you have all those things all those objects with all those attributes with the right types read back out of the database, you can also do a multi get to get multiple objects back out or you can do a remove on an object.

A

That'll delete that object from the database using the key that you specified, and the only thing that's special here is: you need to have an attribute called key, which is going to be the key that it's going to use for the the key value storage.

A

All right, and so then the next thing you do so like I, said at the beginning, all of column value pairs have timestamps associated with them. You can get at those timestamps by saying, include, timestamp equals true on get and then on inserting you can insert your own timestamp instead of Picasa just using now, if you want to specify what the timestamp is, that something gets stored with, and you do that on the insert just add an extra parameter to insert to do that and then there's also something called consistency, level and Cassandra.

A

So when you have a multi machine cluster, the consistency level is how many machines do you want Cassandra to check the data on before it returns? You and answer so, if you say, are storing data across three machines, when you insert, you can say I want you to return success to me when one of those machines has said I got the value or you can say when I want a quorum which means n over two of those machines have gotten the value return success to me, and basically you can pick how consistent you want.

A

The data to be and how fast you want your insert and read operations to be using this consistency level, and so, if you want stuff to be always be consistent, you can always use insert with quorum, read with quorum and then you're always going to get the same answer when you read them right, but you can also just if you want it to be fast and you don't care that, maybe you get an old value by a couple milliseconds or if that server just came up for being down, so it missed the right and hasn't gotten it propagated to it.

A

Yet you know you just care that it's fast, you can use one and then you'll just get whatever value that that machine has on it. So you can pick how consistent you want your data to be and how fault-tolerant you want your data to be all right, so indexing. So cassandra has some native indexing built into it and you can also roll your own indexing.

A

Not I'm gonna go into a little bit if you go check, get the slides later. Here's some links to some good articles about doing indexing in Cassandra going more in depth with it. The native indexes are very easy to use. You just update your schema to say that you want Cassandra to index a column.

A

You have to have given told Cassandra what type that column is going to be and then basically it's going to build a column family in the background, that's feed by column values so that when you, you can search on that index and it'll, give you back the rows that go with it. You can also do filtering when you query.

A

You always have to have at least one equality operation, but after it's matched that equality, you can do things like greater than less than an equal to on other columns in your query and I'll show that real, quick and then, but this isn't recommended for really high cardinality values. So Cassandra has a maximum of 2 billion columns in a row, so I mean really high cardinality. But so, if you have some column, that's going to have a value in it that you're gonna, have you know a billion values?

A

It's not going to be very performant using the the native indices and then the native indices. Also slow down writes a little bit because the server always has to do a read before right before it can insert your data so to make sure it doesn't have to update an older value so to use it to add an index I'm just going to go to in the command line. Do an update, column, family I'm, going to say, update it, so that state is a utf-8 and it has an index on it.

A

The only index type right now is keys there. What other indices they're planning to add, but that's your only choice right now, and so basically, once you've added that index and Picasso you're gonna create an index expression so we're here, I'm gonna, search for everything. Everyone who lives in Illinois and their age is greater than 20 and then once I've created those two index.

A

Expressions I create an index clause out of those two expressions and you put the clause in the order you want them checked and then you can use the get index slice passing in that clause and it's going to return you an iterator, that's gonna page through the values returned from the database to give you back all those the index values and it actually will default. Paging batching data back from the server I think the default page size of the thousand items you can specify how what you want the paging size to be.

A

You also specify a maximum count of things to be returned. The other thing you you can roll your own indices. Basically, you know, like I, said before use the batching interfaces to write data to two different column, families, so that you don't have to do the read before right. If you know the things new, the other thing you can do, if you do it yourself, do you normalize your queries so that when you read something from the index row that has your data in it?

A

You don't have to read the index and then go read the column, family, the the main column, family and then so from what we're we're done. Sort of in questions.

A

Which one the very first one, oh.

A

You know the URL for the slides. Yes, I can go back to that other question.

A

A

Yes, so can the key be a composite value? Yes, so Cassandra does have n talked about. There's a there's, a more advanced feature, that's new to Cassandra or one, oh, where you can tell Cassandra that a key or a column name is a composite value. Basically, it's letting the composite types. Let you say this key is going to be a string and an integer, or this columns.

A

Gonna have two strings and an integer and the date in it, and it's basically just so that it's going to concatenate all that stuff together before it stores it to the database and then for the and the indexing. If you actually have composite values with the indexing, the indexing can know about the composite values, and so you can do some interesting, sorting things with using the composite values so that it'll it'll sort by the the first part of the composite and then the second part and the third part so yeah.

B

Yeah hi there. What is the data that you store that morning star in Cassandra, so.

A

We're we're using it as an operational data store, so we're storing chunks of data that we then go back in later.

A

B

A

Cassandra so she's not saying how does Cassandra store the data with the key value system? Does it store it to RAM, first to disk? First, whatever it's so Cassandra, depending on what consistency level you tell Cassandra when you write data when you write data, if you see a consistency level of any, it's just going to write the data in RAM before it replies back to you, but they walk, but it will then eventually propagate it back out to other machines.

A

If you use a consistency level of one, it means that data is in the commit log on disk of one of the nodes. So basically Cassandra has has a commit log where basically, it's going to write data real fast at the end of a commit log in order and then event as it collects data up, it writes stuff out to SS tables on disk. You know, which are the sorted tables for actually indexing stuff.

A

The so writes are really fast because, as long as you keep your commit log on a separate disk from your random access to your tables, the commit log is always written in sequential order. So your hard disk is always just your read head doesn't have your write head doesn't have to move; it's always just writing to the end of this file, so that makes the writes fast and then yeah. So then, with your other consistency level, it's how many machines is it in the commit log of before it replies back to you.

A

And also anyone has other questions about Cassandra feel free to come talk to me. The data stacks guys are also are actually apply. Some PyCon sponsor as well. They have a booth. If you want to ask them about Cassandra.

B

Maybe maybe advanced answered this is there a way to specify a replication level before you return? What.

A

Yes, that's the consistency level, so the consistency level you specify is how well I'm sorry the replication. You specify replication on a per column family basis. So you say this column. Family gets this many replicas on these servers. But then your consider consistency level specifies how many of those replicas data is written to a read from before you get an answer to your query.

B

B

A

A

I mean it can be used for pretty much anything anything that fits this data model. You know where you're coming in with keys- and you know you want you- know- either ranges of values or anything I mean and because of the ordering of the columns. There's interesting things you can do with you know having data based on keys, but I mean, if you're and the other thing with Cassandra and most other no sequel key value stores. Basically, you want to store your data so pretty much.

A

Every query that you come up with for what data you want has its own representation inside the database so that you can get it out based on key and so I think. That's our last question. We have time for that's all the time we have and I want to thank Jeremiah for this great talk.