Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Comparing Architectures — Cassandra vs the Field

Description

Speaker: Sameer Farooqui, Freelance Big Data Consultant and Trainer
Slides: http://www.slideshare.net/planetcassandra/cassandra-vsh-base
Have you wondered what actually happens when you submit a write to Cassandra? This vendor agnostic technical talk will cover the internals of the read and write paths of Cassandra and compare it to other NoSQL stores, especially HBase so you can pick the right database for your project. Some of the topics mentioned are consistency levels, memtables/memstores, SSTables/HFiles, bloom filters, block indexes, data distribution partitioners and optimal use cases.

A

Hello, everybody, my name is samir and I'm a freelance trainer and consultant for hadoop cassandra and sometimes openstack. um In the past year, I've taught about 50 classes throughout the country and in india and usually my classes.

A

When I talk about hadoop towards the end of the class, I get a very common question which people ask me: okay, now that you know about like hbase and hive and pig well. How does that compare to mongodb? Why would I want to use cassandra instead of hbase right? So um what I wanted to do in this presentation is specifically talk about why cassandra is different from the other competing nosql architectures.

A

So a really quick question: how many people here have either installed cassandra or done a quarry to retrieve data in and out of it or are familiar with the architectures like what a ss table is or a mm table? Okay, cool, so about 50? um What about the same question for hbase, um okay, about 10 percent?

A

All right sounds good, so here are the nosql options. It is it's pretty overwhelming. So it's understandable! Why people ask me this question?

A

It's really hard to know which database to choose for your solution, um and you know every vendor claims that their benchmark is better than their competitors.

A

So the way I think you should approach this problem is first try to identify which one of these categories of databases is correct for your use case after you've decided upon a category, then selectively choose a specific database and I'll try to tell you in a vendor agnostic way. What I tell my clients for how to do this three to four years ago. It was the wild wild west of nosql databases, and it was really hard to tell how to pick one.

A

There were no books, no youtube videos and no benchmarks, but these days I think there are clear leaders emerging in each one of these categories, and there are databases that are dying in these categories, so you should once you pick a category. You should really pick one of those leaders as opposed to one of the abandoning databases all right. So, first, let me tell you about these categories and you can be thinking about which use case is correct for you, so we'll talk about key value.

A

First, a key value store is basically like a simple hash map and it's the right use case when all you need to do is access your data via primary key. The api for these type of databases is rather simple. You can basically do a get a put or a delete.

A

The key value pairs are stored in a container called buckets. It's different for different databases, but that's a common terminology. You only have consistency per key, so you can't put two key value pairs in and if one of them doesn't get committed, pull them both back out, so atomicity is per key.

A

And these things are usually really fast because they're, pretty simple architectures, so very fast, lookups, very good scalability. They usually scale with sharding techniques, kind of like mysql.

A

The value in my example here is a string. It's just a string with sometimes a space inside it, and the key here is the integer.

A

But in these databases you can actually put any type of type of data you wanted the value the value is not examinable, so you can't say give me back all of the keys where the value equals the color blue right. You can only say give me for this key. Give me the value and the value can be a object, a blob, json xml and then some of these key value databases. You can also attach a metadata item to the value.

A

So for a name, you can also attach a metadata item there saying like this is the age of the person, for example right. So that's that's kind of uh almost getting into key document territory, but think of it kind of like key value. For now.

A

um The use cases for this type of databases would be things like content. Caching, um web session information, user profiles, preferences and shopping carts, so here's an ideal use case like if you're developing a website when somebody types in their username and password to log into your website.

A

Maybe you have customization done on the front end where the background color can be specific for different users, or maybe the language is different, or maybe the time zone settings will affect how the web page will be rendered in that case, as soon as a person logs in you want to very quickly retrieve their preferences for your site and display that html and dynamically render it right. So in those type of cases where you want a very quick lookup, it's a simple look up.

A

A key value database is a good choice where you don't want to use. It is if you need to query by the data type, I'm sorry by the by the value as opposed to the key or if you want to do, multi-operation transactions or you need to do some relationships between the different data types.

A

Key document is the other category. Mongodb and couch base are probably like the really popular ones here. This is kind of like a key value database, but the value is examinable and a very common misconception is that in a key document- databases you're going to store documents like word documents or pdfs. That is not the case.

A

The documents tend to be xml or json or bson. Right, usually, I see json and the structure of the json or the xml does not have to be identical across the different keys. They should be similar, but they don't have to be identical. So here is a use case that I think, exemplifies key document databases.

A

Let's say that you are developing a software that is going to profile children in refugee camps in africa all right. So what you do is you like basically go to the camp and you interview children and you find out information as much information as you can in hopes of uniting the child back with the child's parents right. So in this case we have a key that I kind of randomly made. It's called one. I know the first name last name and location of this child. I know that she speaks two languages.

A

I know the mother and the father's name, I know which camp they're in- and I have a blob picture that I put in now for another child, though perhaps I don't know all that information, this other child is younger, it's right so for dee. I only know that she knows some swahili. I don't know who her parents are.

A

So you see how the the two schemas are somewhat similar, but they're not identical, and in these type of databases, they're, very tolerant of incomplete data and later on once you, maybe somebody else in the camp will recognize dee and will tell um tell you hey by the way I do know who her parents are- or I know you know some more information about her. In that case, um you can connect back to you get to that value and you can append only what you want to it.

A

um So the use cases here are like event: logging, content management systems, blogging platforms or web analytics are all pretty good use cases for something like this and keep in mind that in this case, you don't have to pull the entire value out in a lot of these type of databases, you can reach into the value and only pick out the specific things you want.

A

If you're going to make a web page with just pictures in it, you can basically only pull out the binary blob of the picture instead of getting the full json out or the full xml out. You can also just update a specific item within the json or xml.

A

You don't want to use these type of databases for any complex transactions. There are spanning different operations and you can't really have a strict schema here. If an item is missing like on the key value on the right, then there's no null store there. It's just it's just basically blank. So sometimes when you query it you'll get nothing back.

A

The next example case is a graph databases. These are made of nodes and edges. So here is an example. Let's say that you got your hands on millions of metadata records from a phone company, so here is a node.

A

The node has properties attached to it, or here I have a phone number attached. Gps coordinates for where the phone currently is, and the imsi number the international mobile subscriber number and here's another node with a different phone number right. So a node is basically like an instance of an object. An application and the next concept in graph databases is edges. So edges are also called relations, so here's an edge they can be unidirectional or bi-directional, but a call is made from one for to another. So I called that number.

A

The 415 number- and you know here- is another node and a uni-directional edge to it, and now here is a node that is contacting me. So it's in the other direction and you can see that these edges have properties either called or text and here's a little bit more characters and you can add more properties to these edges. So maybe I want to keep track of like these edges were initiated, perhaps but by mobile phones, and this one edge in the middle was done by a landline and here's some more properties.

A

You can add, like a duration of a call to the edge and the time of the call, the time the call was placed. So the use cases for this is, you know, connected data like social networks, shortest path, algorithms, recommendation engines, routing dispatch, location services, and you don't want to use these for really large scale. Examples.

A

Titan is a graph database on top of cassandra and infinite graph is a another graph database that can scale beyond one node. But you know the the complications between the nodes and the relations is pretty complex, and these things usually run on one machine. Neo4J is probably the most popular one here. Flockdb is another one, but flogdb only lets you query one level deep, as opposed to like neo4j and other graph databases can let you do a traversal of just what querying the graph is called.

A

You can do a traversal that says, you know, give me all of the people who are twice removed from the central node next, a much more emerging category is real time and there's a little squiggly behind before us, because it's technically near real time is, I guess the category here and in this example, I'm talking about storm the way storm works is it's used to analyze data in motion before you actually even commit it to the hard drive. So the data comes in from a spout.

A

Think of this, like an application that you write, that connects to the twitter, firehouse wiring api and it's downloading tens of thousands of tweets per second into this pipeline coming into your data center and then in the very middle. Next to the spout you have a bolt. A bolt is basically consuming the stream and it is picking out the hashtags every time a tweet comes in with a pound symbol.

A

In a word, it's picking that out and it's basically counting the hashtags that are perhaps coming in or it's uh every time it sees a specific hashtag increments, a counter for that hashtag and then stores it in a database down there. So it's taking the incoming stream and making two streams out of it. One of the streams continues onward into a database where you're, storing the full tweets, but then the hashtags are being aggregated or counted and they're going into a separate database.

A

um A reason. A recent benchmark shows that storm is very scalable and highly performant. It clocked storm to be able to process a million tuples every second per node right.

A

These things can get pretty fast, but you know these things are very emerging um and I think most of these are in alpha or beta stages and there's almost no books written on these impala and stinger are a little bit different from storm or here you're capturing the data before anyone goes to the database, but in empower storm you basically do near real time queries on data in rest, that's already in a database.

A

um Next we come to column, family and the main concepts here are that um on the left, you have rows. Those are the row keys and in a database like hbase you're, going to actually group certain row keys, like ranges or row keys into something called a region all right, so like an age base that region, one through three rows, will be stored on a specific machine slave machine 10 and the next region.

A

Rows four through six will be stored in another machine, maybe machine 20 in your hbase cluster, okay, so region, one is those three rows and all of the column, families all three column, families for those three rows so um so for row. One for example, you can see that on there it's there's three three column: families we have um a column. Family is basically just a bunch of related columns. The reason you want to be thinking about how you do the data model here and why would you want to use multiple column families?

A

Is you want to put data that you're going to query together into a column family right? So you kind of don't want to do queries which are going to span column families, because each column, family is essentially written into a file by itself right so over here we actually have two regions. Let's say this is hbase. We have two regions right region. One is on one machine region: two is on a different machine region.

A

One on the first machine is going to have like three files, basically one for each column, family and the other machine will have three files, one for each column, family and if you end up doing a query across column, families which these databases allow you're gonna have to read it across at least two files, and it could be potentially slower um other reasons to be thinking about column families. Is you configure? Usually these databases per column family?

A

So you can put data, that's highly compressible in column, family one, and um maybe when you query that it'll be a little bit slower because you gotta decompress but column family two is maybe where you want the really high performance stuff. So you don't compress that and you can in something like hbase and cassandra.

A

Also, the um these cells are versioned. Now this is kind of internals. To these databases you shouldn't really be using these versions, so you can do it if you want. There are exposed java methods, but it's um not recommended. This is kind of more of an internal thing within the database. So you could, however, in a lot of people, do it in hbase. I guess less so in cassandra, it's really discouraged, but um you can.

A

You can version your data and when you do the versioning, you can set the version per column, family, so column, family, one is compressed and just one version maintained, column, family, two, we're going to do three versions and we're not going to compress it and if you add a fourth version, it'll override the oldest version, so some people call the the data axis here kind of like key value. um You can think of a key see that down there.

A

The key is basically like the table name, the row key, the column family, then the column and then the time stamp, and that will uniquely identify a version of the cell. So that's how you pull yours data out.

A

There are no nulls stored here, so one of the good things about these data models is that you know in column, family one. I have four columns, but maybe for row one I don't have any. I don't know what I don't have the data for those columns. I don't put anything there. There's no nulls there. Nulls have a real cost on disk and databases are relational databases, so we're here, there's no nulls. I only have xy, so I mean it's not putting nulls all over the place.

A

So it's good for sparse data models perhaps later on I'll know what the other data columns are supposed to be, and I can populate them in there later. So other things you want to know about these column. Family databases is, you should know your queries up front and the read and write queries will determine how you're going to do the data modeling if you're a startup- and you think you're going to be changing your queries around all the time.

A

Then this is maybe not the best database, because you, if you change your query around significantly enough you'll, have to change the data model as well, and when you design the data model, I think you really need to know the architectural fundamentals to understand how this engine works, because that's what will guide you into thinking when you do a read, what's really happening, you want to be able to truly reason about how many disk seeks will occur and what kind of latency you will incur, based on the type of reads and write, queries you're doing in a relational database, that's kind of like the black box, where it's hidden from you and as a sql developer, you're, not really thinking about how many disk seeks are going to happen.

A

If you do this join right, but in these databases you really want to understand the architecture, at least at the level where, when somebody gives you their data model- and they tell you what the architecture is like for the system- and they say this is a query. I'm going to do a good, age-based, developer administrator will be able to say things like well. You know what I'm not going to allow you to run that query, because that query is going to put a lot of. I o pressure.

A

It's going to take too many disk seeks right now: okay, that's kind of a background column, family databases, but um if you think this is the kind of database you want, how do you pick it? I gave you like six examples in that first slide. Well, um I think you should pick it based on three different things. First thing is: who are these databases being adopted? So I this is indeed.com it's a job.

A

It's a job posting website kind of like monster or career builder, and I typed in a bunch of column family databases into here, and we can see very clearly that if you pick a cumulo or hpcc or hyper tata, hyper table you're going to be in a very small club right. If you pick hvac for cassandra, the orange is cassandra and blue is hbase. You can see that they're um they're really getting adopted and they're really close to each other. So this is. This is basically saying that. Well, you can't.

A

You probably won't go too wrong if you pick either or these databases will not be abandoned. Basically, the 0.06 means that in point zero, six percent of all job listings on indeed.com hbase for cassandra is basically showing up as a keyword.

A

Other thing you can do is go to google trends and type in cassandra and hbase, and if you just do that, cassandra will be way higher because of greek mythology.

A

So you want to actually go to the left side of this page and you can narrow it down to computer and technology field and then it'll do a somewhat of a better job. But even here you can see that um it's saying that, like cassandra was getting a bunch of queries under databases and or computers and technology, but cassandra was didn't exist in 2005 right, so um I mean you gotta.

A

You should maybe be thinking that, maybe up to here this level right here, might just be other people searching for other cassandra related stuff and it's kind of like a kind of a skew. You know so this stuff down here is kind of skewed. I think- and uh these databases were invented around this time. This is when the world found out about them. This is when quarry started started to happen, and I was trying to figure out what this spike was.

A

This is when, um because around the time, cassandra became a top-level apache project, so maybe people just just kind of got a bunch of press releases.

A

um This is showing google trends um search volumes in different countries. I found it kind of interesting that, um first of all, south korea there's something happening in south korea that they're like doing the most hadoop and cassandra and h-based searches. So um it's really interesting, and then um this leftist cassandra it's adopted more in india than um and over here you can see that hbase is adopted more in china and india is kind of after taiwan down here, so china prefers hbase and india seems to prefer cassandra.

A

Another thing you can do to decide which database you're going to use do your due diligence by going to the apache user groups, and this is the activity on the apache user mailing list. You can see that in may 2013 there were about 567 messages on the apache emailing list and agebase is somewhat similar. So I mean these two databases are kind of neck to neck right now, right, so how? How would you actually go a little bit deeper to pick one?

A

A little bit of background. Cassandra comes from its origins. Come from two white papers november 2006 google published a white paper called bigtable and october 2007 amazon published a dynamo and then I think, a couple of years later.

A

Basically, facebook took the storage engine out of dynamo and married it to the data model of bigtable for their um inbox search application and they open sourced the cassandra database and put into apache they kind of somewhat abandoned it and they adopted hbase much more heavily these days at facebook, but jonathan ellis, the cto of datastax was the first non-cassand non-facebook committer to cassandra and he's uh brought this database very very far from its humble origins when it was first released.

A

This is hbase um it's uh it's kind of this is actually hadoop right, so hadoop origins are 2003. Google file system paper basically is implemented. 95 of that architecture is implemented as hdfs in apache uh mapreduce, because the whitepaper becomes apache mapreduce. Google did all of this coding in c, plus plus released the white paper architectures, but no code.

A

It was up to companies like yahoo and later on, facebook, twitter and a bunch of other companies to implement the uh ideas of these white papers in in java, which they found to be a more robust language than c plus plus because of his garbage collection and handling of memory leaks.

A

um So both of the hb, specifically by the way, was started in 2006 by chad, walters and jim kellerman at powerset, which was later acquired by microsoft and right now, there's a lot of committers to these databases.

A

You can actually go to apache and you can find out who the people are, who have privilege to actually write the code in apache if you're going to get tech support from like cloudera or horin works, and you know who you don't know who to go with for your vendors find out who who's employing the software engineers who are actually committing the code right so go do that and these apache projects show you exactly who's writing the code.

A

Okay, both cassandra and apache are written in java, they're columnar into databases, they've been proven to scale to more than a thousand nodes in production, very low latency reads and writes, meaning, probably under 100 millisecond, reads and writes they use. This idea called log structured, merge, trees, which I'll talk about in a minute and they're atomic at the row level. So you cannot like lock two rows and, like shuffle data around, you can only lock one row if you want to lock two rows.

A

You'll have to do it the application side and then release the lock there. There's no support for joins transactions or foreign keys. In both of these databases. Now they defer um cassandra's more of a peer-to-peer architecture, hbase's master slave. All the google stuff was done in a master slave idea um where, if the master node goes down, then everything kind of goes down. But these days there is more high availability in the hadoop world. Cassandra comes from dynamo and amazon did a purely peer-to-peer architecture, no master slaves. Cassandra has tunable consistency.

A

Age base is strict consistency. I'll talk about that in a bit, cassandra has secondary indexes, natively available um hbase does not. There are tricks to be able to do it. In hbase, cassandra writes the incoming data directly to the linux file system like ext3 or ext4. Hbase, writes it down to hdfs conflict resolution during in cassandra is handled during reads and conflict resolution in hbase is handled during, writes and we'll talk about other ideas in a minute. All right, let's just think for a second, why amazon invented the amazon dynamo database.

A

So if you read the white paper, their use case was they wanted to make a shopping cart service that will always allow people to add data into the cart. They never wanted to be able to reject an item that somebody added on their website to a shopping. Cart, no matter what no matter whether there are you know: hurricanes, taking down data, centers or power, outages or disk failures or network failures. They needed that right to be successful for and not lose revenue on that sale.

A

If there's two conflicting versions of a right, then amazon wanted this dynamo database to be able to pull both conflicting versions.

A

So if somebody basically like um logged into the website from their browser on their computer and they added something to their shopping, cart and then like they logged in from their ipad and the shop added something else to the shopping cart, perhaps there's conflicting things in in their shopping cart, because they're logged into both, then what happens is when you go to the checkout dynamo will go, read both conflicting versions of the items and merge them together and give that back to you, you can do different things you can also.

A

If there's conflicts you can you can look at them and you can only look at the timestamp and pull and basically throw away the old timestamp data. So you have the options of what you want to do. This database dynamo was designed for applications that need tight control between the trade-offs between availability, consistency, cost effectiveness and performance.

A

Other things they use dynamo for was like best seller list, customer preferences, sales ranks, uh product, catalogs and service management. Now google designed bigtable mainly for um keeping the entire web crawl cache data. They were basically doing like a distributed parallel, parallel wget and like downloading the html, into um h, into uh big table into a column in big table and um then, and that was its own column, family with one column with html files, and then they were basically doing like a mapreduce job against that column.

A

That would um derive statistics and metadata out of that each html file and then populate a different column family with metadata like this is an english language. On this row number two, it's english and um if the geography is you know, this is um like uk english and it was like last updated at this time, and you know all this information goes in other metadata column, families which are not compressed, but google like compresses, this first column found just all html. It is highly compressible because it's all text data.

A

um At one point in time google used bigtable for many different products, gmail youtube, google earth, google, finance and even though hbase is almost a clone of big table in recent years, it's the architecture started to diverge. The big table of white paper came out in 2006, where we are like six seven years away from there, and uh even though the early prototypes were a clone. Now it's kind of becoming a little bit different and google also designed bigtable to be able to handle a variety of workloads.

A

um Sometimes they wanted to be able to do like high throughput. You know batch processing, jobs and other times they wanted to be able to do like really latency sensitive, serving data to serve a real website. So it's kind of a diverse database.

A

It relies on a underlying google file system to do the replication of the data and we'll talk about that in just a minute, and they also have strong integration with mapreduce.

A

This is the cassandra model. We have eight nodes in a ring here. Each node will have multiple hard drives.

A

One way that cassandra realizes the health of the different nodes is through a protocol called gossip gossip runs on every node every second and when a gossip protocol runs on a node, it will contact three other nodes and it will basically share information with those three nodes about its own health and health of other nodes.

A

It's gossiped with in the past so basically like if node four like gossips with node eight note, four might tell eight like a bunch of stuff, like you know, hey by the way, like a few seconds ago, me like four, I talked to three, and actually he didn't respond back, so eight will be like all right and then maybe five will talk to eight. The next second and five will gossip with eight and say um I was able to talk to three right, but four was not able to talk to three.

A

So eight now basically has a suspicion level on node three, it's node three up or down. Well, it must be up because note phi was able to talk to it, but note 4 was not able to talk to us so now.

A

Node 3 is suspicious but see this is a more advanced way of determining the health of nodes than a simple heartbeat, where, if a heartbeat doesn't come, the node's probably dead right, but in this case it's more interesting because with the gossip, maybe just the network, connection between four and three is was was having a problem to switch. But five to three was fine, so note, 3 is not down and note. 8 can trust that 3 is still up. It's just four can't get to it. um The way the actual failure detection happens.

A

Is this japanese protocol called fee accrual detection, which allows you to set a suspicion level on a continuous scale on nodes to suspect whether they're down or not, hbase doesn't do this. Hbs is a little bit different architecture. So hbase has a lot of moving parts. um There's master machines and slave machines, then the green name, node, is the master of the file system, and you can see that the master of the file system has a slave component called the data node on each node.

A

The data node heartbeats with the name node every three seconds and each data node is telling the name node I'm up and running, and these are the things happening from storage and perhaps a network perspective on me and there's the hbase master. You have a standby, age-based master as well. There's a standby name node and the hbase master actually does not heartbeat with the region. Servers. The the slave component on each slave machine here is called a region server.

A

A region server is basically storing, uh like you know, maybe 20 to 100 regions. Remember in the previous slide, we had rows, 133 was a region and that region actually was made of three column families with three files coming out. Well, you basically put something like you know: 10 20, maybe 100 regions on each region. Server.

A

All right um you have to have zookeeper as well, and zookeeper should be run on an odd number of machines, so leader election will um be able to work. If you have two four machines, two can say one thing: two can say another thing and that can cause problems so run out odd number of machines, three five or seven.

A

You can start with three and sometimes you'll also need mapreduce. If you wanna do like full table scan. So now you need a job tracker as well. There's no high availability for the job tracker right now and now you have a task tracker running on each machine which will launch daughter, jvms for maps and reducers to actually run to do the table scans. So you have a lot of moving parts here.

A

You have a whole bunch of master machines if you want to be able to run this in a highly available manner on the slave machines, each one of these components is a jvm running with its own heap, and the task tracker is launching a bunch of daughter, jvms maps and reducers so and you've got zookeeper up there. So learning this architecture is going to take more time. It took me like six months to a year to learn hbase, but maybe only a couple of months to get comfortable with cassandra.

A

Heartbeats from the region, server go to zookeeper in hbase and then zookeeper will tell the hbase master if there's a problem with one of the slave machines and then the master will take the decision to actually do something to recover from a region. Servers failure or maybe just one disk, crashed and some regions got lost. Age-Based master will help take care of that problem.

A

So I mentioned effort to deploy. These are just some more information on that really quickly. I'm going to talk about how a write happens in hbase, you have a client there client's going to contact zookeeper and say: where is the root table, so the client wants to write row 500, okay and some column in row. 500 but client doesn't know where, where the region is being hosted on the slave machines client goes to zookeeper, says where's the root table. Zookeeper says the root table is on this machine over here.

A

The client goes to that machine pulls the root table out and now the root table is on the client. That's what the root table looks like it doesn't show where real 500 is. Basically, you go to the root table, you say where's row 500, it says: well, you need to go to meta2. For that. You got to go to this other table meta too. He should know where the region actually is so now the client is I'm going to. Basically, you know query the root table say where is row 500, meaning?

A

Rather, where is the meta table?

A

That knows where row 500 is, and then the root table says that you know it's over there, that's the meta two table and then the meta two table is basically, you know pulled in to the client as well, and then the client queries the meta table and says: okay, where is the region that has row 500 and then the meta table actually tells it that and then the client goes to the specific region, server, which is hosting the specific region out of hundreds of regions for row 500, and maybe the regions, uh maybe also the rows above and below it, and it actually sends that x that right to that machine and then what happens is underneath of hbase h base just sends it from the hbase jvm down to the data node jvm for hdfs, and then it's the hdfs file system's responsibility to do a synchronous replication from that first machine to another machine in your hadoop cluster, and then another replication happens.

A

So the default replication factor in hdfs is three. This is a synchronous, replication.

A

um Well, no, it happens in a pipelined way, so the hbase just writes to one machine. It's hdf's responsibility to take that right and replicate it in a pipelined manner to two other machines to achieve their replica factor of three.

A

So the thing is um and then basically after that happens, then hbase is going to the hbase region server. There will tell the client, okay, I'm done with the right. So it's a synchronous right that just happened underneath of hbase um there's no control or the replication for each right or consistency for each right.

A

You can just kind of set that for the entire thing and these root and meta tables are cached, so it's never going to have to go back to zookeeper for the root and eventually or learn about all the meta tables and then in the future. When decline, does it right? It's just going to go directly to a slave machine. It doesn't have to keep going back to zookeeper and cassandra is different. You guys have probably seen this before and cassandra when the client does it right.

A

It does a right, let's say with replication factor three consistency level, one. So the client will randomly pick a machine called a coordinator machine. It doesn't have to be seven, but it is in this case and the coordinator will say: okay, you want to write with replica factor three: okay I'll, send it to three different machines.

A

Basically, the the row key gets partitioned with a mummer, hash or md5, and whatever long number comes out of the hash will determine which machines it will go to and the replica one two three is now sent to three machines and consistency level. One means that as soon as one machine responds back and says, I got that right. Then the coordinator will continue on with the client and give the ride successful and the other two rights are still going to be implemented.

A

But it's just uh you know it's not holding up the right, but if you want to hold up the right, you can a little bit more so now we're doing replication factor three right just like before, but we're saying we want to get the right committed to at least two machines because of some you know garmin sarbanes-oxley right and uh then after two machines respond back. Then we tell the client we're done.

A

This is flexible, so you can do replication factor like four right. Four rights gets sent and consistency level is two and then and then you send it back to the client. So this is a very really important thing to understand about the differences um cassandra. Has um you know tunable or elastic consistency?

A

Now here's the thing, though, when you do the read from cassandra: if you want to guarantee strong consistency, you have to do the read from at least two nodes, so you write to three nodes right and then you have to read from at least two nodes.

A

So when you write to three nodes, you basically write with like consistency level two, so you definitely wrote to at least two out of three nodes and then, when you read to make sure you got the newest object to make sure that you're not getting that third update, that might have not been finished and the right to do the read you have to read from two nodes right and then as long as two nodes agree, then you'll get the response back um in hbase.

A

When you do the read to guarantee strong consistency, you only have to read from one machine some machine. Some region server is responsible for that region and only that machine can respond back and give you back that right. Those other two copies are dormant and they're kind of like hidden in hdfs. Unless that first machine crashes, you don't read from those other two places at all. You can only go to one machine to do the reach. So to get this to get strong consistency in h base.

A

The right was already strong right and the right was maybe a little bit slower as well, because you had to wait for all three to happen and when all three happen in hbase, you can choose whether you really want all three to go to disk. The first one will definitely go to disk the second and the third can only you can you can configure so it'll go in ram and then eventually get flushed later or for the second and third in h base.

A

You can force it to go to disk as well, which will be a little bit slower.

A

All right. We have about 10 minutes left. Let's, uh let's see how much more I can cover here. This is a idea called log structured, merge, trees which comes from bigtable. We have a key there like before, and we want to write this value x. We've already found the actual region server in h base and we've already located the ring, the node in cassandra. Okay, this is the node that we want to do the right on.

A

Okay, the right comes to this node now from either the coordinator machine in cassandra or from like the client in h base all right and it comes into a jvm in both cassandra and hbase. Blue is going to be hbase and green will be cassandra from the jvm. The right gets kind of duplicated to two places. It goes into a something called a commit log or a write ahead. Log in cassandra. You call it commit log in hbase. You call it right ahead log, so the z went into there and this is a sequential file.

A

So if you're, using a spinning disk you're, not doing a disk, seek you're just writing to the very end of the file right and usually spinning disks can write like sequential throughput, like 120 megabytes per second, it's pretty good and then the jvm also sends that z right to something called a mem table in cassandra or a mem store in hbase.

A

The what yes, that's, why it's the same color as hbase up there and green is commit log for cassandra all right, so that that mem store is usually a 64 megabyte memory buffer. The right is persistent to disk. You know it's safe and it's in a memory buffer. As soon as those two things happen, you'll basically give the right successful back all right, then. What happens in the background is if you're using h base remember. This is only one node.

A

The write always comes to so you got to do that synchronous replication into hdfs with cassandra. Remember the right went to two other nodes already, so you don't have to do that kind of thing. That's the client already sent it to two other nodes and the client, the two other nodes reloaded to their own, commit logs.

A

All right, um then, that memory buffer is eventually going to get more data coming in, because more writes will come in the 64 megabyte memory buffer will get full when it gets full. It gets flushed to a file. The file is called ss table and cassandra or h, file in hbase and remember in hbase. You still have to then replicate that synchronously to three different places and um for every column family in these databases, you'll have a different mem store. So this is an example right now of a table with two column.

A

Families- and you see the first column. Family is flushing to its own file. The other column family is flushing to its different file, and when these flushes happen, this is a flush and in the file. You actually also maintain a bloom filter and a block index in cassandra. You can only do a row, only bloom filter in hbase. You can do a row or column bloom filter and what these things are doing is um by the way in hbase.

A

That h file is what that's called the h file has the block index and the bloom filter in that file that gets flushed in cassandra. You actually have like three files, so you have like one file with the data one file for the block index and like one file for the bloom filter, and the reason you have that is because imagine that these flushes keep happening. The memory buffer keeps getting full and you have like a bunch of ss tables now and the way you.

A

The reason you have a bloom filter is because, like when you have 10 ss tables. How do you know where row 500 is so the bloom filters are all stored in memory as soon as you reboot the node.

A

It goes into memory and then really quickly when you want to do a read, you first go check all those bloom filters in memory and say: hey, which ss tables have the row 500 and maybe one or two of them will say yes and then you go directly into the ss table, and then you check the index really quick to see where in the file it is, and you go pick it out of that file. You do a disk seek into that file and then you actually pull it out.

A

All right um now in cassandra you can flush per column, family, so remember I had two mem stores there. They can individually get filled and when one of them gets filled, they'll get flushed all right, but so one column family got full memory buffer. It got flushed by an h base. If one flush happens for a column, family, all the other flush, all the other mem stores will get flushed as well, so maybe one column family has like pictures you're storing in it and other column.

A

Families are just like tiny metadata stuff, so it's not getting full. But when you fill up this mem store with the pictures now you're gonna have to flush the other ones as well, and this will probably put a lot of memory pressure on you, because you're flushing these little tiny files, synchronizing them with hdfs or the network.

A

um There is work being done in cassette and hbase 2 under jira 3149. To do the flush per column. Family secondary indexes are natively supported in cassandra. You want to use a secondary index in cassandra when your column has a certain key or certain value. That's showing up multiple times so like this is like a secondary index is a good use case for like um if you have a column, that's storing states, there's only 50 possible values and you have a billion rows.

A

It's good good, good choice, but if you're storing email addresses in the column don't put a secondary index on there to query it out, right use different techniques, um no native secondary index support in hbase, but there's the idea called triggers. Where, when you update a value on a cell, um you can cause a trigger to be launched to go update, another column family, where you're doing like a poor manned secondary index with a sec. Another column, family you're like rolling your own um index.

A

Basically so it's possible to do that in agebase, as well as just a lot more effort and work ssd support. So I have just a few more slides.

A

um There's a really good talk by rick branson is 13 minutes long on youtube about ssd support in um cassandra so check that out. One thing you can do with cassandra, though, is you can actually put like you know you can put like the ss tables those flushed files on ssd, um but you can put the commit log on spinning disks. If you want you can choose in hbase, you can't really do that, because hbs is kind of dumb.

A

As far as the file system goes, hbase just hands the right to hdfs and hdfs is responsible for putting that on disks on its own machine and other machines, and um in hbase you can't really separate the you know the commit log to go or, like you know, the the right ahead log to go on ssd and like the flush files to go on spinning this. You can't really do much control there. However, the map aren't intel. Distributions of hbase are starting to make some effort in this field and in the hdfs those two jiras.

A

There's preliminary discussions happening for being able to decide where you're going to put the file, whether it's going to be on ssd on a node or a spinning disk compactions. I'm not going to talk much about this I'll. Just tell you that in cassandra you can do tiered and leveled compactions in hbase.

A

You can only do tiered if you want to learn more about what a compaction is check out that talk from that blog from jonathan ellis reading, after disc failures um in cassandra, if a specific disk fails- and that happened to be the disk that had the data you wanted to query- um it's not a big deal, because when you do the read, you know you're not going to that machine. Necessarily you don't know which machine it is you just kind of randomly pick a machine.

A

That's now the coordinator machine and that coordinator machine will do like a hash of the key. Then it will see it will know which three machines to do the query from and if one of the machines can't respond, it doesn't matter. One of the other two will respond in hbase. If there's only one machine who can do that read because that's the region server responsible for that region.

A

So if a disk fails that regent server is going to basically have to get the data from one of the other two machines or the network and send it back to you um so perhaps after disk failures, hbase will respond a little bit slower than cassandra will a few more two more slides. I believe data partitioning in cassandra when you are writing your table out with the rows row keys, you can do either ordered partitioner or random partitioner random partition means oops.

A

Random partitioner means that, like you put row one in there, then row 200, then row three, then we're 500, it's just random and you don't get any hotspots when you do that, because you're kind of spreading your writes across all of the nodes in hbase, if you're just doing integers for the row key you're going to only be hammering the last node, which your machine has the region at the very bottom, that guy is going to get hammered and you're not spreading your rights across because the stuff up there is all ordered and those regions are not being touched.

A

So in hbase you very commonly will actually take the row key. You want to write and you'll actually do an md5 hash or something of it. Then you get the md5 hash and append it to the beginning of the key, and then you send that into hbase. So now the key is kind of going to a random machine and you're not getting a hot spot on just one machine, but it depends on your use case.

A

So sometimes the order partitioner does make sense, but you should know for cassandra like 95 of use cases use the random partitioner, not the ordered. You have three more minutes triggers and coprocessors. This is under development for cassandra, but it already exists in hbase. What this lets you do is basically send client supplied code into the address space of the server to run. So basically, you know you can do something like every time you have a special column family and you say like look anytime, anybody does a put in this column family.

A

I want you to also run this other logic to do something else and that all happens in the cpu of that machine. So you can say anytime, anybody does a put in this column family. I want you to also do like something like increment a counter in a different column. Family comparing set is under development for cassandra but supported in hbase. This is basically like atomic read, modify, write, it's also called compare and swap or check and set, and finally, multi data center support or disaster recovery is significantly stronger in cassandra.

A

It's very, mature and well tested see the initially. When cassandra was developed at facebook by former amazon dynamo engineers. um You know facebook needed multi-data center support.

A

um It was a global company with the data stored all over the world and you don't want to have european users doing a query to a data center in like virginia it's the mil the latency is too high, um so they needed multi-data center support, so that was kind of one of the core things built into dynamo and cassandra from day. One rackspace is also a big contributor to cassandra because of its multi-data center support. So with um cassandra you can have a recovery point objective, be zero, meaning you can rest assured.

A

It will be a little bit slower to do these type of rights because you're sending the rights to both data centers, but you can do the right where you say. I need the right to go to both data centers and go to disk on both data centers and only then go and tell the client that you did the right in hbase. The replication is only asynchronous, so the recovery point objective cannot be zero. You might lag a few milliseconds seconds or minutes behind in cassandra.

A

You can do synchronous or asynchronous replication to the other data center. All right. We are at time and since I'm freelance, I can open source my slides completely they're available at that link under creative commons. That's my email address.

A

I'm also authorized data stacks training partner, so I can teach data stacks curriculum. You can shoot me an email if you're interested in getting any training classes from me. I do custom consulting and training. I used to work at hornworks, accenture, r d and symantec. You can find me on linkedin twitter and there is a youtube video I have as well.

A

A