Apache Cassandra NYC* 2013, 8 Apr 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NYC* 2013 - Jonathan Ellis Keynote for NYC* Big Data Tech Day 2013

Description

Speaker: Jonathan Ellis, Apache Cassandra Project Chair and CTO/Co-Founder of DataStax
SlideShare Presentation: http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20

A

Welcome to this amazing vente de today, we're here to celebrate and really recognize the achievements of everything that the Cassandra community is done, especially on the East Coast, and the amazing thing about this product. For all this technical achievements that have been there and have been achieved over the last few years is the fact that without the community, none of this would be possible. It's the people behind it that make this thing as great as it is. It's the 20 committers that work on this code every single day.

A

It's the hundreds of users that report bugs and submit lines of code as they're, going through the process to make their own technologies better, all the way to the thousands of businesses that depend on this to keep their company up and running on a day-by-day basis, but it all starts with the people and without those people. None of this would be possible without that community. Things like true 100% uptime, wouldn't be possible.

A

Things like true multi data center capabilities would be a dream, and things like without the ability to have those people behind it, companies would not be serving their customers better every single day. Today we have two amazing tracks. We've got many speakers here to give you the highs and lows of the technology things to do better, how they're, using it we've got. Speakers from large companies such as eBay we've got companies such as simple reach here, to tell you more about how they're, using it all the way through to new development advances on client libraries.

A

We've got schedules all around so check those out as you go and see what will really help you as you're building around this and speaking of the community, I'd like to welcome Jonathan Ellis to the stage jonathan. Is my co-founder and he's going to tell you a little bit about the state of Cassandra and where it's going.

B

Thank you Matt, so with the theme that man kind of launched us off with we have about twice as many people here, as we did just over a year ago for our last New York conference. So that's that's the kind of growth that's really awesome to see and- and it's part of a general trend that that I've started to see if Cassandra really starting to go mainstream and people starting to become more aware that it's not this database for social media that it started out.

B

As that you know, people are using Cassandra and the advertising space in the energy sector in government processes in healthcare, in retail in energy. It's it's really a general purpose tool and solves the big data problem that more and more people are having better than anyone else as a system of record.

B

So what I'm going to dive into in just a second is you know, kind of get into the weeds about what we've been working on for the one dot to release that we had in January as well as, what's coming up for 20 that we're targeting for the beginning of July?

B

All of all of these kind of tie into Cassandra's core strengths that are on this slide, the scalability of the high-performance, the you know, bulletproof reliability I wanted to cover briefly some of some of things that you might have seen before.

B

If you're, you know constantly on top of the Cassandra news and some that you probably haven't so last year at the very large databases conference, a group of researchers from the University of Toronto did a performance study of a bunch of no sequel databases and they also threw in my sequel as kind of a known known quantity as as a control, so they benchmarked half a dozen workloads and I've kind of cherry picked the best one, of course the one that, where the difference between Cassandra and the rest is starkest, but all of the workloads Cassandra was leading the pack and so definitely worth taking.

B

B

Cassandra, of course, on this slide is the is the line going up into the right at the top the solid black line, the next and then the number two is HBase there about a third of the way down. So this this is actually what the the Toronto guys did is perfect for what I want to you know tell the world, in terms of you know, for a high-performance, scalable database.

B

You really need to be using Cassandra, but there's one slight fly in my appointment here, which is that, because of the kind of broadness of the the no sequel term, you know a lot of people ask me. So how does Cassandra compared to MongoDB?

B

So just because they're, both kind of part of that same category, even though we really kind of tackle different markets, so datastax actually hired a company called endpoint to basically repeat the the vldb benchmark but add MongoDB to the mix. So on this slide, I have Cassandra HBase in MongoDB, the Cassandra HBase results match what the Toronto guys found pretty closely, which makes me feel good that you know this is this, is a reproducible result and then DB is, is the green line at the bottom? So what's interesting about this?

B

Is that you know this is this is a logarithmic scale, both on the x and y axis? So on the bottom, we have, you know the number of machines in the cluster. How is it scaling, as we add machines, and then the y axis is operations per second? So, even though the the MongoDB line is about halfway up the slide, it's it's really only about one-twentieth the throughput of Cassandra. So now this is.

B

This is kind of what most of us already know that that MongoDB is a good system for prototyping and for you're dealing with small workloads, but it's not really what you want to go to for a scalable system.

B

These benchmarks are fun and you know they give us some good data points in terms of you know: performance per machine, but going up to a dozen machines going up to three dozen machines. You know we have a lot of users and customers who want to use much larger clusters than that and so I.

B

You know I need to show people how Cassandra scales beyond that that relatively small number of machines, fortunately Netflix, did a public study of this about a year ago, and you know you can see the results here. The scaling is just a nice straight line, all the way out to 300 machines where Netflix said no, that's that's pretty good. That's giving us a million updates per second across across these 300 large instances. So, rather than the the you know, raw performance numbers which, which obviously change this was done against Cassandra 08.

B

So the the performance of cassandra has changed. There's higher performance ec2 instances available now, but but the important thing about this is is how that scalability line looks. As you add machines, you consistently get more performance.

B

Finally, some quotes from from people that I saw on Twitter's from Cassandra users that I enjoyed saying that you know this. You know Cassandra, you can talk about in theory how it's designed for reliability, but really the the proof is in the pudding. How does it? How does it hold up in the real world? I'll call your attention to the lower right, Nathan Milford will be talking later this afternoon.

B

I'm actually not sure what he's talking about, but it should be interesting, so I want to spend the next part talking about the new features in Cassandra 12, so I've kind of grouped these into two categories: I've got on the Left features that are making your lives easier from an operations perspective from managing a cluster and on the right.

B

We have things that are more meant for making your life easier as an application developer new features that you can use in you're in your application code, so I'm going to start with the ones on the left and I'm going to talk about each of these in more detail, with the exception of the the one at the bottom, parallel leveled compaction.

B

So, just to give you a quick sketch of what that's talking about in 10, we introduced a new compaction strategy called leveled compaction, and what this does is it's a it's tuned for hep, read heavy workloads where I've been doing a lot of updates to you know the same data set as opposed to inserting new rows, so leveled compaction works well for that, but especially on newer solid-state deployments, because level compaction becomes cpu-bound rather than I/o bound, and so we needed to be able to let you do multiple of these compaction operations within a single column family at the same time.

B

So that's that's what this is referring to. So starting with concurrent schema changes, let's get into a little more detail here. This is this is a little bit of a mulligan for me because we actually tried to do this in one dot one, and we got almost all of the way there, except for creating new tables, which is arguably the most important part, and so we had to go back to the drawing board and fix it right for 12.

B

So now you can actually, you know, create and drop tables in application code and not have to worry about the dreaded schema confusion, errors that you could get with with older releases. So if you're, not a Cassandra veteran and the end, that's can agree to you.

B

Cassandra has always supported being able to modify your schema and create new tables, but it's been targeted at you know: I'm, pushing a new release of my software, so I'm going to add some new tables for that I'm going to add some new columns for that, and this is a once in a while occurrence, rather than my application code is creating new tables on the fly.

B

So for that second case, that's what this is talking about- that now it's safe to let your application do that we with no risk of getting confused about what's going on and the rest of the cluster, the a bigger feature in terms of I think the impact on on cluster management is virtual nodes, so we've always had the paradigm that each Cassandra node was responsible for a single range of data. So on the left here this is this.

B

Is what we're talking about in the you know the 11 and earlier days, and what we're doing in one dot 2 is we're splitting that up, so that we're where each node is still responsible for the same amount of data, but it's split up into smaller ranges and what that does for us is. It makes it so that the machines that it shares data with that it has pieces of data replicated to it also spreads out across the cluster. So as an example of the problems of this solves.

B

If I were rebuilding node 5 in this six node cluster, so it's failed. I'm brought a new machine in and I need to. Re replicate the data to it that the old node 5 used to have so I have ranges of data, C, D and E replicated to it, or that I need to replicate to it, and I can grab those from node, 3, node, 1 and node 4 and there's there's other choices I could make. But fundamentally I can pick one node to grab each of those ranges from so in this six node cluster.

B

I have a 50-percent participation rate in that rebuilt, so that's not terrible, but as I scale that cluster up the same number of machines can participate. So if I have a hundred node cluster, that's a three percent participation rate, so I, my rebuild is going to be not nearly as parallel, not nearly as quick as it could be. So since, since V nodes lets us, you know split those ranges up and spread them across the cluster in terms of who I'm replicating with that lets.

B

Everyone participate in the rebuild so in in one dot to the default. If you enable V nodes, it's actually disabled by default, because we want, we want to be a little bit conservative and and basically not surprised people who don't know what they're signing up for, because some of the management strategies are a little bit different, but by default we use 256 V nodes. If, if you have more machine, if you have hundreds of machines in your cluster, you know that might not be enough.

B

You might want to increase that, but Cassandra is capable of increasing that after you deploy. So it's not something you're locked into the other thing that the vino's helps a lot with is adding new machines to the cluster. So if you've managed Cassandra clusters before you know that either you need to double the number of machines in the cluster or after you add some machines, you need to rebalance the cluster and have it basically shift everyone around the token ring in terms of what they're responsible for and V nodes means.

B

You don't need to do that anymore, because I'm grabbing small ranges of data from all my peers in the cluster.

B

So if you want to enable V nodes, if you're upgrading and you do want to enable that, then there's an a-line, you uncomment in the configuration file and then what that's going to do is that's going to split up my range into smaller virtual nodes, but they're all still going to be right next to each other. So then the next step is. We need to kind of spread those across the cluster and so there's a there's, a command to do that called shuffle, and so once you do, that Cassandra will start spreading.

B

Those virtual nodes across the cluster exchanging ranges with everyone until it's nice and randomized. So it's a two-step process.

B

The the next category of features we added to one dot to their kind of all targeted at denser machines or fat nodes, as I like to call them where people are increasingly, you know it's it's easy to build hardware now that has you know five, ten terabytes of data on it on a single machine and cassandra is historically done best with basically half a terabyte to one terabyte per Cassandra process, so we needed to make some changes to enable a larger, larger amounts of data per machine.

B

So the first of those is better support for just a bunch of disks deployments. So, historically we've you know the best practice has been to deploy Cassandra in a raid team configuration because in this scenario, if I have a bunch of hard disks on Cassandra and one of them fails, Cassandra hasn't known how to recognize that you know that disk is dead and it's not coming back, and so we've encouraged people to deploy on raid 10, which hides those single disk failures from you.

B

The downside, of course, being that yeah I've I'm already letting Cassandra replicate my data three times across the cluster or however many times you choose and so giving up an extra fifty percent of disk space to have that replicated locally as well. It feels like a waste. It's it's. A trade we'd rather not make so one dot to where we're emphasizing support for that jbug configuration just like Cassandra managed the raw disks, and so we've talked to sandra to recognize that you know when it disk fails and what to do about it.

B

What to do about. It is a little bit. It's not a one-size-fits-all answer, actually so by default. What we'll do is if we recognize that a disk is dead, will actually shut down the cassandra process on that machine and then we'll, let you you know either replace reboot, strap that machine or you know, maybe you want to run a repair, but it's it's you know will will shut it down by default.

B

The reason is that, if, if we instead allow that machine to continue running knowing that it has a missing disc, then if a request comes to me for data that I'm supposed to be managing I'm supposed to have a replica of, but that data was on the disk- that's gone, you know, I, don't know which rose I'm missing. All I know is that I've lost a disk, but I don't know exactly what's missing.

B

So my only choice is to you know, reply and say: here's my best guess: here's what I have on the other discs that I haven't lost and that may be severely out of date.

B

So that's why conservatively we stop the process and let you you know: if you reboot strap it, then you won't be getting any out-of-date information served up, but it's up to you. You know, if you, if you, if you're okay with with running it with Cassandra, serving up those that obsolete data, then that's an option that you can configure.

B

The next thing we needed to address was that we have a bunch of structures: internal to the Cassandra storage engine that take up memory proportional to the amount of data you have on disk, and so obviously, as you you know, you know when you're talking about 5 10 terabytes of data, which is what we're shooting for managing with one dot to then, then, that's a lot of memory to allocate on the Java heap.

B

So as as, as you know, if you've you know done garbage collection tuning in anger, and it usually does make you angry the the Java heap you know the or rather the JVM is garbage collection. Algorithms can deal with a heap up to about eight gigabytes or so before it. You know it really. It really starts to get. You know the pause times get worse. Fragmentation gets worse now.

B

Everything gets worse above that you might be able to push it up to 12 or 16, but that's really the outside of of what you can you know, grow a java heat to. So what we needed to do was we needed to move some of some of our memory usage. We needed to move it to native memory, so not need the JVM garbage collector to deal with it. So that's what we did in one dot to.

B

We moved the bloom filters which take up about one to two gigabytes per billion rows, depending on on how you've, how aggressively you've tuned them and then the compression metadata we've moved both of those off heap, and so we manage those with manual reference counting, which makes you know of course, makes us a little sad, because one of the benefits of Java is you're not supposed to need to deal with that, but desperate times call for desperate measures and all that, but you know all in all, I'm still happy with Java is a platform.

B

It's good to have that garbage collection on by default makes a lot of the concurrent algorithms that we do a lot more sane. But in some of these cases we do have to kind of go behind its back to get the the performance that we need so switching gears. Now to what have we added on the client development side? What new features do we have for you to use in your applications?

B

One of the the first of these is atomic batches. So as review before I talk about the atomic part, I just wanted to do a quick review of what regular batches are and and what the problem is that we're trying to solve with atomic batches. So a batch is just a group of updates to different rows that you want Cassandra to apply as a unit. So in this slide, I've got a red, yellow and blue rose that that's my batch and I and those live on different replicas.

B

So now the Red Road, you know, is not part of the same token range as the yellow row or the blue row. So this is what that looks. Like you know. If everything goes well, that the client says here's my batch, the coordinator says okay I'll figure out where each of those rows goes and send them out. The problem is what, if the coordinator actually starts, sending out those rows but then dies partway through so now, I had this group of rows that I wanted to apply as a unit, and you know I.

B

Now there it's in some unknown state. You know some of the rows may be applied and not others I, don't know. So what we do with atomic batches is we actually basically create a backup coordinator by using this concept, called a batch log, which is basically just a system table where the coordinator will pick a couple other machines in the stir and say here's the batch that I'm about to apply.

B

If you don't hear back from me soon then assume that I'm dead and you can take over applying that batch. So in that scenario the batch alot will actually my diagram slightly misleading because it doesn't know which rose got applied either, so it will actually replay all of them, but that's safe in the Cassandra world, because rights are idempotent, so maybe the the biggest change in 12 is that so we've been working on this. This thing called cql the cassandra query.

B

Language we've been working on it since 08 a year and a half ago, but one got to is where we've made it feature complete with respect to you know the existing thrift API in particular, we've taught it how to deal with.

B

You know composite data types that that's the big thing that was missing in the earlier versions of cql. So all of these statements here on this slide.

B

These are legal, both in sequel and in c ql create table, create index select from where, but we're keeping Cassandra we're still focusing it on request that we know we can deal with efficiently, so in particular, no joints no sub queries we're not going that direction with cql, but what we are doing is we're giving you a better API, that's easier, it's both easier to use, because I can use an interactive shell to test out the queries that my applications doing, and it's also more performance.

B

You know once you, because with cql we can do prepared statements now, I don't have to send the whole query every time. I just need to say you know, here's the prepared statement. I want to run and here's the bind variables that I want you two to inject into that.

B

So what I want to do here is show you how Cassandra can map data schema that was defined under the old thrift rules and how that will map to c ql and I'm gonna get a little bit it's going to get a little bit. Hairy and brains will explode, but the take-home lesson is that you can express anything in c ql that you could have using the thrift, api and upgrading. If you decide that I want to start writing new features for my application in c ql, then that's a gentle upgrade path.

B

I don't have to dump and recreate tables, or any of that I can use the same data files that I've been using, but Cassandra will know how to deal with that with the with the cassandra query, language, so I'll show you how that works, I'm going to be talking about a fairly simple data model where I have songs, and I have playlists that have groups of songs and and show how that maps to c ql. So the songs definition is is both the simplest and the hairiest begin.

B

Just in terms of how long it is because the you know, the thrift schema definition was not optimized for this, but but what I'm doing is I've been basically creating what we would call a static column family where I have four columns in this in this table, and they all know, every row has the same columns. Every row has a title and all album and artist, and then some song data, you know mp3 or flac or whatever.

B

So this actually maps one-to-one straightforwardly with with the cql definition where I have create table Titus title artist, album data- you know very straightforward, so you can see how now these rows here, where I have these thrift or storage engine data cells? You know those turn into one to one to the the cql columns. So there's very, very straightforward.

B

Now things start to get more interesting when we want to model what we, what we called a wide row column family, so to illustrate that I'm going to use a song tags, I wanted AG my my songs with different categories and each song can have multiple tags and, of course, each tag can be applied to multiple songs. So the way the straightforward way to do this in one dot, one would have been with a table like this.

B

I'll have a song tags table and what I'm going to do is I'm, going to take that column, name and I'm going to treat it as a piece of data. And so what? What we're going to see here is that this isn't going to map one to one with cql columns anymore and so we're I'm going to do a little bit of a retroactive terminology.

B

Change so from from here on when I say, column, I'm going to be talking about the cql concept and instead, when I'm talking about the the thrift concept of you know a name and a value tuple, I'm going to say, sell so I'm to just to make it clear which one I'm talking about so so each of these cells has a name that you know is determined. That is basically the name. Is the the tag data that I'm interested in so the way we map that to c ql?

B

Is we introduced the concept of a compound primary key, so I'm going to have my partition key that the song ID here and then I'm going to have the the tag name and by having that tag name as part of the primary key? That's telling Cassandra that that tag name, that's actually contained in the cell name, and so each of those cell names I've tried to illustrate here that this orange this row at the top here that I've grouped in an orange box that turns into one row per sale in the c ql world.

B

B

The the blues tag becomes one row and 1973 becomes another row, and that's that's how that gets displayed or returned to a query saying select star from song tags where ID equals x I'll get back multiple rows now, extending that concept to composite types. This is where it gets interesting.

B

So here I've got my playlists and because I'm am an expert Cassandra, modeler, I'm d normalizing, the song data into the playlist, so I don't have to do a multi, get to correlate the two together and the way we do that the natural way to do that would be to use a composite sale name.

B

So I've got three components in there: I've got a I've got the title: artist and album component, all all composited together into the sale name, and so it's the same concept as before in the song tags where I have each cell is now turning into a row for cql and and I kind of pull those components out into the title artist and album columns now.

B

The other thing that's important to note here is that, if, if I pull so did this create table here up at the top, that's kind of that's optional in the sense that if I don't have that, if I had the, if I had this definition in one dot, one and then I upgrade to 1 dot, 2 and I say select star from playlists I'm going to get I'm going to get these. This result set back, except that Cassandra won't know what name to give to each of those components.

B

So what it will give me back is instead of ID title artist. It'll give me key column, one column to column three and value is what it will give me. So the the sort of column metadata is optional and you can give it to Cassandra without recreating anything. All you need to do is you say, alter table, playlists, rename, column, 12 title and that's what you would do yeah. The other thing that we've added in 1 dot 2 is kind of syntactic sugar for certain types of composite columns, and we expose these as elections.

B

So, instead of having a separate column family for tags in 12, we could actually push those into the songs table as a set, and so that's what I've done here, yeah it's it's a nice convenient way to to add small pieces of variable length collections to your base table instead of having to split those out in two separate, separate column, families, the other things we've done in four cql m1 dot to weave we've added a data.

B

Dictionary and I I won't go into that in in detail, except to say that this replaces all the kind of one-off methods we've had for thrift for described, schema for getting the the token ring and so forth. So we've got you know all in all of these are in the system. Key space you've got key spaces, you've got column, families you've got local, which is data that I know about myself.

B

Note that the tokens column and because of the wrapping it's a little bit hard to tell, though the tokens column is the second to last one here. So you can see that that's a set of zero is what that is. So I have a single token, and that and my token is 0 on this machine, and so as if, if you upgraded to V nodes, then that would be a larger set, so we're already using these these data types internally.

B

Similarly, the peers is a what data do I know about the other nodes in the cluster and again, tokens is, is a set one of the the last thing here that we've added from 1 dot 2 is the ability to ask Cassandra show me what is actually happening as you execute this request.

B

So here's a simple insert and you can see I've color coded it for the blue lines, are the coordinator and then the it hops over to the replica, and it actually does the insert, and then it goes back to the coordinator to report success.

B

So this can be very useful for figuring out. Why is Cassandra not as fast as I thought? It should be so something that a lot of people like to do when they're, starting out as saying that hey since cassandra is going to order everything within a partition. For me, I can really I can use that to making a persistent q really easily. So my q definition might look like this, where I have a queue, ID and then I'll have Q entries that are the created at and then the the value.

B

So because so Cassandra will give me these q entries by in chronological order. So as I pull those out, you know I'll, delete them and then always get the the next most recent one from my cue. So the problem is so I'll. Here's the query, I'll do at the top here. You know to get the next item in the queue, but after I've done, you know thousands of these inserts and deletes. Then the tombstones start to be a problem, and so this is this doing a trace.

B

Illustrates that, where everything is you know small numbers of microseconds until we get to this highlighted one in green, that all of a sudden this took you know 35,000, microseconds or 35 milliseconds, and that says I read one live cell. That's what you asked me for limit 1 and a hundred thousand tombstoned.

B

So you know it the tombstones, you know, aren't free, and this is one of the things that you need to keep in mind as you're designing your Cassandra data model and it's always been a little bit tough to to see those, because you know everything in the client API, basically months, it's a tombstone. It doesn't exist anymore. So it's very easy to not see the effect that they're having on you and tracing exposes that. So briefly, I also wanted to give a little bit of a heads up today on what we're working on for 20.

B

Like I, said we're targeting this for july. The items on this list are ordered in kind of how far along we are in implementing them, so eager retries and improved compaction. These are done and then triggers compare and set and a more efficient repair. Those are works in progress, so eager retries is no for historical reasons.

B

We call the system in Cassandra that does this the dynamics niche where it because each Cassandra coordinator knows how quickly the replicas are handing it back data and it will route requests to the ones that are responding fastest to the ones that are least heavily loaded.

B

But you know you can still. You can still have hiccups in that. You know either because I routed the request to the to the middle one here and then he died. You know after I sent him the request, or maybe he didn't die, but maybe maybe he had a garbage collection pause. You know just briefly, so what what eager retries does? Is it's tunable by default? It will use 75th percentile.

B

So if a request takes longer to to hear back from than 75 percent of all the requests, then I'll go ahead and I'll do a duplicate request: I'll aggressively retry, that before I've timed out to another replica to the next fastest, responding one and this generalizes to hire consistency levels as well, and it's also tunable that you can ask it to do 90th, percentile or 95th, or you can say I want you to retry after a certain number of milliseconds.

B

So that's going to be very useful for your smoothing out the the latency volatility, we're doing a couple things for improved compaction, we're introducing specialized compaction strategies for four different workloads, a one that's pretty easy to take care of is I'm only inserting new data I'm, never overriding existing rose and then every so and I want that data to stay around for 30 days or or three months or whatever. So we can. We don't need to merge those data files with existing ones to take care of your row over rights.

B

All we need to do is expire those data files, all at once, when everything in that row or in that table in that data file has expired. The interest the more interesting question is: can we do anything for a general purpose workload? Can we do better and I'm going to refer you here to to Jake Lucy yannis talk later today, he's going to be talking about his new compaction strategy, I found out about this yesterday, it's it's pretty clever. I'd recommend checking that out.

B

We triggers if you've been around the Cassandra community for a while triggers, has been on my list of maybe next release for a couple years now so this time for sure or this time. For almost sure we have a.

B

We have a proof of concept, it's not finished, but we do have a proof of concept that shows that it's possible and this actually builds on the atomic batches from 12, which is why that it's a lot more tractable now, so the syndics is a little bit up in the air, but what it looks like is it's going to be probably something like this. Where you tell Cassandra I want you to execute this trigger. That's in this jar against you know this table and then inside the jar.

B

You'll have a class that implements the trigger interface. That basically takes a row and you can add or remove, updates from that row, and since it is your raw code, you can also do things like send an email. You can do things like emit a storm tuple and other things like that. So we're giving you kind of maximum flexibility as well as maximum, not user friendliness, because you know it is, it is raw code.

B

So what we're shooting for with to do is we're proving that as possible, we're giving you something that you can start exercising and then we'll take a look at kind of what patterns emerge and what do we want to make reusable and and provide more syntactic sugar for compare and said, is the one I'm really excited for because this is this is kind of letting us order, updates, sequentially within a column family and guarantee that nobody else is interfering with those.

B

So the classic example I like to use is what, if I have you I want to support user registration so in in Cassandra by itself? There's no really good way to do that, because I can have concurrent users as I've Illustrated here both say: does this user exists Cassandra says? No, so both of them try to create it and remember that insert in Cassandra is really insert or update. So one of these is going to scribble over the insert that the other guy did so we need to be able to separate those.

B

So that's what compare and set does it turns out that the a lot of I wouldn't say a lot, but when people have needed to do this in the Cassandra community, we've typically used a locking strategy like zookeeper or Hector locks, but there's actually kind of esoteric failure conditions where that's actually not demonstrably correct, and so we had to kind of go beyond that.

B

And, of course, paxos is the distributed, consistency superhero and that's what that's what we ended up with after after a lot of research, so Paxos is basically two-phase commit on steroids that it lets a new proposer kind of take over and and finish up a proposal that that got have got partway through. So it handles the kind of failure conditions that two phase commit. Doesn't so there's a couple interesting questions when we apply this to Cassandra, one of which is what do we call it?

B

You know I can say, compare and set to an audience like this to a more general audience. I think I need something: a little more descriptive, a little more user friendly, so kind of trying to figure that out. Another is that this is actually a rare feature that it's actually easier to create a thrift API poor than it is a cql one, because there's no real analog in the relational world. The relational world solves this problem with transactions which are not a good fit for Cassandra for a bunch of reasons.

B

So one way that we might do this syntax is with adding an if clause as I've, as I put here that says, here's the here's, the condition part of the compare and set.

B

Finally we're looking at doing more efficient repair. So repair, of course, is consist of two phases that we call validation where it builds a hash tree of the data that it has and exchanges it with the other replicas and then, after that, the the replicas dig down in the hash tree to find out where the inconsistencies are, where the replicas have different information, and then they stream that data to each other. So most of the most of the time, there's a relatively small amount that we need to stream.

B

So the validation part is the painful part. So what I have to do is I have to go over all the data by that I have for the range that I'm repairing turn it into a hash tree. If I then add more data, I have to build a new hash street from that I have to go through that same process. What we'd like to do is if I've added new data I want to build a hash tree just for that new data and and resync that so I think we can do this.

B

This does mean that we're going to need to have kind of two modes of repair, because, if going back to my first example of j bata, I've lost a disk and I need to rebuild the data that was there. So in that case, I can't just repair it. New data I need to rebuild everything, so we'll need to add an option to repair that says you know include previously repaired. You know, rebuild everything kind of mode, so I I'll take two questions. 2 questions.

B

The question was since we're adding a cql language inspired by you, know, sequel, have you considered new sequel and and the approaches they're taking their fundamentally it's those systems are taking a different approach where there they are trying to provide you full acid transactions and and there's a bunch of limitations that come with that that we're not prepared to accept primarily around not being able to do the kind of multi Dennis data center replication that we do as well as now. There's a lot of overhead with that and and Cassandra's faster.

B

Are you going to decommission the existing thrift API and will will new features that are being added to c ql become available to existence? Rift! That's a good question, so I should have should have mentioned that, because I want to be very clear that thrift isn't going anywhere so we're not going to break working code, we're very, very firm on that.

B

That's that said, those things like compare and set, and that's straightforward to add a thrift api, for, I just add a new cass thrift method, things like collections that doesn't really make sense in the thrift world, because that's kind of syntactic sugar that I've done to your thrift row to expose that to c ql. So on the thrift side you would still be dealing with. You know lists of byte buffers right and then so you can still access that collection data, but you don't really have anything that says you know treat it differently.

B

You just have to go, buy that composite cell so yeah. We we are going to try to expose things to thrift where that makes sense, but I think that there's going to be cases like the collections, where it doesn't really make sense to try to try to do anything more than thrift already does with it so I. We do need to break now we're going to the the two tracks now and at has one more word of housekeeping for us just.

A

A cute few, very quick things. First of all, at twelve-fifteen today, edward kappa yolo is going to be signing the book. He wrote on Cassandra the high-performance cassandra cookbook right over yonder. So if you want to go ask them any questions about the book feel free to go visit him as a reminder. There's 23 total rooms for this or three tracks. This is the first one.

A

The second is on the fourth floor and then the meet the experts session, which starts at launch, is on the eighth floor and the elevator and stairs are right back here. We've got free, Wi-Fi access in here. The SSID is metro wireless and the password is Metro 2013 all lowercase on that. Second, one.

B

It's the one without a space, there's.

A

Two Metro wireless's and it's the one without the space we've got charging stations / by our partner pavilion, as well as on the fourth floor, so for everyone who needs to have a laptop knock yourself out and speaking of the partners, a lot of this wouldn't be possible without them. So please stop by and say hi to them and see if there's anything of interest in there, because they've been very, very gracious and making this event possible or than that have a great session. Everyone- and thank you.