Apache Cassandra Cassandra Summit EU 2013, 12 Nov 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1

Description

Speaker: Jonathan Ellis, Apache Cassandra Project Chair and CTO/Co-Founder at DataStax
Slides: http://www.slideshare.net/planetcassandra/c-summit-eu-2013-keynote-by-jonathan-ellis
Keynote Presentation on Cassandra 2.0 & 2.1 by Jonathan Ellis at Cassandra Summit EU 2013

A

Hey I'm Patrick McFadden, a lot of you prior I know me probably hear me talk all the time anyway, so I apologize for that really messed up. Looking screen um so I'm a last time, I talked to you, I was not, but now I am I'm chief evangelist now which is really cool, because that's what I like to talk about all the time is Cassandra. The evangelist team is here today. It's me and alto, be some of you may know al from his talks.

A

He's probably one of the best performance people when it comes to Cassandra he's around today come find us if you want to talk to us and we're here to help you it's all about community. So this year we've seen so much community growth, and this is really exciting. We did our summit in San Francisco over a thousand people there. I think a lot of you were there and now we're doing this in Europe.

A

We're just going to keep the love going so today we're going to be doing a lot of talks, but also I we're doing some announcements as well. So one of the announcements were going to be doing is the datastax Academy for Apache Cassandra. Yesterday I said I was gonna. Put the link out here. You go hi I apologize for the wallet text here. This for me is really exciting, because this is where we're gonna really start seeing people use Cassandra, I, think effectively. We have online training, it's free.

A

You can go out there and do it. This isn't just a couple of webinars. This isn't just a couple of videos on YouTube. This is a full course where and you could sign up for this- go through it. um Mr. Coley I'm sorry I take his class if you're open, but alright sorry I see you.

B

Go ahead, yeah.

A

So um we have a lot of content on there. I was personally involved in this. There were several of us. They were in a community that were involved in this great content. Please sign up for this. We really want to get a lot of people on there. A hundred thousand registrations by 2014 is pretty ambitious. So um a couple other things little housekeeping tonight after the talks we're going to be doing a reception, but we're also going to be doing lightning talks and light.

A

If you, if you haven't seen the lightning talks here last at the summit in San Francisco, they were so much fun and we have a great lineup for tonight too. So, stick around don't just take off after five o'clock plus you don't want to get on the tube at this time anyway, just took around have a beer. um The other thing we're going to be doing tonight is our MVP announcements.

A

We didn't do that earlier this year, so we're going to do it today and it's really exciting, because we really like to recognize people in the community and MVPs or what it's about so I think I ran through my list and I almost caught us up on time. So I think I'm gonna be done talking. You guys are here to talk to listen to Jonathan talk, he's gonna be talking about Cassandra to dido. Of course, Jonathan is the project chair for Patrick, Cassandra CTO for datastax and pretty much the man when it comes to Cassandra.

A

So here you go.

B

My best are we good, so I've been working on Cassandra for about five years now and are for the first roughly four years of that, we've really been focused on three core goals of scalability performance and availability, and by and large we really we achieved those goals who are the best in the industry.

B

Add these three important parts of scalable database to add a little color to that a group of researchers at the University of Toronto, benchmarked and I for no sequel systems and they use my sequel- is kind of a control point to as kind of a known quantity to benchmark these systems against each other, and they went through a bunch of different workloads.

B

This one is the reads, writes and sequential scan, so you've got all three requests types being mixed in together here, so you can see that Cassandra, you know, is not only increasing linearly as you increase the cluster size, but it's also about four times as fast as as the runner-up.

B

Now that what the one problem with this study that the guys in Toronto did was that it didn't include ODB and so mongodb is kind of in a kind of playing in a different part of no sequel they're more in the in the rapid development part than the scalable system part. But in the minds of a lot of people out there you know hidden, no sequel, I've heard of MongoDB I want to know how it compares to Cassandra performance-wise.

B

So we actually repeated datastax hired a company called endpoint that that works on these kinds of things, to repeat the Toronto study, with HBase and with MongoDB, and so we again see Cassandra about four times faster than HBase, which is the line kind of in the middle there. So it's it's! It's a reproducible result, that's exactly what the Toronto guys saw. So we felt good that you know we're doing it right. This is done with, and both both of these benchmarks were done with an open-source benchmarking.

B

Suite called ycs be it's available on github, so we were. We felt good that we were getting a reproducible result mongodb. You can see, didn't you know it's roughly 20 times slower than Cassandra once you start throwing a lot of data at it, and so that's that's an actually an important caveat, because if you look at the MongoDB site, you know a lot now all they're, all their numbers that they give.

B

You are from data sets that actually fit a hundred percent in memory, and so you know as when you win when you start getting into the real world, people don't want to have to get enough ram to fit their entire data set in memory. You you want to fit your hot data in memory, but there's no sense in in spending extra money on keeping your cold data in RAM as well. So I actually told the endpoint guys make sure you benchmark a set.

B

A data set where you have the hot data set fitting in memory, but more cold data then will actually fit in memory, and so when you that that's that's why this this graph looks very different than the ones you might see where someone did. A toy data set: hey db's fast! Well, it's fast until you get outside of memory and then things really start going through the floor.

B

These are some experiences that people have had with with Cassandra in the field. Speaking to the the reliability and availability story in the bottom center. Here you have Nathan milford from Outbrain lost a data center when Hurricane sandy hit the east coast of the United States. Cassandra powered through cassandra also deals well with with small outages, not just really big ones, so Jake luciani in the lower right talking about Cassandra, dealing with individual disk failures and and managing that as well.

B

So Cassandra really gives you reliability in the large, as well as in the small and in deals well with both of those.

B

So what we've been focusing on this year, then has been really a new, a fourth core value and really making Cassandra easy to use and bringing that power to two people who are interested in using cassandra and the need of scalable database but they're there. They don't want to live on the bleeding edge. You know they don't want to necessarily have to dig into the codebase to figure out.

B

What's going on so ease of use and developer productivity are really what we want to focus on and what we've been focusing on this year, so that comes in two forms in one is on the right here we have the cassandra query language, so you know up until cassandra 12 at in january this year. The main way you access Cassandra was with an RPC protocol called thrift, which was kind of clunky and kind of painful. So we introduced the cassandra query, language, which is basically a subset of sequel.

B

The statements on this slide are legal, in both sequel and in cql.

B

So we that that's part of ease of use, but another part, is making Cassandra more forgiving, making it harder to do the wrong thing and I'll talk about some of the examples of of how we've been working on that, as well as part of bringing out cql and giving Cassandra query language. We've also introduced a new native protocol. That's you know, that's native to c ql.

B

It gives you improved performance that gives you a synchronicity and push notifications out of the box, and so we work on that kind of in parallel with with the server piece and so the Java and.net drivers. Those are both production, ready, Python drivers in beta and we're working on others. There's a talk later today on the Ruby driver for cql, and the author of that driver will be talking about the low-level details of that. So, if you're looking to get your hands dirty with some pretty low-level code, I would definitely recommend checking that out.

B

So one of one of the important things that we've introduced, we actually support tracing in the old thrift protocol as well as cql, but it's so much it's so easy to use with cql that I want to call this out this actually, for those of you have been using cassandra for a while that you know that this isn't new and 20. This was actually an introduced in 12, but a lot of people don't know about it. Yet so I want to keep mentioning it, because it's such a powerful tool when you're trying to figure out.

B

What's going on under the hood with my Cassandra cluster, so what this does is is it's sort of like explained analyzed from the postgresql world or something like that where it actually shows you how what's going on when Cassandra execute my query and how long each of those steps take. So here, I just have a simple insert statement, nothing fancy, but you can see in I've color coded this that in the blue at the top.

B

This is the node that the client talks to and gives the request to, and then Cassandra actually has to send that insert off to some other machines. You can see that lower down and then then those other nodes acknowledge the right back to the coordinator, which is what the client is talking to in blue again and then and then it hands back that acknowledgement to the client. So you know in a distributed system. Obviously you can't just tail your log to see.

B

What's going on, you need to assemble pieces from all the different machines that participate in the request and that's what that's what this does for you in a very nicely packaged easy-to-use format. Now this is one way you can use tracing and just from cql shell, you can set tracing on and run your query. You can also trace programmatically and you can say, I want to trace. You know point one percent of my requests and save those out all that all the traces are saved to a system table.

B

So I want to trace point one percent and I'll look through that afterwards and look for slow queries and then I can say well, okay, this query was slower than expected: let's see what actually happened, because all those trace events are saved for me, so it's super valuable both in development as well as in production tuning.

B

Another part that we introduced earlier this year that a lot of people don't know about yet is authentication and authorization, so you can enable authentication and, and let in Cassandra will make sure that I you have the proper credentials to access the cluster. When you enable this so out of the box in an open source Cassandra, we have the password authenticator, so that stores your your users and your hashed passwords natively inside the Cassandra cluster in datastax enterprise. We also offer the Kerberos Authenticator, so that gives you access to enterprise single sign-on.

B

So, as John mentioned earlier, datastax enterprise is now free for startups, so definitely something to keep in mind there, and so this is what that looks like in c ql that I can create users. I can. I can change their passwords as the superuser. I can drop them. This is actually something you want to do something like this. If you enable authentication by default, Cassandra just still lets everyone just connect to the question: do whatever they want.

B

If you enable authentication, Cassandra creates a default super user whose username is password who's in user name is Cassandra and who's password is Cassandra. So so you know this is this is kind of the the bootstrap user? So what you're supposed to do, then? Is you either create a new super user and drop the existing one or you change the password on the existing one? That's that's an important thing to do.

B

We also support authorization and once I've got these users authenticated splitting out what data and they're allowed to access and modify, and so I I can do that on the keyspace level on the table level and read and modify our separate permissions. So so pretty straightforward paradigm is what's what we're familiar with from the relational world.

B

So moving now into the the Cassandra 20 world, one of the the marquee features of Cassandra to do is lightweight transactions. So what we've seen over the last you know five years is that as people deploy Cassandra applications, cassandra is an eventually consistent system or tuna Blee consistent if you prefer, and so what that means is there's no way to say there there's no concept of locking, there's no way to say, while I'm doing this operation, nobody else is allowed to touch this data and that's great for ninety-nine percent of your application.

B

But it turns out that that, for a lot of op applique patient's, there's a there's, a small piece that will need some kind of linearize ability a way to separate and isolate operations from other things going on in the cluster, and so what we saw is that that people needed these so badly that they were willing to do unnatural things like integrate, zookeeper, locking on top of cassandra to get that, and so it this this setup. You know this cuz operational problems is caused development complexity.

B

So we wanted to solve this problem and so in a nutshell, as an example of the kind of race condition that this solves is, you know, consider creating user accounts where I allow users to specify their own username. If I have two people, both asking Cassandra hey does this does username JB Ellis exist and they both asked that at the same time and cassandra says no, it doesn't exist yet no, it doesn't exist yet and then they both go ahead and try and create and insert that row.

B

Then one of them is going to stomp on the other and so I've basically introduced application level corruption, because those operations were not isolated and and Patricks going to talk in more detail about this in his data. Modeling talk later on today in this auditorium, but you know so. The short version is that lightweight transactions solves this problem for you. So we implement this using a consensus. Protocol called paxos apexis has a reputation for being difficult to use difficult to implement. It's actually, not that bad. There's a paper called paxus made simple.

B

Most of the complexity in paxos comes from trying to elect a master that then does most of the paxus operations. But then, if the master fails, you need to do a reelection and we're not doing that we're keeping it fully distributed system with with no master and doing paxos at a purely peer to peer level, actually isn't so bad. So this gives us the properties that that were used to with Cassandra that you know if, if a node fails, your life goes on as long as image.

B

As long as we still have a majority of replicas alive, we can still make progress. So that's the that's the world we want to live and we don't want to compromise availability to deliver this now. The trade-off is that we do compromise on performance, so in particular, we need for round trips to do a lightweight transaction versus one for a normal update. So no, roughly speaking, you're going to have a quarter of the operations per second, then then you would with normal updates.

B

So what you know the the design goal here, then, is to support that one percent that really needs it and you know, continue to build the rest of your application, using eventual consistency. Now, when, when people hear that word eventual consistency, you know they start, you know getting scared and nervous, and what does eventual mean? Well, eventual means that you know it's going to be replicated within a handful of milliseconds.

B

You know, as as long as nodes aren't going down so basically the same situation as you're used to if you've got my sequel, master-slave, replication, setup, same kind of situation, that's actually an eventually consistent system that where the slave doesn't get the update, for you know a small amount of time after the master does.

B

An engineer named named Christos from netflix in the united states gave an excellent talk at our san francisco summit on building applications around eventual consistency and kind of his learning process, as he migrated from oracle to cassandra and basically how he learned to appreciate the advantages that it introduces to application, design and reliability.

B

So the syntax for lightweight transactions as you've got. We've got two different cases here: we've got on insert and on update both of them use a new if clause in c ql for inserts. It's really simple, because all we can do is you know, insert this if it doesn't already exist so that that's super straightforward. So it's just if not exists on the bottom.

B

Here we have an update and that can actually be a little more complex, because I can specify an arbitrary number of columns in my if clause and so and and they they all have to be conjunction with and so there's so there's no or involved here, but but I can have as many columns in my if Clause as I want and then I can update the here. I'm updating the same column that I'm checking in the if clause, but that's not required. I can update arbitrary columns as well.

B

In fact, under the hood I can update any rows that are in the same partition as the one I'm when I'm checking that's not fully exposed in c ql yet, but we're evaluating our syntax options to make that possible.

B

Another new feature in Toronto is triggers, and so this is. This is fairly it's it's straight forward until it's not, and so what I mean by that is on this slide. I have create trigger. You know very straightforward, but but what you see here is it is I'm giving it a class name. So that's that's the part that's less straight, for it is that you have to write java code or something that compiles down to a java class to implement the trigger, and so the reason is that you know we have.

B

We have people are, and companies now that have Cassandra clusters of over a thousand nodes. So so, while I said that we've been focusing on ease of use this year, we haven't forgotten about the people who are really pushing the envelope in terms of scalability and performance, and when you have a thousand no database, it's actually worth a bit of complexity to get that extra performance from pushing some some logic closer to your data. So the trigger allows you to know based on Rose being modified.

B

You know, take some action and push a tuple out to your storm cluster or you know, read some data from the Cassandra cluster and and push it out to some other tables.

B

So so at that scale, performance does matter and and people are willing to write some java code. If that's what it takes to to get that extra performance. That said, for 4 99 percent of the people in this room, you know we're putting big scary quotes up for triggers and saying you probably shouldn't use this. Yet we are we. We know for a fact that we're making changes here in two dot, one, it's not going to be backwards compatible.

B

So this is definitely one of those bleeding edge features where, if you absolutely need the extra performance it's available, but we're trying to steer people away from it until we have a better idea of what the use cases are that we're optimizing for and have a friendlier syntax to access that.

B

Another thing that we introduced in 20 was was triggers, so this is actually it was cursors on the the native protocol side. So what this allows you to do is not have to worry about manually paging through a large result set.

B

So, if I have this, this timeline table, where I have some tweets that I've denormalized into this table and I want to say you know what are all the all the tweets that my friends have made know if I have thousands of of tweets from people I'm following that, I'm going to need to page through it with some syntax like this, where I'm saying give me the first hundred tweets.

B

Okay, now give me the next hundred, and you can see that you know, since I have a compound primary key here, the the logic for give me the next hundred starts to get slightly painful, because I have to say give me the give me the the tweets from the next partition or give me the tweets from the current partition that I haven't seen yet and so the more components I have in my compound primary key, the more clustering columns. I have the more clauses that I need in this paging syntax here.

B

So cursors makes that simple, you just say, select star from timeline and cassandra handles grabbing more rows from you after you've consumed the you know, the first, the first hundred or so this is actually preserves your state across servers. So if the cassandra know that you're talking to goes down- and you have to fail over to a different one- the cursor actually still lets you resume where you left off in your result, set so so very cool in that respect.

B

So we have. We have some minor improvements to c ql as well. We added a special case for select distinct, so we still don't support, select, distinct in a general case, but as a special case, we allow you to say, select, distinct primary key are a partition key. Rather so that's actually something we can do very efficiently. We don't need to do a sort and merge. We can actually just grab the partition keys from the data files on disk and and give those back to you. It's it's a very performant operation.

B

So we were okay with giving that out as a kind of a baby step there create table. If not exists, so just just syntactic sugar, so you don't need to do a dance of hey check. If this table exists, if it doesn't exist and go ahead and create it, otherwise don't bother. So just since some syntactic sugar, therefore you select, as so we've been introducing some functions into cql. So it's it's more important to be able to allow you to create aliases for that.

B

So, if I, if I ask for the date of function for a cell, then I can give that I can give that a name or you know you know it's any arbitrary function, call or column I can rename that, in my results that if I want to alter table drop column, this one's a little interesting here so the history there was that we actually had a drop column in an early version of cql, but all it did was it removed that the metadata for that column definition.

B

So it left your your actual data sitting on disk, so for one dot to we remove drop column entirely for 20. We added it back the right way, which is that when I say drop a column, not only does it drop the metadata, but as my data files get compacted, Cassandra will omit the column that you dropped. So it's actually it's not going to go through and start scrubbing your data files and cause a big impact to your cluster, but it will evict that lazily, as the compaction happens. Naturally,.

B

So I mentioned that that our goals for ease of use were partly on the developer productivity set, but also partly on the operational side and making Cassandra more forgiving operationally. So one of the things that we've that we concentrated on for 20 was moving our storage and engine internals off of the JVM heap now. The reason that's important is that you know I've tried to diagram it here that the amount of memory that you can allocate to the JVM is roughly about 8 gigabytes before you start getting really onerous garbage collection pauses.

B

So you know modern servers, of course, have a lot more memory than that available. So we want to move our internals out to that native memory, where we don't have to worry about garbage collection, and so that means, of course, that we're handling the the malloc and free ourselves.

B

But you know we take care of that. So you don't have to worry about. You know the the heap issues so much so I'm going to give a quick overview of what happens when you read some rows from Cassandra to explain what what we've done here so on a per SS table bases per data file. What we're going to do when you ask for ROS from your from your users table first we're going to check a bloom filter, and so what this is is it's a probabilistic set?

B

That tells me with high confidence if I don't have any rose for the partition you're asking for. So if, if the bloom filter says there are no rows here that I'm done. If the bloom filter says, there's probably some rows here, then I'll continue on to the to the partition, key cash, and so that's going to say if I did have a if I do have a partition here. This is going to tell me where that partition begins in the data file.

B

So, if I hit in the cache, then I'm going to go around at this dotted line here around to the left side. I'm going to skip this next part, because I had a cache hit. If I had a cache miss, then I go. I go to this next step in the partition summary, and so the partition summary is a sample of my primary key and partition key in the in the data file and so and what that does is it lets me?

B

Do a binary search against that to see where on disk I should start looking for this partition? So then, when I, once I've done that, then I go to the first place on disk, so I've got this dotted line down the middle on top of the line is in memory. We've stayed in memory so far. This is my first seek to disk is I'm going to check that partition index that has all of the partition keys, indexed in it and I'm going to do a sequential scan, starting at the place of the partition.

B

Summary gave me looking for that partition entry once I've got that I'm going to go to the compression offsets table to turn that into what, where that data block is in a compressed location, so compression is on by default, starting in Cassandra 1 dot 1.

B

What we find is that what you give up in terms of CPU you make up for in terms of having that much more data cashed in RAM. So, even even in terms of the the performance, your you get better performance as well as reduced disk space. So that's why it's on by default. You can disable it, but it's on by default. So once we've got the compression officer, then we finally go and we can go read that data from disk.

B

So the three things that we moved into native memory are the bloom filter, the compression offsets and the partition summary. So what's in what's critical about these is that all of these structures grow linearly with the amount of data that you have in Cassandra. So you can see how this this can cause.

B

It could catch you by surprise when everything's working, fine, everything's working fine until now, I have two terabytes of data instead of one- and you know, I start hitting large garbage collection pauses as the JVM valiantly struggles to find enough space to carry on, because these structures have gardens so large. So we're moving those into native memory. You know basically means I, don't need to worry about that so much.

B

You know as long as I have enough RAM proportional to the amount of disk I have then I'll be in good shape and that's a problem that that most people are in good shape about now. The last part that I that I didn't circle here, the partition key cash- that's still on heap, but that's a fixed amount of of space.

B

So you can you tell Cassandra I want to give the key cash one gigabyte, half a gigabyte and then we'll it will limit itself to that and and use an LRU policy to evict the oldest entries as you as you pull a new data, so it works, works very well with the Cassandra goal of I'll. Keep my hot data in memory. My cold data on disk.

B

One other thing that we improved on the operation side is a couple improvements to compaction. So one of those is that compaction is always single pass. So in 1-2 and earlier, if you had a large enough partition, we would have to do two passes on that to compact it because of some details of our data form data storage format. So we actually change the format to be able to do a single pass compaction and save you that performance when your partitions get large, which they sometimes do, the other one is that leveled compaction.

B

So so, for those of you not super into the Nitty Gritty of Cassandra operations, LCS stands for level compaction strategy. Stcs stands for sized here compaction strategy and what those are is that size, tiered compaction strategy is the default compaction and it's more io efficient, but it does a less good job of evicting old data, obsolete data from your data files, and it will tend to give you less good performance on reads than leveled compaction.

B

So level compaction is a popular option, but it had some kind of poor behavior under under some corner case scenarios and that's what that's what we're addressing here so level? What level compaction does? Is it guarantees that it's that, in any level above level, one will only have one data file containing data in a given partition?

B

So if I go and I want to read, you know some some rows from a partition, then I'm at most I will have one data file that I need to hit per level, plus any data files containing that partition in level 0, because level 0 is where newly flushed data files appear that haven't been organized into levels yet, so those could potentially contain multiple entries from my partition in level 0. So the problem is that it's actually relatively easy to have a right spike.

B

That Cassandra can accept and write to disk faster than compaction can push this out into the levels. So what I end up with is a scenario like this, where now I'm having to check hundreds of SS tables for this partition. So what because compaction gets behind my Reed performance gets worse, so now I'm doing more. I owe on Reed's, which means there's even less I owe to do compaction, so compaction gets more behind and you can get into kind of a vicious cycle.

B

There, so what we're doing in 20 now is when, when the level 0 compaction falls behind is will actually do sighs, tiered compaction, which is which is significantly faster and so that it's not going to give me it's not as good as if I had all that data merged into the levels, because I'm still doing those three extra reads or, however many extra reads in my size tiered component, but it's a lot better than doing hundreds of extra reads. So this is not a.

B

This is not a silver bullet that makes level compaction a ton faster that you can, just you know, beat the hell out of it with you know, without caring about how level compaction is doing, but what this does do is. Is it let's level compaction, tolerate spikes in the right load and say: okay, I'll, just sighs tier these up when the when the right load dies down again, then I'll be able to get back to pushing those out into the levels and catch up.

B

So if you're, you know a hundred percent writing as fast as your cluster can tolerate it, then sighs tiered compaction is still really the the only option. But if you have the workload where your level compaction can mostly keep up, but you do have those brief spikes in extra right activity than this helps out a lot.

B

The last improvement operationally in 20 is what we're calling rapid read protection. So what Cassandra does is when you tell it to go, do a read it will. Actually, it will send as many requests out to Cassandra replicas as it needs to to satisfy the consistency level you're asking for so. If you do, if you have three replicas and you do a quorum read it will send out to it will send out requests to two replicas. If you are reading a consistency level, one, it will just send out a single request.

B

So what what that implies, then, is to successfully finish that read. It needs all the replicas it sent requests to to actually respond to those requests, and so what you, what you would have, then, is if a note dies before the failure, detector notices and stop sending it requests. Then you would have this this blue line here. That goes that goes down to the bottom. You know that those are requests that are timing out. They were sent to their to a replica. The replica died and then the the coordinator waits for those requests. Nothing happens.

B

It sends back to the client sorry time out failure.

B

So what we're doing in 20 now is we introduce this rapid, read protection so that the coordinator doesn't just wait until the the request times out what it will do is it will wait until the request is taking longer than most so by default, starting in 20 2, which will be released probably next week, starting in two da dota 2. This is enabled by default at the 99th percentile level.

B

So if a request takes longer than ninety-nine percent of other requests, then we figure something's probably gone wrong, maybe that maybe that replica died, maybe it's just you know just ran into some- maybe ran into a garbage collection pause so but whatever the reason whatever is causing it slow, we don't really care why it's slow.

B

What we care about is finishing this request, so I'm going to make a duplicate request, a redundant request to another replica, so I, so I can get that request done before the client timeout expires, and so what you see at the top there are a bunch of different options. Depending on how aggressive it is. You can see it at the top there. One of the lines is 99th percentile, that's right! Next to the no read protection at all, so the the line of the bottom there, the orange line there, that's always do a redundant request.

B

So you can see that that has an impact on my throughput when I'm always doing a redundant request or the next line up from it is seventy-five percent of the time. So you can see that there is an impact on on throughput, but at 99th percentile I'm only doing it one person of the time the impact on throughput is negligible and it still saves me from having to fail some requests when I when I lose a note. So it's it seems like a good default to have you.

B

Another value that might make sense would be taking that down to 90th percentile, so doing extra requests. Ten percent of the time- and the reason you might want to do- that is that's going to help out my 99th percentile latency, so I didn't have room here for another slide, showing the the impact on latency, but this also smooths out the latency curve as well the latency distribution, as well so at night, if I'm only doing a an redundant reads. One person of the time.

B

Now that's not going to help my 99th percentile latency it'll help my 99.9 latency, but not my 99th. So if I reduce, if I bring the read protection down to being more aggressive at 90th percentile, then that will roll into that will flow into a better latency distribution at the 99th level, which is which is something that a lot of monitoring people care about.

B

So moving on to two dot. One are we're looking at delivering two dot one in january of 2014, so we're we've got a pretty good idea of what we're looking to deliver there now. So one of one of those is user-defined types, so we kind of started giving you access to two nested data with collections in cql already and user-defined types takes that to the next level, because I can I can create a type like like address here.

B

That has multiple fields and then I can use that in a column, value or even in a collection value and those can be nested arbitrarily.

B

So here we have a simple example, with just one level of nesting where, where I've got the addresses inside a map and then I can I can pull those out with, with a fairly simple extension to c ql, using this dot notation to access fields within the type or I. Could you know the main reason I've done? That is so that so that it fits on my slide here, I could just say, select star still get everything back, but but then the wrapping gets ugly.

B

So I pulled out just a few of those fields here, we're also adding collection indexing, and so so. What this lets me do is just create an index in the middle here, create an index on this set collection that I have or on them, on a map, collection or or what whatever kind of collection I have.

B

So up until now, we've only supported index on simple fields, simple column, types, primitive column types, and this will extend that so now, I can say select star from songs where blues in tags and that that will will use the index to to do that efficient way.

B

Another I think that were tackling that's more on the low-level operational side is making our bloom filters more efficient, more memory efficient, since in particular, so when I'm doing compaction, this is something that a lot of people add don't know about. Actually is that when I'm doing compaction, I've got this this data file here that I've got? I have this bloom filter.

B

That says this is my my probably stick set of partition keys when I go ahead and merge these these two SS tables in compaction, it's possible that there's a lot of overlap in the data that those contain, and I end up with.

B

Basically, the the same amount of data that a single one of those source, SS tables contained, or it could be that there's, oh there's, no overlap at all and I end up with all the data still up still there, but now it's merged into a single data file, and so, depending on what on which of those is happening I want to, I would prefer to size my bloom filter appropriately, because I need to pre allocate that before I start doing the merging just just because of how the nature of compaction works.

B

So what we do now is we allocate the bloom filter, pessimistically and say you know this. This is how large it will need to be if there's no overlap at all, and so, if it turns out that there is a lot of overlap, then we end up with this. This. This bloom filters that's larger than its need than it needs to be. So what we want to do is we want to size it appropriately. So what we're using in 20 a two dot?

B

One sorry is cardinality estimation technique called hyper log log, and what that does? Is it tracks a sampling of the distribution of partition, keys in each SS table, and so that lets us calculate a to within five or so percent of accuracy of how much overlap there actually is in those partitions in those SS tables before I actually go ahead and merge them? The interesting thing about this is that we can do apply this cardinality estimation to more than just bloom filters, so in I talked briefly about size, tiered compaction.

B

So what we do right now is we look. It's called sized here, because we look for our data files that are of about the same size and say: ok, let's merge you guys together into a larger one, and what cardinality estimation will allow us to do is say: okay, you know we can actually go do better than just say: SS tables of about the same size, probably overlap. We can actually go ahead and look and say which SS tables actually do overlap and do a smarter compaction based on that. So this this.

B

This is the most hand-wavy of my slides here, because we actually did we've we've been we've got code written for everything else that I've been talking about on this. This part about the compaction. Here we don't have any code written here. We just have big ideas in in our bug tracking system. So what we're planning on doing this for two dot one?

B

But you know, don't don't send me hate mail if, if we have to push that out farther, but but it looks really good on paper, another thing that were that we're working on that's also kind of at the at the data file level is increasing the repair efficiency. So this is. This is another place where, where we have a kind of an operation that grows linearly with the amount of data in the system, and so what would what I mean there? Is that our repair highly networked efficient because I'll go through your data files?

B

/ replica I'll build a hash tree called them out called a Merkel tree. That represents the data in that in that replica and I'll. Compare that Merkel tree with the Merkel tree. For the other, my peers in the cluster that also replicate that range of data, and so that's what I do is I'll compare the root of the tree and if the roots of the tree are identical, then I know that all the data matches everything's good. If the root of the tree is not identical, then I'll look at the left.

B

Half of the tree look at the hash at the at the root of that half I'll compare the right half of the tree and wherever the those halves don't match, then I'll drill down further into that half. So at each step, I'll cut my tree in half that I'm evaluating until I get to a leaf node. That says this is the partition or the set of partitions.

B

That is, that is missing on one of these replicas and then I can, and I can send that over so it's very network efficient, but building this tree building this tree is not efficient. Building. This tree requires a sequential scan across all that data, so, as I start to get into the terabyte range of data that starts to get painful. So what so? You know as that this is my illustration of as that data set grows, I'm still having to evaluate sequential scan that whole set to build my tree.

B

So what I want to do, and- and and this part we do have some code written for- it's not finished, but we we have it. We have an early version, that's being reviewed and fleshed out.

B

What I want to do is as I've added that that extra data I want to be able to build the tree out of just that. New data I should be able to recognize that hey I've, already repaired and made sure that all the replicas have the data on the left here. I shouldn't need to repair that again ad infinitum, so I'm just going to build the tree now for this new data.

B

So that's going to be a lot more efficient, so our roadmap for two got one I mentioned, is in January and we're looking to deliver we're looking to continue delivering ease of use both on the improved cql feature side, as well as on the operational side. So how close might out of time? I've got two minutes. So, let's do we have time for two questions if we're fast.

B

So the question is: when should I put 20 into production, yeah I think soon is a good, a good way to look at that so like it like I, said to dota. 2 is coming out next week, so I would. I would definitely evaluate that, especially if I have a staging environment evaluate to 0 2 in staging and I.

B

You know I I'm, in a different position than a lot of people in this room. I would be comfortable with 202 in production. So for for those of you who, don't you know, dig into the cassandra code on a regular basis, I would definitely evaluate that in a staging environment, we're looking at getting it into datastax enterprise, hopefully by the end of the year, so yeah. Well we're going to be evaluating that internally as well.

B

The question was about aggregate functions in c ql, and what the plan is there. I have mixed feelings about aggregate functions because they're in they tend to be I mean you can special case things and precompute them the way hakuna does for their analytics, but for the ad hoc aggregation site of things, I'm, not a hundred percent sure, that's a good fit for what Cassandra is trying to do because you're by its nature, you're giving it a query that that may involve a lot of sequential scans, a lot of sorting a lot of merging.

B

So it's it's different than what we're the problem we're solving now, for you know online applications that need to do millions of similar requests per second, so not on I. I have my Midland about it. That said there there are some Cassandra committers who are pretty gung-ho about doing this Jake at Blue, Mountain capital is one their use case could use this and could get a lot of benefit out of this. So my guess is: it will probably happen at some point. I, don't know if it's going to happen in the two dot.

B

One time frame, but there's definitely motivation in the community to make it happen thanks for your time guys, and we do need to break I- will be around definitely catch me. You know, if you have other questions. I will be at the meet the experts room this afternoon. Yes,.