Apache Cassandra This Week In Cassandra, 11 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: This Week In Cassandra: 3.0 Storage Engine Deep Dive 3/11/2016

Description

Link to blog referenced in video: http://www.planetcassandra.org/blog/this-week-in-cassandra-3-0-storage-engine-deep-dive-3112016/

A

B

Alright, here we are with another. This week in Cassandra a planet Cassandra I'm, John headed today we have Tyler Hobbes datastax is a committer to open source Cassandra. We also have Aaron Morton of the last pickle. Also a committer I believe ya, sound old late, yeah, pretty pretty exciting old school committer I. You know that's great. We have some good stuff here today we're going to be talking about the new Cassandra releases and we're going to also going to be taking a really big look at the new storage engine in 3 dot 0.

B

So first, let's talk a little bit about. What's happened this week. Big news is we're looking at two Cassandra releases, 30, dot, 4 and 3 dot. 4 304 is a bug-fix release. So it's going to have a bunch of stuff fixed in the three dotted line: good stuff. If you're already running 30 in production and on 3 dot, 4 we've got that's our new features on our tick tock release cycle. The last number is odd or then it's bug fixes. If it's even then we're looking at new features. So it's a peach release.

B

Let's take a look at that so Tyler. What what would you say is the new interesting stuff out in 30 4.

C

And 3 dot, 4 30 30 for well.

B

The new stuff is in 3 dot for right.

C

Yeah so I think by far the biggest new feature in and three dot 4 is the sazzy indexes. Those are contributed well, one of the main people with Pavel developer at at apple and cassandra pmc member. These are a you know, a huge upgrade from what we have an cassandra and older versions of Cassandra for secondary indexes.

C

So basically, you know and previous versions of Cassandra, the secondary index works as as kind of a hidden second table, that for each index, value partition that stores primary keys of the index table for every row that matches that index value. So it's basically just a you know a primary key that we then use to go to another look up on the base table and that comes with you know a lot of there's a lot of inefficiencies in that and it's very limited in terms of what operations that can support.

C

It's basically just sort of a hash index of sorts. So this has the index is maybe John. You can talk a little bit about this about how how they work, but they really you know they provide a lot more efficient support for Sydney Opera. Some new.

B

Operators, mmhmm yeah, so that the interesting contrast to to the existing second remix of implementation with sazzy is with sazzy effectively. You have an index written /, SS tables. Whenever an SS table is written to disk, you also get a B+ tree written, which is a really efficient storage format for doing database lookups for range lookups. It's optimized for disk seeks it normally can get problematic.

B

If you have a lot of updates and deletes, that's why you see issues with relational databases under insert and delete and update heavy workloads, but in our case, because cassandra has immutable data files, you can actually have perfect B+ trees, written to disk and then they're never touched again so from a performance standpoint. They don't slow down over time. You just generate a new one during the compaction process. Whenever you write a new SS table, so they support prefixes. You can do like there's a light clause that you can add to it.

B

So you know the things that people kind of are used to in relational databases where they want to do range, queries and they're, not a hundred percent sure the queries that they're going to do ahead of time. This will support a lot more flexibility and it's you know it's really cool I've played with this a bunch and I wrote a blog post. It's on rusty, razor blade, calm, it's it's very, very cool stuff. It was really fun to actually use a like claws and see results come back.

B

The only the only thing that you have to keep in mind is because they're B+ trees, their memory, mapped files you're. Definitely what I'm going to want to run this in systems where you have more free available, Ram, not so much on issue, if you're in somewhere, like Amazon, where you can fire up a machine, you've got so, let's say 60 gigs of ram. You know it's okay, if 20 gigs of that is indexes totally fine. Yeah.

C

I think the other thing to keep in mind is they still have some of the same caveat says as the existing secondary indexes right. They don't. They still don't make sense for indexing all types of data you don't want to index. You know an email address or some other unique value with these, because the fan out, when you query still has to happen. It still has to touch essentially every note in the cluster build a query response so fill out.

C

So while they do have some cool new things, you still have to treat them and a lot of ways like the existing secondary indexes.

B

Yep yeah, the scatter gather aspect, is definitely going to be always problematic and I. Think in this case, for your example like emails, you would probably want to use the materialized view feature that got introduced in 30 and so now, yeah so material like that. The interesting thing that here is that you can say that as views while they may be a little bit slower at right, time are going to give you a huge performance boost agreed time.

B

So, if you have like equality, searches and you have like reasonably high cardinality, where you're not you know, you don't want to have behaved like, let's say a part of materialized view on like people's ages with a billion people in your database, because it's just not practical.

B

If it's something like email where you can get a super fast lookup, it's going to be better than secondary indexes, so it's cool. I think we've got multiple tools that can solve different problems, different ways. Each one has certain trade-offs, but I think, as we get used to them, we're going to see some really good. You know recommendations come out and kind of advice jama.

B

I'm pretty excited about both those things and then the other thing that I wanted to just kind of briefly touch on was the return of the SS table may be used to be called s is tabled json. Now we have table dump. You guys have used that more than I have what's uh excited yeah.

A

This is critical to ever explaining Cassandra to anyone like every time. I've explained the the right path and the immutable files and how the read path merges those things together, the ability to say, hey, let's insert a row flush, the disk into the row. Flush to this look I've got two copies there, I can I. Can you can see now that this data is on desk in multiple places, and you can now see what the repub does? That's always been really important.

A

I used this a couple of weeks ago, the new SS table dump when I was looking into what happens when we drop it drop a column in sync ul and it just works, and it does a really good job of outputting it, and it's really a useful. If you're trying to understand things, I wouldn't I, wouldn't use it as a way to backup your data or export.

A

It I think there's much better ways to do that, but as a what's going on here, I've been doing this for a few years now and the a couple of times we've had to get someone's SS table, pull it out, convert it to Jason. Take that tiny little piece out that that somehow make things crash back in the day and then put the SS table back so every now and again they're useful for that I. Don't think it's such a cause. Now it's mostly as a learning tool. It's invaluable yeah.

C

And and I guess it's good to point out that SS table dump doesn't have a sort of the inverse operation of loading it back in right. We don't have a JSON to us as table equivalent anymore yeah, but yeah. I agree with there and it can be really instructive. Just for looking at how that is stored at on disk, you know from from a teaching perspective, but also you know for doing support and operations.

C

That really can help you understand so so. Somebody recently on on the mailing list wondered why partition was so large, even though it had relatively little data in it and just by dumping the SS table. They are able to see that it had whatever 10,000 tombstones that they didn't know existed there, so being able to see that sort of information community really helpful and debugging different types of problems, yep.

B

Very good stuff, so we were talking a little bit about you know. We've got three dot for out looking forward since I. You know, I, don't really get the opportunity to have committers, you know in the room very often or in the virtual room Tyler. You would tell me a little bit about some of the stuff that you're working on going forward for the future versions of Cassandra. What do you got for me.

C

The future is future is bright for Cassandra, so off the top of my head. There's, you know one of the things that is in progress right now is non-frozen UDTs, so we're looking at being able to store those split across multiple cells. So, right now we we essentially force them to be serialized into a single cell with the frozen keyword I'm, so we're making that optional.

C

Now so that they'll be stored across multiple cells, and that allows you to update each field in aedt separately or individually, and it can also allow you to optimize the read path. If you're only selecting a single field from the UDP, you don't have to deserialize the entire thing mm-hm, so it's kind of a fun one I'd say by far the biggest change that some working on right now is, as part of the work to switch Cassandra from this seata model right now, the staged event-driven architecture to a thread per core model.

C

So this this will have really big performance implications for Cassandra really what what? What we're looking to do is eliminate a lot of the overhead that we see from from context switch switches right now. Cassandra uses tons of different threads split across a lot of thread pools. So it's constantly switching during context switches which is expensive in terms of CPU caches.

C

We also have to use a lot of suboptimal data structures to be able to handle that multi-threaded concurrency. So eventually we will be able to isolate this data to say a single token range is handled by a single thread and that will allow us to use some walk free data structures.

C

Things like that that can really help out the throughput performance of Cassandra. So it's a massive undertaking and we're looking to do it bit by bit right now, we're focusing just on the read and write paths but I think by 36 or 38. We might start seeing some of the first parts of this be released into the wild and we'll see how it how it works in in real life, nice yeah.

B

Yeah, that's a big project, yeah yeah! It sounds like a lot of a lot of rewrite, but yeah you definitely be overhead of context which is and locking everywhere. If you ever like, follow a program with estrace. You can just see it's just mutexes all over the place and getting getting rid of that making it more optimized, definitely improve performance overall, so that I'm actually very excited for that.

B

So we've got three dot for a pretty cool release. We got some really interesting stuff coming in the future. Let's back it up a little bit and talk about the three dot: zero storage engine so Aaron you just wrote a blog post on the 30 storage engine. You can find the link in this week and Cassandra blog post.

B

It's pretty it's very detailed. This is this is something that Aaron I love your attention to detail and stuff like this. I saw a talk that you gave maybe three years ago when I was first learning Cassandra on the right path and the read path and just how Cassander works and it's it's good that you didn't get lazy and it give me some something less than that. So I appreciate that so yeah I, don't know. What can you tell us about the new new format like what are some reasons why this thing exists?

B

A

If you think back when Cassandra started so you know six years ago, the storage engine really hadn't changed a lot in those five or six years up until this released late last year, and if you think about how how the API has changed, cql three of what cql had a little bit of a bumpy start cql to seem to be just mi a and C 2 L 3 was just a huge, huge change and it's gone on in ways that maybe the original guys it's all about.

A

But the extensibility that that's added into the platform is huge and it's all been done on some nasty hacks on the existing storage engine and the biggest one I think was that the existing storage engine had no concept of a CTL row. They were something that was kind of hacked on top of the internal storage engine row, which came to be known as a partition. So just that as a basic fundamental thing to say: hey we now we now know what the data model is.

A

Let's, let's efficiently store that in the storage engine, it's led to a bunch of improvements. That again, we're probably not going to see all of the impact for a while there's a really great post. That Sylvain did when this first came out. That explained just the impact on how it can reduce the on disk size and there's a lot of stuff in that blog post I point2 around that of understanding things like hey every cell, that we put on disk as a timestamp.

A

What about if we record that from an epoch of every cell, that's in that SS table rather than the UNIX epoch? So if you've got a time series data model, for example, when we flush the disk we're going to record something that says, write the highest type, the lowest timestamp we have here is twelve o'clock.

A

So, if I want to record the timestamp, that's 1201 all I need to record is that it's 60 seconds or 600, six thousand milliseconds whatever hi, then that twelve o'clock times, then all that type of stuff means and some variable, int and coding means that on disk, it's a lot more efficient. And if you have a look at that post and there's plenty of links in there to go to the code, you can really get a feel for the idea of it.

A

Now we know what's actually going on disk and if you look around at some of the old examples of how to explain, cql three, it used to be all right here. C, ql, 3, now I, better explain how it stores it in the storage engine and that's really complicated because look there's this cell here on the storage engine that doesn't have a name and doesn't have a value. That's important, like just trust us on that game, and we were talking earlier and Tyler we're saying it.

A

We don't repeat all of your clustering keys, the values of those used to be repeated for every non primary key cell in that row, and that doesn't happen anymore. All that information is stored once and it's a so much more efficient on disk and really sets the groundwork for the next couple of years. It's really exciting. Yeah.

C

One of the cool things is that you know, because we had so much redundant information and alert Cassandra versions compression made a really big difference it would. It would take care of a lot of those issues, or at least you know, mitigate them, but yeah. If you look at that, a blog post by Silvan that Aaron mentioned you can see. The new storage format is so efficient that, in a lot of cases, it's smaller than the compressed SS tables from the previous version, even without compression.

C

So one of the things that people can look at doing now is is actually dropping compression from the Restless tables to save a bit of CPU. It maybe might be a win for them, and newer versions and.

A

All be Ivan spoke to people who were gone back to basically cobalt column names because they're scared of all of those wasted bites on this by having one column names and it's a terrible thing. So, hopefully, if we can prevent that, that's everything yeah.

B

Well, one of the one of the things is it's nice to just: have this EEMA even encoded separately right like that's, that that'll I mean that used to be just a separate ticket right. It was like no! No! No! It's totally ridiculous, like fact that your column, like that, the name of your field can like dramatically increase the size of your SS tables is just absolutely nuts and I mean that that alone is a huge win, and then you talk about not repeating certain values like TTL, zuhr time stamps or encoding them as a delta.

B

From the beginning of time stamp that you saw for that particular row, I mean the savings are huge. I think I remember seeing like for certain size tables. You can see like a 10 fold reduction right, like that's, absolutely crazy, to be able to see an optimization like that like. When do you ever get something that gets ten times better? Never so.

A

Yeah yeah, we go ahead time. I was.

C

Going to say, I mean you know, it depends on the workload, but especially if you're, if you're, storing a lot of small values, especially like you know, a single rope or partition or even a wide row kind of format. If you don't have large values that you're storing you'll see a massive reduction in size on disk with the 304 minute, yep.

A

Yeah I think all the way down to the individual self storage. You know, whereas previously we'd have the time stamp in every cell, if all the rows, if all the cells in the row have the same time stamp, we just have it at the row level and that's that simple understanding of knowing that these things are all collated. These all these things are together.

A

It's a safe space and saves reading of disk down to things like I can't remember if this was already in the 2 point 0 engine, but we know when different data types are fixed and when they're variable width and blends are just encoded as a bite and then, if there's three ball wins together is just three bytes in a row, and we know when we go to read that off to disk. What's in that, what what actual columns are in that row they were order and so fixed with things can be read very efficiently.

A

Obviously, and then the the complex cells which the complex to accomplish cells are things that you do with our things like UDTs and collection types, and previously these were frozen as Taylor was talking about earlier on now, they're they're non frozen and they're so much more extensible and will support the type of feature that ty was talking about a non freezing UDTs so that you take your idea of here's. My column, that's defined in my table. It's really a list or a UDT or whatever.

A

It is that when it gets to the storage engine now explodes into a bill of materials, type approach where it's like. Okay, here's my column, it's a cell, yet my cell is made up of other cells. Each of these cells is then encoded as a cell inside of that one cell, and it is then individually addressable that type of feature, but it wasn't around previously Tyler. You were talking earlier about a really interesting point, which was the dealing with dense versus fast tables and optimizations for those yeah.

C

One of the kind of the cool things that Silvan, designed into the new three dot, o storage format is this will will switch to using a different storage format based on whether the set of columns that are actually used is is sparse or dense. So, for example, if you you know, if you only have ten columns in your table and you pretty much always writes it to all of them and we're going to use this dense format, that's that's optimized for that.

C

On the other hand, if you have say a thousand columns define for your table and each row only normally has two or three columns actually set. We switch to to a different format. That's that's specifically optimized for that, as well, so sort of on both ends of the use cases. We've got a more efficient format than we had in in 20, so it's kind of nice to just optimize for these different ways that people use Andrew.

A

Yeah that was really interesting, as I was reading through the code on that it's in there encoded cells. If it's, if it's mostly missing, if they're mostly there, then we encode it one way, if they're mostly missing, then we'll we'll go a different approach and similar the surprising bit there supporting.

A

Therefore, if your number of clustering keys is above, I believe it's 32, then they're encoded in a different way than if it's less than 32 they put into chunks of 32 clustering keys and coded so really a lot of attention to detail of dealing with things that nowadays we might laugh at but later on, we might be like wow, that's really great. It handles that hundred and twenty 828 keys in my clustering index. That's great.

B

Okay, he definitely went and prepared for, like every single scenario, magic yeah.

C

This you know, there's a lot of features in there that uh that are kind of we don't really utilize them now, but there they set a nice foundation for for a lot of features that were interested in building soon. So hopefully, this this format will, you know, help us to to kind of grow and support new features efficiently, without resorting so much to the old hacks that we we kind of had to build on top of things like in the two da dough storage engine. What.

B

Are some? What are some features that this will enable.

C

C

So some of the things that that can work really well with the new storage format are things like being able to push filtering predicates to replicas.

C

That was something that the new old storage format, just wouldn't. Let us do.

C

Now that I'm on a spot, I can't figure them all, but I mean.

A

There's a really like macro level thing in here, which is the the actual. The entire on disk format is now pluggable. So previously there was one way we wrote to disk, and that was it. It's now there's an interface. Well, there's a factory in interface, a whole mechanism in yet to say: hey, let's, try a different on this format like whoop-dee-doo, let's go and code, this stuff is Parque whatever you know as an experiment, and that's that's a nice big top level.

A

Extensibility point that probably didn't exist and I think the all of the extra knowledge and stare will lead us to be able to improve performance over time. You know we know what cells are encoded into HSS table right. Let's that's some useful information. We could go and do things with and the extensibility around the encoding of cells on the desk will mean that I think, like the UDTs and all those types of things will become, I think could probably see some more more activity in that field and more complex, predicate pushed out the disk absolutely.

C

That's that's a good point. Like the you know, the non-frozen UDTs, those are so much easier to do with the new storage engine. There was a ticket that got talked about on here a couple weeks ago. Optimizing the number of disk seeks that we do based on SS table metadata. If you, if you have like a limit one on your query and that's something that would would have been really insane to do with the old storage engine but is easier now doing / partition limits, as is something else.

C

That's we have the you know the plumbing. For now there really is a lot of stuff. You know it kind of froze up earlier, but it's a long list of things.

A

We can my one eye. One nitpick is that the the keys base and the table name have been removed from the file name for the SS table and we think there's probably for windows compatibility, but that's my one nitpick I found in the whole thing so far.

B

Yes, alright guys well, this has been pretty fun I, I'm pretty sure that people are gonna want to hear more about the SS table. Format learn more about it. I definitely recommend, as I said before, reading errands blog. This is it's awesome. It's in the blog post that accompanied this video on planet Cassandra so definitely check that out get into it and there's lots of really good stuff is coming up and that has just been released. So it's pretty pretty fun week.

B

So yeah I want to thank you tyler on Aaron, for it on and yeah I think we can wrap this one up. Oh wait here.

A

B

A

Going to say, I'm in San Francisco with the tail p is going on the road next week. So nate is going to be talking at the Cassandra they in atlanta on thursday next week, and then I'm going to be talking in san francisco at the san francisco made up on Monday week about cql three and storage engine three and then talking down in South Bay meet up about how the back up Cassandra grace.

B

All right, Tyler, anything you want to say, nope.

C

Nothing right, alright, thanks guys. Thank you. Tim.