Apache Cassandra This Week In Cassandra, 13 May 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: This Week In Cassandra: 3.0 in the Wild 5/13/2016

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

You know it would be really sweet if I had like a guitar solo to start out every one of these things all right here. We are this week in Cassandra. What's today, it's May thirteenth come into the world with the Cassandra news we also. So what do we got here? I'm, John, Haddad, right, technical evangelist for datastax we've got Luke Tillman. Also technical evangelist for datastax lovely to see you Luke and Julian are gonna. He's the vp of engineering frog to cloud I'm. Sorry, I can't actually pronounce your last name correctly. It.

B

Says I'm gonna theme with delazny, better Fergie.

A

I can't pronounce everybody is the last name thanks for thanks for coming on Julian. Thank you guys. um So today we're going to be talking about. We got some Cassandra news and we're going to be talking about Julian's experiences with putting Cassandra 30 into production, so new new ish kind of release.

A

You know getting getting ahead of the ball having a good time in the open source world, but first, let's talk about materialized views, so Jonathan Ellis need for patchy, Cassandra, all-around, good guy, just posted a blog post and datastax blog about materialized views on their performance. Look you had an opportunity to let this over. Would you think so.

B

It's a it's cool to actually see some performance numbers. You know because I know a lot of people you know are interested in materialized views, I. Think just from a like usability standpoint, you know not having to do all the manual denormalization and manage that in your code, anymore is pretty attractive, I think for for a lot of developers, but it's it's also interesting to see the you know the performance, numbers and and kind of gauge.

B

You know not only what kind of the impact does this have on your rights, but then you know it was I was actually kind of shocked or surprised. You know how you know some of the reed perform numbers that he showed in the in the blog, so cool to actually have some numbers and a tool. You know, that's that's an open source that you can go and test for yourself. If you want to try, you know see what your own performance is like. This.

A

What I think the interesting thing about materialized views is when people come to the Cassandra world like Luke, you and I have talked like 3,000 classes on moving from relational to Cassandra I.

A

Think the number one thing that people kind of get hung up on is you know secondary indexes, and the reality is that most of the time you're going to actually going to want to use materialized views if you're, especially if you're selecting from a single partition- and you know it's cool- to see that, though, that performance differentiation as the closer it gets bigger you can see, it realized view performance, get better and you know secondary indexes kind of level off. You know yep feeling.

B

You said you guys haven't you guys, haven't done anything with materialized views yet, but you're thinking about it. Yes,.

C

So basically, I think, as you said, we're interested in the usability part of the materialized view to you know, have a better or cleaner that I'm data model and, as you said, we were like no waiting for performance performance benchmarks. You try to basically like put that in place, so we actually have planned in internally with some of our table layout to migrate and basically do some testing. So, as you said, awesome that we start having members showing up and the tuning to actually measure for ourselves the actual drama performances.

C

So, yes, we will be definitely looking at it in the in the near future or now that we are running a customer speed of owen function. Yes,.

A

30 and prod- and we are definitely going to talk a little bit more about that- just a second, but before we do, we could talk about something. I wrote what.

B

So John John John wrote it. You wrote a blog post this week, surprise yeah, so working relationally. You actually are on this list twice this list of blog posts, if you're, if you're, following along in the blog post on planet, Cassandra and so John and Danny, traphagen ridah wrote a post together and then John wrote one himself about working relationally with Cassandra, and so this was kind of about using spark and Cassandra. You know, but what that's SQL spark SQL. So why did you even like why write it?

B

Why did you even write a blog post like this, so.

A

I I will, let's, let's touch on one thing, so the the previous post, the first one with Danny, that's a kind of a preview as to what this kind of stuff that we're going to be talking about. I was calling which is going to be next week. So, if you're watching this and you're going to be an oz con and you're interested in learning about standard spark, we sell a few spaces left in our three hour. Tutorials can be hands-on. You get a vm, you can play with stuff.

A

There's examples that you can run and you're gonna be coding in this tutorial. So I just want to get that out there and.

B

You can meet John and myself in person and.

A

We hang out and take my shift shots, which will be amazing, the the second part with with the sparks equals. So the reason why I wrote this post is there's a lot of really good documentation in the spark world and then there's some stuff, that's poorly documented and unfortunately, the spark like using this spark, sequel and like the hive like side of things is not documented well at all, and I like people would like. Datastax has done a lot of work to kind of make this stuff.

A

You know work well and straight in a straightforward manner, but outside of that there's like no documentation, so I took a look at what was going on and poked around and made it so that you know you can take the connector and you can fire up the sparks equal shell and basically create tables and sparks equal, that map's, your casino tables, and then you can do like joins and aggregations. And you know some of the more like hive style.

A

Analytical queries like rank, dense rank that stuff is really really cool and you know, like I, said: there's no Doc's around and so I figured the world needed a blog post and.

B

A lot of people probably realize you can SQL against your Cassandra cluster, or you know even easier if you're using datastax enterprise, but because we don't have to do all the setup but yeah and you can absolutely use you spark with Cassandra. So one thing I noticed in the blog post. That I definitely wanted to ask you about so you're talking about sort of exploring this. This movie lens data set and as set up you do this this step where you do a pip install and then and then you do a CM install.

B

So your pip, installing Cassandra dataset manager and since I know that you were responsible for this project. Maybe you want to talk a little bit about what CDM is and kind of? Why you did? It sure sure.

A

So I kind of yeah, I sort of like glossed over that and it's like a pretty big point and I had a previous blog post, which kind of showed it off a little bit, but the Cassandra dataset manager is, if you're, looking to learn Cassandra right now. What normally happens is you download, Cassandra and then, like?

A

Maybe you read some blog posts and you come up with like a data model which is probably wrong because, especially if you're coming from the relational world is just not intuitive to think like Oh everything I learned in the relational world is wrong when it comes to scalability right, like people are like, oh I'd have to do, joins it's like no Cassandra you're not really going to be doing joins.

A

You have to model your tables, how you're going to query them, and so what I realized was, and this this actually goes back to a talk I had with Patrick McFadden two years ago.

A

What we need is a way to install sample data, sets for the purposes of learning and so I built this tool. Cassandra data set manager which effectively treats Cassandra data kind of like how you would treat like packages in debian or in Red Hat. So you can just do CDM install and you give it the name of the data set and it will just load your Cassandra cluster for you and it's it's just really nice to be able to use a tool like this in blog posts, so I don't have to come up with.

A

You know: okay, we're going to install this data model and here's some sample data and blah blah blah blah blah. This is nicer to go we'll just install this and you get like you know, hundreds of thousands of Records, as opposed to like four, which is what normally happens when you're talking about data set. So how.

B

Is uh so how are packaged like so you did CDM installed movie lens small. My car packages actually manage. Do you have a registry set up or what are you doing under the covers there? Yeah.

A

It's it's kind of a little harry's. Well, I, don't know about here: let's go with fun, so what what I do is each each data set is really get repo and the CDM maintains its own list of data sets. So you can just like a apt repo. You could have apt CDM update and it will fetch the latest list of repose and those will just basically have descriptions and text information about them, along with a reference to a git repo that you can download.

A

So when you do CDM install movie lens small in the background is actually cloning, a git repo, pulling the corresponding tag that you're looking for and running this installer script, which is based on Python and pandas, and some other things like.

B

That so these are like hipster data sets carefully carefully curated small batch. They.

A

Are curated small batch data and the idea here, though, is that you, if we can get enough data sets in like, for instance, movie lens, we can start to build machine learning, tutorials, right and they're they're nice, because they're reproducible you can just download, you know you can download the you can stall. The data set I think you can download some.

A

You know Jupiter notebook or some sample code, or maybe you can download a net application which does it one way, and you know you can kind of follow along and when it comes to learning like having 20 tutorials that all use the same set of data models or the same yeah. The same set of data models gives you kind of a lot of flexibility there. So I know that, like Danny on Co speaking with oz, Khan is going to be writing. um Some data sets to do like look.

A

A lot of medical research like there's like an open diabetes, related data set and she's going to pull that one in deceit again things like cancer rates, and um you know various economic data sets like we can have some really really cool things and then Patrick is working on getting killer like we were messing around with killer weather and we've kind of talked about killer video getting support fat in there.

A

So it's like you know you can just have these like a dozen or two dozen curated data sets and then build this kind of ecosystem around stable data models. You know that's kind of that's where it is so so.

B

Back to the like kind of topic of the blog post, real, quick and just asked Julian this so spark: are you guys doing anything with spark integration on your Cassandra data like? Are you guys using it at all, or are you guys doing analytics at all on your Cassandra data so currently.

C

We do the analytics directly of the customer database and we have basically G cluster on top wait to perform or dis with api's we've been doing research at the moment to integrate spark and for one reason, is that so weird cloud provider. Okay, so we have six data centers and we would like to start analyzing and streaming the actual network, that's basically coming in and out of our data centers.

C

So as you can imagine, this desire, like Freddie heavy and large data like constantly like bringing an out and for that we will be using a Kafka and inspired so we've been like walking on it and doing some R&D is not ejecting production, but it is difficult. We are looking forward to to start deploying for some initial use cases around the actual network analysis by the end of the year, cool.

A

So that's that's kind of a good background as to what you're doing I think this is kind of where we shift gears. Now you uh you were talking before about how your migration to 30, how did that? How did that go like what did that look like for you guys? Alright.

C

So I mean we've been running, casts on Raw in production since 02 dot 0, so we did 20 in 2013 and shoot up one day after and this year we did Trudeau to and sweet home, and let me if I knew why we did cheat at you first so to do too. If you want one good reason to actually go to to the to be for 300. That would be because of the drivers.

C

You know does that new Java driver that that has 7 compatibility and I actually choose to go to to do to to, firstly migrate, the application on the new driver, making sure that everything was ok before going through a door that allows you having the application, and you know the new driver migration to do at the same time as the actual database, but has been actually working pretty good for us and any of the driver guys on Java are listening right now or some job on the documentation or some Java everything around the driver, because he's been really easy to take the documentation and migrate, the application with the documentation.

C

Everything was just just follow. Yeah.

A

I'm definitely secondhand I love the work the driver guys have done so.

C

That's how we did it, so it's been pretty easy going to 21 to 22 on Miss Lee super super easy with the driver. Everything was working, fine and a new position to the 20 to 21 migration that require some. You know different tuning and we're a bit of hardware bump when the happen at the time he's been mostly a flawless now to do hard and play with some stuff as well.

C

It was the bootstrap resume which, when you're, basically, you know dealing with multiple data, centers and a settlement of node, is really really critical, but basically saving time, because you know the string can fail. So that was like a pretty big winner by itself. Yes,.

A

You know I am a stamping, because I think that was my hero. The boots.

C

For presume yeah, there were good woman because, honestly it's like actually do any work. I just I.

A

Was just like we need this yeah.

C

I mean for me just just that trust the ability to run. You know the new driver against 22. That's really a good my opinion, a good step, pretty easy one to make sure that everything's, okay and who you're getting the bootstrap resume, which is honestly a super handy for for us and I, worked really well for you. Yes, yes, that was at that that that was easy, almost no changes. As far as configuration like same hardware. No, we don't I've been monitoring that for like two three four weeks at a time, but everything was okay.

C

How long does it for you to bootstrap a new node to boost rajan? You know: do you mean like we did? Data set and everything yeah.

A

C

That that can take a while yeah.

A

C

What we have in between one terabyte, which each otha bite on the node, so you can imagine I mean you know it can take a while right yep. But usually we don't need, like you know, a lot of elasticity. We have, you know, put in every data, centers, no settlement of resources, and you know I'm monitoring. We can basically anticipate the walls as it comes because we know the amount of vm that are getting put strapped on our crow at that at that point yeah.

C

So, after what we did is that once the application was migrated running up the new Java driver against you tune, we started migrating to 30 and it's been mostly okay, but we've been hitting couple of roadblocks. So let's talk about the good thing first about 300 and was something that we've seen is the space the disk space used, compare and between 22 and 30 with the new storage engine.

C

For us, it's been quite interesting because we could see up to, I would say, twenty percent just saving on disc after the nerds were basically am I grouchy with the new SS tables. Yes, so yeah. That was really nice. We were actually quite surprised about it, and and, as ben has been really good cool.

C

What else can I say? So? Yes, we got like you know my no issues, maybe a little bit more on the actual assess table itself, a couple of Corrections. Why am I writing it, but honestly, considering the the size of the refactoring on storage, Laird, I was really I was really anticipating way more problems than than that, so overall, great, a great upgrade and and one huge benefit that you get as well with 302 is the hints, so the hints so there's two things about the hint.

C

So first is not in the SS tables anymore, which is which is never been like, really working right and I guess. This is one of the reason why it's been put back on. You know on on the file system, right so before I used to refer to what you have to you know to watch the jeans trunk, a blue jeans that were just like stuck there. You had compaction going on on the hints, etc, and we straight out who we don't have that anymore. It's mostly Italy perfect, you just a couple of hints.

C

You know getting stuck on five system super easy to to truncate. So that's that's a good one as well. Awesome.

A

Has have you seen any any performance metric like t think that, before you ever seen, performance problems as a result depends no.

C

Well, yes, oh the hinge. The program is double and really performance problem as such. That was just that I had to monitor them and make sure to take care of him. So I was using actually to splurge fighters at the time and they had like a bunch of tuning to do that. That was a pretty handy, but no no I have to do that. I have. I have actually remove that sound monitoring the hints, of course, making sure that everything is ok, but I can tell you after I know several weeks having 300 in production.

C

That's really is a huge huge improvement. You're seeing as well is that you can now a disable or enable on a pair that essential basis, the hinge delivery in streator 0, which is like really awesome, because you know, depending on the latin see you want to know tune. Actually you know we have data centers in singapore and we have them as well. In a you know, Central America and East Side in the US, so it's been really handy for us as well. Awesome yeah,.

A

Those are, those are definitely a big upgrade. I was, I was very happy to see that I mean there's a lot of people, that you know they talk about the problems with hints and how they can actually result in a poorly performing cluster like performing even worse, like compaction problems that result from it. So, oh yeah.

C

I think that does defeat at was defeat. Look as if you do money toward the hints with a 21 or 22 cluster. Definitely you'll have basically the compaction threads being busy stuck on compacting hints, that's basically what happening so you can remediate like to eat once you know this is the case. You know you can basically like take care of it very easily, but the fact that we don't have to do that anymore with 300 super ugur. You do it for us, nice.

A

Operations when love us.

C

And yeah- and I want to mention something as well like you guys, been doing an awesome job with the documentation. You should check the right documentation section that you, a bright guy died. Remember how you guys, calling it nowadays but really complete, I mean everything's in there. As you know, the tips, the good practice, the basic that you have to not do when you do it, when you do a migration so and be so, people are basically I, don't be afraid of a grating. It's mostly it's mostly doing doing doing right. Awesome.

A

It's nice to hear a major upgrade like this get a you know endorsement from someone running a you know, fairly decent-sized cluster, so I'm happy.

C

With that I'm not happy, you look at the operation, liver, cool.

A

All right, well, uh I, think we're about out of time. uh So Luke is anything you want to say before we sign off no.

B

Hopefully, we'll see some people that listen to this a toss can come say hi at jons talk or will be at the dates. Tax will have a booth there, so come say hi and ask on yep.

A

We'll be around no.

C

Spin guys, yeah and you'll be a toss. Cot right, Julian I would be. There was one dances.

A

And I believe you're right around the corner. If you can't tell you here from Texas, that's right, cool all right! Well, Julian, thanks for thanks for coming on and talk with us about three Votto I definitely enjoyed this um I think we can sign off. Do.