Apache Cassandra This Week In Cassandra, 25 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: This Week in Cassandra: Analytics without ETL 3/25/2016

Description

Link to blog referenced in video: http://www.planetcassandra.org/blog/this-week-in-cassandra-analytics-without-etl-3252016/

A

Hello, human beings: how are you I'm, John Haddad? This is this week in Cassandra on planet Cassandra with me: Luke Tillman also technical evangelist for data stacks. Of course, now we have Evan Chan from double jump. He is our pretty much one of the most intense analytics people in the Cassandra community that I know of so that's pretty low. It's pretty exciting, I think so.

B

A

We got a couple blog posts that we're going to be looking over we're going to be talking to Evan about analytics on Cassandra without etl, which is kind of a weird little thing. I'm, not a lot of people are aware of, I think, or at least having been over the last few years, definitely getting more popular. So, let's take a look at our our blog posts. First thing that we're going to dig into you this cystic post about analyzing.

A

What's going on in production, um this thing, I thought was awesome being like a pretty having a lot of time in the Hoffs world. I love this sisting post. Look at him. Tell me Luke I,.

A

What did what did you enjoy the most out of this well.

B

You know I'm not an ops guy, so you know any time. There's an ops post. You know I kind of it. I, kinda and I'm also I also happen to be the token windows guy on the team. So you know my linux. Foo is pretty weak when it comes to stuff like this, so I'd ask you, because you were really excited about this. You know. I was definitely familiar with some of the tools that they were talking about where it looks like.

B

Maybe this is sort of an uber tool to kind of wrap up a whole bunch of stuff that you might use on a regular basis to you know kind of kind of debug Cassandra we're not what it was it that you kind of like as a as more on the up side of things was kind of excited about. Well, so.

A

I, what I was gonna, try and not wrapped about this for the next 10 minutes, but since the balls in my court I don't mind doing it, ah ok, um I haven't spent a lot of time working, let's just dig yet so like reading this article was really for me. There's a couple things: I really appreciated it about it. One was the attention to detail that is in this post to me is absolutely amazing. I love this thing.

A

It's really cool when someone's trying to debug a performance problem, and they take the time to look at like the system calls that are actually being dispatched and like visually graph things so.

B

Just real quick, like explain to somebody who's watching this or listening this that hasn't read the blog post yet like what is cystic, you know just kind of overall sure so.

A

Sis, sis tink, is a utility that you can use that effectively. It replaces like iostat top like it's. It's like more than deist at more than TCP dump more than netstat, like it's just a whole bunch of utilities, kind of all wrapped into one tool.

A

That is like super comprehensive and clearly really really well thought out by, like some people that have enough time in operations so that that's the thing it's not just like some tool built by like some random dude who's like has never touched a servant as like I'm gonna, build it off tool, and it's like thanks for nothing like it's, actually a really really well done tool, so the post basically is talking about going through and looking at all the stuff, that's happening in Cassandra from a system perspective, so you kind of like you know you can just look at the actual I.

A

Oh that's going on like to your disk right. You can see and you can plot your disk activity for a particular Cassandra query and that's really really cool and that's extremely powerful heaven. What do you think about this, like I, know that you've had to touch some of the stuff before?

A

C

Kind of yeah I got it yeah. This looks pretty cool, especially for working with containers, oftentimes the paid need to log in and find specific tools, so one tool to rule them all sounds great. In my book, yeah yeah.

A

I'm, a huge fan of this like I, can't I'm, definitely gonna be digging into this, and you know ripping it apart figuring out what I can replace, because I've got an OP stock. That's like that's just a smorgasbord of tools, yeah.

B

Maybe 10 me neither its last onto us. You know that's.

A

Like hey now, you need to be like it's like. You need estrace, an IO, IO, stat d, stat and and I just kind of go off and now I can just like right. One slide, that's like cystic and then my talk is over basic.

B

Know another good thing about this article, though too was he kind of walks you through the steps of how to debug a Cassandra problem like that using this tool too. So you know from a practical practical perspective, not just to like you know, talk about how great this tool is. You know it's very nice to to kind of see the thought process that went into it. Yeah.

A

Huge fan of this one, the the next thing that we had on our list was, was kind of cool, a like an auntie patterns post and I I. Like these, I love like a little bit of sarcasm, how not to start with Cassandra. This is it's. It's really really it's great, because I think it's how a lot of people try and start with cassandra, and unfortunately, they kind of get it wrong and shoot themselves in the foot. um Yeah.

B

A

B

Do you think I was going to say I kind of feel like it was? It was like you know we go around, we do these Cassandra day's events and you and I are often in the beginner track, presenting to people that are brand-new that have relational database experience, which is almost a hundred percent of the rooms we we present to, but don't have really any Cassandra experience and I felt like this blog post is like a lot of things that we tend to say during those presentations.

B

You know between are the two of us and Patrick McFadden to you know it's just covering a lot of the stuff that people almost always try to do, but probably shouldn't in a lot of cases. You.

A

Guys so one of the things that the points that this blog post I hadn't here was about iterating over your data model. Right, like just um you know things that you have to solve up front you guys. So one of the things that's interesting about this is that we always tell people. Is you want to know your queries up front right, so you can optimize to query against a single partition.

A

I don't know. Do you like, in the other databases that you guys have worked with, like I personally, have found that the same the same thing kind of applies that even if you're working in the relational world, when you're, like oh third old form, it's amazing um the reality is still need to know.

A

Your queries like, like I, have not encountered a magical database that you just throw random stuff at it and then all of a sudden, it's like hey I, got your data like, and it's fast and I didn't need to know anything like I. Don't know if, if you guys touched it, this mythical unicorn database for it, just as for you automatically no.

C

I think every database you need to do. They do modeling and understand yeah how it.

B

Works and I mean I, guess I just make the point like and I like to make this a Cassander days to is that, like you know, you've touched on third normal form, and it's so like it's so ingrained in us with people with relational backgrounds that, like yeah, you do data modeling up front, but a lot of it is on autopilot, because there is, this sort of, like kind of you know, prescribed way that you kind of start your data model. So, like you know, you're, not thinking about it, probably as much and you're.

B

Definitely not thinking about it in the same ways, I think that you do with Cassandra data modeling where you're you know kind of at least on the transactions. I you're, starting with your queries, upfront to try and you know, try and make the past, and that's because we don't have joins, we don't have secondary indexes. We can't just you know.

B

A

B

Do- and we have better ones now, yeah but that's a whole other, but the whole their topic, but we don't have those kind of things to just. You know the queries. We don't think about up front. We don't have those things to kind of fix the problem for us later. You know in it like that. We do in a relational database, so yeah.

A

Well, maybe maybe maybe wait, maybe we actually do have those tools that kind of it's an interesting segue kind of you know. The other thing that we wanted to talk about was I've ended. This full analytics, bold right, fixing your data model like Luke, I, I, completely agree with you that you know. If you have a totally broken data model, you roll into production and you're like hey I'm, going to use Cassandra's, let's say you're using 33.4 secondary indexes.

B

A

Mean I know most people in the world have already upgraded the 3.4, it's just natural.

A

If, on the off chance, you haven't- and you you know in your role, second, a texas production and they're terrible fixing that you know not not easy. You know, especially if you've.

C

A

Of data people are, you know they have to write like what scripts to migrate their tables around and dude rights, and it can be kind of a pain.

A

But one of the things that I really have liked is how easy it's become too, to set up a new data model or new tables based on existing data by using spark right, and that kind of that this can this kind of like lets us talk about now analytics with cassandra and having a tool for doing like batch jobs right, which I think is pretty awesome right like there.

A

How many times have you looked at like lambda architecture and seen you know we have to do all this like chaos and this etl, and you it's just a lot of like cognitive overhead, so I think one of the one of the reasons why I'm really excited to have have you talked with us today. Evan is you know, you're one of the first people that that I've heard of that had been working with spark and cassandra together. I actually saw your talk that at the fort mason summit, would you kind of present definite, really yeah?

A

I was like you know. Wow like this is awesome. So what do? What have you kind of seen in the last couple years about like with working with analytics and cassandra together, I.

C

Think we're definitely seeing more and more people that want to try out using a spark and Cassandra to do queries, especially those coming from the relational world or having to work with traditional of BI stacks kind of type and I. Think the two things you speak up are really interesting. Kind of 11 is the need for data modeling, but but two is that coming into in cuz underworld, things are a you know, a bit different right, mom.

C

What that you don't have the joins, but but I think that when you look at you know the power of spark. One thing that it gives you is is that you can do these more complex things. You can do joints that doesn't make them fast necessarily, but the fact that you can do them and can do other things opens up a lot of other possibilities. For example, you see more people trying to marry Cassandra with machine learning by using spark like they can store the raw data at the time series data.

C

Then it can pull it out and build models with it, which is pretty powerful. You know and like what you've mentioned about the lambda architecture, that's something that I've spoken about, and my colleague Helena adults in this talked about. If I were doing a drawing target at strada next week, I saw this, but what the idea is that I think for a lot of people that you know they want the benefits of Cassandra.

C

They want to be able to write to a solid database, for that is idempotent for their IOT or time series stuff, but at the same time they need to run a lot of analytics on it. So what a lot of people are doing is that they would ingest in a Cassandra, but at the same time they are doing utl into whatever hdfs files, and then suddenly you got this. These two systems that are both like you know. You need to maintain two systems: I need to figure out how to merge resell.

C

So this is like pretty complex right and something that, um if you can avoid avoid it and do everything one system, why not do that right? So.

B

What are the one of the challenges? You know? Why are people doing that etl step? You know what is the? What are the challenges with doing spark or other analytics just directly on top of Cassandra? What is it that makes Cassandra? Sometimes you know difficult to use in that scenario, yeah.

C

I think that's a really good question. I mean, I think the um I think, a challenge that people have is is that cassandra is designed for you to read small amounts of data like what massive concurrency write and read and write, and so it works extremely well for that arm.

C

But if you want to use it to read a huge amount of a for bulk analytics that you need to be a bit more creative, so if you just use the normal cqo tables, you might find that the data size and crave sweets are not what you might be used to from the HDFS or Hadoop worlds. So you need to be a bit more creative and there's a couple of different strategies you can take like I. Think the traditional strategy would be. Let me model many different tables for crazy.

C

So let me have a job that can summarize like one thing we used to do at regala Whidbey to have massive lube jobs. We would then crunch every conceivable slice and dice kind of queer. You could run and we would write those in Turkish under that way. You can read them out. As you know, very small price.

A

C

How does some data some limits and becomes inflexible, because then anytime, you need to change something I would need to edit of me to edit my massive Hadoop job. If that would not be easy and for certain kinds of queries, you don't have enough space to write out every single kind of query right. All right, like.

A

Every all, as the as the number of the.

C

Middle imagine ed query on the permutation absolutely.

A

Exactly totally.

C

Impractical um and the the other thing is that so right now, I'm working with the enterprise that is actually looking to move a data warehouse into sparking Cassandra. So for them it's kind of a trade-off right. The more of this room reports you to run and there's some flexibility that is required on the extreme.

C

We can try to I great everything into tables, then you can do small reads, but that increases the etl complexity and it's there certain, like David, traditional star, schema where certain dimension tables might be got slowly, updated and certain it to get updated, then having to update all of your query tables becomes more complex, so it's kind of a trade-off like if you look at a scale like um do you want to go all the way? To that end, where you write, you know everything and try to update everything.

C

Or do you try to do something that is more in between I? Think smart gives you that flexibility that were you don't have to write everything you don't have to carry data modeling to the extreme. You can have you still date, a more like still very important. You still to have partition keys that allow facilitate really fast lookups, but then you don't have to quite get it to that extreme and can do more things in spark. Yeah.

A

I think I mean I, think the really the the the thing that allows that or the thing that the flexibility that we have flares that were not as sensitive to latency right right like when we are serving like a customer. You know milliseconds make a big difference.

A

No, but if you you know, if you have a big analytics job that were to take like you know, 10 minutes to run, and then you kind of like a cool like I, made an optimization to this key, and it takes nine minutes to run like you didn't really solve, like I'm, probably a major problem right like unless I happen, to have an SLA of like nine minutes, 45 seconds life, then you're, fine, I guess but like for the most part of its gonna, make that big of a difference yeah so yeah. There's.

A

I, like the. I really really like the additional flexibility that you can kind of go with spark and not having to remodel all your data. I completely agree with you and that and that's really what kind of like the whole data science like world is all about. Right. We don't know what we want to get and that's kind of what people struggle with a lot of times. They look at Cassandra they're, like what do I do like what happens when I don't know all the queries that are coming of fun and right. I love.

B

A

Get that with foot spark I wanted to also talk to you having a little bit about the project that you've been working on yeah four mile now so Philo DB, we, you know we had our. We had a meet up together in Amsterdam and it was pretty cool to talk to you for a while about that. um I don't know. Can you talk a little bit.

C

About file ODB yeah definitely would look, would love to dive into that. So sometimes you have cases as John mentioned. Sometimes your jobs are like machine learning jobs that could take near a long time like half an hour to an hour, and you don't care, but sometimes you do actually want care about response time like you might for it, for example, for bi your enterprise may be used to say second response times right, so you don't want something. That's half an hour.

C

You want something that responds in the unit seconds frame and so Philo DB is basically an an OLAP analytics database builds on top of Cassandra and what it does is it stores data in a park, a light column, their format, so that it is very compact and quick for reading out a subset of your data for analytics queries.

C

Typically, you only care about a few columns out of your many columns in your faculty, but whatever it is, um and so it stores this data as regular, pants on tables, and the benefit of this is that you can manage file ODB tables just like regular consignor tables from operations.

C

Point of view you can back them up and we store them exactly the same way that you know for regular Cassandra, but you get the benefits of very fast queries, beads for spark mom, so the data because it gets the word very compactly, I- think for some tact tables that we've seen up to a 40x reduction in size and the and the speed gains are also quite big. So basically, we did a blog post.

C

This allows you to do ad hoc queries in a reasonable time frame on Cassandra and as well as to do things like joins quicker nice yeah.

A

And that was the blog post that you had on planning, Sandra, yep, yep, yep.

C

A

Reading the blog.

A

B

I got nothing. I am curious, though, what like, so? What do you like? You guys at triple jump work with other companies like what is the appetite for analytics on Cassandra, like what kind of like uptick? Are you seeing or interest in philo DB, and that kind of stuff.

C

I think there's definitely an interest that you see folks. In fact, the current customers are working with I. Think looking at some of the work we've done before you know they that's one reason why they're interested in using Cassandra spark and they devotes datastax Enterprise to replace the data warehouse because they saw this possibility and.

B

Is it mostly people just wanting to simplify their stack? Like you said earlier, you know they do it's like. Why use two tools when you can get away with doing it with one sort of thing, yeah.

C

I mean I think for for some folks. It's definitely that you know that sedan. Some some folks are like. Well, we don't really want to run a giant Hadoop system. You know we, you know we like what cuz I no promises, so yeah I think that's, definitely a big part of it and I've seen one person at one company that is considering using stuff like red shift, but for them they from the economics of running and the simplicity, running Cassandra sack in the clan owed with what something my father TV actually makes more economic sense.

C

Then something like red chips, so yeah we'll see that and go as well. Nice.

A

Very very cool what we have I think we're about at a time so yeah, heaven I, really appreciate you taking time to come on and talk to us I know you're really busy. Thanks for having me on yeah anything you wanna, do you have a talk coming up you put. Where is that again strata.

C

Yeah, a couple of things highlighted: we have on the net and Ellison and I have a talk at strata. Title know: lambda simplifying your analytics. What spark streaming Cassandra and colonies so come and check it out. I think it's on Wednesday the 30th of march in san jose convention center annex next week. I will also be doing a developer showcase, their on fala DB, where I'll be showing some demos in working with it.

C

What's mark mark well and some interesting data set so come by and check it out, and I finally to put job today today we provide development services for enterprises wishing to integrate the latest the best in open source, big data, including spark and Cassandra so interested in partying ourselves out I'll. Also, let me know cool and, I think, are also hiring.

C

Yes, that's right, so yeah car tech meaningful if you're interested.

A

Like something tells me what.

A

Alright cool well thanks, guys, I, think I think we can wrap this one up.