Apache Cassandra Cassandra Summit 2015, 14 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IBM: Real time Advanced Analytics and Machine Learning with Spark and Cassandra

Description

Speaker: Chris Fregly, Data Solutions Engineer

The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.

A

A

B

My name is Chris fregley I joins I started off at playboy. I I keep the playboy thing up here, because it actually comes into to play a little bit later. It's been a few years and that's me at the mansion, with my boss and his wife, yeah back at my younger thinner days, yeah, that's me and my mom at Netflix, moved out here from Chicago to join Netflix joined a to brick shortly after and then now I'm at the IBM spark tech center. Well, not short. You know this is five or six years later.

B

B

You also want to mention yeah I run a spark meetup group here called the advanced spark meetup group. If you just google it it's pretty easy to find, there's about a thousand people that we've gotten in about three months, so yeah and then write like.

B

Of course, IBM has been backing it yeah and if you yeah it's like pretty much a dream job right, my job, my boss, my company support this folia, which I didn't believe I asked everyone before I joined I'm like is this for real, can I not wear khaki pants, pleated khaki pants can I just wear my shorts and you know flip-flops and they said yeah, that's good, so yeah. That's why I joined so yeah yeah.

A

B

Yeah we're only hiring nice people not and not the ehrlich's over here.

B

So speaking of the meetup actually monday, I planned this right before the Cassandra summit, because I knew a lot of datastax people would be in town from all over, so datastax was kind enough to like host it right down the street at their office. If you, google, there's two slide presentations so Russell, who I think is actually talking right now, possibly talking about the spark Cassandra connector, he has about 90 slides that actually goes into more depth than he's covering here at the summit.

B

So you might want to check that out and then I have my slides, which is about half of these slides. So you won't be missing too much and then the upcoming ones project- tungsten. You probably heard about that spark and then focusing on the elasticsearch connector. There's a lot of people asking about the internal, so yeah on that and, of course, yeah catalyst, deep dive.

B

So we'll try to blaze through these recommendations live demo talk about data frames at sort of a high level, then like dig into what the catalyst optimizer does and other various query plans. We've got data sources, API, that's the main API. So when I like Russell and like pioter and Alex, and those guys went to write the spark Cassandra connector, that's what they used right, that data sources, API from spark and data bricks, has changed quite a bit over the last year, so yeah they did a really good job. Keeping up with that wow.

B

That was crazy.

B

We'll talk about how to create right, like your own, like custom data source, will talk about like native ones. That are part of spark. Will talk about third party ones that people have built and then a few tips on sparks equal performance tuning from like my work at data bricks and yeah, currently with IBM.

B

So yeah just real quick. You guys probably know this, but there's a couple types of like recommendations right you know, there's like personalized non-personalized yeah. We saw this quite a bit at Netflix, I'm, sure, there's a lot of Netflix people and you're the whole cold start problem right. What do you do?

B

You don't really know anything about the user, so you just do just normal summary statistics find out which movies are popular, which ones you know you can kind of seed with now that changed when they went and they added facebook integration right because then instantly now you have a graph that you can pull from.

B

Yes, you can also do things like PageRank. You can just kind of look and see who people which movies people are liking. This is like a dating data set that that relic will be showing. So really it's rarely user to item. But, like the item is actually like another user right. So it's kind of funny and then personalized you can do things like collaborative filtering matrix, factorization yeah, we'll have a demo without a bit.

B

So yeah this is interesting. This was some terms. I picked up, like probably the past couple years about like different types of like user feedback and back in like right like the early days of netflix, they relied on. You know, ratings or people liking, each other. You know picture or like tinder, where you're just swiping.

B

Those are just likes: they're, not 1, to 5 1 to 10 ratings, it's just zeros and ones, but you can't really count on that right, so yeah, there's these other types of feedback call the implicit feedback which is like hover, easing searches and clicks, and right like how long you spend watching the movie. How long you write like view the person, if you click in, to see multiple pictures of the person. If you spend time reading the air like movie, the like summary and all that kind of stuff.

B

Yes, all that gets fed into these models, obviously for future recommendation. So yeah, that's the big data right. Just a quick note on similarity. I mean this is what it all comes down to your right: we're trying to find user user similarity relic user item similarity items item similarity was made like pretty popular by other guys up at amazon, so yeah, and I mean it basically all comes down to math right.

B

So how far in like vector space are these things got trying to put a number on each of these items, or these users and yeah in calculating angle distance and like euclidean distance is kind of the most basic.

B

There's this concept of log likelihood that sort of associated with Jaccard similarity where you factor out popular so at like Netflix, for example, where, like everyone at first the first week, you were there, you have to say what your favorite movie and TV show was and yeah like I guess, like eighty-five percent of rarely people said Shawshank Redemption, so that was kind of funny but yeah. That would be an example of like that's, not really like a high value recommendation right. So you want to factor that out so yeah. Those are techniques to do it.

B

Yeah here, for this case, I've got matei is the most popular person. People are picking guy he's a really really popular guy with the ladies guys, founder of data bricks.

A

B

And then comparing similarity I mean this is the whole challenge here right, so you basically have to compare like everything to everything else and, of course, that's the scariest thing in the world right, that's like Cartesian and that's tons of shuffle. That's a huge huge cost right, yeah, like network costs, so there's clever ways to do it. You can approximate right, that's the most obvious, so trying to reduce em, which is the number of rows you want to buck it.

B

There's this algorithm called locality, sensitive hashing that probably comes up about once a week and I inbox people asking about this yeah. So that's one to keep an eye out on and then to reduce n right, like you can use sparse matrices and pull out common values like zero yeah. Something interesting, zero like sometimes isn't right, like the most popular or yeah. The most frequent item so or got the most frequent value so keep an eye on that.

B

All right, this is oh yeah, so this talk used to be called spark after dark, still kind of out there on the web. So it was essence like my playboy days I when I first joined I was too young to know about playboy after dark. I think it was in the 60s or something and I. You know got ahold of all the footage and started watching and yeah super progressive right, like Marvin, Gaye, yeah, old-school, Jerry, Garcia and, of course, half in his heyday there with his short haircut.

A

B

So that's kind of where all this came from, so this is what I need you guys to do and then I'll do it with you. So this is sort of yeah there's this project called flux. Capacitor yeah, if you guys have seen me talk before I, like tend to use flux, capacitor for various projects, I've done out the years it started off with a netflix project, and now it's yeah.

A

B

Over to Big Data, so we've got Kafka. We got spark streaming. Cassandra we've got ml lib, generating the models and making recommendations, putting recommendations into elasticsearch right like really for no good reason. I just wanted to like demo the elastic search connector and see how it worked, and then the user yeah, so the user talks, so the user being. If you go to spark after dark and select three like actors and three actresses, that's that's now coming into Kafka, which is listening.

B

Yes, I have a docker image sitting out on a soft layer yeah I was using Amazon up until like two weeks ago and IBM started kind of pushing me toward softlayer. But it's a bare metal uh yeah like piece hardware, so it's actually pretty fast but yeah. So if you guys just want to do that for a little bit for like a minute.

B

Yes, I'm going to do it too. This is totally anonymous. I'm, not I, don't know who you are and the data is going to get blown away when I get rid of the dr image anyway. So just pick a few of your favorite people, yeah, both men and women, doesn't matter we're all like adults here, we're all in the Bay Area.

B

A

This is a patchy Zeppelin.

B

I'm going to make sure I zoom in here I, usually don't go about 10 seconds without someone yelling at me to make the screen bigger, so I've gotten better about that right, like have you guys use notebooks I've used ipython notebook, you guys got raise your hands but yeah, so yeah very similar to that. It's going to look a lot like this right, like data bricks as a product used to be called data, bricks cloud now, just rebranded data bricks. That's also a notebook type thing. They support our.

B

They support right, like a lot of like the advanced commercial features, but yeah Zeppelin is open source. So here I'm, just importing libraries I've got the standard. Connector I've got Kafka and let's see some CSV stuff here so yeah so like when you're clicking on the actual page, that's calling a REST API into Kafka. That's a right like confluence the company that spun out of like LinkedIn those guys been building a rest proxy, which is super valuable I, don't know if anyone's actually gonna use it in production.

B

But for me it was awesome for this demo. I could just make breast calls out yeah just right like do an ajax e type stuff and then spark streaming so there's a job running.

B

So right, like Zeppelin itself, is a job. It's a long-running job. Basically, just like spark streaming is a long-running app.

B

This here this is actually pie. This one here is a spark notebook spark. Notebook is also its own. Long-Running spark app, that's kind of how these things get away with running within a typically batch oriented type of system right, so.

B

Yeah, so this is just running out there: it's just basically pulling off kafka every five seconds or ever I, don't know I. Think I did one second, but just pulling all these numbers off. So if I go back to the notebook.

A

B

A

B

And then yeah it's going into it's basically pulling it off Kafka and then sticking it into Cassandra and then from there. We can pull out the data from Cassandra and generate the models.

A

Start that one get rid of that one.

A

B

So, let's just run this from the beginning, see how it goes.

B

Alright, so yeah just to show you guys some of the code so I'm just setting up reference data right here. This is I, basically just pulled it off. Imdb actors actresses profiles, their BIOS pictures things like that that I can just join just to make some of these charts a little bit more interesting, showing names instead of ID's right.

B

So that is just referenced data, so yeah we got Leo in there we got I just went to IMDB and search like like I think, like top or like best looking actors, best-looking actresses and I just got someone's list or whatever then converted it I'll, probably get sued. But that's okay.

B

Okay, so that works count.

B

This is going to be how many likes have come in off of kafka, which I hope is greater than 30.

B

Alright, so yeah.

A

B

Is going to keep this is just going to display, so this is the from so the number that is on your guys. Web page is the one on the left that's from, and then two is going to be. The it's gonna be one of the 15 yeah 9000, I think, is the females and 1000 is the males by the way, like your recommendations are going to have both. In fact, I had one time where all five of my recommendations were guys so I'm not quite sure what was going on there.

B

I gotta have to tune that a little bit.

B

Okay, so most desirable users.

B

Or sorry yeah, this is most desirable act. Most desirable actors actresses, of course, Sophia shows up there for modern family. So let me do.

A

Cranked up the count.

B

Oh I think it's still running the rest of the yeah so very quickly here. Yet yes, I want to get into what the actual plans look like.

B

Yeah, there's a way to show you know all of the different plans. So there's just like, like sequel, you know, there's a logical plan, there's an analyze, logical plan, there's an optimized logical plan and then the physical herb, yeah I'll show you guys.

B

So you can basically take any sequel statement and run. Explain. True and it'll show you all the different I think it's still probably I going to get to the end there. Alright.

B

So this is just planned stuff. Will okay, it's trying to rerun the I just want to see who the top like five or six are it's always kind of interesting, because it it's different with every crowd but yeah.

A

Sophia shows up.

B

Almost every time.

B

So oh yeah, leo Ashley Judd, that's a new one: oh yeah, at Tiffany Amber Thiessen yeah like she's, my personal favorite, okay. So let's go let's generate so this is the non-personalized. So the only real, interesting thing, besides just kind of summary statistics here, would be the page rank.

A

So let me run this from the top. If I can.

B

Yes, I'll start showing you guys the code too.

B

So yeah, basically I take the results of that first notebook and I'm sticking it into elasticsearch there's something going on with elastic search.

B

It got wedged earlier today, but so that would be and then here's some pagerank action here pretty typical. But if you think about it you know you have all these people on the left that are liking all these people on the right. You know picture Twitter kind of swiping along right, so you've got this big great. You know this big by part, bipartite graph going on here and you're right, like contributing your likes to them and they're contributing back to the rest of the yeah. This is like my buddy right here.

B

He just likes everyone just because he thinks that that's how he's going to you know you just play the numbers. I guess all right. So let's try to do so. Let's get to the interesting stuff here which is going to be the matrix factorization we can get back to this later, but basically what it would do is populate elasticsearch and then, when you click those three links at the bottom, those are our personalized to you. It knows your ID and it would grab it out of that. It.

A

Would pull it from elasticsearch.

B

Alright, so yeah, if you guys, want to see some more of this in action, github com, flux, capacitor and then yeah that there's a docker image that has all of this built in. So it's got like Kabana, logstash, ganglia, yeah, all the metrics and all that kind of stuff. So it's pretty cool alright, so data frames inspired by our and pandas I'm going to just kind of blaze through here. You should always be writing to data frames that there was a big shift in this as I've spark 13 yeah.

B

There's a lot of performance gains going on there there's a lot of cogeneration happening, rightly beneath the scenes. If you write to our dd's directly you're bypassing that, so when you write to data frames that participates in the right, this whole catalyst optimizer. So.

A

B

So think of a data frame as a logical plan container, you know like pig or yet like sequel, that kind of thing where you're just building up transformations, building up a dag, but then catalyst is the one that actually comes in. So right, like catalyst, was a rewrite right like initially, there was shark and so shark was basically hive on spark and it used the like hive optimizer beneath the covers, and it really wasn't it didn't fit into the spark right like execution model, it was more like mapreduce.

B

It was limited by that, so it was kind of a right like sit down we're like a big meeting where they decided yeah. Let's just rewrite this thing. This is it's going to be a long road ahead, but yeah they have smart guys, guys guys have PhDs and this kind of stuff so yeah they set out to do it and keeping it clean, keeping it it open too. So that's the data sources API that we'll talk about in a second you can plug things in you can plug in custom rules.

B

You can yeah basically hook into any plan that you want. You can like manipulate it at all the levels- logical, physical optimize, that kind of thing. Yes, you can write custom udfs. The one drawback to this is that catalyst doesn't understand. What's going on beneath that UDF, so you might actually have some problems if your start right, like doing crazy things in there. Yes, a catalyst has to optimize around it can't can include it in the fun and as of spark 15 there's a new UD AF support yeah for like custom.

B

You d is right now, the only like UD AF, so you can use our the hive you da FS, so spark sequel is essentially a closed subset to hive ql right like closed meaning there's a few obscure things that sparks equal doesn't implement. Right, like one of the things I BM and data bricks have been talking about together, is trying to build that out to get right like to become more like an c sequel, compliant get like t, pcds and all that stuff running smoothly.

B

Alright, so catalyst yeah does things like, so basically your data frame hands catalysts. This this dag, right and catalyst can like rearrange things that can kind of dig deep. It sees the whole execution and could do things it right. Like can eliminate sub-queries again like collapse filters.

B

The big thing here that will talk about quite a bit in a second here as predicate push downs right. So basically the ability to to as deep down into the source itself and like you'll, see this. When we talk about the data sources, API there's ways to actually push filters down so that you aren't returning that data.

B

Yes, I'll show examples of that where you can actually check to see- and I think I actually found a bug in the spark Cassandra connector this morning, so I have to write like track down Russell and those guys after this, because I ya can't seem to get it to push down. So it's probably something I'm doing, but it's not clear so yeah, there's hooks for custom rules, so yeah you can implement your own rule.

B

If you want to override and or you know, which would be totally silly but yeah they're, basically scala case classes, if you implement rule yeah, so OAS is org like apache spark yeah that took me about six months at like data bridge, to figure that out.

B

Alright, so he at the top is a yeah there's this this sort of concept of a data frame, dsl right. So it's all the familiar things selects and filters and if you run explain true on it or from a sequel standpoint, just really put explained at the beginning of your sequel, it's going to be the exact same thing: it's going to dump out the parse logical, the analyze, logical, yeah, the optimized, logical and then physical yeah, so keep an eye on these.

B

It's like logical without the word plan analyze without the word plan that yeah that bites me every time.

B

So what you'll see here is the sort of early phases, so there's two filters at the top there right, there's really there's three types of genders in this data set: there's female male and then unknown or undetermined, or something so here, I'm specifying it separately. So it shows up on like two separate lines, and then this is kind of simple stuff, but yeah this collapses it into so, if we stuck with the parse logical or the right like analyze, logical, that would like require two passes through the data.

B

So, of course we can collapse those filters and just do one pass, and then you see that in the physical and then yeah, so up top is kind of a like generic. This is on the spark right like documentation, site the sparks equal programming guide, but that kind of shows there's there's some cost analysis going on. So after physical plans are chosen, there's a cost step, we're basically there's there's one optimization: that's there right now and that's the like broadcast join right like broadcast hash join, which takes it's basically maps.

B

I'd join right, it'll broadcast out the smaller table to the larger table so that you don't have to shuffle as much here's. The data sets I think.

A

I was gonna.

B

Yeah, so here's the data set when I started on this spark after dark thing about a year ago. This was the only dating orale public, dating data set that I could find so like suits my needs. It's just a simple joins and I could do partitions based on the rating partitions based on gender things like that. So let me actually show that.

B

Okay, hopefully Zeppelin is settled down here, a little bit so I just kind of threw this together this morning performance comparison right like with these notebooks you can put in markdown. You can actually run shell commands. Please don't hop on the server and there's no protection. Don't yet, please don't delete everything. I mean it's a docker image, so you're not going to get that far but yeah.

B

So we're going to compare we're basically going to join ratings and genders right, so ratings have user, ID have both user IDs and then rating and then I want to get their gender. So, like pretty simple, but the thing to note here is that CSVs JSON you, you cannot partition by those right now. Yeah they're, not supported I, think there's work going on specifically for like JSON, but still actually I think it might be experimental in 15, but now with parquet.

B

So yeah parquet is like the most popular format among state of bricks amongst everyone yeah that I come across, so super tight compression right like very, very just yeah. It's good for right. These analytic workloads, because all the columns are stored together.

B

Similar to orc, oh yeah, which, by the way there is orc support as of spark 14 I, think so, if you guys are still using that that's supported now so just kind of comparing the whole point here is where our pushdowns happening and right like how yeah right like how effective are they so CSV just kind of pulling things in here. I want to point out one thing: if you see a filter in your physical plan, that's bad! So this means. So this is csv where there's no push downs right.

B

So this means I'm pulling in all the data and then spark has to filter it. So my query here is I'm trying to find right like the medium hotties. I call it would be four through six if you rated 436. This is a that way. I just wanted to have two filters there. So we're and filter are just aliases, so they're the same thing and then I'm printing out the logical plan, all the plans, so you see that the filter gets collapsed. So that's kind of nice, but it's still there on the physical plan.

B

So that's not good. So now, let's check out oh and then here's the join yeah. It's just not good, it's still there. So let's do.

B

So now we have genders and ratings are both JSON. So now what's happened in here: let's do and explain. We still have the filter at the top. So that's not good. So, let's fast-forward here a little bit for the sake time.

B

Oh yeah, one thing: oh yeah, yes, I'll, show you in a sec, so this is Parque unpartitioned. So.

B

Any kind of push downs yeah, so parque is good at column skipping right because it has all the columns together if it knows that you only want that you're only selecting one out of you know, hundreds of columns it'll physically skip over those right on disk, but it's not partitioned by rating. So let's look at the bottom here: filter yeah! That's not good! Now! If we do.

B

Let's get to the good stuff, so this is so now both so I've, basically partition, both datasets I've got the one by the gender and then one by rating.

B

So now, let's see how things evolve here, so we have these two. They get collapsed, that's pretty cool and then boom right there, and so this changes to this is Union rdd. So this is taking the bare minimum and then sticking yeah sticking them together. If you see Matt partitions, that means it has to do a full scan and that's not good. So, let's get to Cassandra here. Oh.

B

And if you have one that's partitioned and not the other.

B

You'll see yeah it. Let's see, where did I, do it unpartitioned yeah, so one has the filter and one doesn't so it's going to do. It has map partitions, which is bad and a filter, and it's got union rd d, which is good. Unfortunately, this is the.

B

This is the smaller table that is benefiting from the from yeah. The actual partitioning I did that on purpose. What is this? Oh yeah I tried to switch over to 15, so the best here, oh yeah, so basically, what I found with Cassandra I was trying the 15 like I, like the new connector, to see if it fixed the issue, but you'll see filter on top and then Matt partitions.

A

B

I've got like I think a minute here. This is new and spark 15. You can actually see all these optimizations happening and like the effectiveness of your filters, this is the data sources API. These are things if you're going to write your own. Yet you have to implement these.

B

The one to keep an eye on is pruned filter scan, that's the one that that's the basically like the Holy Grail, that's column, pruning and a predicate push down so I'm working on DynamoDB, just because I don't know, I signed up for it like a while ago and I'm still working on it. But if you want to create one, this is what you do just basically look at the existing ones.

B

So far, the most complicated to that I've seen are the Cassandra and the elastic search right and then parquet, of course, but yeah keep an eye out. If you guys don't know spark packages, if you do contribute your own data source right now, IBM is obviously working on db2 and their big sequel and, like some other integrations as of about a year ago, maybe year and a half ago right, like data bricks, doesn't accept anything into the core that is vendor-specific, so that kind of slowed.

B

My my like dynamo DB in progress, because I wasn't really going to contribute much, except for my dynamo friends partitions.

B

Turning pretty basic stuff push downs. This is showing how catalyst is pushing that filter deep into the source, so yeah. This is I. Did this on purpose, it's kind of ridiculous. These are all the rules, but for like a Sandra right, like obviously anything that's going to get pushed down, has to be part of the partition key and there's some weird, like combinations of things and certain rules and stuff that you still like applied to the partition key and where the partition key is the first or the last, whatever so native jdbc JSON.

B

So parquet there's this yeah, this method partition by. So this is actually how I created the like partition data. When I was testing it this morning, was I loaded it in from json? I converted it to park a and then chose a field, gender or rating.

B

So there's CSV: this is actually built by data bricks. This gets released every big release. They update it every same thing: yeah redshift are like data bricks cloud. The product is built on Amazon's, of course, like a lot of our customers, have red shift yeah, so redshifts pretty interesting, because the redshift server, the like redshift master, is a a single bottleneck right. So yes, anytime, that you're returning super large number of results that yeah that's, obviously not going to be good.

B

So what we do is we actually write out to s3 and then we can parallel pull from that. So yeah within, like redshift, there's I think it's called unload and you give it a the yeah that s3n that, like temporary bucket to store it in and then we can parallel eyes from there and then pull the data in faster there's. An upcoming meet up I think it's februari. It's right before the elastic search conference coming up, februari 2016.

B

So what we doing the same thing I did on Monday with Russell and the guys from datastax will be tearing that thing open. Let's see, yeah Cassandra, of course yeah. This is kind of a cool one, yeah there's a rest data source that yeah. So this is Michael armbrust yeah, if you guys have any questions about sparks equal he's, not that hard to find right, like data bricks as a startup yeah, so people like to choose their first names.

B

Obviously, when you're a startup, so I fill in the blanks there, but yeah he's super active on the spark user list too. So yeah posting there is the best thing to do, but it's kind of cool like you can give it a rest endpoint. You can write and read in a restful way, yeah, I'm working on dynamo, yeah me and ehrlich.

B

I think this is one of the last slides here, yeah performance tuning I'm not going to go over these specifically gal post this on SlideShare right after, but some of the various things like you definitely want to turn on tungsten bump this value up shuffle partitions people always forget about that one. It's a huge bottleneck turn on partition discovery! That's! When you're reading you could point to a parent directory, it'll infer all the partitions.

B

So if you have a by date or gender or whatever it'll figure that out yeah, I'm heading to spark some at amsterdam yeah anyone go into that spark summit. Please bail me out of jail. That's what the little yet police guy is and then the 13th is my birthday in Scotland, so that that's not going to be pretty.

A

B

So that's it alright, Chris. Thank you very much.

B