Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Nexgate: Social Media Security Company Nexgate Relies on Cassandra for Fraud Detection

Description

Speaker: Harold Nguyen, Senior Data Scientist at Nexgate

In this talk, we focus on a use case by showing how Cassandra can detect spam and spammers on social media. We also show how we use Cassandra to train our 100+ social-media-security classifiers. The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we've ever scanned, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This talk is about how Datastax and Cassandra make it easy.

A

So really excited to be here, I'm, a data scientist at next gate and my backgrounds in particle physics. So in the valley that sometimes people call that yap I heard that before which is yet another particle physics. So I'm going to be talking about Cassandra here at next gate and it's it's going to be a kind of a story about how we moved from to a no single solution.

A

So so what is next gate? I just want to give you kind of a background. First does: does anyone know suppose you were a fortune 100 company? How many social media accounts do you think represent your brand if you were a fortune 100 company, so let's kind of just focus on Facebook pages. First I do want having a guess on how many Facebook pages your brand might be represented by any like a quick guess on a number or maybe to 10.

A

Well, five yeah well turns out turns out to be 300 around 300 accounts and that's just social, that's just facebook! So when you, when you throw in Twitter Linkedin Google+ YouTube, that's a lot more. So imagine trying to sift through all this content and try to manually kind of classify it that so there's there's a lot of stuff that song there doesn't belong in there. So you have that actors on Facebook pages they're trying to harm sort of other users.

A

Your audience members on your Facebook page and stuff, like that, so they're kind of diluting the message that your brands trying to present to the audience. So next gate sort of a it's a technological solution to help you discover the accounts that you have helps you protect it, so you might be seeing some hacked accounts lately in the news like the White House was a Twitter. Account was hacked a couple years ago. There was a bomb threat. Celebrity twitter, twitter accounts get hacked all the time, so we help protect that also monitoring them.

A

If you were a kind of school, for instance, you might not want hate speech on there or was cyberbullying.

A

If you were finance an institution, you don't want content posted on Facebook that violates FINRA or FF IC, or if you were pharmaceutical, you don't want HIPAA violations, so we also we offer automatic classification of your content and so there's over 100 categories that we do and so also a little bit about the team behind next gate, so we're startup and we launched about a year and a half ago around 18 employees, so kind of a small company.

A

So we're going to talk about how we use cassandra has a small company, but collectively the co-founders and employers have dozens of years of experience. In security- and you know over a year and a half, these are the type of customers that are taking a social media very seriously. So that's pretty good I think for being around for a year and a half so talking about the the scale of the data a little bit. There's over 350 million pieces of social, social media content spread across facebook, youtube, linkedin, twitter, etc.

A

And if you were asked me four months ago, what that number would be it'd be 250 million, so we're growing. The data that's coming in is growing exponentially, so the rate that today is 11 and a half million new content per day, and it's all classified in real time as it comes in and there's 65 million total social media authors. That means people are posting on social media and then it's about a quarter of a million new authors a day. So that's kind of kind of data.

A

To give you give you an idea all right, so so the machine, learning experts, statistics statisticians in the room know that, in order to have a good classification system, you need to have a lot of data. So, in order to have a lot of data, you also need a strong infrastructure, so I like to sort of give a quote by by rich our CTO is that the completeness of any classification system is predicated on the breath of the corpus of data upon which is built.

A

So that means that you know think of email, spam and ham. It takes a lot of data to be able to do that correctly. So imagine if you hundred categories, so not only do you need breath, but you also need debt for those categories as well, so you need to collect a lot of data and we need to have a strong and capable infrastructure in order to do so.

A

So we'll talk about how we got to that infrastructure. So in the very beginning we threw everything in the my sequel, and you know why not it's easy to you as a start-up you in China, launched quickly. You want to use a tool that gets the job done, and many people already know it. So it's easy to kind of you know, hire anyone off the street and they'll be able to use my sequel, it's very, very easy to use, also very secure. You know the banks. Banks are using sequel, it's been around for a while.

A

A lot of security issues have been known. It's also inexpensive. You can't you can't do much better than free from my sequel. Unless people pay you to use it, it manages memory very well and up to 50 million rows, you could have pretty fast queries so, starting out it's a great solution. It also supports several development interfaces.

A

So, even though the training topic of the day training framework of the day could be rails, you know that it's probably going to have a sequel connector and in fact it does so it's not to say that my sequel is a dumpster, even the hell. The picture is that we have really rely on it for our success.

A

So there's this one does not simply alter table in my sequel. So after several months where we realized that our data model wasn't handling all the scenarios and we needed something else, we need to move to a no sequel solution and to talk about the kind of data we have before. We talk about no sequel, pretty much social media data sauce on average is about 1k. That includes content and metadata, so you're you're familiar with sort of the content on Facebook and Twitter.

A

It's just kind of like simple phrase or to your sentence, maybe maybe some links so that's kind of the content and the metadata is the stuff around that. So the timestamps who it's posted by what account it's on and things like that, so the metadata can also vary depending on what platform you're on so, for instance, engagement activity like for facebook. You have likes for Twitter, you might have followers and YouTube subscribers, so social media day is pretty rough and jagged and you want to store some of it in a flexible database.

A

So you want to store actually in both sequel, endo sequels, so you don't want to sort of take all your data and sequel in store and no sequel. You still want to use the right tool for the job. There are some cases that you want to store your data and sequels. So for these cases it's things like fixed length, non-null heavily index, so things like the time stamp or the author ID. It's going to be there for every content.

A

You have no matter what platform, so it's good to store that stuff in a relational database for other things that a more variable length commonly Knoll that you're going to access only once you don't have to worry about joining against another table. You want to store it. A no sequel solution, so different authors might be posting different number of times. Each account might have a different number of authors, so these are all kind of variable length variables that you can store in no sequel solution.

A

So when we looked around for a no single solution, we had a couple of requirements in mind, and so I mean the punch line is we're at a cassandra summit. So you kind of know what we chose in the end, it's cassandra, and so as we go along one of these bullet points, I'll kind of say why it was trying to fit that use case. So it's easy to use.

A

It's actually very easy to use because, in my case, coming from an academic background, a very first date, you know before I new kind of my sequel, even relational database I, was put on a task to make a web app using Cassandra as a back-end, and you know, and one day is very easy to learn how to put data into it, create a data model and get data out of it, so very, very easy to use and the second data was making composite columns. We wanted something that would would scale horizontally and so like.

A

If you had a cluster and you had a server and you fired it up, this cluster would magically know that servers there. And then you have a new node in your cluster, so you want something that was a simple improvement and cassandra is just that. We also wanted some integrated tools for research because we rely on our classifiers, so we wanted people to do search and analysis easily so slowly provided that we wanted operational simplicity so that all nodes are the same.

A

No, no there's no knock on having a master node, but we didn't want to worry about if a node went down is that the master node of the Gateway I just wanted everything sort of to handle itself, so the cost the gossip protocol and Cassandra was great fantastic support, enterprise support, so obviously data statics provided that I don't talk a little bit about how they helped us in a bit.

A

It's simple to deploy maintain so a couple of weeks ago on AWS I was able to make five clicks and put a command line, and my kissena cluster went up in two minutes, so it took longer for it to fire up than for me to get bit to fire up. So it's very easy to deploy and in terms of maintaining it. I mean they've, been a few support issues, but not very big, and not very long and if few and far in between, so that was good integration with other big data tool.

A

So at the time we noticed that it were, there was a negation with Hadoop, so there was a CFS was being actively developed, which is the casino file system, it's kind of like the HDFS version and Hadoop ecosystem, and so that was great. You didn't have to worry about doing your batch processing and transferring the data over to cast an or you could just do a straight on Cassandra. So that was great and these days the shiny new tool is spark and we're really excited about that as well all right.

A

So this is a picture from the ops center and just give you an idea of how how it looks for us, the previous speaker had a tweet about you know if your startup three-note that's kind of what we have here, three nodes.

A

We have one node in the east and the reason why we have the multi region clusters is because you know if there's an earthquake in Napa on the west coast, so you still have your data available on the east, so we use em one large instances and we're about to scale again and we have a separate cluster for Deb test and production so that we can throw data and see how it works. So so yeah datastax has been extremely helpful and help supporting us.

A

Obviously, this is the opscenter created by data stack, so they've been extremely responsive, so just to look at some of the monitoring monitoring tools for datastax, so we have about a 70 reads. A second and about 25 writes a second and you might have been reading that castagna is really good on rights and in fact it is.

A

We just have a high number of reads, because we do real-time analysis on the data that comes in, so we require a bit of reading as the data comes in and in our classification, and so that's why the number is hot and the reeds okay. So I mentioned before that we have over 100 categories and one of them is spam and I want to go into sort of a use case and how we we detect spam using Cassandra I sort of like a little bit data modeling here so spans been around for a long time.

A

Y'all know what CM is the first time was Gordon The Telegraph, actually so even before email, they were spam in the Telegraph and everyone is extremely familiar with spam. Now you know not to open an email from someone, you don't know, never download any executable files and if there's words like viagra or Nigerian prince, you know and you're, not going to believe what the email says. So the point is that there's a there's, a lot of great infrastructure around it and sorry. So gmail does a great job at that.

A

But social media is sort of the new medium and attackers and hackers are taking advantage of social media, and so it might be worth kind of talking about what kind of spam there is in social media. So you might get something like this. Just a simple link- and the link says, visits to my comment. That might be a little obvious because you can read that and the most common types of spams that we see actually are ways to make money easily we're from home schemes, weight loss, also apps.

A

We also see a lot of spammy apps, where it promises that it could do something to your profile that you can't normally do through Facebook. So, for instance, change the color theme on your profile or see who's visited your profile. So these are a lot of common spamming apps, but the catch is that the thing is sometimes you're. The link is not straight for it. You can't see, what's behind the link, there's a lot of link shorteners.

A

For instance, twitter has the character limit, and so you might get a link shortener and you don't know where it's going to. You might click it and so, and you might go to a phishing site or malware site. So there's a lot of danger there. People aren't kind of aware yet, as they are an email besides links, you also, you might get some personal message spam. So people can be, the attackers can be very clever. They might send you.

A

A very personal note to which you reply so here is an example of some message has been sent to two different accounts, even though it's the same message, so they can send you a message and you can start a conversation and then it might entice you into falling for one of their traps. So we want to be able to catch these messages using Cassandra, and so we'll talk about that in a little bit.

A

But we did release a social media. A spam report I encourage you to take a look at it if you kind of want to get educated about social media spam, but that was back in 2013 and and since then it's grown about seven times so spam is becoming a real problem. You can create spam signatures to catch the type of content by looking at things like work from home, or things like that.

A

But you would only you can only do that after the fact, so that crane e-signatures would take a 22 long and it'd be too slow to catch in real time. So cassano to the rescue. So how do we? How do we do this? So what kind of walk you through a data model of how we catch spam in real time? So, even though cassandra is a no sequel solution, you can't just throw data and hope that your crew is going to work out as you've been hearing over and over again.

A

You have to define the data model based on how you're going to query it. So for us we want to determine the number of times a certain contents been posted so because spam written the paper tend to post same content messages. So how do we do that? A typical table in Caesarea could look like the following.

A

You might have a row key and and the column names after or you might have composite columns, so the real key could be Satanists a the hash of the content that comes in so the social media post that could be md5 put as the row key and then the column name could be the unique ID of the post. So each social social media platform has an ID associated with its content. So the comments are also with it with comments that come later, the ID increases.

A

So if you, if you store the IDS this way, you also have the comments ordered by time and the item value. You could have more information, such as another item ID, which is a variable that we use internally and then the time of the post.

A

So this is how kind of we just this, is how we defined our data model Cassandra and we'll see how this can help us catch multiplicity, posts so again, spammers typically post the same content over and over, as you saw before, it's very easy to determine how many times a content post has been made using that data model, all you have to do is count the number of columns so because you've hashed your contin the row each every time you see a new post, it's going to add another column, so our new post with the same content allowed another column and you'll never double count, because it will overwrite the column because that the ID will be the same.

A

So it's pretty fast because you're indexing by the row key and then you can also extract a valuable time series information from this too. So remember that we store the item ID and the time of the post. So the time of the post. You can look at to see if there's a person activity from the spammer or, if there's regular intervals of posting. So these can all be really great indicators of spammer item ID.

A

You can also tell you if it's the same author, that's posting the content, or maybe it's different authors, which is more interesting because maybe that's the same person in real life. So you can get a lot of information from just this very simple data model in Cassandra we thought with that said. Cassandra is not a magic bullet and it won't solve everything.

A

So you still need a relational database to glue all the pieces of data together, such as where, where who's the parent of that post, if it is, if it's a reply to a certain comment or something like that- and you also you don't- have batch processing on Cassandra. So you might need to look into other tools like Hadoop. So after we implemented this data model, we SAT back and then actually so what we saw. What happened so brace yourself spam is coming.

A

We actually saw a post that came in 38 times the day after this was implemented, so this was a they post that made they made over and over 38 times in the day. So it's something that you know if spam is defined as something that's sent a lot over and over again, and so this is definitely spam. It's something that the brand we want to remove from their wall. It's not adding any value to anyone. In another case, we also saw a customer receive 25,000 types of inappropriate messages and it also helped removed it.

A

So, with the simple data model, a lot of value was added, so it's really important to keep all the data and another another way that container has helped us is by checking down spammy users, so those that's primary content to identify spammy users. We have to know all the posts that a person has ever made. So what we do is we look at the post to see if it's spammy, if a person's up spammed certain number of times, then there's spamming users. So from that point forward, every time they make a post it's spam.

A

So Cassandra is nice because you can throw data at it easily it's readily accessible and you can make these kind of queries on it in real time. Additionally, besides the spamming users and this spammy content itself, it's important to keep all of data to change, trainer, 100, plus classifiers, so tuning Cassandra, it's actually been humming along quite nicely.

A

We barely had to tweak anything from the default values and- and we don't have a lot of deletes- it's just kind of nature- of our data set and whenever you do deletes that's when you have to do repairs, because your your apparent inconsistencies across the replicas of day are due to deletes yeah.

A

A

Yeah, so it's actually it's a web app, but you can install it through facebook, so it would look like a facebook, app, yeah, yeah, ok, so again, not a lot of tuning needed, and so now Allah deletes. As for us, that's great, so there's not of intensive disk I/o, there's only a few times that we observe performance issues- and these are the times is when the rates of our reason rights reach a certain threshold when the size, the data being asserted, was too large or a heat memory issue with the co-signer 1.1.

A

But in all cases, when we asked datastax they jumped on within a couple hours and resolved it quickly. So that's that's been great. Cassano community is wonderful. Obviously you guys are all wonderful. It's easy to jump on IRC channel and talk to fellow users for what in one story, we wanted a feature in a certain Ruby wrapper that we were using or a driver that we were using and I was able to jump on IRC and author com.

A

We had a conversation with the author itself that himself and he asked me to put in a pull request on github, which I did and then a week later, the the feature was released in the next version. So it's very easy to talk to people actively developing Cassandra and you could become a quick developer as well.

A

So just to end with a couple things, opscenter has been extremely useful in helping debugging performance issues, Solar's been useful in looking to to train our new categories as they come in and then we're looking forward to using spark to train our label data with MapReduce. So I encourage you guys to I kind of take a look at us. If you need to protect your social media accounts and thanks thanks very much.