►
From YouTube: Nexgate: Social Media Security Company Nexgate Relies on Cassandra for Fraud Detection
Description
Speaker: Harold Nguyen, Senior Data Scientist at Nexgate
In this talk, we focus on a use case by showing how Cassandra can detect spam and spammers on social media. We also show how we use Cassandra to train our 100+ social-media-security classifiers. The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we've ever scanned, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This talk is about how Datastax and Cassandra make it easy.
A
So
really
excited
to
be
here,
I'm,
a
data
scientist
at
next
gate
and
my
backgrounds
in
particle
physics.
So
in
the
valley
that
sometimes
people
call
that
yap
I
heard
that
before
which
is
yet
another
particle
physics.
So
I'm
going
to
be
talking
about
Cassandra
here
at
next
gate
and
it's
it's
going
to
be
a
kind
of
a
story
about
how
we
moved
from
to
a
no
single
solution.
A
So
so
what
is
next
gate?
I
just
want
to
give
you
kind
of
a
background.
First
does:
does
anyone
know
suppose
you
were
a
fortune
100
company?
How
many
social
media
accounts
do
you
think
represent
your
brand
if
you
were
a
fortune
100
company,
so
let's
kind
of
just
focus
on
Facebook
pages.
First
I
do
want
having
a
guess
on
how
many
Facebook
pages
your
brand
might
be
represented
by
any
like
a
quick
guess
on
a
number
or
maybe
to
10.
A
Well,
five
yeah
well
turns
out
turns
out
to
be
300
around
300
accounts
and
that's
just
social,
that's
just
facebook!
So
when
you,
when
you
throw
in
Twitter
Linkedin
Google+
YouTube,
that's
a
lot
more.
So
imagine
trying
to
sift
through
all
this
content
and
try
to
manually
kind
of
classify
it
that
so
there's
there's
a
lot
of
stuff
that
song
there
doesn't
belong
in
there.
So
you
have
that
actors
on
Facebook
pages
they're
trying
to
harm
sort
of
other
users.
A
Your
audience
members
on
your
Facebook
page
and
stuff,
like
that,
so
they're
kind
of
diluting
the
message
that
your
brands
trying
to
present
to
the
audience.
So
next
gate
sort
of
a
it's
a
technological
solution
to
help
you
discover
the
accounts
that
you
have
helps
you
protect
it,
so
you
might
be
seeing
some
hacked
accounts
lately
in
the
news
like
the
White
House
was
a
Twitter.
Account
was
hacked
a
couple
years
ago.
There
was
a
bomb
threat.
Celebrity
twitter,
twitter
accounts
get
hacked
all
the
time,
so
we
help
protect
that
also
monitoring
them.
A
So
we're
going
to
talk
about
how
we
use
cassandra
has
a
small
company,
but
collectively
the
co-founders
and
employers
have
dozens
of
years
of
experience.
In
security-
and
you
know
over
a
year
and
a
half,
these
are
the
type
of
customers
that
are
taking
a
social
media
very
seriously.
So
that's
pretty
good
I
think
for
being
around
for
a
year
and
a
half
so
talking
about
the
the
scale
of
the
data
a
little
bit.
There's
over
350
million
pieces
of
social,
social
media
content
spread
across
facebook,
youtube,
linkedin,
twitter,
etc.
A
And
if
you
were
asked
me
four
months
ago,
what
that
number
would
be
it'd
be
250
million,
so
we're
growing.
The
data
that's
coming
in
is
growing
exponentially,
so
the
rate
that
today
is
11
and
a
half
million
new
content
per
day,
and
it's
all
classified
in
real
time
as
it
comes
in
and
there's
65
million
total
social
media
authors.
That
means
people
are
posting
on
social
media
and
then
it's
about
a
quarter
of
a
million
new
authors
a
day.
So
that's
kind
of
kind
of
data.
A
To
give
you
give
you
an
idea
all
right,
so
so
the
machine,
learning
experts,
statistics
statisticians
in
the
room
know
that,
in
order
to
have
a
good
classification
system,
you
need
to
have
a
lot
of
data.
So,
in
order
to
have
a
lot
of
data,
you
also
need
a
strong
infrastructure,
so
I
like
to
sort
of
give
a
quote
by
by
rich
our
CTO
is
that
the
completeness
of
any
classification
system
is
predicated
on
the
breath
of
the
corpus
of
data
upon
which
is
built.
A
So
that
means
that
you
know
think
of
email,
spam
and
ham.
It
takes
a
lot
of
data
to
be
able
to
do
that
correctly.
So
imagine
if
you
hundred
categories,
so
not
only
do
you
need
breath,
but
you
also
need
debt
for
those
categories
as
well,
so
you
need
to
collect
a
lot
of
data
and
we
need
to
have
a
strong
and
capable
infrastructure
in
order
to
do
so.
A
So
we'll
talk
about
how
we
got
to
that
infrastructure.
So
in
the
very
beginning
we
threw
everything
in
the
my
sequel,
and
you
know
why
not
it's
easy
to
you
as
a
start-up
you
in
China,
launched
quickly.
You
want
to
use
a
tool
that
gets
the
job
done,
and
many
people
already
know
it.
So
it's
easy
to
kind
of
you
know,
hire
anyone
off
the
street
and
they'll
be
able
to
use
my
sequel,
it's
very,
very
easy
to
use,
also
very
secure.
You
know
the
banks.
Banks
are
using
sequel,
it's
been
around
for
a
while.
A
A
lot
of
security
issues
have
been
known.
It's
also
inexpensive.
You
can't
you
can't
do
much
better
than
free
from
my
sequel.
Unless
people
pay
you
to
use
it,
it
manages
memory
very
well
and
up
to
50
million
rows,
you
could
have
pretty
fast
queries
so,
starting
out
it's
a
great
solution.
It
also
supports
several
development
interfaces.
A
A
So
there's
this
one
does
not
simply
alter
table
in
my
sequel.
So
after
several
months
where
we
realized
that
our
data
model
wasn't
handling
all
the
scenarios
and
we
needed
something
else,
we
need
to
move
to
a
no
sequel
solution
and
to
talk
about
the
kind
of
data
we
have
before.
We
talk
about
no
sequel,
pretty
much
social
media
data
sauce
on
average
is
about
1k.
That
includes
content
and
metadata,
so
you're
you're
familiar
with
sort
of
the
content
on
Facebook
and
Twitter.
A
It's
just
kind
of
like
simple
phrase
or
to
your
sentence,
maybe
maybe
some
links
so
that's
kind
of
the
content
and
the
metadata
is
the
stuff
around
that.
So
the
timestamps
who
it's
posted
by
what
account
it's
on
and
things
like
that,
so
the
metadata
can
also
vary
depending
on
what
platform
you're
on
so,
for
instance,
engagement
activity
like
for
facebook.
You
have
likes
for
Twitter,
you
might
have
followers
and
YouTube
subscribers,
so
social
media
day
is
pretty
rough
and
jagged
and
you
want
to
store
some
of
it
in
a
flexible
database.
A
So
you
want
to
store
actually
in
both
sequel,
endo
sequels,
so
you
don't
want
to
sort
of
take
all
your
data
and
sequel
in
store
and
no
sequel.
You
still
want
to
use
the
right
tool
for
the
job.
There
are
some
cases
that
you
want
to
store
your
data
and
sequels.
So
for
these
cases
it's
things
like
fixed
length,
non-null
heavily
index,
so
things
like
the
time
stamp
or
the
author
ID.
It's
going
to
be
there
for
every
content.
A
You
have
no
matter
what
platform,
so
it's
good
to
store
that
stuff
in
a
relational
database
for
other
things
that
a
more
variable
length
commonly
Knoll
that
you're
going
to
access
only
once
you
don't
have
to
worry
about
joining
against
another
table.
You
want
to
store
it.
A
no
sequel
solution,
so
different
authors
might
be
posting
different
number
of
times.
Each
account
might
have
a
different
number
of
authors,
so
these
are
all
kind
of
variable
length
variables
that
you
can
store
in
no
sequel
solution.
A
So
when
we
looked
around
for
a
no
single
solution,
we
had
a
couple
of
requirements
in
mind,
and
so
I
mean
the
punch
line
is
we're
at
a
cassandra
summit.
So
you
kind
of
know
what
we
chose
in
the
end,
it's
cassandra,
and
so
as
we
go
along
one
of
these
bullet
points,
I'll
kind
of
say
why
it
was
trying
to
fit
that
use
case.
So
it's
easy
to
use.
A
It's
actually
very
easy
to
use
because,
in
my
case,
coming
from
an
academic
background,
a
very
first
date,
you
know
before
I
new
kind
of
my
sequel,
even
relational
database
I,
was
put
on
a
task
to
make
a
web
app
using
Cassandra
as
a
back-end,
and
you
know,
and
one
day
is
very
easy
to
learn
how
to
put
data
into
it,
create
a
data
model
and
get
data
out
of
it,
so
very,
very
easy
to
use
and
the
second
data
was
making
composite
columns.
We
wanted
something
that
would
would
scale
horizontally
and
so
like.
A
If
you
had
a
cluster
and
you
had
a
server
and
you
fired
it
up,
this
cluster
would
magically
know
that
servers
there.
And
then
you
have
a
new
node
in
your
cluster,
so
you
want
something
that
was
a
simple
improvement
and
cassandra
is
just
that.
We
also
wanted
some
integrated
tools
for
research
because
we
rely
on
our
classifiers,
so
we
wanted
people
to
do
search
and
analysis
easily
so
slowly
provided
that
we
wanted
operational
simplicity
so
that
all
nodes
are
the
same.
A
It's
simple
to
deploy
maintain
so
a
couple
of
weeks
ago
on
AWS
I
was
able
to
make
five
clicks
and
put
a
command
line,
and
my
kissena
cluster
went
up
in
two
minutes,
so
it
took
longer
for
it
to
fire
up
than
for
me
to
get
bit
to
fire
up.
So
it's
very
easy
to
deploy
and
in
terms
of
maintaining
it.
I
mean
they've,
been
a
few
support
issues,
but
not
very
big,
and
not
very
long
and
if
few
and
far
in
between,
so
that
was
good
integration
with
other
big
data
tool.
A
So
at
the
time
we
noticed
that
it
were,
there
was
a
negation
with
Hadoop,
so
there
was
a
CFS
was
being
actively
developed,
which
is
the
casino
file
system,
it's
kind
of
like
the
HDFS
version
and
Hadoop
ecosystem,
and
so
that
was
great.
You
didn't
have
to
worry
about
doing
your
batch
processing
and
transferring
the
data
over
to
cast
an
or
you
could
just
do
a
straight
on
Cassandra.
So
that
was
great
and
these
days
the
shiny
new
tool
is
spark
and
we're
really
excited
about
that
as
well
all
right.
A
We
have
one
node
in
the
east
and
the
reason
why
we
have
the
multi
region
clusters
is
because
you
know
if
there's
an
earthquake
in
Napa
on
the
west
coast,
so
you
still
have
your
data
available
on
the
east,
so
we
use
em
one
large
instances
and
we're
about
to
scale
again
and
we
have
a
separate
cluster
for
Deb
test
and
production
so
that
we
can
throw
data
and
see
how
it
works.
So
so
yeah
datastax
has
been
extremely
helpful
and
help
supporting
us.
A
Obviously,
this
is
the
opscenter
created
by
data
stack,
so
they've
been
extremely
responsive,
so
just
to
look
at
some
of
the
monitoring
monitoring
tools
for
datastax,
so
we
have
about
a
70
reads.
A
second
and
about
25
writes
a
second
and
you
might
have
been
reading
that
castagna
is
really
good
on
rights
and
in
fact
it
is.
A
We
just
have
a
high
number
of
reads,
because
we
do
real-time
analysis
on
the
data
that
comes
in,
so
we
require
a
bit
of
reading
as
the
data
comes
in
and
in
our
classification,
and
so
that's
why
the
number
is
hot
and
the
reeds
okay.
So
I
mentioned
before
that
we
have
over
100
categories
and
one
of
them
is
spam
and
I
want
to
go
into
sort
of
a
use
case
and
how
we
we
detect
spam
using
Cassandra
I
sort
of
like
a
little
bit
data
modeling
here
so
spans
been
around
for
a
long
time.
A
Y'all
know
what
CM
is
the
first
time
was
Gordon
The
Telegraph,
actually
so
even
before
email,
they
were
spam
in
the
Telegraph
and
everyone
is
extremely
familiar
with
spam.
Now
you
know
not
to
open
an
email
from
someone,
you
don't
know,
never
download
any
executable
files
and
if
there's
words
like
viagra
or
Nigerian
prince,
you
know
and
you're,
not
going
to
believe
what
the
email
says.
So
the
point
is
that
there's
a
there's,
a
lot
of
great
infrastructure
around
it
and
sorry.
So
gmail
does
a
great
job
at
that.
A
But
social
media
is
sort
of
the
new
medium
and
attackers
and
hackers
are
taking
advantage
of
social
media,
and
so
it
might
be
worth
kind
of
talking
about
what
kind
of
spam
there
is
in
social
media.
So
you
might
get
something
like
this.
Just
a
simple
link-
and
the
link
says,
visits
to
my
comment.
That
might
be
a
little
obvious
because
you
can
read
that
and
the
most
common
types
of
spams
that
we
see
actually
are
ways
to
make
money
easily
we're
from
home
schemes,
weight
loss,
also
apps.
A
We
also
see
a
lot
of
spammy
apps,
where
it
promises
that
it
could
do
something
to
your
profile
that
you
can't
normally
do
through
Facebook.
So,
for
instance,
change
the
color
theme
on
your
profile
or
see
who's
visited
your
profile.
So
these
are
a
lot
of
common
spamming
apps,
but
the
catch
is
that
the
thing
is
sometimes
you're.
The
link
is
not
straight
for
it.
You
can't
see,
what's
behind
the
link,
there's
a
lot
of
link
shorteners.
A
For
instance,
twitter
has
the
character
limit,
and
so
you
might
get
a
link
shortener
and
you
don't
know
where
it's
going
to.
You
might
click
it
and
so,
and
you
might
go
to
a
phishing
site
or
malware
site.
So
there's
a
lot
of
danger
there.
People
aren't
kind
of
aware
yet,
as
they
are
an
email
besides
links,
you
also,
you
might
get
some
personal
message
spam.
So
people
can
be,
the
attackers
can
be
very
clever.
They
might
send
you.
A
A
very
personal
note
to
which
you
reply
so
here
is
an
example
of
some
message
has
been
sent
to
two
different
accounts,
even
though
it's
the
same
message,
so
they
can
send
you
a
message
and
you
can
start
a
conversation
and
then
it
might
entice
you
into
falling
for
one
of
their
traps.
So
we
want
to
be
able
to
catch
these
messages
using
Cassandra,
and
so
we'll
talk
about
that
in
a
little
bit.
A
But
we
did
release
a
social
media.
A
spam
report
I
encourage
you
to
take
a
look
at
it
if
you
kind
of
want
to
get
educated
about
social
media
spam,
but
that
was
back
in
2013
and
and
since
then
it's
grown
about
seven
times
so
spam
is
becoming
a
real
problem.
You
can
create
spam
signatures
to
catch
the
type
of
content
by
looking
at
things
like
work
from
home,
or
things
like
that.
A
But
you
would
only
you
can
only
do
that
after
the
fact,
so
that
crane
e-signatures
would
take
a
22
long
and
it'd
be
too
slow
to
catch
in
real
time.
So
cassano
to
the
rescue.
So
how
do
we?
How
do
we
do
this?
So
what
kind
of
walk
you
through
a
data
model
of
how
we
catch
spam
in
real
time?
So,
even
though
cassandra
is
a
no
sequel
solution,
you
can't
just
throw
data
and
hope
that
your
crew
is
going
to
work
out
as
you've
been
hearing
over
and
over
again.
A
You
have
to
define
the
data
model
based
on
how
you're
going
to
query
it.
So
for
us
we
want
to
determine
the
number
of
times
a
certain
contents
been
posted
so
because
spam
written
the
paper
tend
to
post
same
content
messages.
So
how
do
we
do
that?
A
typical
table
in
Caesarea
could
look
like
the
following.
A
You
might
have
a
row
key
and
and
the
column
names
after
or
you
might
have
composite
columns,
so
the
real
key
could
be
Satanists
a
the
hash
of
the
content
that
comes
in
so
the
social
media
post
that
could
be
md5
put
as
the
row
key
and
then
the
column
name
could
be
the
unique
ID
of
the
post.
So
each
social
social
media
platform
has
an
ID
associated
with
its
content.
So
the
comments
are
also
with
it
with
comments
that
come
later,
the
ID
increases.
A
A
So
it's
pretty
fast
because
you're
indexing
by
the
row
key
and
then
you
can
also
extract
a
valuable
time
series
information
from
this
too.
So
remember
that
we
store
the
item
ID
and
the
time
of
the
post.
So
the
time
of
the
post.
You
can
look
at
to
see
if
there's
a
person
activity
from
the
spammer
or,
if
there's
regular
intervals
of
posting.
So
these
can
all
be
really
great
indicators
of
spammer
item
ID.
A
You
can
also
tell
you
if
it's
the
same
author,
that's
posting
the
content,
or
maybe
it's
different
authors,
which
is
more
interesting
because
maybe
that's
the
same
person
in
real
life.
So
you
can
get
a
lot
of
information
from
just
this
very
simple
data
model
in
Cassandra
we
thought
with
that
said.
Cassandra
is
not
a
magic
bullet
and
it
won't
solve
everything.
A
So
you
still
need
a
relational
database
to
glue
all
the
pieces
of
data
together,
such
as
where,
where
who's
the
parent
of
that
post,
if
it
is,
if
it's
a
reply
to
a
certain
comment
or
something
like
that-
and
you
also
you
don't-
have
batch
processing
on
Cassandra.
So
you
might
need
to
look
into
other
tools
like
Hadoop.
So
after
we
implemented
this
data
model,
we
SAT
back
and
then
actually
so
what
we
saw.
What
happened
so
brace
yourself
spam
is
coming.
A
We
actually
saw
a
post
that
came
in
38
times
the
day
after
this
was
implemented,
so
this
was
a
they
post
that
made
they
made
over
and
over
38
times
in
the
day.
So
it's
something
that
you
know
if
spam
is
defined
as
something
that's
sent
a
lot
over
and
over
again,
and
so
this
is
definitely
spam.
It's
something
that
the
brand
we
want
to
remove
from
their
wall.
It's
not
adding
any
value
to
anyone.
In
another
case,
we
also
saw
a
customer
receive
25,000
types
of
inappropriate
messages
and
it
also
helped
removed
it.
A
So,
with
the
simple
data
model,
a
lot
of
value
was
added,
so
it's
really
important
to
keep
all
the
data
and
another
another
way
that
container
has
helped
us
is
by
checking
down
spammy
users,
so
those
that's
primary
content
to
identify
spammy
users.
We
have
to
know
all
the
posts
that
a
person
has
ever
made.
So
what
we
do
is
we
look
at
the
post
to
see
if
it's
spammy,
if
a
person's
up
spammed
certain
number
of
times,
then
there's
spamming
users.
So
from
that
point
forward,
every
time
they
make
a
post
it's
spam.
A
So
Cassandra
is
nice
because
you
can
throw
data
at
it
easily
it's
readily
accessible
and
you
can
make
these
kind
of
queries
on
it
in
real
time.
Additionally,
besides
the
spamming
users
and
this
spammy
content
itself,
it's
important
to
keep
all
of
data
to
change,
trainer,
100,
plus
classifiers,
so
tuning
Cassandra,
it's
actually
been
humming
along
quite
nicely.
A
A
Yeah,
so
it's
actually
it's
a
web
app,
but
you
can
install
it
through
facebook,
so
it
would
look
like
a
facebook,
app,
yeah,
yeah,
ok,
so
again,
not
a
lot
of
tuning
needed,
and
so
now
Allah
deletes.
As
for
us,
that's
great,
so
there's
not
of
intensive
disk
I/o,
there's
only
a
few
times
that
we
observe
performance
issues-
and
these
are
the
times
is
when
the
rates
of
our
reason
rights
reach
a
certain
threshold
when
the
size,
the
data
being
asserted,
was
too
large
or
a
heat
memory
issue
with
the
co-signer
1.1.
A
But
in
all
cases,
when
we
asked
datastax
they
jumped
on
within
a
couple
hours
and
resolved
it
quickly.
So
that's
that's
been
great.
Cassano
community
is
wonderful.
Obviously
you
guys
are
all
wonderful.
It's
easy
to
jump
on
IRC
channel
and
talk
to
fellow
users
for
what
in
one
story,
we
wanted
a
feature
in
a
certain
Ruby
wrapper
that
we
were
using
or
a
driver
that
we
were
using
and
I
was
able
to
jump
on
IRC
and
author
com.
A
We
had
a
conversation
with
the
author
itself
that
himself
and
he
asked
me
to
put
in
a
pull
request
on
github,
which
I
did
and
then
a
week
later,
the
the
feature
was
released
in
the
next
version.
So
it's
very
easy
to
talk
to
people
actively
developing
Cassandra
and
you
could
become
a
quick
developer
as
well.
A
So
just
to
end
with
a
couple
things,
opscenter
has
been
extremely
useful
in
helping
debugging
performance
issues,
Solar's
been
useful
in
looking
to
to
train
our
new
categories
as
they
come
in
and
then
we're
looking
forward
to
using
spark
to
train
our
label
data
with
MapReduce.
So
I
encourage
you
guys
to
I
kind
of
take
a
look
at
us.
If
you
need
to
protect
your
social
media
accounts
and
thanks
thanks
very
much.