►
Description
Speaker: Evan Chan, Ooyala
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-realtime-analytics-using-cassandra-spark-and-shark-by-evan-chan
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
A
So,
who
am
I
I'm
a
staff
engineer
at
Luella
will
tell
you
more
about
the
yellow
in
just
a
second.
For
the
last
couple
years,
we've
been
building
real-time
web
scale
systems
on
top
of
Cassandra
Kafka
storm.
Other
big
distributed
systems
like
that
Hadoop
we're
very
excited
to
share
with
you
today.
Some
open
source
projects
that
we
believe
is
game-changing
and
my
twitter
handle
is
their
case.
You're
interested.
A
A
A
You
know
you
can
all
see
it.
There
we're
fairly
large
Cassandra
user.
We
have
11
clusters
ranging
in
size
from
32
36
notes
about
28
terabytes
of
data,
managed.
We
write
over
2
billion
events
into
a
Cassandra
datastore,
it's
probably
close
to
three
every
day.
This
is
I,
think
that
averages
out
to
about
fifteen
to
twenty
thousand
events
per
second,
this
powers,
all
of
our
analytics
Cassandra
powers,
all
of
analytics
infrastructure
from
recommendations,
raw
event,
storage
trend,
detection,
aggregated
analytics,
and
we
have
a
much
much
bigger
cluster
coming
soon.
So
we're
pretty
committed
to
Cassandra.
A
A
We
have
video
metrics
that
are
aggregated
alarm,
several
high
cardinality
dimensions,
for
example,
we
have
URL,
and
we
have
you
know
what
city
someone's
watching
the
video.
What
is
your
user
agent?
Your
browser
device,
type
that
sort
of
stuff,
and
these
lookups
are
very
fast
because
they're
all
pre
computed
along
all
these
different
dimensions,
but
the
problem
is
that
it's
a
little
bit
inflexible.
A
It
would
take
us
a
long
time
to
change
the
pattern
of
writing
this
data
and
the
other
problem
is
that
most
of
these
aggregates
are
actually
not
read,
because
it
turns
out
that
most
customers
are
not
really
interested
in.
Like
video
metrix
sorted
by
every
single
city,
there
are
few
key
things
that
users
look
for
and
now
what
happens
if
we
need
more
dynamic
queries,
for
example,
we
want
top
contents
for
mobile
users
in
France,
so
that
right
there
is
two
different
dimensions.
A
So
now
you
start
having
cross
products
right
if
this
was
not
friends,
but
some
city
like
so
how
many
different
types
of
browsers
are
there
and
how
many
types
of
cities
are
there?
So
we
start
having
this
giant
cross
cross
product
right
and
let's
say
that
we
want
to
do
data
mining
trends,
machine
learning.
A
If
you
look
at
there's
really
a
continuum,
but
one
end,
you
have
one
hundred
percent
of
your
data
being
pre
computed,
which
is
where
we
are
today.
It's
pretty
fast,
but
it's
pretty
inflexible.
It's
good
for
kind
of,
like
maybe
the
top
eighty
percent
of
queries
on
the
other
end
of
the
spectrum.
You
have
hundreds
and
dynamic
here
you
get,
you
can
pray
for
exactly
what
you
want,
but
it's
very
flexible,
but
it's
usually
pretty
slow
and
what
you?
A
What
I
think
we
really
want
is
wonder
something
in
the
middle,
something
that
is
partly
dynamic
where
you
pre
aggregate
w
most
common
queries,
but
you
have
a
system.
That's
flexible
work
fast
enough
to
do
your
interesting
queries
and
you
can
usually
generate
the
materialized
views
that
you
need.
So
let's
have
a
quick
look
at
industry
trends.
A
A
A
So
spark
is
an
in-memory
distributed
computing
framework.
You
can
also
write
the
disk
is
created
by
UC
Berkeley
three
years
ago
and
initially
it
targets
problems
that
Hadoop,
mr,
is
bad
at,
for
example,
iterative
algorithms,
where
you
need
to
recompute
the
same
set
of
data
multiple
times
or
like
iteratively
interactive
data
mining.
A
A
Let's
have
a
look
at.
Let's
compare
the
different
paradigms.
Let's
look
at
her
do
so
you
have
HDFS,
you
read
data
out,
you
map
it.
It
gets
written
a
temporary
files
on
disk
shuffle
to
the
reduced
phase.
Then
you
write
data
back
to
your
gfs.
Now,
usually
like
a
single
stage
is
not
enough
for
your
job,
so
you
usually
need
two
or
three.
So
you
end
up
writing
data
back
to
ex
gfs
reading
it
back
out-
and
you
know
you
do
this
over
and
over
again
right.
Let's
look
at
spark.
A
What
happens
in
spark
is
that
let's
say
you
have
different
data
sources.
You
read
it
out.
That
still
happens.
That's
the
IO,
but
after
that
things
happen
in
memory,
I'm
doing
a
map,
then
I
can
have
higher
level
constructs
like
joint
I
can
join
the
two
different
data
sources.
This
happens
in
memory
and
then
the
really
neat
thing
is
that
I
have
this
intermediate.
We
start
I
can
cash
it
after
I
cash
it.
I
can
do
transformations.
A
The
red
lines
indicate
that
the
data
flow
happens
in
memory
and
what
we
find
is
that
throughput,
let
that
memory
is
really
king.
We
did
a
quick
study.
Reading
data
from
Cassandra
when
a
cache
is
code
about
that
that
fast,
when
a
cache
is
warmed
up.
We
gotta
speed
up
of
about
seven
X,
2
10
X
having
a
warm
cash
does
do
wonders
for
you,
but
it's
really
nothing
compared
to
a
memory
processing
by
spark
and
the
other
reason
why
we
know
that
we
were
headed
in
the
right
direction
is
because
developers
just
love
it.
A
A
You
see
why
I
wrote
this
in
a
minute.
You
get
the
full
power
of
a
regular
programming
language.
You
have
an
interactive
shell.
That
makes
it
very
easy
to
do
ad
hoc,
stuff
and
figure
out.
What
you
need
out
of
your
data
and
testing
is
extremely
easy.
Anybody
tried
to
do
real
testing
with
Hadoop
jobs.
It's
it's
not
very
easy.
A
A
A
You
have
spark,
which
is
the
distributive
framework,
underneath
it
is
an
exciting
project
called
tachyon.
What
tech
here
is
it's
like
an
a
memory
acceleration
for
HDFS.
You
can
actually
use
it
without
spark,
but
it'll.
It
gives
you
memory
speed
access
to
your
HDFS
files
and
then,
on
top
of
that,
they've
managed
to
build
pretty
interesting
projects,
which
kind
of
goes
to
show
the
flexibility
of
spark.
A
One
is
something
of
a
go
sorry
for
the
pixelated
image.
This
is
anybody
heard
of
Google
prego
prego
is
Google's
gigantic
distributed
graphing
engine,
so
they've
built
a
system
like
that
on
top
of
spark
their
shark,
which
I
will
talk
about
in
demo.
In
a
minute,
this
is
high,
ql
on
spark
and,
finally,
their
spark
streaming.
This
is
doing
mini.
A
A
So
more
on
shark
shark
is
built
around
hi,
so
how
many
people
use
hive
in
here
just
so
fair,
fair
number?
It's
100
sent
five
ql
compatible,
but
it's
about
10,
100x
faster.
You
can
get
answers
in
seconds.
There's
no
need
to
compile
a
NMR
job.
You
can
reuse
all
of
your
existing
udfs
storage
handlers,
30s
and
you
can
use
if
used
datastax
enterprise.
You
can
use
cuts
down
your
FS
as
your
hive
meta
store
or
you
can
just
use.
Hdfs.
A
There's
also
really
neat
feature
that
you
can
write
a
scholar
java
integration
I
mean
hi.
Sql
is
good
for
a
lot
of
things,
but
for
more
complex
creation,
materialized
views
it.
You
know
you
need
to
create
udfs
and
your
package
it
ship
it
I
personally,
would
prefer
to
use
the
real
programming
language
to
create
some
of
these
jobs
so
having
this
integration
is
actually
really
really
powerful,
but
anyway
enough
about
Mark
and
shark.
What
we
really
want
to
get
to
is,
how
are
we
using
it?
A
So
we
sort
of
raw
events.
What
we
call
raw
events
are,
let's
say,
you're
watching
a
video
on
your
mobile
phone
or
desktop.
Then
we
have
a
player
in
flash
and
and
jas
that's
sending
events
to
our
servers
constantly.
So
these
are
what
we
call
it:
a
bra
events
they're
too
raw
video
pings
from
there.
We
have
a
real
time
and
distant
engine
that
gets
writes
these
events
into
Cassandra.
In
matter
of
seconds,
then
what
we
want
to
do
is
from
Cassandra.
A
A
This
is
our
stack.
You
have
cassandra
notes
on
top
of
Cassandra
notes:
we've
written
a
custom
input
format
that
reads
from
our
schema
and
on
top
of
the
infrared
there's
something
called
a
sturdy.
If
you
don't
know
where
30
is
it's
basically
highest
way
of
mapping
Hadoop
data
into
hive
into
a
table
and
we'll
talk
about
that
in
a
sec
and
on
top
of
that
with
the
SEPs
park,
which
is
controlled
by
master
and
a
job
server.
A
Just
a
little
bit
about
our
schema,
we
have
a
time
series
schema
and
this
might
look
familiar
to
a
lot
of
you
guys.
But
you
know,
let's
say
you
have
a
user
user
has
events
coming,
and
so
you
have
every
column
is
like
a
different
point
of
time,
so
you're
pretty
standard.
We
have,
however,
split
out
attributes
into
a
separate
column
family.
A
The
reason
why
we
do
that
is
because,
if,
let's
say
your
user
there's
a
lot
of
data,
your
user
agent,
your
IP
address
that
pretty
remains
static
during
the
whole
day
or
you
know
whatever
it
is,
and
if
you
were
to
write
that
again
and
again
into
every
event,
you'd
be
wasting
a
lot
of
storage.
So
we
separate
that
out-
and
this
also
gives
us
an
extremely
nice
way
of
doing,
indexing
and
filtering.
A
What
do
we
get
out
of
this
is
that
we
get
all
the
data
for
one
user
in
consecutively,
which,
for
us
gives
us
a
lot
of
interesting
data
processing
this
really
quickly.
If
you
are
interested
in
writing
your
own
input
format,
it's
important
to
know
what
API
target
there's
a
new
input
format
do
API
and
an
old
API
I've
only
supports
the
old
API
cascading.
Maybe
something
else
you
should
be
prepared
to
spend
time
tuning
your
split
computation.
A
A
Finally,
hive
is
an
interesting
feature
called
predicate
push
down,
which
is
where
a
people
where
clause
hype
can
actually
send
out
information
down
into
the
input
format,
so
that
the
key
thing
in
these
jobs
that
you
want
to
reduce
the
size
of
data
you're
reading.
So
you
don't
want
hive
to
or
the
shark
in
this
case
to
be
doing
the
filtering
and
memory
you
want
to
minimize
the
number
of
rows
you
reading
from
Sandra.
So
that's
pretty
important
too.
A
So
here's
an
example
for
OLAP
processing.
You
start
with
start
up
with
the
you
know
the
events
in
Cassandra
we
create
or
lab
aggregates,
which
are
the
materialized
views,
and
we
do
a
union.
So
that's
later,
we
want
a
certain
time
range.
These
represent
different
time
range.
Maybe
this
is
an
hour
ago.
This
is
half
an
hour
ago.
This
is
now
we
do
union
of
the
data.
Then
we
can
do
different
queries
out
of
that
data.
A
For
example,
for
reading
data
out
of
Cassandra
with
the
cold
cash-
let's
say
it
takes
about
two
minutes:
the
cash
is
warm
about
20
or
30
seconds,
and
then
we
generate
the
aggregates
in
memory.
Reading
the
aggregates
in
memory
is
way
less
than
a
second,
so
you
can
imagine.
This
opens
up
huge
number
of
possibilities.
A
So
I'm
going
to
go
into
a
workflow
a
little
bit.
What
do
we
do
is
that
we
have
a
rest
job
server
that
sits
in
front
of
spark.
This
is
our
spark
as
a
service
layer
that
it
also
gives
us
a
nice
boundary
between
the
teams,
because
teams
can
submit
this
use
of
simple
rest
api
to
submit
to
jobs.
We
start
up
an
aggregation
job
that
I
will
pull
data
from
Cassandra
writes
into
a
data
set.
A
The
data
set
is
flipped
across
notes,
so
each
of
these
boxes
represents
a
one
worker
node
and
then
what
happens
is
that
you
have
a
query
coming
in
this
starts
a
query
job.
That
is
the
distributor
job
reads
from
the
data
set,
that's
a
bunch
of
number
crunching
and
gives
the
result
back
and,
and
then
you
can
do
this
repeatedly.
A
A
Smark
has
something
built
in
called
lineage,
which
is
that
when
there's
missing
data,
it
remembers
the
steps
that
it
takes
to
transform
data.
So
we
can
always
go
back
to
source
which,
in
our
case,
cassandra-
and
it
can
do
this-
we
computation
for
you,
but
this
is
fairly
slow.
If
it
takes
two
minutes
or
30
seconds,
that's
not
really
an
acceptable
delayed.
A
A
A
Yep,
so
just
want
to
show
you
guys
a
really
quick
demo
on
I'm.
Sorry,
I
really
wanted
to
do
this
demo
live,
but
it
was
really
difficult
to
get
the
equipment
cooperating
so
instead
of
made
a
little
quicktime
video
and
some
screenshots.
So
if
you
excuse
me,
this
is
actually
just
running
on
my
laptop
I.
Didn't
really
want
any
network
problems,
we're
going
to
demonstrate
how
to
create
a
table
from
Cassandra
with
our
input
format
and
querying
data
out
really
quickly.
So.
A
Creating
a
table
from
Cassandra,
basically,
you
do
a
create
external
table
which
creates
a
it's
basically
a
table
view
on
Cassandra
data
and
the
key
points
we
need
to
we
need
to
set.
You
know
the
storage
handler
that
we
use,
which
is
the
custom
input
format
that
we
taught
you
about,
and
then
you
said
in
the
table
properties.
You
pointed
out.
A
Okay,
these
are
the
consignor
nodes
that
we
want,
the
port
key
space
start
time
and
time,
it's
actually
very
similar
to
the
Cassandra
input
format
of
anybody's
use
that
so
now
that
doesn't
actually
read
data
out
of
Cassandra
that
just
defines
the
table
and
actually,
if
I,
go
back,
the
most
important
part
is
that
there
there
are
fields
here
so
the
event
I
provide
early.
Those
are
actually
fields
in
our
JSON.
Well,
it's
actually
not
JSON
anymore,
but
those
are
actually
fields
in
the
event
map.
A
So
every
time
you
create
a
table,
you're
saying
I
want
these
fields.
These
views
these
fields,
and
then
this
is
the
magic
is
that
I
created
cash
table
and
this
will
actually
pull
data
out
of
Cassandra.
In
this
case,
this
is
a
little
bit
of
a
trivial
example.
I'm
creating
a
table
does
exactly
like
they
did
on
Cassandra,
except
that
it's
cash
in
memory.
In
reality,
you
would
probably
be
doing
some
processing
on
it,
and
this
is
the
video
I'm
going
to
try
to
count
the
rose
for
my
cash
table.
A
Don't
try
doing
that
right
and
let's
say
that
I
wanted
to
group
my
events
by
a
by
provider,
which
is
you
know,
a
customer,
and
now
this
this
comes
out
very
fast
and
now
I've
counts
by
provider.
But
let's
and
you
can
see
how
fast
an
interactive
does
this-
which
is
this
amazing?
But
let's
say
that
I
want
to
do
top-k.
You
know,
because
this
is
sort
of
like
you
know,
I,
don't
know
who
are
the
most
interesting
ones.
A
A
A
Why
did
you
choose
bark
shark
over
Cloudera
and
Paula?
Do
you
hear
the
question
yeah
yeah,
so
I
think
the
question
was:
why
do
we
choose
spark
and
shark
over
Impala?
That's
a
good
question.
I
think
impala.
One
reason
is
that
we
have
a
big
investment
in
Cassandra.
We
want
to
stick
with
our
Cassandra
investment
with
the
data.
The
other
reason
is
because
we
believe
that
spark
and
shark
are
a
lot
more
general
purpose
that
we
can
actually
write
very
easily
Scala
Java
Python
programs
to
crunch
your
data
I.
A
We
don't
think
that
SQL
gives
you
the
same
like
anywhere
near
close
to
the
same
productivity,
a
product
type
of
productivity
productivity.
Maybe
one
more
and
I'll
be
happy
to
take
questions
later.
They
use
bacon
for
anything.
Sorry,
sir
I
can
hear
you
do
you
use
bacon
for
any
of
your
application?
So
did
you
say
fragrance
begley?
Sorry
obey
go!
Oh!
No.
We
haven't
used
that
yet
that
would
be
pretty
interesting,
but
yeah
right.
Yeah
thanks
I'd
be
happy
to
answer
questions
here.