►
Description
ABOUT DATA COUNCIL:
Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.
FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
Eventbrite: https://www.eventbrite.com/o/data-council-30357384520
A
A
And
a
little
and
a
little
demo
by
the
way,
I
presented
this
talk
at
the
kiss
under
submit
and
that's
available
online
as
well.
If
you
are,
if
you're
interested
in
the
slides
they're
a
few
updates
here,
but
it's
mostly
the
same
so
who
are
we
our
motto?
Is
we
power
personalized
video
experiences
across
all
screens?
If
you
watch
a
video
on
ESPN,
comm,
Victoria's,
Secret
REI,
you've
probably
used
our
technology
without
realizing
it.
A
A
We're
fairly
large
Cassandra
user,
we
have
12
clusters,
you
know
from
3
notes
to
107
notes,
but
I
don't
know.
28
terabytes
manage
easily
over
2
billion
cassandra
column
writes
per
day
it
powers
like
most
of
our
analytics
infrastructure
from
the
standard
slice-and-dice
aggregates
that
customers
use
to
discovery,
recommendations,
trend
analysis.
A
A
A
A
What
we
have
today
is
that
we
have
pre
computed
aggregates.
We
have
video
metrics,
their
computer
launched
several
dimensions,
such
as
geography,
country,
city
device,
type,
browser
platform,
their
free
order.
Things
like
that,
so
this
is
fine.
So
if,
if
basically
I
want
to
look
up
say,
I
am
ESPN
and
I
want
to
look
up.
What
are
my
top
videos
in
the
US?
You
know,
or
what
are
my
top
videos
by
device,
hi
borders,
mobile
or
a
tablet?
Whatever
that?
A
That's
very
easy,
we
just
pull
out
a
key
from
Cassandra,
it's
pretty
computer,
so
this
is
nice,
it's
very
fast.
The
downside
is
that
it's
very
inflexible
and
it's
it's
very
difficult
to
change
and
most
of
the
aggregates
and
indexes
are
in
fact
not
read.
For
example,
we
store
a
sorted
list
of
say:
videos
by
city.
A
Most
people
are
only
interested
in
the
top
few
cities.
Very
few
people
are
interested
in
the
tail
end
of
things
right
and
what,
if
we
need
more
dynamic
queries
like
when
the
mix
of
several
dimensions
together
top
content
for
mobile
users
in
France
I
want
to
do
data
mining
a
little
bit
more
interesting
stuff.
When,
when
I
start
to
move
to
the
morgue,
it's
use
cases
then
excuse
me
for
a
minute.
A
A
There
we
go
okay,
so
in
the
more
advanced
use
cases,
then
this
model
starts
to
start
to
break
down
of
this
precomputing.
These
aggregates,
for
example.
Let's
say
you
have
a
hundred
thousand
cities
and
let's
say
that
you
have
one
thousand
browser
combinations.
Just
doing
cross
product
of
two
dimensions
alone
will
get
you
like.
You
know
a
huge
I
mean
you
know
something.
That's
like
way
way
too
big
right.
A
So,
basically,
imagine
that
there's
a
continuum
a
scale
between
static
between
pre-computation
and
dynamic
queries
right,
so
you
can
think
of
it
as
on
one
end,
you
can
pre-compute
everything
and
and
that's
what
your
customers
who
look
up.
So
this
is
nice,
but
it's
inflexible.
On
the
other
hand,
you
can
do
dynamic
queries
over
your
data.
This
might
be
like
you
know
slow,
but
this
is
very
flexible
but
might
be
very
slow,
and
what
we
really
want
is
to
kind
of
like
have
a
a
medium
where
you
can
pre-compute
your
common
queries.
A
You
can
have
flexible
and
fast
dynamic
queries
and
uses
easily
generate
a
lot
of
materialized
views
to
suit
your
different
needs
right.
So
what
that's?
What
you
really
want
is
something
you
know
kind
of
a
medium.
Now,
if
we
look
at
some
industry
trends,
you
have
some
new
technologies
coming
out
for
fast
query
execution
cloud
era.
Impala
drill
is
the
new
coming
project.
You
got
you
probably
all
heard
about
presto.
You
have
in-memory
databases,
you
have
a
lot
of
streaming
and
real-time
three
marks
that
are
coming
out
and
even
for
tradition,
a
tube
you're
starting.
A
A
A
A
A
So,
let's
recap
what
the
Hadoop
paradigm
is,
so
you
have
a
disc
reader
file
system,
HDFS
or
any
number
of
replacement
DFS's,
and
when
you
do
have
MapReduce
face
HDFS,
you
read
data
from
HDFS
and
then
you
after
you
shuffle
the
data.
You
do
a
video,
so
you
write
the
data
back
to
HDFS.
Then
you
read
it
again
now
for
simple
jobs,
and
it
might
be
ok
like
that.
A
The
more
complex
a
job
is
the
more
stages
you
have,
and
so
you
find
that
your
reading
writing
from
HDFS
a
lot
and
that
quickly
become
starts
to
dominate
your
your
I/o.
The
smart
paradigm
is,
let's
say
that
I
have
two
data
sources
and
I.
The
first
thing
you
notice
is
that
you're
not
limited
to
just
map
and
reduce
you
have
some
higher
level
operators
such
as
join
so
I
can
take.
A
A
This
is
a
dramatic
increase,
like
maybe
8
X
10,
X
least
so
so
using
your
cache
effectively
helps
you
a
lot.
Cassandra
also
has
a
lot
of
caching
options,
as
you
guys
probably
know,
there's
row
cache
and
different
kinds
of
things,
but
if
I
was
to
put
this
data
in
inspark
in
an
in-memory
data
set,
that
is
another,
you
know
one
or
two
orders
of
magnitude
faster.
A
We
found
that
developers
with
lexpark
one
of
our
engineers
looked
at
it,
for
30
minutes
was
able
to
like
get
up
and
running
and
write
jobs
right
away.
The
collections
API
is
very
high,
very
high
level,
so
it
allows
you
to
think
in
terms
of
high-level
transformations
and
you
have
three
development:
three
languages
for
you
to
implement
SPARC
jobs.
A
You
have
an
interactive
show
which
allows
for
you
know
you
can
basically
play
with
your
data
and
just
type.
You
know.
Let
me
let
me
map
my
data
and
transform
it
in
a
certain
way
and
kind
of
get
results
out
pretty
quickly
for
testing
and
a
little
didn't
see.
You
know
you
lead
to
quick
development
cycles,
and
you
know
these
days
oftentimes
the
the
only
advantage
that
a
lot
of
startups
have
is
or
companies
have.
This
is
basically
developer
velocity.
You
know,
so
that's
something
that
we
find
very
important.
A
A
A
A
A
So
shark
is
100%
hike,
QL
compatible
because
it
actually
embeds
a
hive,
but
it
is
much
faster.
However,
it
can
actually,
we
use
all
of
your
hive,
sewage
handlers,
sergey's
and
other
things
like
that.
I'll
touch
on
this
in
a
minute
and
if
you're
running
the
se,
you
can
still
use
the
SE
and
cassandra
FS
for
your
meta
store.
A
One
one
thing
is
that
shark
has
integration
with
spark,
so
you
can
for
the
things
that
SQL
is
not
good
at
expressing
what
you
can
do
is
you
can
have
a
query
in
hive
QL
and
then
use
Scala
and
Java
code
to
transform
your
data
in
a
way
that
is
not
possible
with
what's
5qr.
So
that's
an
alternative
to
writing.
Udf's.
A
We'll
talk
about
our
new
architecture
a
little
bit
and
how
we
integrate,
Cassandra
and
spark
and
shark
so
on
the
left's
are
what
we
call
raw
events.
Basically,
every
time
you
watch
a
video
on
one
of
our
customers,
websites,
like
ESPN,
you
have
a
little
bit:
JavaScript
module
or
ActionScript
module,
which
is
sending
data
even
when
a
video
you
know
displays
and
when
you
play
or
you
skip
forward
so
sending
events
to
our
servers.
A
So
that's
what
I
mean,
but
raw
events
from
here
we
have
an
ingestion
pipeline
that
will
write
it
to
Cassandra
and
what
we
call
an
events
tour
from
here.
We
then
use
spark
to
create
various
materialized
views
and
the
way
that
we
then
utilize.
These
materialized
views
is
that
we
can
use
spark
to
do-
and
you
know
quick,
pretty
fine
queries
or
maybe
we
can
run
ad
hoc
v.
Ql
queries
in
shark.
Both
both
are
available.
A
A
little
bit
about
how
we
actually
physically
organize
our
clusters
will
start
with
Cassandra
on
the
notes
on
top
of
Cassandra,
we
put
an
input,
format
and
sturdy
I'll
go
into
detail
about
why
we
have
it.
We
basically
wrote
our
own
and
perform
at
and
I'll
go
into
detail
just
in
just
a
second
about
why
we
did
that
and
sitting
on
top
of
the
input
format.
This
spark
worker
that
Cassandra.
A
A
So
ours,
our
schema,
is
kind
of
a
it's,
mostly
it's
a
variation
on
a
standard
time
series.
Basically,
you
have
an
ID
or
a
user,
and
you
have
events
at
various
times.
We
use
the
time
as
as
as
the
column
key.
In
fact,
however,
we
actually
have
a
separate
column.
Family
called
an
attribute
column
family.
A
The
reason
why
we
do
this
is
because
there
are,
there
are
a
lot
of
attributes
such
as
a
user's
IP
address
user
agent,
things
like
that
which
repeat
in
again
again,
so
we
keep
them
in
a
separate
column
family
to
one.
It
saves
a
huge
amount
of
space
and
IO
and
to
it
allows
us
to
easily
generate
indexes
and
to
annotate
attributes
after
the
fact.
If
we
want
to.
A
And
so
the
way
that
we
unpack,
you
can
think
of
the
job
as
like
in
Cassandra.
You
have,
you
know,
rows
and
n
columns
in
a
time
series
and
now
what
we
want
to
do
is
we
want
to
extract
the
you
can
think
of
them
as
JSON
blobs
and
expand
this
into
an
SQL
like
table,
for
you
know,
data
warehousing,
so
you
can
think
of
each
cell
maps
to
one
row.
This
is
actually
very
similar
to
how
cql
reads
out
data,
so
your
user
ID,
which
is
essentially
part
of
the
row
key,
gets
one
column.
A
A
Alright,
this
slide
is
by
the
way,
not
in
the
cassandra
submit
presentations,
different
options
for
integrating
spark
with
Cassandra
versus
that
you
can
use
the
Hadoop
input.
Format
spark
will
use
any
existing
to
be
in
perform
at
out
of
the
box,
so
the
choices
are.
You
have
built
into
Cassandra,
Collins
family
input
format.
This
was
really
the
only
option
built
an
option
before
I
think
1.2
I'm.
Not
quite
sure.
A
In
more
recent
versions,
there
is
now
something
called
a
cql
paging
and
perform
at,
which
is
pretty
nice.
You
can
actually
give
it
a
sequel,
query
and
we'll
read
out
into
the
relevant
data
for
you
and
it
pages
so
that
it
will
also
handle
wide
rolls
pretty
well
as
well
as
the
supports
secondary
indexes.
A
A
You
can
also
read
data
from
any
source
in
in
spark
without
an
input
format.
The
big
disadvantage
is
that
without
a
new
perform
ad,
you
can't
write
a
standard,
Hadoop
jobs
like
standard
or
more
jobs.
However,
it
is
actually
much
easier
to
develop
to
read
from
Cassandra
without
an
in
perform
ad.
You
can
essentially
do
it
in
one
line
that
line
right
here
that
says
SC
dot
parallelized.
What
I'm
doing
is
spark
has
a
paralyzed
operator
that
will
take
any
list.
A
Let's
say:
I
have
a
list
of
row
keys
a
thousand
row
keys
and
I
want
to
read
these
rows
from
Cassandra
in
the
spark.
What
parallel
lines
does
is
a
text
at
thousand
list
element
and
distributed
across
the
cluster,
then
what
the
flat
map
does
is
it
takes
each
row
key
and
then
you
something
you
can
have
something
that
reads
all
the
columns
from
Roky,
and
so
you
end
up
in
one
line,
was
something
that
can,
in
its
distributed
manner,
read
in
a
whole
bunch
of
rows
from
Cassandra.
A
So
if
you
don't
care
about
Hadoop
compatibility,
or
rather
MapReduce
compatibility,
you
can
actually
save
yourselves,
probably
like
I,
don't
know
two
weeks
or
a
month
or
whatever
developing
and
perform
at
there's.
Also
a
JDBC
RTD,
which
so
sorry
so
RTD
is
a
spark
term
that
stands
for
resilient
distributed
data
set.
This
is
sort
of
like
the
base.
A
Base
abstraction
in
spark
and
it
so
it's
it's.
Basically,
a
data
set
that
is
stored
in
memory
and
can
be
transformed,
there's
also,
you
can
also
use
the
Cassandra
JDBC
driver
with
the
treaty
support
and
do
things
that
way,
so
there's
actually
a
fair
number
of
options.
Lastly,
there
are
companies
that
are
starting
to
open
source,
some
spark
and
Cassandra
integration,
so
I
just
have
one
make
there
that
you
might
want
to
check
out.
A
A
There's
a
new
API
and
an
old
API
plan
is
carefully
because
they
are
incompatible
but
similar
enough
to
be
very
confusing
and
some
things
that
build
on
top
of
it,
for
example,
cascading
I'm,
not
sure
if
it's
so
cascading
well,
for
example,
only
work
or
sorry
hive
up
to
until
very
recent
versions
will
only
work
with
the
old
Hadoop
API.
A
You
need
to
be
fair
to
spend
time
tuning
your
split
computation,
so
I'll
give
you
an
example.
If
you
have
your
only
indexes
in
Cassandra,
that
could
be
a
very
long
row.
You
don't
want
to
spend
a
lot
of
time
reading
that
one
row
out
and
you're
in
perform
at
because
impre
it
has
to
do
when
you're
computing,
splits,
that's
all
done
on
one
node
and
it's
not
distributed.
A
So
this
is
sort
of
like
the
earlier
graph.
I
showed
you
for
reading
from
Cassandra.
With
the
code
cache
takes
I,
don't
know,
maybe
two
minutes
to
read.
You
know
one
half
billion
events
oops
wait
a
minute
when
we
read
it
again
and
again
the
couch
warms
up
and
it
drops
by
a
factor
of
three
or
four.
When
we
process
the
data
in
memory,
then
we
can
get
it
out
in
like
well
under
a
second.
A
A
Here's
how
we
actually
here,
here's
the
workflow
for
spark
and
how
we
do
our
processing
first,
we
start
an
aggregation
job
in
in
spark
through
our
Java
server.
This
basically
will
read
data
from
Cassandra,
the
raw
of
raw
events
and
process
them
and
create
some
aggregates.
The
aggregates
are
stored
as
a
data
set
in
memory
distributed
amongst
all
the
different
spark
worker
notes.
A
So
spark
has
something
called
lineage,
which
is
that
it
remembers
for
every
computation
how
the
the
computation,
the
exact
transformations
that
you
ran.
So
it's
designed
to
in
the
case
of
failure,
go
back
and
rerun
your
source
commutation
so
that
you
can
end
up
with
the
same
aggregates
now.
Obviously
this
is
rather
expensive.
You
know
you
need
to
in
the
case
of
Cassandra,
you
need
to
read
from
Cassandra
again
and
do
the
aggregates,
which
might
not
be
what
you
want
it
turns
out.
A
A
A
It
also
allows
us
to
easily
inspects
our
materialized
videos
in
Cassandra.
Let
me
see
there
there's
another
option
now
that
sparks
project
tachyon
will
allow
you
to
cache.
Basically,
it
allows
you
to
use
it
as
a
write
through
cache
to
HDFS
and
if
you
use
take
GFS
or
you
can
use
Cassandra
paths,
for
example,
then
it
will
cache
your
data
in
memory
so
that
you
don't
have
to
read
it
back
again.
Essentially,
the
data
stored
already
in
memory.
A
Okay,
I'll
take
this
so
okay,
it's
about
storm
versus
spark,
so
I
would
say
storm
and
spark
are
kind
of
like
designer
address
very
different
problems.
Storm
is
for
streaming
and
you
want
to
stream
from
a
never-ending
data.
Source.
Storm
is
very
good
for
that.
Did
this
I
guess
we
look
at
this
a
little
differently
and
that
we
are
reading
terms
of
data,
so
we
think
of
this
as
more
like
we
think
of
this
as
more
like
spark.
A
If
you
used
to
spark
itself
it's,
we
still
have
storm
systems
in
production,
so
this
doesn't
replace
storm.
I
would
say
that
if
you
want
to
use
something
like
sparks
dreaming
that
might
replace
storm,
but
I
would
say
this
use
of
spark
is
more
like
replaces
Hadoop
for
enabling
extremely
fast
batch
jobs.
A
A
Once
you
have
created
a
table,
that's
that
that
actually
doesn't
read
the
data
right
away.
That's
just
a
mapping
now,
when
I
do
create
table
events,
cached
that
actually
caches
those
events
in
memory
as
a
cache
table,
and
that
actually
is
what
reads:
data
from
Cassandra,
and
now
we
want
to
query
the
cache
table.
A
For
example,
I
just
want
to
you
know,
let's
start
with
the
really
simple
I'm
just
going
to
select
count
one
from
the
cache
table.
You
know
1.4
million
rows
that
you
know
the
fast
half
a
second,
but
but
the
real
power
comes
when
you
know,
let's
do
something
a
little
bit
more
complicated,
so
I'm
going
to
try
to
get
a
count
of
a
various.
You
know,
providers
or
videos,
and
you
get
like
this
long
list
again.
This
is
you
know.
A
Extremely
fast
is
a
fast
group
by
operation
and
not,
but
let's
say
that
we,
what
we
really
want
to
do
is
we
want
to
get
a
top
okay
or
top-10
list
of
videos
by
count
and
so
we're
gonna
group
we're
gonna
order
by
the
count
descending
limit
20.
So
this
again
returns,
like
you
know,
they're
pretty
fast.
So
this
allows
you
to
do.
You
know
kind
of
queries
that
would
be
pretty
slow
if
you're
done
from
disk.