►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
There
is
the
spark
Cassandra
connector,
which
is
an
open
source
product
available
for
everyone,
download
and
I'd
like
to
take
my
slot
here,
to
talk
a
little
bit
about
data
locality
and
how
we
in
the
spark
Sandra
connector,
achieve
data
locality
with
Cassandra
and
I'm,
going
to
start
out
with
a
little
story,
and
it's
one
you
might
have
heard
before,
and
it
involves
this
man
here,
Lex
Luthor
and
he
happened
to
own
a
large
amount
of
property
in
an
undesirable
location.
And
that's
the
lesson
we
need
to
learn
from
this.
A
Luckily,
for
us,
we
do
have
a
different
kind
of
paradigm
that
we
can
do
with
spark.
So
spark
is
really
our
hero
that
lets
us
get
away
from
this
old
pattern
of
trying
to
move
everything
all
at
once
and
then
do
work
on
it
and
it's
a
lot
more
like
having
superheroes
and
we
have
superheroes
in
different
cities,
and
each
of
these
superheroes
can
do
their
work
locally.
So
the
thing
is
when
we
have
Superman
doing
work
in
Metropolis
and
we
have
Batman
doing
work
in
Gotham
City.
A
We
save
all
that
time
of
the
two
going
back
and
forth
and
doing
all
this
wasted
effort
and
the
same
thing
happens
with
spark
because
with
spark
we
can
end
up
saying
the
only
data
that
you're
going
to
work
on
is
that
local
data
on
the
machine
that
the
executor
is
running
on.
So
it
ends
up.
Looking
a
little
like
this,
where
our
spark
executors
are
our
superheroes
and
our
nodes
are
our
cities.
A
So
this
ends
up
saving
us
a
lot
of
time,
because
when
we're
talking
to
a
database
like
Cassandra
walking
to
the
local
machine,
is
going
to
cut
down
a
lot
of
network
traffic
on
that
initial
load,
that's
so
important
to
get
data
into
spark
for
processing.
So
now
that
I've
gone
over
this
little
story,
let
me
just
say
that
this
is
all
automatically
implemented
inside
of
the
spark
Cassandra
connector,
which
now
is
compatible
with
spark
1.3
and
actually
from
the
time
that
we
submitted
these
slides
due
today.
A
We've
just
had
our
milestone
release
for
spark
1.4
as
well.
So
if
you,
if
you'd
like
to
try
that
out,
it's
it's
ready
right
now,
we
support
all
of
Cassandra
data
types,
including
user-defined
types
we
can
save
to
Cassandra.
We
have
intelligent
right,
batching,
we
support
collections
and
all
kinds
of
other
great
features
and,
of
course,
open
source
free
to
use
for
anyone.
Now,
let's
go
into
a
little
bit
about
how
the
spark
Cassandra
connector
actually
reads:
data
node
local,
and
to
talk
about
that.
A
I
first
have
to
go
over
a
quick
lesson
on
how
Cassandra
actually
stores
data
so
Cassandra,
basically
stores
all
data.
Using
this
thing
called
the
token
range
and
the
token
range
can
be
thought
of
as
this
long
series
of
numbers
and
these
numbers
are
then
divvied
up
amongst
the
nodes
in
the
Cassandra
cluster.
So
if
I
had
three
nodes
and
to
keep
with
my
theme
here,
we
have
metropolis
Gotham
in
Co
city.
A
What
this
ends
up
being
means
that
metropolis
ends
up
being
responsible
for
all
of
the
tokens
between.
In
this
particular
example,
750
99
Gotham
is
responsible
for
350
to
750,
and
ko
City
is
responsible
for
the
remaining
parts.
Of
course.
In
reality,
our
partitioner
is
going
to
go
from
a
very,
very
small
number
to
a
very,
very
large
number,
but
this
just
makes
it
a
little
bit
easier
to
illustrate
the
point.
Once
we
have
this
assignment
of
the
primary
responsibility
for
each
of
these
tokens,
we
can
start
saying
where
data
actually
belongs
in
the
cluster.
A
So
when
we
apply
that,
for
example,
to
this
row,
if
we
say
that
the
name
part
here
is
actually
the
partition
key,
we
get
out
a
single
value
and
that
single
value
will
tell
us,
based
on
the
ownership
of
that
range
where
that
piece
of
data
belongs.
So
in
this
case
we
have
yosik
hashing
to
830,
which
means
that
this
particular
row
belongs
on
the
Trop
list.
Note
so
based
on
the
content
of
the
row,
we
know
exactly
on
what
note
it
should
live
on
now.
This
changes
slightly.
A
If
you
have
V
nodes
enabled
in
Cassandra
and
there,
the
picture
looks
a
little
bit
like
this,
where,
instead
of
the
ranges
being
a
large
contiguous
piece,
we
actually
have
them
broken
up
into
smaller
chunks.
And
what
I
like
to
illustrate
here
is
that,
just
because
we
change
two
owns,
which
ranges
doesn't
mean,
we
change
what
the
hashed
value
of
this
particular
thing
is,
but
it
does
change
which
node
it
might
belong
on.
A
So
with
that,
let's
talk
a
little
bit
about
loading,
huge
amounts
of
data
and
that's
basically
into
the
Cassandra
connector,
a
full
table
scan
and
a
full
table
scan
means
we
need
that
entire
token
range
turned
into
a
spark
RTD.
So
the
way
the
spark
you
see
under
connector
does,
that
is,
it
starts
when
you
call
one
of
these
two
functions:
either
through
the
data
frames
API
or
through
spark
Cassandra
connector
Cassandra
table.
It
will
make
a
listing
of
all
of
the
token
ranges
in
the
cluster
and
know
where
those
particular
ranges
are
hosted.
A
So
it
knows
where
all
of
the
replicas
are
for
each
of
these
particular
sections
of
the
entire
token
range.
Then
it
takes
these
token
ranges
and
starts
building
them
into
spark
partitions
where
in
the
metadata
for
the
partition,
it
says
this
particular
partition
is
for
this
particular
piece
of
this
particular
token
range.
So,
for
example,
we
might
break
up
these
purple
ranges
into
to
spark
partitions
now.
The
way
we
break
up
these
ranges
into
spark
partitions
is
based
on
the
spark
Cassandra
input
split
size.
So
this
basically
says
how
many
live
tokens.
A
Would
we
like
there
to
be
within
a
single
spark,
partition
and
I'd
like
to
know?
If
this
is
an
estimate,
we
aren't
actually
able
to
do
the
the
true
count
without
actually
reading
all
the
data,
but
this
is
basically
as
close
as
we
can
get
so
we'll
continue
doing
this
for
each
node
and
since
we
know
each
token
ranges,
replicas
were
able
to
consistently
make
spark
partitions
that
only
map
to
a
single
Cassandra
node.
At
this
point,
we
can
also
assign
a
piece
of
metadata,
since
we
know
which
node
we
built
each
spark
partition,
for.
A
We
can
assign
a
preferred
location
for
that
particular
spark
partition,
which
is
a
Cassandra
node.
So
when
we
go
about
to
actually
executing
queries,
if
we
take
a
look
at
one
of
our
executors-
oh
sorry,
if
we
take
a
look
at
the
spark
driver
will
see
that
the
spark
driver
is
going
to
see
the
spark
partition
know
that
all
the
token
ranges
that
are
in
the
spark
partition
belong
on
a
certain
machine,
and
it
can
use
that
to
assign
the
task
to
that
particular
node.
Note
that
this
is
controlled
by
the
spark
locality
weight
parameter.
A
So
if
you
do
go
past
sparklet
II
wait
seconds,
it
will
end
up
getting
sent
to
a
different
node.
So
if
you
really
want
to
ensure
data
locality
when
coming
when
getting
data
from
a
Cassandra
node
make
sure
to
have
this
either
be
very
large
or
set
to
infinite.
So
once
the
task
has
been
assigned-
and
we
know
that
since
this
is
a
metropolis
partition,
it
needs
to
be
done
by
Superman.
A
So
what
will
happen?
Is
our
token
range
gets
split
into
this
query
and
the
query
gets
executed
and
as
its
executed,
the
can
Sandra
input
page
row.
Size
decides
how
many
rows
will
be
paged
out
of
Cassandra
at
a
time.
So
this
is
basically
how
you
can
how
quickly
data
is
pulled
from
Cassandra,
while
you're
running
these
queries.
A
So
again,
another
cool
thing
about
having
cql
here
is
that
any
push
down
that
you
can
do
to
a
cql
query.
Any
queries
on
secondary
indexes
or
clustering
keys
can
be
done
here
as
well.
So
we
can
actually
use
the
interface
that
the
connector
provides
to
push
down
clauses
to
Cassandra
to
minimize
the
amount
of
data
we're
actually
pulling
out
of
our
Cassandra
tables.
A
So
that's
basically
how
we
would
go
about
loading,
a
large
amount
of
data,
but
let's
say
instead
we
have
just
a
selection
of
partition
keys.
So
in
the
case
that
you
have
an
RDD
full
of
partition,
keys
that
you
want
to
extract
out
of
a
cassandra
table,
we
also
have
into
a
different
method.
So
this
is
about
loading
sizable,
but
well-defined
amounts
of
data.
A
The
one
thing
to
notice
in
this
is
that
that
RDD,
that
you
might
give
us
to
to
pull
out
of
Cassandra,
may
not
have
all
the
data
in
such
a
way
that
the
partition
key
is
all
mapped
to
the
same
node.
So
that
gets
us
to
a
little
problem
because
we're
back
in
that
unfortunate
side
of
things
where
we're
no
longer
having
data
locality
in
our
actual
execution,
but
we've
also
provided
a
way
to
get
around
this
with
a
function
called
repartition
by
Cassandra
replica.
So
this
lets
us
take
that
RTD.
A
That
is
partitioned
in
some
way.
That
is
not
congruent
with
the
underlying
partitioning
of
the
Cassandra
table
that
we
want
to
use,
and
basically
it
will
perform
a
shuffle
that
places
the
data
so
that
we
get
these
node
local
partitions
again.
So
when
we
have
no
local
SPARC
partitions
with
the
assigned
idea
of
which
node
that
they
belong
to,
we
can
then
do
our
join
with
Cassandra
table
and
have
each
spark
partition
mapped
to
a
single
node
that
is
also
running
Cassandra.
A
So
the
way
that
this
actually
executes
sunder,
the
hood
is
pretty
similar.
We
end
up
building
the
cql
statements
that
look
like
select
something
from
your
table
where
the
primary
key
is
the
primary
key
in
the
in
the
data
that's
being
pulled
out
of
the
RTD,
the
same
Cassandra
input
row
input
page
row,
size
controls
how
much
data
is
pulled
at
a
time
and
the
same
push
downs
and
all
of
that
apply
as
well.
A
So
I
want
to
take
a
quick,
a
quick
second
to
talk
about
how
this
works
in
data
stacks
Enterprise.
So
that's
our
commercial
offering
and
one
of
the
cool
things
about
that
is
that
it
has
Apache,
Solr
and
Apache
spark
together,
and
the
thing
is,
if
you
run
both
at
the
same
time,
one
of
the
cool
things
you
can
do
is
combine
your
SPARC
queries
with
a
solar
lookup.
A
So
what
we
can
end
up
doing
is
pushing
down
a
solar
query
into
that
same
syntax
and
what
we'll
end
up
doing
is
hitting
a
Lucene
index
locally
on
that
same
machine
before
hitting
the
same
machine
for
the
Cassandra
data.
So
we
can
end
up
doing
this
cool
combination
of
leucine
indexing
and
Cassandra
lookup
and
spark
all
at
the
same
time
while
maintaining
data
locality.
A
So
I
know
that
was
a
lot
of
information
to
go
over
and
I.
Don't
have
a
very
long
slot.
So
I
just
wanted
to
quickly
point
out
that
you
can
learn
a
lot
more
about
this
online
Academy
days.
Xcom.
We
have
tons
of
videos
about
how
the
connector
works,
how
our
integration
works
and,
of
course
the
Cassandra
summit
is
coming
up
in
September
and
if
you
enjoyed
being
here
I'm
sure
you
will
enjoy
being
there
as
well
and
I.
Think
I
have
time
for
one
or
two
questions.
A
So,
for
example,
if
you
have
a
whole
lot
of
partitions
based
on
name-
and
you
also
have
events
that
are
clustered
based
on
time,
if
you
give
us
a
whole
lot
of
names
and
a
specific
time
that
you
want
greater,
then
we
can
efficiently
grab
all
of
those
slices,
but
it's
not
as
efficient
for
taking
something
that
isn't
on
a
clustering
key.
So
last
question
yeah
yeah,
so
you
can
set
consistency
level
and
basically
we
apply
it
to
the
entire
RDD
collection,
the
entire
RTD
action.
So
you
can
choose
whatever
you
want.