►
From YouTube: Webinar | Big Data Analytics with Cassandra and Spark
Description
Apache Cassandra is the leading distributed database in use at thousands of sites with the world’s most demanding scalability and availability requirements. Apache Spark is a distributed data analytics computing framework that has gained a lot of traction in processing large amounts of data in an efficient and user-friendly manner. The joining of both provides a powerful combination of real-time data collection with analytics. After a brief overview of Cassandra and Spark, this class will dive into various aspects of the integration.
A
A
Willie
Sutton
was
a
bank
robber
in
the
1930s
40s
and
actually
into
the
50s
and
within
the
first
month
of
being
of
the
creation
of
the
FBI
most
wanted
list,
Willie
managed
to
get
up
on
that
list,
and
then
he
was
actually
captured
a
number
of
times.
I
managed
to
escape
a
few
times
actually
and
then
was
finally
captured
in
1952.
A
The
thing
that's
interesting
about
Willie:
is
they
asked
him
this
question
after
his
career
of
of
robbing
banks
of?
Why
do
you
do
that?
Why
are
you
robbing
the
banks
and
his
answer
was
that's
where
that's
where
the
money
is
and
so
we're
going
to
keep
Willie
in
the
back
of
our
mind
for
a
little
while,
as
we
talk
through
some
of
these,
these
other
pieces
of
the
puzzle
and
we'll
come
back
to
him
towards
the
end.
A
We've
been
talking
about
doing
that
for,
like
30
years,
still
never
understood
what
the
toasters
would
have
to
say,
but
nonetheless
it's
a
the
idea
that
we're
going
to
have
all
these
connected
devices
across
the
enterprise
and
out
in
the
world
connecting
back
to
some
location
to
deliver
data.
So
we
see
this
in
a
whole
bunch
of
places.
You
get
things
like
your
thermostats,
like
nest
and
and
and
some
of
the
others
who
are
getting
in
that
space.
You
see
it
with
connected
cars.
A
We
see
it
on
manufacturing
floors,
giving
telemetry
data
for
their
data,
smart
metering,
etc,
and
all
of
them.
The
idea
is:
we've
got
our
our
sensors
or
devices
out
there
on
the
internet,
and
they
are
sending
data
back
to
some
central
system
that
central
system
clearly
can
be
distributed
around
the
world.
A
But
you
know
from
a
a
logical
standpoint:
that's
coming
into
you
into
this
system,
and
now
we
have
to
think
about
what
that
system
should
do
so
it
needs
to
be
able
to
receive
from
these
various
places,
and
it
also
needs
to
then
be
able
to
answer
some
relatively
straightforward
questions.
Those
could
be
questions
like
what
did
that
silver
toaster
say
last
or
it
could
be
something
a
little
bit
more
fancy
like
you
know
what
was
the
average
number
of
pieces
of
toast
per
I,
don't
know
toaster
type
or
something
along
those
lines.
B
A
If
we
are
at
all
successful
in
what
we're
doing
we're
going
to
see
more
toasters
coming
on
the
scene
and
our
system
is
going
to
need
to
be
able
to
handle
the
fact
that
we
are
successful
here
and
that
the
growth
of
the
the
number
of
sensors
that
are
out
there
is
going
to
need
to
have
a
our
system
is
going
to
need
to
tolerate
that
growth.
And
then
the
other
part
is
that
we
are
certainly
going
to
be
in
a
world.
The
realistic
world,
where
faults
will
happen.
A
Nodes
will
go
down
because
of
Hardware
reasons
or
something.
You
know.
We
had
this
situation
almost
a
year
ago,
where
Amazon
had
to
reboot
a
whole
bunch
of
servers
for
some
regular
maintenance,
and
we
hear
about
some.
You
know,
folks
that
were
hosting
things
on
Amazon.
Their
systems
became
offline,
while
other
folks,
like
Netflix,
were
rolling
just
fine
and
we
we
now
talk
about
the
Amazon
outage
and
not
the
Netflix
outage.
A
B
A
Apache
Cassandra
is
they
distributed
no
sequel
database?
It
is
sort
of
the
love
child
of
the
BigTable
paper
by
Google
and
the
dynam
yes,
and
the
Dynamo
paper
by
a
by
Amazon
BigTable
is
really
the
data
structure.
So
we're
talking
about
tabular
data
with
these
sort
of
column,
families
and
you're
able
to
sort
of
have
a
bars
storage
of
it.
A
In
other
words,
we
don't
store
the
nulls
that
don't
happen
for
certain
things
and
it's
a
very
flexible
data
structure
and
then
we've
got
the
dynamo
paper
was
all
about
resilience
and
fault
tolerance
and
having
this
concept
of
no
no
master.
That's
an
interesting
element
that
comes
into
play
with
Kassandra.
That
we
make
a
lot
of
use
of
is
the
fact
that
all
the
nodes
are
equal
and,
as
a
result,
in
that
sort
of
democratized
world,
we
don't
need
to
worry
about
a
couple
of
things.
A
Similarly,
since
all
nodes
are
the
same,
that
means
that
all
nodes
are
answering
questions
and
queries
to
client
applications,
as
well
as
serving
up
the
data
and
owning
some
part
of
the
data.
To
answer
those
questions
in
the
system
means
that
if
we
double
the
number
of
nodes
in
the
system,
we
really
are
not
only
doubling
the
amount
of
storage,
but
we're
also
doubling
the
number
of
questions
that
we
can
answer
at
the
same
time.
So
we
usually
talk
in
Cassandra.
We
usually
talk
about
transactions.
A
We
merge
the
idea
of
reads
and
writes
into
sort
of
one
bucket.
There
certainly
are
ways
in
which
the
read
path
and
the
right
path
are
a
little
bit
different,
but
we
usually
talk
about
transactions
per
second.
So
when
we
scale
out,
we
can
get
more
transactions
and
more
data.
One
of
the
things
baked
into
Cassandra
won't
talk
too
much
about
that
on.
This
talk
is
really
this
multi
data
center
idea,
and
so
we
can
have
our
data
centers.
A
Can
the
data
centers
can
either
be
geographically
disparate
because
of
of
either
two
things?
One
is
fault
tolerance,
and
so
we
can
say
that
if
there
was
a
flood
in
in
in
say,
New
York,
like
a
hurricane
sandy,
then
a
data
center
they're
going
out
the
application
can
can
go
to
another
data
center,
say
on
a
you
know,
high
up
on
a
mountaintop,
far
away
from
a
flood
plain,
and
so
that's
one
reason.
A
Another
reason
for
geographic
distribution
could
be
the
idea
of
wanting
locality,
so
I'm
going
to
do
the
guys
in
New
York,
we'll
talk
to
the
New
York
server
and
the
guys
in
Los
Angeles,
we'll
talk
to
the
Los
Angeles
server.
But
for
this
talk
we
actually
might
be
interested
in
the
third
reason
for
having
this
and
that's
that's
having
workload,
isolation
and
so
those
those
data
centers
can
be
virtual
data
centers.
In
other
words,
they
could
actually
be
in
the
same
location
and
we've
just
isolated
them.
A
So
that,
if
you're
doing
a
slightly
different
workload,
you've
got
different,
Hardware
dedicated
to
it
and
then
last
a
little
bit
on
the
Cassandra
query.
Language
Cassandra,
when
it
first
came
out,
was
admittedly
not
the
easiest
thing
to
use
a
lot
of
the
easy
use
cases.
A
very
straightforward
use
cases
still
required
you
to
get
down
into
the
guts.
A
Admittedly,
the
the
harder
more
exotic
ones
also
required
you
to
get
down
in
the
guts,
but
you
were
probably
willing
to
go
that
far
with
something
that
that
custom,
and
so
we
came
up
with
this
thing-
called
the
Cassandra
query,
language
which
looks
a
lot
like
sequel,
but
is
a
little
bit
deceiving.
Basically,
it
covers
all
the
operations
that
you
would
do
in
Cassandra
in
these
normal
sort
of
use.
Cases
makes
them
very
simple
and
they
look
like
the
SQL
analog
now.
A
So
if
we
think
just
briefly
on
sort
of
how
Cassandra
works
on
writes,
we've
got
our
little
nest.
Thermostat
here
so
NASA
actually
is
using
Cassandra
and
he's
going
to
connect
to
the
cluster.
Now
when
he
connects
to
the
cluster,
he
actually
sees
the
whole
set
of
nodes
here
and
he
can
make
intelligent
decisions.
A
There's
all
sorts
of
load,
balancing
decisions
that
he
can
make
we'll
do
something
simple,
let's
say
he's
doing
around
robin
and
say
he's
going
to
share
his
workload
among
these
five
as
he
goes
around
and
so
he's
going
to
just
tell
one
of
them:
hey
if
73
degrees,
and
so
that
node
may
actually
not
own
the
data
and
if
he
doesn't
that's
okay,
he's
happy
to
be
the
coordinator
who
we
called
the
coordinator,
I.
Think
of
him
more
as
like
the
maitre
d,
and
he
says
I'll,
take
care
of
you.
A
I
can
wrap
that
question
to
who
who
needs
to
know
it
or
who
needs
to
have
that
information,
and
so
internally.
He
will
pass
that
on
to
the
node
that
owns
the
data
that
person
or
that
node
will
respond
when
he's
done
and
the
coordinator
or
the
maitre
d
is
able
to
reply
back
to
the
to
the
thermostat
or
to
the
client
so
few
things
going
on
there
that
we
can
actually
start
controlling
here
and
that
comes
under
this
large
umbrella
of
tunable
consistency.
A
So
consistency
is
one
of
the
parts
of
acid,
but
it's
frequently
one
of
the
things
that
that
we
take
for.
We
don't
really
take
advantage
of
in
in
the
sequel
space,
but
we
end
up
having
to
pay
for,
and
so
the
one
of
the
things
that's
been
done
in
Cassandra
is
relaxing
consistency
and
allowing
people
to
tune
it
to
the
needs
that
they
have,
and
so
by
doing
that,
we're
in
a
position
where
we
can,
where
we
can
get
different
scale
out
characteristics,
fault,
tolerance,
characteristics,
etc.
A
So
the
first
thing
to
note
is
the
data
in
Cassandra
is
replicated.
You
set
that
when
you
set
up
your
table
and
we're
kind
of
distributed
by
token
range,
so
each
node
is
responsible
for
some
portion
of
the
of
the
range
of
tokens
for
rows,
and
so
all
rows
have
a
primary
key,
which
maps
into
that
token
and
based
on
the
ranges,
tells
you
where
that
data
should
be
located.
So
if
we
keep
it
simple,
and
we
said
that
we
had
a
hundred
tokens,
we
don't
we
have
a.
A
A
Okay,
now,
when
we
do
writes,
we
have
options
as
to
how
many
of
those
replicas
need
to
know
the
data
was
delivered
before
we
tell
the
client
that
it's
all
done
and
by
combining
reads
and
writes-
and
these
consistency
levels
we're
in
a
position
to
ensure
some
things
about
how
the
data
is
distributed.
All
right-
and
so
this
becomes
a
core
part
of
the
application
and
the
design
decisions
that
we
make
and
one
of
the
things
that
somewhat
interesting
about
Cassandra,
so
the
first
one,
the
top
upper
left,
is
probably
the
most
relaxed.
A
Basically,
what
you
tell
the
maitre
d
is
hey
as
long
as
you
have
it,
even
if
you
don't
own
it
as
long
as
you
know
that
that
you
acknowledge
that
you've
received
this
data,
then
you
just
tell
me
the
clients
and
I'll
be
happy.
Now
what
the
coordinator
is
going
to
do
in
the
background
is
he's
going
to
make
sure
that
let's
say
we
have
our
F
of
a
replication
factor
of
three
it's
pretty
common.
A
The
coordinator
is
always
going
to
make
sure
that
all
three
replicas
get
the
data,
it's
just
a
matter
of
when
he
tells
the
client
that
things
have
been
done
and
you
can
some
assurances
around
it.
So
the
upper
left
is
the
most
relaxed.
Okay,
just
as
long
as
the
coordinator
has
it
it's
fine.
The
lower
left
is
just
just
make
sure
one
of
the
replicas
has
it,
but
if
one
of
the
replicas
has
it
I
know
that
it's
durably,
you
know
in
the
data
in
the
system
somewhere
and
I'm.
A
Happy
in
the
upper
right
is
actually
probably
the
most
common,
which
is
quorum,
ensure
that
a
majority
of
the
the
nodes
are
actually
seeing
the
data,
and
then
we
can
do
some
interesting
a
little
bit
of
interesting
things
with
respect
to
guaranteeing
certain
data,
consistency,
things
and
then
in
the
lower
right.
It's
actually
the
most
stringent
we
say:
hey
I,
need
to
make
sure
that
every
node
has
actually
written
the
right
that
I
just
got
now
the
lower
right.
A
One
has
actually
got
some
challenges
to
it,
because
if
any
node
is
down,
then
that
right
won't
succeed.
So
we
lose
some
of
our
availability,
which
is
one
of
the
things
that
we're
trying
to
keep
very
high
here
is
the
high
availability
and
so
quorum
becomes
a
really
popular
choice
for
really
two
main
reasons.
A
Now,
that's
the
that's
the
right
side
of
things
now
on
the
read
side
of
things.
So
now
we've
got
our
client
here
on.
The
left
is
doing
this
query.
This
is
looks
like
SQL.
It's
actually
cql,
it's
very
similar
in
that
intersection
of
space
between
them
and
again
he's
going
to
contact
the
coordinator.
It
could
be
anyone
and
he's
going
to
ask
this
question.
A
Give
me
the
user
ID
for
the
for
the
user,
whose
name
is
PB.
Cup
fan
all
right,
so
the
coordinator
says
no
problem.
Let
me
go
get
that
and
consult
with
some
number
of
the
replicas
based
on
the
consistency
level.
So
let's
say
that
he
did
it
on
quorum,
and
so
the
coordinator
is
asking
two
of
the
nodes
for
what
they
have
and
they
each
give
him
the
value
that
they've
got
now.
They
may
be
different
right,
we
didn't
say
we
said
before
very
active
system.
A
We
said
that
you
could
end
up
with
only
needing
to
ensure
that
one
or
two
or
a
subset
of
the
nodes
got
the
rights.
So
when
you
ask
the
question,
you
may
get
different
answers
and
the
way
we
do
that
in
Cassandra
is
that
when
we
do
the
writes,
we
put
a
timestamp.
So
what
the
coordinator
has
to
do
in
this
case
is
he's
not
trying
to
see
if
the
values
are
the
same.
This
is
not
a
majority
votes
majority
wins
system.
A
A
We've
got
this
scalability
so
that
if
we
get
successful,
then
we're
able
to
scale
without
having
to
massively
reaaargh
attack
things,
and
that
makes
it
a
really
good
choice
for
Internet
of
Things.
Okay,
so
now
Isis
a
really
positioned
as
being
a
place
where
the
data
is
but
like
I
said,
Cassandra
alluded
to
Cassandra
doesn't
do
everything
it's
great.
It's
awesome.
We
love
it,
but
it
does
come
short
by
design
of
doing
certain
things.
A
They're
primarily
lookups,
like
I
mentioned,
so
we
talked
about
needing
a
partition
key
and
sometimes
the
predicates
aren't
based
on
the
partition
key
and
that's
another
challenge
that
we
hit
with
Cassandra
and
and
essentially
Cassandra
is
not
designed
to
be
this
full
table
scan
kind
of
place.
So
we
need
to
have
something
else,
helping
us
out
here
all
right.
So
we
talked
about
the
peanutbutter.
Let's
talk
about
the
chocolate
so
SPARC
distributed
computing
frameworks,
then
let's
say
wengie
a
version
1.1
GA
last
spring:
it's
got
this
generalized
execution
model
can
make
reuse
of
data.
A
That
is
already
pre-calculated.
People
talk
about
SPARC
being
an
in-memory
system,
it's
actually
more
than
that.
It
actually
does
work
off
of
disk,
but
I
think
the
thing
that's
really
important
to
keep
in
mind
is
that
spark
is,
are
really
good
at
reusing
memory
and
very
efficient
there,
and
so
that
pays
a
lot
of
dividends
for
folks
and
then
there's
this
easy
abstraction.
This
thing
called
data
frames
and
a
whole
bunch
of
other
tools
that
are
built
on
top
of
this.
A
So
if
we
look
at
the
SPARC
eco
sister
or
the
SPARC
sorry
the
product,
it's
got
a
number
of
elements
that
are
all
integrated
here
in
one
place,
we've
got
a
basic
core
and
then
we've
got
a
sequel
engine,
a
streaming
engine
and
then
there's
a
few
other
pieces
that
I
won't
talk
as
much
around
machine
learning,
graph
analytics
and
in
our
and
statistics.
So
this
is
the
general
high-level
view
of
the
stack
of
spark.
A
Now
spark
is
a
relatively
simple
architecture
to
understand:
there's
only
two
roles
of
the
nodes
in
the
cluster:
there
is
a
master
which
we've
talked
about
a
little
bit
before
and
and
some
workers
we've
got
this
like
I
mentioned
efficient
memory.
Caching
and
this
generalized
model,
which
is
really
nice,
and
this
abstraction
called
data
frames.
Now,
if
I,
given
this
talk
about
six
months
ago,
we
would
be
talking
about
resilient,
distributed
data
sets
or
rdd's.
A
lot
of
similar
things
can
be
said
about
them.
A
Data
frames
are
the
sort
of
Noori
envisioned
version
that
actually
gives
a
lot
more
capability,
but
conceptually
speaking,
they
share
a
lot
with
the
concept
of
an
RDD,
so
the
SPARC
master
he's
the
one
who
used
submit
jobs
to,
and
he
assigns
resources
to
a
particular
job.
The
SPARC
worker
is
the
one
who
now
has
tasks
that
he
needs
to
do,
and
so
he
assigns
work
to
the
local
executors
running
on
this
machine,
and
then
the
executor
is
actually
the
little
bit
of
code.
A
B
A
Now
we'll
talk
about
sort
of
how
the
the
two
pieces
fit
together,
so
the
way
that
the
way
that
Cassandra
fits
in
under
in
with
SPARC
is
is
on
the
bottom,
and
so
we
can
put
a
Cassandra
source
or
a
Cassandra
database
underneath
the
the
spark
core
engine
and
we've
got
this
piece
called
the
data
stack
spark
Cassandra
connector.
That
is
basically
the
interface
code
between
the
Cassandra
data
ace
on
the
spark
engine.
That's
an
open-source
technology
or
package
that
we've
got
out
there.
A
We
contributed
out
there's
others
who
are
contributing,
but
data
facts
is
doing
the
lion's
share
of
that
work.
And
what
that
does.
Is
it
surfaces,
Kassandra
tables
up
to
the
spark
engine
as
a
data
frame
and
by
spite
by
integrating
it
at
the
bottom,
by
integrating
it
at
that
lower
level?
We're
able
to
take
all
the
elements
above
it,
the
sequel
engine,
the
streaming
engine,
the
machine
learning
engine
and
we're
able
to
just
reap
the
benefits
of
that
from
this
data
from
this
concept
and
this
this
data
structure
of
the
data
frame.
A
If
we
dig
in
a
little
bit
and
take
a
look
at
how?
What's
really
going
on
under
the
covers,
I
mentioned
this
spark
executor,
so
the
struct
executor
is
the
workhorse
he's
doing
the
processing
of
the
data
that
is
sitting
in
the
in
the
data
set
or
the
data
frame
he's
taking
it
piece
by
piece,
he's
working
it
through
the
the
dag
of
execution
and
and
processing
it.
A
So
the
way
that
it
is
inside
of
that
executor,
we
need
to
take
a
look
at
how
he's
grabbing
data
from
Cassandra,
so
each
one
of
those
executors
has
a
connection
to
the
to
the
Cassandra
database,
but
he's
making
through
the
through
the
java
driver.
So
data
stacks
and
other
thing,
data
stacks
does
is
builds
a
whole
bunch
of
the
open
source
drivers
for
Cassandra.
This
one
happens
to
use
the
Java
driver,
but
there's
a
number
of
other
drivers
available,
and
he
makes
a
connection
to
the
cluster
makes
it.
A
Each
of
these
executors
will
then
pull
across
a
part
of
the
data
frame
or
part
of
the
full
data
set
and
bring
it
across,
and
so
what
we
do
is
we
split
these
up,
rdd's
and
and
and
data
frames
each.
They
split
up
the
full
token
range
into
pieces
and
a
spark
executor
will
operate
on
all
of
the
data
for
a
subset
of
the
tokens.
So
the
just
in
this
picture,
the
first,
the
first
slice
might
be
tokens
one
to
a
thousand.
A
The
second
one
could
be
a
thousand
one
to
two
thousand
etc,
and
we
work
our
way
through
the
entire.
The
entire
data
set.
That
way
we
can
work
on
these
in
parallel.
So,
conceptually
speaking,
we
we
sort
of
have
this
situation
where
each
spark
node
these
orange
nodes
on
the
large
marbles
on
the
left
are
each
making
a
connection
to
a
sister,
Cassandra
node
on
the
right.
This
is
a
simplified
version.
A
There's
really
no
reason
why
the
spark
cluster
and
the
Cassandra
cluster
have
to
be
the
same
size,
but
for
simplicity,
we'll
just
display
them
this
way,
and
so
they
buddy
up
and
so
the
top
spark
node
there
might
work
with
the
top
Cassandra
node
and
they're
going
to
be
reading
the
data
that
is
owned
owned
in
that
space.
Now,
once
you
see
this
picture,
you
sort
of
rapidly
come
to.
Why
are
there
two
different
clusters?
A
Why
don't
we
co-locate
the
spark
process
and
the
Cassandra
process
on
the
same
node
and
get
these
sort
of
hybrid
nodes
will
run
the
Cassandra
and
the
spark
on
the
same
thing
that
allows
for
us
to
do
local
reads
and
writes,
and
we
can
skip
some
of
the
network
performance,
Network
costs
of
doing
spark
here
and
and
save
that.
So
we
end
up
with
this
sort
of
spark,
Cassandra
hybrid.
A
So
now
that
we've
done
that,
let's
take
a
look
at
things
that
we
can
do
now
with
spark
that
we
couldn't
do
before
with
Cassandra.
So
the
first
one
is
joined.
So
this
is
a
silly
little
example
where
maybe
I
have
some
metadata
about
each
of
my
sensors,
that
is
in
this
table
called
metadata
and
I
have
another
set
of
data
that
actually
has
the
temperature
readings
and
I
might
want
to
know,
say,
tell
me
what
the
temperatures
are
or
last.
A
A
The
second
example
we
got
is
aggregates,
and
so
now
I
might
be
interested
in
the
year
and
monthly
average
or
I'm
sorry
maximum
temperature
for
each
of
my
sensors.
So
this
is
actually
going
to
do
that
kind
of
full
table,
scan
and
grab
the
data
and
and
process
it
and
give
us
this.
This
summary
table
sort
of
an
OLAP
style
query
now,
one
of
the
other
pieces
in
the
SPARC
ecosystem
there's
a
lot
of
use
of
distributed
file
systems
like
HDFS,
and
they
may
be
external
to
the
cluster.
A
That
we've
got
well
they're,
certainly
external
to
Cassandra,
which
does
not
have
an
HDFS
in
it,
and
so
we
might
want
to
do
a
join
between
the
two.
So
we
have
our
HDFS
data
is
say,
the
data
of
temperatures
from
2014,
maybe
they're
stored
in
CSV
ES,
or
something
along
those
lines
and
we'd
like
to
join
it
with
the
data
that
we
have
the
hot
data
that
we've
got
in
Cassandra
the
operational
data-
and
maybe
this
is
a
query.
This
example
is
trying
to
say
in
this
first
line.
A
You
define
the
HDFS
data
set
in
the
second
line.
We
are
defining
the
Cassandra
dataset
and
in
the
third
line
here
we're
joining
the
two
together
and
then
doing
a
little
filter
to
find
out
which
sensors
are
hotter
this
year
in
this
month
than
last
year
in
this
month.
So
simply
simple,
but
it
just
wanted
to
give
an
example
of
sort
of
a
year-over-year
kind
of
query
and
then
the
last
example
is
super
simple.
This
one's
just
saying:
hey
what?
If
I
wanted
to
restrict
my
query
based
on
not
not
the
partition
key?
A
In
other
words,
I'd
like
to
know
every
time
there
was
a
temperature
reading
above
a
hundred
degrees
and
and
and
that's
just
over
the
entire
data
set
and
again
SPARC
is
well-suited
to
scrub
through
all
of
that
data,
because
we're
integrating
in
with
SPARC,
we
can
latch
into
the
entire
SPARC
ecosystem
of
tools.
One
of
the
pieces
there
is
through
ODBC
and
JDBC,
because
SPARC
sequel
has
that
interface,
and
so
we
can
tap
then
into
things
like
tableau
and
Pentaho
and
I
put
R
here.
A
There
is
a
spark
our
capability,
but
there
is
also
our
ODBC
in
our
JDBC
and
you
could
leverage
those
just
for
getting
an
extract
of
data,
and
then
there
are
some
notebook
style
tools
like
Apache
Zeppelin.
That's
that's
an
incubating
project
in
Apache
that
we
could
also
tap
into
so.
The
spark
ecosystem
enables
a
number
of
tools
that
are.
In
addition
to
the
tools
that
we've
got
in
the
in
the
Cassandra
ecosystem,
so
I'll
do
a
quick
word
on
spark
streaming
I'm.
A
Conscious
of
the
fact
we
got
a
late
start
here,
so
this
will
be
relatively
quick.
Spark
streaming
is
one
of
the
big
things
that
we're
seeing
a
lot
of
folks
use
with
Cassandra.
The
idea
here
is
that
you've
got
a
number
of
data
coming
in
from
the
ether
in
and
they're
going
to
come
into
some
custom
into
a
receiver,
and
so
we
could
think
of
this.
A
Cassandra
I'll
skip
going
into
the
details
here,
but
there's
some
relatively
basic
commands
that
you
would
need
to
do
to
set
this
up
and
it's
relatively
straightforward
by
setting
up
what
the
receiver
is,
setting
up,
what
you're
going
to
do
on
each
window
and
then
telling
it
to
keep
running
until
you
decide
to
stop
just
a
quick
plug
on
data
stacks
Enterprise,
one
of
the
things
that
data
stacks
Enterprise
delivers
is
Cassandra
at
its
core.
But
we
have
these
other
capabilities.
A
We've
got
a
this
SPARC
integration
built
into
the
into
that
platform,
as
well
as
our
integration,
with
Apache
Solr
on
the
on
the
search
side
of
things.
So
now,
if
we
come
back
to
our
Internet
of
Things
use
case,
we
can
see
we're
SPARC
and
Cassandra,
really
feuer,
where
Cassandra
fits
us
in
this
system
of
being
able
to
receive
all
this
data
and
scale
for
all
the
toasters,
we're
going
to
add
to
our
system
and
how
SPARC
really
helps
us
address.
A
B
C
C
Again,
as
Brian
mentioned,
I'll
keep
this
very
brief.
We've
got
started
a
little
bit
late,
so
we'll
go
through
this
quickly.
We're
going
to
cover
the
data
stacks
implementation
methodology.
Kpi
is
a
partner
of
data
stacks
and
so
Valera
partner
will
also
be
at
the
Cassandra
summit.
Hopefully
all
of
you
can
join
us
there,
September
22nd
through
24th
and
Cape.
You
guys
done
quite
a
few
of
the
data
stacks
implementations
in
both
in
retail
and
financial
service
customers
throughout
the
US.
C
So
here's
an
overview
of
the
methodology
we're
going
to
cover
we'll
keep
it
at
a
high
level,
not
get
too
technical
just
to
cover
these
things.
To
ensure
the
success
of
implementations
that
we've
done
in
the
past
so
initially
to
get
started.
We
have
we
referred
to
as
a
requirements
phase
and
here
a
high-level
we're
going
to
do
our
use
case
requirements
for
data
modeling,
which
is
very
important.
C
Security
and
encryption
requirements,
service
level
agreements,
SL
A's
operational
requirements
to
allow
us
to
monitor
and
manage
a
system,
and
then
the
searching
and
analytics
requirements
as
well.
Some
of
the
key
points
here
is
and
I
think
Brian
said
this
earlier.
Get
the
data
model
right
is
going
to
be
key
to
our
success
here
and
leverage
whatever
you
can.
C
So,
if
there's
an
existing
database
where
you
can
get
access
to
the
query
logs
go
and
do
that
that's
going
to
help
you
define
your
data
model,
you
want
to
be
able
to
define
specific,
create,
read,
update
and
delete
requirements.
Those
are
essential
for
the
requirements
phase.
Then
also
security
is
important
both
from
an
authentication
perspective
and
an
authorization
perspective,
and
you
can
see
the
areas
that
we
cover
their
encryption.
C
Sles
are
must-have
and
are
highly
recommended,
even
if
you
just
define
them
for
your
own
project.
The
lack
of
SaaS
has
led
to
a
lot
of
project
failures
and
this
always
comes
up
during
our
implementation
process
and
we
work
with
the
customer
to
help
define
those
we
have
to
understand
we're
building
a
mission-critical
system.
C
So
we
have
to
make
sure
we
define
the
operational
monitoring
and
the
management
of
the
system
early
on
in
the
process
and
to
make
sure
we
build
those
those
requirements
and
then,
as
we
talked
earlier,
data
stacks
search,
we're
going
to
be
defining
our
requirements
here.
But
ultimately
we
need
to
determine
the
fields
that
will
be
searched
on
and
returned
in
the
searching
process,
and
you
can
see
some
examples.
We
give
you
as
well.
C
Data
analytics
has
requirements
as
well
they're
important
to
capture
at
this
time.
The
key
ones
that
we
see
out
there
that
need
to
be
incorporated
are
statistical,
algorithms
required
data
sources,
the
data
movement
and
modifications,
security
and
access,
and
then
there's
the
other
analytical
requirements,
and
we
just
have
to
make
sure
we
have
enough
detail
to
perform
a
good
design.
C
C
You
know
our
replication
strategy
name
our
table
design,
and
you
can
see
that
the
components
within
that
that
are
necessary
for
us
to
put
this
design
together
and
then
again,
any
relationship
between
tables
needs
to
be
noted
here,
and
you
know,
the
the
joining
is
not
technically
feasible
within
data
sets,
but,
as
Brian
mentioned
his
demonstration,
how
that
can
be
overcome
and
accomplished.
So
again,
we
want
to
identify
this
stuff
early
in
the
design
process
and
incorporated
here.
C
C
Right
now
is
is
an
application
development
where
the
project's
leverage
framework
to
encapsulate
an
abstract
and
represent
database
components
as
application
objects,
we're
doing
that
a
little
bit
differently
and
then
designing
application
DSX
as
much
as
possible
up
front
will
help
in
the
overall
application
development
and
functionality
component,
and
then
the
last
piece
is
that
data
movement
design.
You
know
the
batch
real-time
data
integration
between
the
systems,
ETL
change,
data
capture,
data
pipelines-
these
are
all
the
essential
things
in
the
design
phase
for
data
movement
in
our
operational
design.
C
We
want
to
do
as
much
tooling
and
techniques
as
possible,
so
we
want
to
deploy
new
nodes
configure
and
upgrade
nodes
in
the
cluster.
We
will
be
able
to
back
up
and
restore
operations.
We
want
to
know
how
to
do
cluster
monitoring,
ops,
Center,
use,
repairs,
alerting,
disaster
management
processes.
We
want
to
have
all
those
things
in
place
and
KP.
I
highly
recommend
putting
a
playbook
in
place
to
manage
the
operational
design
process.
C
Here's
some
of
the
search
design
stuff
we
talked
about
earlier
search,
searchable
terms,
returned
items,
tokenizer,
x'
filters,
the
multi
document
search
terms.
These
are
all
things
we
identify
and
incorporate
into
our
design
phase,
and
then
you
can
see
the
analytics
component
as
well.
We
do
do
a
design
phase
for
analytics
this
early
in
the
process.
C
C
When
you're
doing
application
development,
you
know,
based
on
your
organization,
you
can
use
an
agile
or
waterfall
methodology
all
work
well
with
this
process.
Here
we
want
to
cover
the
deployment
and
configuration
of
the
management
mechanisms.
So
it's
key
in
a
distributed
system
like
this.
It's
automate
as
much
as
possible,
so
so
try
to
leverage,
ops,
Center,
docker,
vagrant,
chef,
puppet,
all
those
type
of
components
and
then,
obviously,
in
unit
testing,
a
more
complex
with
a
distributed
system
compared
to
a
single
node
system,
be
prepared
for
that.
C
We're
going
to
be
looking
for
specific
defects
such
as
race
conditions,
that's
those
can
only
be
observed
when
we
at
production
scale
or
somewhat
a
scale
of
the
actual
system.
So
we
always
recommend
unit
testing
should
be
executed
over
a
small
cluster,
but
it
contains
more
than
a
single
node
and
then
there's
other
things
that
you
can
use
to
automate
some
of
your
testing
and
launching
things.
Those
are
always
recommended
as
well.
C
C
You
know
it's
critical
to
enable
the
project
team
to
identify
actual
issues
prior
to
going
to
production
at
scale
in
financial
services
is
very
common
where
the
testing
environment
were
using
is
an
exact
clone
of
the
production
environment.
Just
for
that
reason
alone,
they
want
to
identify
as
much
upfront
as
possible.
C
Here's
a
good
example
of
an
operational
readiness
checklist
and
you
can
kind
of
bullet
through
these.
Most
people
need
to
be
expecting
this
type
of
stuff.
These
are
things
you
need
to
be
ready
to
do
and
cute
prior
to
going
to
production
that
way
when
you're
in
production,
you
feel
comfortable
doing
it
and
everything's
in
place
to
to
execute
on
our
last
up.
Here's
the
scale
and
enhancements-
and
these
are
to
highlight
the
normal
operational
mode
of
an
application
built
on
data
stacks
up
right.
C
So
we're
always
looking
for
how
do
we
scale
this,
and
how
can
we
enhance
this?
Whether
it's
tuning
performance,
more
features
more
functions?
These
are
the
things
they're
always
looking
for
in
this
face
here
and
then
you'll
always
have
to
prepare
for
all
the
eventualities.
Things
are
going
to
happen
and
you
can
address
this
by
adding
nodes
to
expand
capacity
to
the
system
when
it's
needed.
C
These
are
all
options
you
have
with
the
DAC
product
and
in
capabilities
that
you
have
to
plan
to
take
advantage
of
as
you
grow
and
also
the
final
thing
is
to
scale
with
data
stacks.
The
Enterprise
Solution
is
a
nice
to
have
and
it's
what
the
products
for
that's
what
it's
known
for
and
its
dominance
out
there
in
the
field
today.
These
are
just
put
some
a
reference
architecture.
Examples
we
put
in
place.
Brian
gave
some
great
examples
either
some
other
ones
to
look
at
that
are
commonly
used
in
the
field
today.
C
Quickly,
our
prepare
it.
The
key
things
we
want
to
push
today
is
one
of
the
things
we've
been
very
successful
with
a
lot
of
customers
appreciate
is
scheduling
a
Lunch
and
Learn
reach
out
to
us.
Let
us
know
you're
interested
in
what
we
do
is
a
KPI,
we'll
team
up,
Athena
stacks
will
come
out
to
your
location
and
it's
clearly
an
educational
two
hour
session,
based
on
your
knowledge
and
what
experience
you
have
a
Cassandra,
it
can
be.
You
know,
introduction
all
we
up
to
some
more
advanced
data,
modeling
or
performance
tuning,
if
necessary.
C
My
contact
information
is
included
here,
feel
free
to
reach
out
to
me
at
any
time
and
we'll
get
back
to
promptly.
The
other
thing
I
want
to
remind
everybody
of
is
the
Cassandra
summit.
It's
in
September
kpi
will
be
in
booth
one
one
one
should
be
easy
to
find
and
there's
a
piece
right
there.
So
at
this
point,
I
am
going
to
change
back,
are
actually
Devon.
If
you
could
change
back
Brian
to
the
presenter,
and
we
can
take
questions
at
this
time.
I
would
appreciate
that.
Thank
you.
B
A
Sure
so
the
data
stack
Center
prior
actually
doesn't
come
with
data
stacks
enterprisers
data
stacks
produces
a
ops
center
tool.
That's
our
visual
monitoring
tool.
You
can
use
it
on
on
some
of
the
open
source
Cassandra
cluster,
but
has
some
limited
capability,
but
into
DSC.
It's
the
data
stacks
enterprise
itself.
It
allows
visibility
into
various
metrics
going
from
the
sort
of
Cassandra
level
and
understanding
about
like
read,
requests
and
write
requests,
kind
of
see.
Spiking.
A
You
can
get
down
to
things
like
the
java
virtual
machine
and
understand
how
it's
doing
and
getting
all
the
way
down
into
lower
level
things
like
disk
access,
and
you
know
operating
system
sorts
of
metrics.
You
could
do
some
charting
for
that
to
see
the
trends
as
things
are
going
on
again,
this
is
a
distributed
system.
So
sometimes
it's
good
to
see
an
aggregated
view
and
we
provide
that
or
you
can
do
an
individual
view.
A
You
know
if
you
have
100
nodes
in
the
cluster,
the
individual
view
could
be
a
little
bit
busy,
but
it
certainly
can
help
to
to
dive
in
at
that
level.
That's
on
the
monitoring
side
on
the
management
side.
There's
there's
some
alerting
that
you
can
do
that's
relatively
standard
kind
of
thing
to
have
an
amount
in
a
monitoring
tool.
We
can
also
take
a
and
do
some
of
the
common
housekeeping
kinds
of
things
to
deal
with
backups
and
restores.
A
We
can
do
things
with
some
of
the
anti
entropy,
so
there
are
some
maintenance
pieces
that
we
would
do
in
the
database
in
the
regular
care
and
feeding,
upgrades,
etc,
and
then,
lastly,
there's
a
number
of
perform
another.
What
we
call
services
that
are
through
the
OP
Center
tool
that
do
things
like
evaluate
whether
or
not
you
are
subscribing
to
some
of
the
best
practices
in
your
cut
in
your
configurations
and
sort
of
just
alert.
You
I
mean
there.
A
Sometimes
there
are
plenty
of
good
reasons
to
not
do
the
best
practice,
and
so
we
just
want
to
make
you
aware
that
you
are
doing
something
non
non-standard
and
that's
great
if
that
works
for
you-
and
it's
not
great,
if
you
didn't
know
that
you
were
doing
that
by
accident
and
a
number
of
other
services,
a
performance
service,
etc
that
we
can
kind
of
dive
a
level
down
so
op
Center
is
our
visual
monitoring
tool.
I
did
sort
of
skip
back
or
I
skip
relatively
quickly
through
the
data
stack
slides
this
data
stacks
slide
here.
A
A
I
think
that
I
think
the
answer
there
is
is
sort
of
unfortunately,
kind
of
fits
in
that.
How
long
is
a
piece
of
string
kind
of
question?
It
really
depends
on
what
you're
doing
in
your
SPARC
job
while
Cassandra,
we
can
actually
talk
a
little
bit
more
concretely
about
the
sizes
and
the
sort
of
prerequisites
or
the
recommended
configurations.
Sparc
jobs
can
go
from
relatively
simple
to
extremely
complicated
and
so
I
think
it
really
matters.
A
The
the
basic
sort
of
configuration
we
talked
about
for
Cassandra
is
is
I,
believe
four
cores
and
32
gigs
of
ram
I
would
prefer
to
double
that,
but
not
requirement,
certainly
and
there's
a
number
of
cloud
instances
that
we
that
we
talk
about
and
recommend
Amazon
and
Google
and
Asher
and
all
the
rest
of
them.
You
certainly
can
go
out
there
and
do
that
when
you
bring
spark
into
the
mix
a
few
more
cores
and
a
few
more
rant
a
few,
some
more
Ram
really
does
start
to
help
the
picture.
A
If
you're
doing
analytics,
that's
going
to
end
up
building
big
models
and
doing
a
lot
of
number
crunching
that
can
be
putting
a
real
stress
on
both
the
computation
and
on
the
RAM,
and
so
it
sort
of
floats.
The
general
rule
of
thumb
would
probably
be
more
like
six
to
eight
cores
and
probably
drifting
up,
48
64
gigs
of
RAM,
you
know,
is
sort
of
where
you
want
to
be.
A
But
you
know
what
I
think
the
best
sort
of
approach
to
this
is
really
to
kind
of
get
a
good
guess
and
then
actually
just
try
it
out
and
see
if
the
workload
works.
One
of
the
things
in
terms
of
sizing
I'll
take
this
opportunity
with
Cassandra
to
talk
a
little
bit
about
cluster
sizing
for
just
a
quick
second.
Is
we
don't
always
do
it
by
data
size
a
lot
of
times?
We
do
the
query,
the
sizing
of
the
cluster
by
your
query,
SLA
s.
So
are
you
trying
to
get
high
concurrency
say
you
want?
A
You
know
50,000
to
clients
all
asking
questions
at
the
same
time,
that's
going
to
change
the
cluster
than
if
you
only
need
5,000,
and
similarly
are
you
trying
to
get
you
know
very
low
latencies.
You
know
I
need
to
get
under
100
milliseconds
to
do
my
to
do
my
right,
vs.,
I'm,
okay,
having
a
1,
millisecond
SLA,
and
so
those
are
the
other
sort
of
I'm.
Sorry
what
yeah?
Those
are,
the
other
sort
of
things
that
we
take
into
account
when
we're
coming
up
to
posture.
B
A
In
terms
of
how
long
it
cannot
be
connected
the
there's
sort
of
a
couple
of
things
mixed
in
to
that
the
architecture
is
designed
to
have
it
tolerate
the
fact
that
that
node
is
not
there
relatively
indefinitely.
Now
the
fact
that
it's
not
there
if
we
go
back
to
our
replication
factor
story
of
data
is
being
replicated
around
a
couple.
Things
are
happening
while
that
node
is
not
there.
First
of
all,
I'm
only
storing
some
of
the
data
in
only
two
places,
and
so
I'll
start
to
worry
about
things.
A
If
I
lose
another
node
now,
I
only
have
it
in
one
and
my
fault,
tolerance
is
starting
to
degrade
a
bit.
So
I
do
want
to
address
this
if
I
know,
for
instance,
that
it's
going
to
be
a
while
I
may
need
to
rebalance,
in
other
words,
give
that
token
range
that's
now
missing
to
one
of
the
active
nodes.
So
if
I
had
say
ten
nodes
and
one
went
down,
I
could
give
that
to
one
of
the
other.
Nine
don't
need
additional
hardware
necessarily,
but
I
do
need
to.
A
You
know
that
it's
going
to
be
taking
a
little
bit
more
strain
on
each
of
the
remaining
nine
I
could
also
spin
up
another
node
to
bring
it
back
to
ten
and
do
it
that
way
after
a
little
while.
So
if
we
get
Network
hiccups
all
the
time,
that's
part
of
the
world
we
live
in,
and
so
the
architecture
is
fine
with
losing
a
node.
For
like
a
second
or
a
couple
seconds,
which
is
you
know
normal
normal
network
hiccup
thing
and
what
we
do.
The
coordinator
will
just
store
a
little
hint.
A
That's
what
we
call
them
store,
basically
what
it
would
have
told
that
node
had
he
been
up
and
we'll
hold
onto
that
for
a
little
while
there's
actually
a
configuration
of
it
to
how
long
that
will
be.
It's
usually
a
few
minutes,
so
maybe
tens
of
minutes
and
then
at
some
at
that
point,
it'll
say:
look
I'm,
just
not
going
to
bother
because
it's
just
piling
up
and
it's
going
to
be
too
much
of
a
drain
and
we'll
deal
with
that
guy
when
he
does
come
back
on
line
now.
A
After
let's
say
there
was
a
machine
guy
power
supply
blew
up,
and
so
the
node
died
for
for
two
days.
When
you
bring
him
back,
there's
a
operation
that
we
call
repair,
that's
not
quite
as
much
like
it
was
broken,
but
it's
more
of
a
we
call
anti
entropy
I
need
to
get
this
guy
to
have
the
latest
data
from
the
other
replicas.
A
They
can
stream
the
data
across
to
this
new
guy
or
the
the
guys
revisiting,
and
then
he
comes
up
to
par
and
once
he's
up
to
sort
of
the
current
state,
then
he's
able
to
sort
of
join
and
take
care
of
things,
and
so
that
operation
of
bringing
a
node
back
in
if
he's
been
offline
for
a
while,
actually
looks
a
lot
like
bringing
a
note
in
as
if
he
had
been
cold
and
bring
in
joining
entirely.
So
in
terms
of
the
tolerating
how
it
goes,
you
can
handle
it
for
a
long
time.
A
A
So
sure
there
there's
a
couple
things
you
can
do
with
the
commit
log.
First
of
all
you,
you
can
actually
set
up
how
much
commit
log
space
you
offer
offer.
So
we
didn't
cover
this
on
the
talk
just
to
bring
other
people
up
when
you
go
to
do
it
when
each
replica
goes
to
do
a
write,
what
he
does
first
is
he
writes
to
a
little
space
on
the
on
the
disk
called
the
commit
log,
which
is
how
he
knows
that
it's
been
written
somewhere.
A
Then
it
will
go
through
the
rest
of
the
path
it
stores
it
in
memory
for
awhile
and
has
an
in-memory
table.
So
it's
actually
a
much
cheaper
operation
to
query
data
that
has
most
recently
been
active
and
then,
after
a
while
that'll
that'll
grow
to
a
space
that
needs
to
be
flush
to
disk,
and
that's
when
we
start
calling
these
things
called
SS
tables,
and
so
those
are
the
on
disk
version.
A
And
so,
when
you
do
a
read,
you
kind
of
combine
what's
on
disk
with
what's
in
memory,
if
the
machine
were
to
somehow
just
go
down,
you
kick
the
plug
out,
then,
when
it
starts
back
up,
he
can
go
to
the
commit
log
and
replay
everything.
That's
in
the
commit
log
to
get
himself
back
up
the
state,
and
then
he
sort
of
brings
himself
online
and
says
now
I'm
ready
to
talk
to
people.
So
the
commit
log
can
is
this
safety
mechanism
and
it
can
grow.
One
of
the
things
you
can
do
is
set.
A
In
other
words,
there
was
that
memory
that
memory
table
I
was
talking
about
and
then
eventually
it'll
overflow
into
the
and
go
on
to
disk
and
SS
tables.
We
don't
want
to.
We
don't
want
to
remove,
want
the
the
only
disk
copy
we
have
until
we
make
sure
that
we
actually
have
the
SS
table
disk
copy.
Once
that's
done,
we
kind
of
mark
these
commits
log
pages
as
go
ahead,
get
rid
of
it.
A
But
it's
a
clearly
a
question
from
somebody.
Who's
done
some
operational
work
with
Cassandra.
It's
it's!
It's
a
good
question.
It's
relatively
it's
relatively
targeted
to
the
operational
use,
though
it's
really
good.
B
Right,
a
perfect
record:
that's
enough
for
all
the
questions
that
we
didn't
get
to
we'll
try
to
get
to
in
the
blog
post.
So
thanks
again
everyone
for
joining
again.
We
will
be
sending
out
a
recording
of
this
video,
along
with
the
slide
decks
within
the
next
48
hours.
So
sorry
for
the
technical
difficulties
with
banks
against
coming.
Thank.