►
Description
Speaker: Robert Strickland, Director of Software Engineering
We hear a lot about lambda architectures and how Cassandra and Spark can help us crunch our data both in batch and real-time. After a year in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.
A
Alright,
so
it
is
11
after
so,
I
will
go
ahead
and
get
started.
I
appreciate
you
guys.
All
joining
me
today
been
a
lot
of
great
talks
today.
I
hope
that
I
can
live
up
to
the
epic
label
for
this
track.
I
have
my
doubts,
we'll
see.
As
you
can
see,
I
am
the
one
and
only
impressive-looking
slide
in
my
deck,
I'm
going
to
talk
about
lambda
architecture,
specifically
our
experience
at
the
weather
company
over
the
last
year,
as
we
have
struggled
our
way
through
building
and
scaling
our
analytics
platform
built
on
Cassandra
and
spark.
A
If
you've
heard
me
talk
in
the
past,
you
know
that
I
typically
take
a
deep
dive
into
some
specific
technology
or
try
to
simplify
a
complex
subject.
This
talk
is
a
little
bit
different.
I
could
subtitle
it
how
to
avoid
all
of
the
land
mines
we
stepped
on.
So
a
little
background
for
those
who
don't
know
me:
I
lead
the
analytics
team
at
the
weather
company
based
in
the
beautiful
city
of
Atlanta.
A
Yes,
more
Atlanta
people,
I'm
responsible,
any
other
Atlanta
people
a
few.
The
few,
the
proud
I'm
responsible
for
our
data
warehouse
and
our
analytics
platform,
as
well
as
the
team
of
engineers
who
get
to
work
on
cool
analytics
projects
on
massive
and
varied
data,
sets
so
I'll
talk
about
more
on
that
in
a
minute.
So
why
am
I
qualified
to
talk
about
this
I've
been
around
the
community
for
quite
a
while
since
2010
and
Cassandra
point
five
to
be
exact.
I've
worked
on
a
variety
of
cassandra
related
open
source
projects.
A
Basically,
if
there's
a
way
to
screw
things
up
with
Cassandra,
I
have
done
it
if
you're
interested
in
learning
more
about
that,
you
can
pick
up
a
copy
of
my
book,
Cassandra
high
availability,
where
there
is
some
of
that
and
if
you're
ever
in
the
Atlanta
area,
I'd
love
for
you
to
come,
join
us
at
the
Atlanta
Cassandra
users
group.
So,
as
I
said,
I
work
for
the
weather
company,
which
is
the
caretaker
of
a
number
of
well-known
brands
such
as
the
weather,
channel,
weather
com,
weather,
underground
telecast,
among
others.
A
At
heart
we
are
a
data
company
and
in
fact,
we
serve
around
30
billion
API
requests
daily,
providing
weather
data
to
the
likes
of
Google,
Apple,
Facebook,
Yahoo,
Samsung
and
other
large
customers.
By
way
of
comparison,
that
30
billion
number
is
about
60
times
the
number
of
tweets
tweeted
daily.
So
we
have
a
hundred
and
twenty
million
active
mobile
users,
plus
many
more
on
weather.com
and
our
other
digital
properties,
and
that
gives
us
the
third
most
active
mobile
user
base.
A
In
fact,
we
have
the
number
one
app
for
weather
on
all
platforms
and
in
some
cases,
the
number
two
as
well,
all
of
which
adds
up
to
a
lot
of
data.
360
petabytes
of
traffic
daily
runs
through
our
systems.
That's
a
very,
very
large
number.
It's
actually
10
and
three-quarters
exabytes
a
month
write
that
number
on
the
board.
If
you
can
even
figure
out
what
that
number
is
at
the
end
of
the
day,
most
of
the
weather
data
you
consume
on
a
daily
basis
actually
originates
from
us.
A
So
for
better
for
worse,
sometimes
so
now
to
our
use
case.
As
you
can
imagine,
all
those
API
calls
pageviews
mobile
sessions
and
weather
observations
add
up
to
a
lot
of
secondary
data.
My
team's
objective
is
to
extract
meaning
from
that
data
using
repeatable
processes
and
tools
that
allow
collaboration
and
shorten
the
time
from
discovery
to
incite.
A
Since
we
have
many
data
scientists
and
analysts
at
whether
who
are
embedded
in
the
various
business
units,
we
had
to
be
able
to
support
self-serve
data
science
and
exploration.
This
frankly
was
one
of
the
more
challenging
aspects
of
our
mandate
and,
lastly,
we
needed
to
support
enterprise
reporting
and
business
user
self-service
using
off-the-shelf
BI
tools.
A
A
These
would
be
dq'd
either
by
spark
streaming
or
by
a
custom
ingestion
pipeline
that
we
built
events
would
land
largely
unchanged
in
Cassandra
and
where
we
would
do
batch
analysis
via
spark,
we
could
use,
we
would
use
sparks
sparks
equals
ODBC
support
to
serve
data
to
various
end
customers
and
for
batch
sources.
We
would
repurpose
our
existing
informatica
team
to
load
the
data
into
Cassandra.
A
So
let's
take
a
closer
look
at
the
data
model
we
use
for
the
system.
As
I
mentioned,
all
events
landed
in
the
same
table.
It's
a
typical
time
series
model
and
you'll
notice
that
most
of
the
fields
are
just
metadata.
Can
you
guys
see
that?
Okay,
all
right?
The
data
we
really
care
about
is
stored
as
this
schema
list
JSON
in
the
event
data
field,
and
our
partitioning
is
also
typical
for
a
cassandra
time
series
model.
We
store
the
data
in
one
hour
time
buckets
further
partitioned
by
event,
type
for
even
distribution
across
the
cluster.
A
We
also
have
time
stamp
as
a
clustering
column
to
get
events
written
in
time
order.
And,
lastly,
we
decided
to
use
the
new
date
tiered
compaction
strategy,
since
it
seemed
to
be
the
right
choice
for
time
series,
since
it
was
new
at
the
time
we
had
little
operational
experience
with
it,
which
turned
out
to
be
a
significant
issue
later.
A
A
A
First
of
all
batch
loading,
many
terabytes
of
data
into
Cassandra
turned
out
to
be
pretty
silly
and
quite
expensive
for
such
large
data
sets.
We
were
going
to
need
huge
numbers
of
nodes
to
handle
the
load,
and
the
need
would
just
keep
on
growing
over
time
unbounded,
which
is
not
a
cut
on
Cassandra
at
all.
It
would
have
handled
it
perfectly
fine,
but
we
would
have
broken
the
bank
in
the
process
and
we
also
learned
that
bulk
loading
Cassandra
using
informatica
is
currently
a
no-go.
We
gave
it
as
we
say.
A
The
old
college
try,
including
a
custom
process
that
used
the
bulk
loader.
The
Cassandra
bulk
loader
under
the
hood
I
will
say
that
informatik
is
working
on
a
native
Native
Cassandra
support
and
we've
been
talking
to
them
about
that.
But
for
the
moment
this
is
really
an
exercise
in
futility
for
reasons
I'll
explain
shortly.
It
also
became
apparent
that
we
didn't
need
to
maintain
a
restful
service,
layer
and
Kafka
setup,
which
added
significant
cost
and
complexity
to
our
system.
A
One
of
the
most
difficult
issues
we
encountered
was
the
lack
of
modern
supported
open-source
hive
driver
for
Cassandra.
We
needed
the
hive
driver
to
support,
spark
sequel
queries
over
ODBC
and
there's
a
lot
there
if
you're
not
familiar
with
that,
you
can
go,
read
up
on
it,
but
that
just
suffice
it
to
say
it
was
necessary.
We
considered
taking
over
the
cash
project,
which
is
an
old
driver
that
was
later
integrated
into
DSC,
but
ultimately
decided
to
abandon
that
idea
in
favor
of
an
alternate
architecture.
A
It's
worth
noting
the
data
stacks
enterprise
users
have
access
to
the
proprietary
hive
driver
bundled
with
the
distribution,
so
you
won't
run
into
that
issue.
Unfortunately,
we
found
that
day
tiered
compaction
is
effectively
broken.
It
falls
down
for
all
but
narrowest
of
time,
series
use
cases
and
should
be
used
with
extreme
caution.
Since
Jeff
jirsa
already
covered
this
in
detail
in
his
earlier
talk
on
the
subject,
which
I
hope
you
saw
I'll
leave
to
you
to
check
out
to
check
that
out
online
later
or
review
the
ticket.
There's
a
lot
of
information
there.
A
If
there's
any
point
in
this
list,
that
seems
obvious.
After
the
fact,
it's
that
storing
data
as
schema
lyst
json
was
a
terrible
idea.
First,
because
our
analysis
jobs
had
to
parse
json
to
extract
the
data
we
cared
about.
This
made
every
job
much
more
expensive
and
caused
us
to
have
to
read
more
data
than
we
cared
about
on
every
job,
which
is
partly
because
all
event
types
were
in
the
same
table,
making
it
expensive
to
analyze
by
event
type.
A
A
Another
side
effect
of
all
events
being
in
one
table
was
that
we
couldn't
tune
by
event
type.
We
later
learned
that
some
events
require
special
tuning
due
to
their
volume
and
the
one-size-fits-all
model
prevented
this.
So
now
take
two
you'll
notice,
several
key
differences
here.
First
kafka
has
been
replaced
by
amazon,
SQS
and
second
we're
using
s3
for
long-term
storage.
A
We're
still
using
spark
for
processing
and
cassandra
for
short-term
storage,
otherwise
I
probably
wouldn't
be
here
and
informatica
becomes
the
mechanism
for
loading
batch
data
into
s3,
both
in
raw
form
and
in
parque
format,
I'll
elaborate
more
on
why
we
made
these
choices
in
a
minute.
First,
let's
talk
about
the
data
model.
Instead
of
dumping,
everything
into
a
single
table,
every
event
gets
its
own
table.
A
This
allows
us
to
individually
tune
each
table
based
on
the
workload
and
we
extract
the
payload
and
apply
schema
at
the
point
of
ingestion.
There
are
a
number
of
reasons
for
this
first,
because
we
have
to
read
the
data
on
ingest
anyway.
This
is
a
logical
time
to
go
ahead
and
app
the
data
to
real
Cassandra
fields,
which,
of
course,
make
subsequent
analysis
a
lot
easier,
because
we
only
have
to
deal
with
parsing
and
mapping
one
time.
A
A
A
This
gives
us
real
time
access
to
events,
while
keeping
overall
cluster
size
to
your
reasonable
number,
and
we
get
data
locality
for
this
data
and,
of
course,
cassandra
is
quick
access
times,
which
makes
spark
jobs
much
easier,
much
faster
for
most
other
use
cases,
and
there
are
exceptions.
Of
course
we
store
the
data
in
s3.
A
A
Fortunately,
it's
also
easily
accessible
from
spark,
and
it's
super
simple
to
share
both
internally
and
externally,
at
whether
we
have
a
bunch
of
different
AWS
accounts
for
better
or
for
worse,
as
well
as
a
number
of
our
own
physical
data.
Centers.
An
s3
lets
us
easily
share
data
across
all
those
environments
and,
lastly,
hive
has
open
source
support
s3
via
HDFS,
which
means
that
we
can
leverage
odbc
to
connect
to
the
s3
data
through
sparks
equals
thrift
server.
A
A
But
what
really
made
the
decision
is
that
when
one
of
our
Ops
guys
made
the
suggestion
that
perhaps
there
was
no
need
for
Kafka
since
SQS
already
has
a
restful
interface
and
he
was
right
initially,
it
seemed
crazy
and
actually
took
me
a
couple
days,
marinating
on
it
to
decide
that
it
was
worth
doing.
But
it
has
turned
out
to
be
one
of
our
better
decisions.
A
A
We
set
up
the
cues
such
that
there's
a
queue
per
event
and
platform
per
event,
type
and
platform
which
allows
us
to
use
Amazon's
built-in
monitoring
to
get
a
granular
view
of
the
events
coming
into
our
pipeline.
So
a
lot
of
things
we
would
have
had
to
build
ourselves
that
we
didn't
have
to
build,
and
so
also
so
because
of
the
issues
you
ran
into
a
date
tiered
compaction.
We
decided
to
switch
to
size
tiered
compaction,
mainly
because
it's
the
most
mature
compaction
strategy
and
it's
effectively
bulletproof
been
using
it
since
point
five.
A
Five
years
now,
that's
pretty
bulletproof.
The
two
primary
issues
with
size,
tiered
compaction,
are
its
facial
complexity
and
the
potential
for
single
partitions
to
span
a
number
of
SS
tables.
We
can
manage
the
spatial
complexity
and
the
partitions
spanning
SS
tables,
isn't
really
a
big
deal
for
analytics
use
cases
where
you're
doing
lots
of
large
token
range
scans.
I
also
want
to
take
a
minute
to
give
another
shout
out
to
Jeff
for
his
time
window
compaction
strategy.
The
implementation
is
exactly
what
most
people
want
and
expect.
A
From
date,
tiered
compact
Shin,
esas
tables
are
grouped
by
timestamp
into
configurable.
Buckets,
then
simply
dropped
when
all
the
data
expires.
Simple,
straightforward
and
low
overhead
for
time
series
data
we
were
this
close
to
doing
it
and
then
I
saw
his
implementation
overall.
This
second
architecture
has
been
a
great
success
for
us.
A
So
of
course,
there
are
numerous
details
to
all
this,
so
I
can't
cover
all
of
them
in
a
35
minute
talk.
But
let's
talk
about
a
few
of
the
more
critical
ones.
First,
make
sure
to
use
at
least
version
2
18
and
at
this
point
precisely
version
2
18.
If
you're
planning
to
run
spark
I've
listed
several
critical
issues
here,
which
you
should
look
up
if
you
plan
on
using
anything
other
than
218,
all
of
which
cost
us
a
lot
of
time
and
frustration
as
we
work
through
them.
A
A
One
thing
I
didn't
show
on
the
diagram
is
that
we
actually
run
two
separate
spark
clusters.
The
first
is
co-located
with
Cassandra
for
our
heavy
analysis,
jobs.
We
keep
it
tightly
controlled,
so
we
can
predict
the
load
and
it
gives
a
sufficient
access
to
Cassandra
data.
The
second
cluster
handles
our
self-service
clients.
It's
not
co-located
with
Cassandra
and
in
fact
we
generally
prefer
using
it
against
our
s3
data
rather
than
Cassandra.
Obviously,
its
load
is
much
less
predictable,
so
the
second
cluster
allows
us
to
isolate
it
from
production
jobs.
A
So,
just
as
with
traditional
Cassandra
use
cases,
proper
data
modeling
is
critical,
but
there
are
key
differences
when
you're
designing
for
large-scale
parallel
processing
compared
to
efficient
single
query
patterns.
Let's
examine
some
of
those
differences,
so,
first
the
partitioning
strategy
you
use
could
be
considered
the
opposite
of
the
normal
Cassandra
modeling
ideas
for
those
still
learning
Cassandra
when
I
fert
refer
to
a
partition,
I'm
speaking
about
the
first
part
of
the
primary
key
declaration,
the
first
field
or
set
of
fields
in
parenthesis
which
determines
the
distribution
of
data
across
the
cluster.
A
When
it
comes
to
choosing
your
partition
key,
you
want
to
model
primarily
for
good
parallel
ism.
This
means
that
your
most
common
queries
should
produce
results
evenly
distributed
across
the
nodes
rather
than
single
partition.
Queries,
which
is
commonly
suggested
for
non
analytics.
Workloads
you'll
also
want
to
avoid
shuffling,
if
possible,
as
you
may
know,
from
prior
experience
with
either
Hadoop
or
spark.
The
shuffle
phase,
where
all
similar
output
keys
are
sent
across
the
cluster
to
be
grouped
together
is
typically
the
most
expensive
part
of
any
distributed
processing
job.
A
Any
query
that
groups
data
by
something
other
than
the
partition
key
will
end
up
shuffling
data,
so
the
key
advice
here
is
to
partition
based
on
your
most
common
grouping.
For
example,
if
you
know
most
of
your
aggregations
will
be
by
user,
then
user
should
be
your
primary
part.
You
should
be
your
primary
key.
A
So
let's
talk
about
indexes
secondary
indexes,
have
been
a
controversial
and
often
misused
feature
since
their
introduction
in
point
7,
the
conventional
wisdom
says
to
avoid
them
under
normal
circumstances,
because
they
require
the
entire
cluster
to
participate.
In
answering
your
query,
but
in
a
model
where
we
want
our
queries
distributed,
this
becomes
a
desirable
feature.
It
allows
spark
to
push
down
a
query
predicate
to
Cassandra,
so
the
database
is
actually
doing
the
filtering
of
the
unwanted
results.
A
This
can
significantly
reduce
your
processing
overhead
and
spark
because
it
never
has
to
look
at
unwanted
data.
It's
worth
noting,
however,
that
low
cardinality
is
still
the
rule.
If
you
are
trying
to
index
on
a
high
cardinality
value,
you
still
need
to
build
your
own
index.
For
now
that
is
keep
an
eye
out
for
global
indexes.
A
In
three
point,
0
and
so
I,
will
it
an
unnamed
source
from
the
last
pickle
who
says
that
50,000
is
the
number
of
is
the
cardinality
you're
looking
for
that's
the
sweet
spot,
so
a
picture
is
worth
a
thousand
words
in
this
case.
So
let's
look
at
the
difference
between
the
behavior
of
a
secondary
index.
Query
using
a
traditional
client
access
model
versus
a
spark
query.
You
can
see
here
that
the
client
request
goes
to
the
coordinator,
which
must
query
the
index
on
all
other
nodes
and
assemble
the
final
result
set.
A
A
Since
it's
implemented
using
the
secondary
index,
api
and
stores
lusine
indexes
in
the
same
manner,
it
uses
the
same
math
access
pattern,
I
described
earlier,
so
creating
an
index
in
this
manner
is
straightforward
and
will
look
familiar
to
anyone
who's
familiar
with
leucine.
Once
you
have,
the
index
did
I
skip
that
too
quick.
So
there
you
go
once
you
have
the
index.
You
can
execute
queries
like
these
as
simple
cql
statements.
Since
queries
run
through
the
server-side
cql
parser,
any
cql
based
client
can
use
them.
A
Okay,
so
I
lied.
I
have
one
more
impressive
slide
seriously.
This
is
a
really
big
issue,
so
it
deserves
a
really
big
and
memorable
image
to
remind
you
how
important
it
is
and
that's
because
it
takes
only
one
really
wide
road
to
ruin
your
day.
In
our
case,
we
had
a
bug
in
a
mobile
client
that
produced
the
same
key
for
hundreds
of
millions
of
records
every
day.
A
bug
like
this
unchecked
for
even
a
relatively
short
period
of
time,
can
literally
bring
down
your
cluster.
A
The
lesson
there
is
that
you
can
end
up
with
a
wide
row,
even
when
you
have
a
model
with
theoretically
good
distribution
characteristics.
The
key
to
avoiding
serious
problems
like
this
is
early
detection.
While
there
are
many
important
stats
to
monitor
wide
rows
can
be
detected
by
keeping
an
eye
on
max
partition
bytes
for
each
table.
If
this
number
gets
significantly
higher
than
the
mean
partition,
bytes,
that's
a
sign
of
trouble.
A
A
Another
dilemma,
you'll
run
into
with
an
event
pipeline.
Is
that
you're
likely
to
have
many
missing
fields
in
your
event
payloads?
But
you
don't
want
to
construct
a
dynamic
cql
statement
for
every
single
query.
It's
easy
to
be
tempted
to
just
write
null
values,
but
the
problem
is
that
nulls
are
actually
deletes.
That
is
important,
of
course,
because
deletes
and
Cassandra
create
tombstones
or
marker
columns
that
cover
any
previously
written
data.
It
takes
work
for
Cassandra
to
deal
with
a
bunch
of
unneeded
tombstones,
so
it's
very
important
in
a
nutshell
to
avoid
writing
nulls.
A
So
how
do
you
leverage
prepared
statements-
and
you
will
want
to
without
having
to
create
a
separate
prepared
statement
for
every
permutation
of
populated
fields
which
we
originally
did?
Our
answer
was
to
create
a
library
that
dynamically
builds
and
binds
prepared
statements
on
the
fly
based
on
actual
observed,
combinations
of
populated
fields.
We
guessed
that
the
number
of
practical
combinations
would
be
much
smaller
than
the
number
of
theoretical
combinations,
and
this
is
actually
proven
correct.
Your
mileage
may
vary.
It
is
also
increased
performance
and
dramatically
reduced
our
code
complexity.
A
So
now
I
want
to
shift
gears
a
bit
and
focus
on
an
often
overlooked
aspect
of
data
analysis
exploration.
My
definition
of
this
goes
beyond
pig
screen
or
sequel
engines.
It's
really
about
collaboration
between
data
scientists,
engineers
and
business
users
for
too
long
our
tools
have
catered
to
one
group
or
another,
but
one
of
our
primary
objectives.
On
my
team
has
been
to
break
down
these
walls,
so
let's
take
a
look
for
a
moment
at
what
many
may
recognize
as
an
old
data
warehouse
paradigm.
A
It's
a
linear
process
that
requires
a
number
of
steps,
and
often
weeks
or
months
before
we
can
begin
looking
at
our
data.
The
way
a
human
brain
can
understand
it
in
this
model.
There's
so
much
work
involved
in
getting
data
into
palatable
form
that
very
little
of
it
actually
makes
it
that
far
we
lose
so
much
in
the
process.
A
A
A
So
a
major
part
of
what
we're
trying
to
achieve
is
a
reduction
in
the
amount
of
time
it
takes
to
start
visualizing
the
data,
and
by
that
I
mean
from
weeks
two
minutes.
We
want
instant
gratification
which
effectively
results
in
our
data
analysis
becoming
more
collaborative
and
agile,
rather
than
a
waterfall
process
like
the
old
model.
A
So
along
comes
Zeppelin.
Zeppelin
is
an
open-source
park.
Notebook
I've
seen
a
couple
of
other
people
talk
about
it
today.
It's
similar
in
many
ways
to
the
data
bricks,
cloud
interface
and
I
can
say
without
hesitation
that
it
has
completely
transformed
the
way
my
team
functions.
It
has
also
become
an
extremely
popular
way
for
other
teams
to
interact
with
the
data.
A
Once
you
arrive
at
something
you
like,
you
can
then
run
it
on
a
schedule
right
there
in
the
same
UI,
so
here's
an
example
notebook
and
I'm,
not
brave
enough
to
do
it
in
live
because
I've
seen
how
that
goes.
Here's
an
example
notebook
where
I'm
using
sparks
equal
to
query
some
data
received
from
mobile
devices.
This
data
is
stored
in
Cassandra
inside
a
map
structure,
yet
I'm
able
to
query
it
in
place
complete
with
aggregations
and
visualize
it
in
a
bar
chart.
A
Here's
another
example
where
I'm
retrieving
some
historical
weather
data
for
a
specific
time
bucket
and
location,
then
extracting
mapped
values.
The
same
way
I
did
before
into
a
data
frame
and
registering
it
as
a
temp
table.
I'm
then
able
to
plot
multiple
values
to
see
the
relationship
between
them.
A
Lastly,
there's
a
cql
parser
that
lets
you
run
direct
queries
against
Cassandra
and
perform
the
same
sorts
of
visualizations.
This
has
largely
replaced
our
usage
of
CQ
lsh
for
anything
other
than
administrative
tasks.
Obviously,
there
are
other
tools
that
can
do
this
sort
of
work,
but
most
are
very
costly
and
almost
none
support
native
cassandra
and
spark
queries.
A
So
as
we
wrap
up,
I
wanted
to
spend
a
couple
of
minutes
addressing
a
common
question,
I
hear
from
people
who
are
looking
to
use
Cassandra
as
part
of
an
analytics
project.
The
question
is
whether
you
should
use
apache
cassandra
and
put
all
the
open-source
parts
together
yourself
or
use
datastax
enterprise,
where
everything
is
bundled
into
a
nice
package.
A
I'll
attempt
to
answer
this
by
asking
a
series
of
questions,
but
before
I
do
that
I
want
to
make
clear
that
I
am
personally
an
open
source
fan
and
I,
don't
care
whether
you
by
DSC
or
use
the
open-source
version
I
am
not
paid
by
data
sex
with
that
disclaimer.
The
first
question
is:
do
you
have
an
open
source
culture?
Is
your
shop
more
likely
to
buy
an
enterprise
license,
so
you
can
get
support
or
use
an
open
source
project?
A
Second,
do
you
have
on
staff
Cassandra
experts,
and
by
this
I
mean
people
who
really
know
what
they're
doing
not
just
someone
who's
used
it
before?
This
is
critical.
If
you
are
planning
to
go
it
alone
at
scale,
so
you
either
need
them
in-house
or
you
need
to
go,
find
some
people.
Is
your
team
willing
and
able
to
actively
contribute
to
the
community
success
with
a
complex,
open
source
system,
especially
multiple
complex,
open
source
systems,
tied
together,
depends
on
community
engagement
and
a
willingness
to
dig
in
and
really
understand
how
things
work?
A
If
you
just
want
to
install
and
go
and
get
help
from
experts,
when
you
have
questions
an
open
source,
analytic
stack
is
going
to
be
extremely
frustrating.
Are
you
willing
to
accept
a
moderate
degree
of
risk?
Obviously
there's
risk
involved
in
any
software
project,
but
the
combination
of
well
tested
software,
tighter
integrations
and
first-class
support
certainly
mitigate
that
risk
to
some
degree.
Next,
do
you
have
a
strong
need
or
desire
to
be
able
to
use?
The
latest
features?
A
Dsc
is
always
a
bit
behind
because
they
try
to
make
sure
you
have
stable
software
that's
well
tested.
The
trade-off
is
that
you'll
often
have
to
wait
for
the
latest
features.
This
is
not
actually
a
terrible
idea
and
practice
anyway
to
wait
for
the
features,
as
we've
certainly
been
burned
by
being
pioneers
more
than
once,
but
it
does
mean
that
you
can't
just
open
the
hood
and
patch
something
any
time
you
want.
The
value
of
that
capability
depends
on
you
and
the
culture
of
your
team.
A
The
next
one
is
critical
for
us,
and
that
is
the
need
to
control
tool
versions.
We
have
a
complex
environment
with
lots
of
dependencies
across
teams
and
many
Cassandra
clusters
moving
to
DSC
would
remove
a
tremendous
amount
of
our
flexibility
and
tool
choice.
But
again
this
may
or
may
not
be
important
to
you.
Lastly,
you
may
not
have
the
budget
for
licensing
or
it
may
be
a
drop
in
the
bucket.
Perhaps
your
business
model
requires
scale,
but
you
don't
have
the
corresponding
funds
to
pay
for
the
software.
A
If
that's
the
case,
my
suggestion
is
to
make
sure
you
fully
embrace
the
community
for
most
businesses.
If
you
can't
answer
most
of
the
questions
with
a
yes,
the
DSE
probably
is
the
right
choice.
So
that's
all
I
have
so.
Thank
you
for
your
time
today.
If
all
of
this
sounds
like
something
you'd
like
to
be
involved
in
and
you'd
like
to
live
in
a
city
with
a
great
tech
industry,
I
standard
living
in
low
income
tax,
we're
hiring
so
please
come
talk
to
me.
A.
A
A
A
Vs.
Kafka
well
so
I
would
compare
our
custom
ingestion
pipeline
to
spark
streaming
as
opposed
to
Kafka
or
to
something
like
Apache
mule,
or
you
know,
one
of
the
other
projects
and
whereas
I
would
compare
Kafka
to
SQS.
So
Kafka
is
just
a
queue
feeding
the
the
system,
so
I
I
sort
of
enumerated
the
reasons
for
Kafka
versus
SQS.
A
As
far
as
the
custom
ingestion
pipeline,
it's
honestly
a
source
of
great
debate
internally
like
should
we
have
done
that
or
should
we
have
used
mule
or
something
else,
and
so
that's
still
a
source
of
debate
and
we
may
throw
it
away.
We
may
use
spark
streaming
there.
There
are
advantages
and
disadvantages
to
owning
that
piece
and
so
for
right
now
it
works
really
well,
and
so
we
do
that.
The
prepared
statement
thing
is
one
of
the
reasons.
D
D
A
So
I
gave
you
that
that
model
just
to
be
very
clear
that
first
model
that
I
showed
you
what
bag.
So
that
is
not
the
way
we
do.
Our
our
data
modeling
for
the
custom
event
types
every
event
gets
its
own
model
based
on
what
we
look,
what
we
what
we
believe
to
be
the
right
access
pattern.
Now
we
get
that
wrong.
E
A
Almost
there
there
we
go.
So
if
you
look
that
that's
actually
not
the
flow,
the
flow
is,
it
goes
into
Cassandra.
It
goes
into
our
event
pipeline
immediately
from
the
queue
and
then
from
the
queue
in
to
Cassandra,
and
then
the
process
to
get
it
from
Cassandra
backed
up
parquet
into
s3
is
asynchronous.
That's
a
batch
job
that
runs
a
spark
job
right
now
overnight,
but.
A
It
it's
a
time
window,
so
it
all
the
data
is
written
with
the
TTL
and
the
TTL.
You
know
default.
We
defaulted
to
about
30
days
that
we
keep
up
data,
but
it
depends
on
the
on
the
event
types
you
know
for
some
data
we
may
will
keep
it
longer
for
others.
We
may
only
care
about
it
for
a
couple
of
days
and
then
we're
done
and
then.
A
Right
so
the
bottom
line
is
that
the
cassandra
data
we
put
data
into
cassandra
because
it's
it's
an
append.
It's
a
log,
structured
storage
engine
right.
So
we
can
pipe
data
really
really
fast
into
cassandra
and
we
can
read
it
very
efficiently
from
spark
because
of
the
partition,
the
natural
partitioning
and
the
collocation,
and
just
the
speed
that
Cassandra
can
read
the
data,
whereas
with
parquet
it's
cheap,
but
it's
also
slower.
A
So
s
3
is
slower,
parquet
is
slower,
and
so
the
the
idea-
and
we
get
it
right
sometimes
than
we
don't
sometimes
and
have
to
fix
it.
But
the
idea
is
to
from
all
of
for
the
bulk
of
our
work
are
really
heavy
hitting
analysis.
Work
is
to
take
place
while
the
data
is
in
Cassandra
and
then
only
move
it
to
park
a
when
it's
more
like
long-term
mining
types,
and
so
we
also
will
some
one
thing
we
do
is
we
will
put
data
into
Cassandra,
that's
aggregates
and
keep
it
forever.
A
A
A
As
I
said,
it's
billions
a
day
and
that,
but
the
actual
number
is
sort
of
changing,
constantly
and
growing
like
every
week.
Somebody
at
calls
me
up
and
says:
I
want
to
send
you
a
hundred
and
fifty
million
record
today,
and
so
you
know
it's
I.
If
I
told
you
now,
it
would
be
different
next
week
and
I'm
not
even
really
sure
what
the
answer
to
the
question
is
at
the
moment,
but
the
size
of
the
cluster.
A
We
use
fat
nodes
for
our
analytics
cluster,
so
we
have
20
fat
nodes
right
now,
very
very
large,
I,
two
series
nodes
in
an
AWS
that
does
that
handles
a
bulk
of
our
of
our
initial
processing.
It's
not
all
the
Cassandra.
We
have
it
whether
we
have
lots
of
other
Cassandra
clusters
but
for
our
analytics,
we're
using
20
fat
nodes.
So.
A
A
Amazon,
kinesis
yeah,
so
I
we
have
not
looked
at
amazon
kinesis,
primarily
because
we're
trying
to
be
as
much
as
possible
platform
independent
now
you
might
say:
well
you
got
a
lot
of
Amazon
up
there.
How
come
you
can't
use
Kinesis,
but
the
reality
is.
These
things
are
very,
very
easy
to
replace
with
with
similar
technologies
on
other
platforms.
H
A
Our
graphs
are
actually
micrographs,
they
are
individually
graphs
for
each
user,
and
so
Titan
is
not
really
useful
in
that
case,
because
Titan
is
for
large
graphs,
we
don't
have
a
large
graph
use
case
yet,
but
I
there
there's
it's
coming,
and
so
we
don't
have
the
answer
for
that
and
I'm
hoping
Titan
will
mature
enough
to
the
point
where
we
can.
It
will
get
there
before
we
do
Thanks.
Thank
you.