►
Description
Speakers: Brian O'Neill and Taylor Goetz
A successful Big Data platform combines distributed processing and polyglot persistence into a single cohesive infrastructure. Over the past few years, Health Market Science has transitioned from traditional relational databases and enterprise systems to a massively scalable Big Data platform that combines Cassandra and Storm to ingest thousands of feeds of data from the health market industry to produce a single high-quality masterfile. Hear how we applied event processing and NoSQL to deliver real-time analytics, while accommodating structural change over time, and fuzzy/geospatial search.
A
To
the
big
data
quad
vechta
I'm,
Brian
O'neill
I
work
for
health
market
science,
I'm
the
lead
architect
there-
and
this
is
Taylor-
gets
the
development
lead
and
we're
going
to
be
walking
through
our
use
case.
Why
we
chose
the
technologies
we
did
and
how
we're
applying
them
and
then
we're
actually
gonna
do
a
deep
dive
on
how
we
integrated
storming
Cassandra.
So
hopefully
everybody
will
enjoy
it.
A
Okay,
so
quad
vechta
I
actually
had
to
look
that
up.
Because
what
had
happened
is
we
lose
the
video
there
I've
done
a
blog
post
on
a
trifecta
where
we
put
Kafka,
elasticsearch
and
Cassandra
together.
That
seemed
to
be
working
well
and
then
we
added
to
that
storm.
So
I
didn't
know
what
to
call
the
new
things.
So
he
said
quad
facta,
then,
when
I
looked
it
up,
it
ended
up
being
a
urban
dictionary
term.
A
A
Here
we
go
okay,
so
there
there's
the
beer
pong
so
we'll
pick
up
from
there.
Alright.
So
in
continuing
on
that
theme,
sometimes
I
think
you
know
big
data.
These
days
is
sort
of
a
in
vogue
right
and
there's
a
lot
of
hype
around
it,
and
it
took
us
a
lot
of
lot
of
time
to
figure
out
what
big
data
actually
was
right.
We
figured
and
we
had
heard
that
if
we
get
big
data,
they'll
solve
all
of
our
problems.
A
It'll
make
coffee
for
us
in
the
morning
and
everything
right,
but
if
you
do
some
googling
and
try
and
look
for
definition,
you'll
eventually
stumble
upon.
You
know
the
3
v's,
sometimes
there's
four,
but
we
focus
on
the
three,
the
volume
of
data
variety
of
data
and
the
velocity
right
and
all
three
of
those
have
to
play
in
our
solution
to
service
our
customer
demands
so
now
I'm
going
to
go
through
our
use
case.
A
So
we
work
for
health
market
science
and
what
we
do
is
we
take
thousands
of
these
data
feeds
and
we
put
them
together
into
a
master
file.
That's
called
master
data
management,
so
in
our
master
file
we
have
every
practitioner
in
the
United
States
every
healthcare
organizations,
and
then
we
affiliate
all
those
different
entities
together
over
time
and
that's
important
to
a
lot
of
different
industries,
one
of
which
would
be
pharmacies.
A
The
figure
out
of
that
prescriber
was
actually
eligible
to
do
that
and
the
goal
there
is
to
eliminate
a
lot
of
the
fraud,
waste
and
abuse
in
the
system
plus
in
general.
Our
customers
use
our
data
to
gain
a
lot
of
insights
into
the
space
out
there,
so
we
do
analytics,
on
top
of
the
data,
to
figure
out
influence
our
networks
right.
So
if
a
physician
is
a
critical
person
in
his
opinion,
counts,
he
may
influence
other
physicians.
A
So
here
are
the
details
of
our
numbers
on
the
physician,
ER,
physician
and
practice
side.
We
have
about
six
million
unique
practitioners
in
our
database-
that's
not
too
bad
as
a
number
size,
but
when
you
think
about
the
2,000
sources
that
we
bring
in
that
comprise
the
data
we
maintain
those
two
thousand
sources
over
time
over
all
time.
So
for
compliance
reasons,
it's
important
for
a
pharmacy
DeBell
to
say:
I,
prescribe
I
allowed
this
prescription
six
months
ago.
A
Was
that
prescriber
eligible
six
months
ago
to
prescribe
that
so
we're
not
just
keeping
the
current
time
but
any
point
in
time,
so
that
you
can
go
and
search
this
and
find
out
what
what
the
characteristics
of
and
then
what
data
contributed
to
that
entity.
At
that
point
in
time,
on
the
claim
side,
we
also
have
I
think
the
largest
claims
database
out
there.
So
anytime,
you
submitted
claims
a
claim
for
insurance,
it's
probably
in
our
database
d
identified.
So
but
that's
a
lot
of
data.
A
So,
on
the
on
the
on
the
claim
side,
we
have
a
billion
claims
annually
and
we
maintain
five
years
of
that,
so
that's
five
billion.
So
the
numbers
get
big
I,
think
you
know
it's
nine
or
ten
terabytes
on
the
on
the
claims
data
side,
and
then
our
physician
is
growing,
so
I
forget
what
the
numbers
there.
A
So
some
of
the
challenges
we
deal
with
is
that
some
of
the
data
has
schemas
associated
with
it.
Some
is
unstructured.
We
actually
harvest
data
off
the
internet,
so
you
know
some
of
those
two
thousand
feeds
come
off
the
internet
and
then,
in
addition
to
that,
those
schemas
change
over
time,
so
you
might
get
a
input
from
the
government,
a
government
agency,
that's
maintaining
state,
license
information
for
practitioners
and
one
month
it
comes
in
with
six
columns
next
month.
That
comes
in
with
ten
columns
right
in
our
database.
A
Actually
actually
asked
has
to
deal
with
that.
So,
if
you
take
a
look
and
what
I'm
going
to
be
talking
about
is
the
platform
that
we've
built
right
and
here's
where
you
know
I
think
when
you
put
these
the
technology
stack
together,
you
get
a
lot
of
powerful
capabilities
right.
So
when
you
throw
storm
an
elastic
surgeon,
Cassandra
and
Kafka
all
together,
you
get
a
lot
of
these
great
capabilities
which
allows
you
to
deliver
multiple
prod.
So
if
you
take
our
the
business
needs
at
the
top
right,
we
actually
deliver
all
these
different
solutions.
A
With
a
single
platform
that
can
ingest
the
data,
put
it
in
a
standard
of
format,
make
it
searchable
like
Google
right
and
then
answer
all
sorts
of
different
questions.
You
know
one
of
the
things
that
we
all
we
do
as
well
as
expense
management.
So
a
lot
of
the
large
pharmaceutical
companies
have
to
report
out
on
how
much
they
spend
on
each
doctor
right.
So
so
the
government
can,
you
know,
verify
that
they're,
not
bribing
doctors,
so
you
know
they'll
hand
us
all
of
their
expenses.
It
just
goes
right
into
the
engine.
A
We
run
it
like
anything
else
and
we
we
can
match
it
up
against
our
master
file
and
then
output
answers
for
them
right
and
I.
Think
when
you're,
when
you're
architecting
a
system
like
that,
you
know
the
flexibility
and
the
system
is
critical.
We
could
have
built
one
off
apps
for
each
one
of
these
business
needs,
but
by
building
it
as
a
platform
we
can
leverage
we
can
just
basically
take.
You
know
these
four
components
stitch
them
together.
Now
we
have
a
new
product
new,
offering
that's
really
powerful.
A
Ok,
so
just
on
the
data
center
I
want
to
put
this
one
out
there,
because
this
is
more
of
a
crowdsourcing.
We
have
about
three
quarter,
petabytes
of
rawl
storage
and
we
are
entirely
virtualized.
We
use
VMware
and
we
have
a
San
all
right.
All
three
of
those
characteristics
were
there
before
we
chose
Cassandra,
so
we
ended
up
rolling
out
Cassandra
on
to
this
infrastructure
and
we're
looking
at
it
now
we're
saying:
should
we
have
gone
physical?
Should
we
be
going
physical?
A
A
Okay.
So
if
you
look
at
the
platform
that
we're
about
to
describe
on
how
we
stitched
everything
together,
this
is
what
it
is.
Is
everybody
familiar
with
master
data
management?
Has
anybody
heard
that
term
MDM
hands?
Okay,
not
a
few?
Okay,
all
right,
so
this
is
our
version
of
MDM
on
the
Left.
We
have
data
sources
coming
in
right,
as
I
mentioned
some
come
in
from
the
web.
We
go
out
and
harvest
information,
so
we
have
something
called
the
harvester
it
brings
it
in.
We
get
it
data
coming
in
from
government
sources.
A
So
sometimes
you
know
this
is
actually
a
you
know
a
CD
coming
in
the
mail
right.
We
have
to
pull
it
in
load
it
and
grab
a
flat
file.
You
know,
and
then
also
we
have
customer
information
like
I
mentioned
the
expenses
so
that
when
that
comes
into
our
system,
that
could
be
in
any
format.
You
know,
and
it
could
change
over
time.
We
could
put
that
CD
in
the
drive
and
suddenly
they're
using
an
excel
sheet
right.
A
The
government
organization
that
maintains
these
oncologists
just
decided
to
use
Excel
instead
of
the
text
file
that
they
were
using,
so
it
to
be
really
flexible.
So
what
we
do
is
we
process
that
in
the
first
set
here,
so
we
standardize
and
validated.
So
we
apply
a
layout
to
that
data
to
add
a
bit
of
structure,
but
really
at
this
layer.
All
we
want
to
do
is
tag
fields,
so
the
data
comes
in
in
whatever
format
we
want,
you
know
they
want
and
then
we
tag
which
columns.
A
What
right
we
don't
want
to
apply
too
much
structure
we
don't
want
to.
We
don't
want
to
make
it
so
that
we
can't
flex
to
accommodate
what
kind
of
data
is
coming
in.
So
at
that
point
we're
just
standardizing
and
validating,
and
then
we
push
it
into
no
sequel
which
is
Cassandra
right
now,
after
that
right,
we
have
to
take
these
two
thousand
streams
of
data
and
find
out
which
ones
are
talking
about
the
same
thing:
the
same
practitioner
right,
one
might
have
an
address.
A
Another
one
might
have
state
license
information,
so
we
need
to
match
it
up
and
then
consolidate
it.
So
if
these
hundred
streams
all
have
data
about
the
same
doctor,
they
might
have
conflicting
addresses.
So
we
take
that
data
in
and
we
run
it
through
pluggable
rules,
some
logic
that
our
data
scientists
figure
out
to
figure
out
which
of
those
addresses.
We
should
choose
right
and
associated,
and
then
we
rank
the
addresses
and
put
them
on
what
we
call
the
profile.
A
So
once
we've
done
that
we
can
actually
push
it
then
into
our
index
and
our
relational
storage
right.
So
we've
added
quite
a
bit
of
structure.
At
this
point,
we've
consolidated
all
these
data
sources.
We've
picked
what
data,
what
data
we
should
use
and
then
we
pushed
it
and
the
left,
and
now
we
can
move
on
to
analytics.
So
once
we
have
that
structure,
we
can
run
the
analytics
on
top
of
the
data
and
we
can
present
that
data
through
dashboards
and
reports.
A
We
then
allow
our
customers
to
interact
with
that
through
both
the
user
interface
and
the
web
web
services,
so
that
is
master
data
management
to
us
and
here's.
The
crux
of
the
problem,
which
is
what
drove
us
to
choose
Cassandra
and
I,
mention
this
a
little
bit.
We
have
multiple
streams
so
picture.
A
thousand
of
these
are
you
know
a
couple
thousand
of
these
arrows
right.
All
data
streams
coming
in
right
and
the
f
is
represents
an
entity
and
what
I
call
fragments
right.
A
So
we
get
fragments
of
the
entity
that
are
split
across
streams
and
they
come
in
at
different
times
and
then
we
have
to
associate
them
together
through
matching.
Meanwhile,
the
schemas
underneath
of
us
might
change
all
right,
so
that
was
the
use
case
quickly.
Hopefully,
everybody
captured
that
and
we're
going
to
move
on
to
the
design.
So
we
had
a
relational
database
at
the
base
of
our
technology
stack
right,
so
we
have
oracle.
You
probably
saw
in
the
other
slide.
We
have
oracle
RAC,
we
have
a
massive
oracle
system
and
what
it
meant
to
make.
A
This
shift
to
big
data
for
us
was
to
swap
out
the
system
of
record,
which
was
oracle
in
relational
and
instead
plug-in
cassandra
right.
So
you
saw
from
the
design
slide.
The
flow
is
different.
Now
cassandra
is
the
first
place
we
write
to,
and
that
is
the
single
source
of
truth.
Everything
else
is
then
derived
from
the
data.
That's
in
Cassandra
right,
so
the
relational
database
to
us
is
really
just
another
means
of
presenting
and
actually
exactly
what
it
is.
Is
it's
one
point
in
time.
A
So,
as
I
mentioned,
we
store
all
points
in
time
of
all
the
data
everywhere.
That's
in
Cassandra.
We
then
materialize
a
single
point
in
time
into
Oracle,
so
that
people
can
view
it
all
right.
So
this
is
an
important
slide
to
make
one
of
the
things
that
that
drove
us
to
use
Cassandra,
in
addition
to
the
flexibility
and
the
scalability,
was
that
I
think
Cassandra
nailed
the
primitive
that
you
primitives,
that
you
need
and
I
was
asked
earlier
today.
A
You
know
what
was
the
hardest
thing
about:
getting
to
use
Cassandra
coming
from
a
relational
world,
I
come
from
relational
world,
were
you
know,
180
relational
databases,
massive
modeling
and
lots
of
tables
to
deal
with,
and
then
I
had
to
come
to
Cassandra,
right
and
you're
tempted
to
apply
the
same
kind
of
modeling
techniques
that
use
in
the
relational
world.
When
you
come
over
to
Cassandra
I
said
you
know
it
waxed.
A
She
wasn't
that
big,
a
shift,
if
you
just
don't
apply
those
relational
concepts,
just
ignore
them
right,
and
you
focus
on
this
in
the
middle
right,
which
is
a
map
right,
a
map
of
maps.
Basically,
when
you
take
a
look
at
it,
you
know
a
column
family.
A
You
know
Roku's
your
first
key
into
it
right,
and
that
goes
to
another
map,
which
is
the
the
columns
themselves,
which
is
a
key
which
could
be
a
composite
which,
in
representing
lengthy
array
that
goes
to
a
value,
and
if
you
just
think
of
that,
that's
your
data
structure
now
use
it
to
the
best
of
your
ability
and
that
thing
can
distribute
across
lots
of
different
nodes.
That's
a
really
really
powerful
concept
right.
A
So
once
we
had
nailed
the
storage
primitive,
which
I
thought
Cassandra
did
really
well,
you
know
what
other
primitives
did
we
need
to
add
to
the
system.
Well,
the
one
we
needed
was
a
queuing
right.
So
as
we
have
all
these
streams
coming
in,
some
of
them
are
dropping
millions
of
records
on
our
door
in
one
day,
some
of
them
are
trickling
in.
We
needed
a
queueing
queueing
primitive
right,
so
we
need
to
push
in
the
pop
of
a
value
right.
A
In
addition
to
that,
we
also
need
to
be
able
to
process
these
these
changes
in
the
data
and
do
something
with
them
right.
So
that's
the
emit
in
process
of
a
tuple
right,
I,
think
storm
nailed,
those
primitives
all
right
and
then.
Finally,
since
cassandra
is
mostly
opaque
and
has
no
idea
what
kind
of
bytes
you're
putting
into
the
system
it
cannot
apply,
it
cannot
apply
any
kind
of
smarts
to
that
data
right.
A
So
at
some
point
you
need
to
apply
some
smarts
so
that
you
can
query
this
data
quickly
and
that's
the
index
right
the
index
field
by
type
and
we
do
stuff
like
you
know,
our
customers
want
to
search
for
dr.
Bob
Smith
right.
Meanwhile,
in
our
system,
it's
robert
smith.
So
at
what
point
did
we
know
that
bob
was
really
ripe?
Robert?
A
A
Okay,
so,
along
the
way,
you
know-
and
I
think
you
know
if
you
look
at
themes-
and
you
know-
building
some
earlier
presentations
today-
item
potent
operations.
How
many
times
have
you
heard
that
thing
that
that
term
today,
you
know
3050
really
important
I,
you
know
it's
as
easy,
as
you
know,
I
shipped
an
operation
to
one
node,
I
lost
communication.
With
that
note,
I
have
no
idea
whether
it
did
it
or
not.
A
So
I
need
to
put
it
over
here
and
run
it
again,
but
my
system,
my
architecture,
my
design,
better,
be
able
to
handle
that
case
right
and
then,
in
addition
to
that,
nathan
martes
of
storm
had
a
great
little
was
a
strata
he
was
at
strata,
has
a
great
little
presentation
on
immutable
data
right
and
we
sort
of
switched.
We
we
adopted
that
model
where
your
database
is
really
just
an
assertion
of
fact.
At
a
time,
don't
go
change.
Anything
don't
do
updates.
A
Don't
do
deletes
it's
just
assertions
of
facts
it
at
a
point
in
time
right
and
that
makes
idempotent
operations
a
lot
simpler
right
so
and
then
the
anti-patterns
to
that
is
the
transactions
and
LA
Kings.
Just
don't
do
it
so
one
thing
there
and
I
think
there's
an
earlier
presentation
on
storm
today,
but
this
is
this
is
important.
A
So
you
just
told
me
that
I
have
to
be
entirely
on
impotent
right
and
there
are
certain
anti-pattern.
There
are
certain
things
that
are
hard
to
achievable
with
that
kind
of
design
right.
One
of
those
is
counting
right
right,
so,
as
things
are
happening,
I'm
trying
to
count
and
if
I
replay
something
I'm
going
to
count
it
twice
right.
So
that's
that's!
That's
bad!
A
So
Nathan
came
up
with
a
elegant,
hack,
I'll
call
it
to
try
and
remain
to
tolerate
replays,
but
also
allow
counting
and
state
transformation
and
the
way
that
I
did
it
is,
and
somebody
in
the
earlier
presentation
they
mentioned
Trident
as
well.
So
if
you
take
a
look
at
storm
and
you'll
get
more
into
this,
so
I'll
go
quickly.
What
he
did
is
the
stream
of
data.
That's
coming
in
this
doesn't
just
apply
to
storm.
You
can
actually
implement
this
design
and
anything
you
can
batch
it
up
right.
A
You
can
actually
parallel
all
the
parallel
lies
all
the
processing
of
those
data.
But
then,
when
you
go
to
update
your
state,
you
then
sequence
it
right
so
here
if
we
have
three
batches
of
data
that
we
want
to
process-
and
let's
say
we
just
want
to
commute
computer
some
for
each
one
of
those
batches.
So
the
first
batch
is
total
of
four
I'm,
counting
something
yeah.
The
second
batch
has
13
of
those
things
in
it.
A
Sorry,
the
batch
number
three
has
13
of
those
things
in
it
batch
to
as
six
of
those
things
where,
if
I
look
at
the
state
transformation
as
I
go
once
it
provided
I've
sequenced,
those
batches
right
when
I
process
batch
1
I'll
have
a
total
of
four.
When
I
pass
this
batch
3,
I'm
going
to
wait
until
I've
practiced
process
batch
two
right
here.
You've
brought
everything
back
together
right
so
that
waits
to
comes
in.
He
adds
his
six
to
get
to
ten
bosch.
A
A
Okay,
here's
the
what
we
did
wrong.
We
got
two
of
these
slides
so
and
this
is
not
tanaka
doop
I
know
there's
a
lot
of
Hadoop
fans
here,
but
we
as
I
said
switch
to
use.
Cassandra
is
our
system
of
record
right,
so
we
first
load
all
raw
data
in
the
Cassandra,
because
nowhere
else
would
do
not
load
it
to
HDFS.
That's
it
so
we
were
running
our
Hadoop
jobs
against
Cassandra
and
we
found
that
to
be
incredibly
inefficient.
A
So
I
don't
know
if
anybody
I've
we've
been
trying
to
talk
to
as
many
people
as
possible
here
to
find
out
if
anybody's
doing
that
successfully
with
Cassandra
input,
but
I
know
a
lot
of
people
and
we
were
in
the
first
place.
Writing
Cassandra
are
using
the
output
of
a
dupe
job
but
input
it
wasn't
working
for
us.
So
the
kind
of
times
that
we
were
expecting
to
see
we're
not
there
so
I'd
be
interested.
If
anybody
wants
to
talk
after
or
just
yell
at
me
right
now,
all
right.
A
So,
in
addition
to
that,
we
also
needed
to
track
what
changed
right
to
in
order
to
went
into
that
next
to
dupe
job.
We
had
to
track
what
changed
so
that
be
added
a
layer
of
complexity
that
we
didn't
want
and
we'll
see
how
storm
fixed
that
for
us,
okay-
and
this
is
what
we
did
wrong
sort
of
I'll
put
it
sort
of
on
this
one.
A
One
of
the
reasons
that
we
chose
cassandra
in
the
first
place
is
because
the
cleanliness
of
the
code,
if
you
guys
have
been
in
there
you
know
Jonathan
just-
does
an
amazing
job.
That
is
a
great
codebase.
There's
some
legacy
stuff
is
sort
of
like
you
know,
super
columns
are
still
in
there
and
things
like
that,
but
you
know,
but
it's
a
really
great
code
base
and
we
were
looking
at
HBase
versus
Cassandra.
You
know
we
went
down
in
the
code
and
said
you
know:
what
could
we
do
with
this?
A
Can
we
extend
the
functionality
here,
and
the
answer
is
yes,
absolutely,
it
is
very
clean.
I
mean
when
you
saw
Edwards
talk
earlier,
if
you're
up
there
for
introverted,
like
you're
able
to
just
go
in
there
and
it
build
on
top
of
it
right
with.
You
know,
just
a
few
hours
of
getting
oriented
kind
of
thing,
which
is
really
neat,
so
maybe
that's
a
bad
thing
when
you
want
to
go
in
there
and
you
build
your
own
trigger
capability.
A
So
you
know
we
weren't
willing
to
wait
and
we
wanted
this
trigger
capability,
because
we
what
we,
what
we
were
seeing,
is
lots
of
different
clients
accessing
Cassandra
right
directly,
and
we
we
knew
we
needed
to
maintain
some
wide
rows,
some
indexes
so
that
people
could
get
at
the
data
through
other
dimensions.
We
did
not
want
every
client
accessing
Cassandra
to
have
to
know
about
those
different
dimensions
right
and
we
hadn't
gotten
to
the
point
where
you
know
eric
is
saying
that
you
know
everything
went
through
a
nice
common
service,
oriented
architecture
API.
A
Yet
right,
so
we
implemented
our
own
trigger
functionality.
It's
out
there.
You
can
go
to
github,
go
to
Cassandra
triggers
I.
Do
not
recommend
doing
this,
but
you
could
so
it
worked.
Well.
Initially
and
to
be
honest,
it's
actually
still
in
production
right
now,
it's
been
there
for
a
year,
it's
doing
well,
but
what
happened?
Is
we
started
leveraging
that
too
much?
So
you
know
we
said
we
need
one
more
wide
row.
We
need
one
other
thing
and
then
every
our
entire
business
process
was
starting
to
be
articulated
via
trigger
right.
A
So
you
know
that
we
start
to
get
this.
You
know
sick
feeling
and
I
stomachs
that
you
know
our
business
process
should
not
be
a
side
effect
of
writing
to
Cassandra
right.
That's,
not
good,
okay,
what
we
did
right,
so
we
did
eventually
look
at
rest,
api's,
right
and
I.
Think
if
you
were
in
Edwards
talk
earlier,
you
know
so
I
we
would
wrote
Virgil,
which
is
a
rest
layer
on
top
of
Cassandra,
and
then
we
got
it
right.
Okay,
everything's
got
to
go
through
services,
an
API,
that's
a
really
smart
thing
to
do.
A
Let's
do
that
so
we
put
Virgil
out
there
to
you,
can
go
out,
go
and
you
can
use
it.
It's
actually
working
again.
Ours
is
in
productions
been
in
production
for
a
long
time,
but
I'm
really
excited
about
what
we
saw
about
introverted
right.
So
if
you
want
that
HTTP
and
rest
capability,
but
you
also
want
to
be
able
to
plug
in
and
do
your
own
own
thing
introverted
looks
like
it's
real
going
to
be
a
really
powerful,
powerful
thing
to
use.
A
Ok,
so
I
spend
a
lot
of
time.
How
am
I
doing
the
time?
Oh
man,
ok,
I'll,
go
faster,
ok,
so
Kafka
is
great.
I.
Think
a
lot
of
people
heard
lots
of
good
things
way
better
than
JMS.
We
were
using
activemq
and
Oracle
advanced
messaging,
advanced
queuing,
and
you
know
they
topple
over
once
you
start
flooding
them
with
messages.
So
you
know
we
took
a
look
at
Kafka
and
it
fitted
into
the
architecture
as
I'll
show
in
a
few
slides.
A
Likewise,
with
elastic
search
we
actually
two
years
ago,
look
we're
doing
the
comparison
between
elasticsearch
and
solar
and
went
we
figured
solar
was
the
solar
was
the
way
to
go
because
elasticsearch
was
too
young,
but
if
you've
tried
to
scale
solar,
you
know
that
it
can
be
a
bear
right.
You're
doing
master-slave
kind
of
relationships
and
directing
queries
to
certain
nodes
and
then
updates
to
others.
It's
not
very
pleasant.
So
when
elastic
search
incorporated
and
became
a
little
bit
mature,
we
took
another
look
at
it
and
said:
that's
the
way
to
go.
A
I'm
gonna
leave
talking
about
storm
to
Taylor,
but
basically,
as
I
said
before,
it
offers
the
primitives
that
we
needed
for
distributed
processing,
and
this
is,
it
has
become
our
backbone
right.
So
you
know
I
think
some
of
the
questions
around
polyglot
persist
are
how
do
I
manage
and
control
the
data
flow
from
one
system
to
another.
You
know
when
you're
using
you
know,
we've
got.
This
is
not
the
only
thing.
We've
got,
my
sequel,
databases
and
Oracle
databases
and
we've
got
elastic
search
on
got
Cassandra.
A
So
we
need
to
control
that
data
flow
and
make
sure
that
all
the
systems
get
updated
appropriately
right.
So
that's
big.
That's
become
storm
to
us,
but
the
interesting
their
thing.
There
is
a
nice
still
long
for
the
day
when
everything
was
a
trigger
is
that
all
crud
operations
now
have
to
go
through
this
data
flow
right.
So
I'll
just
go
quickly
through
the
system
that
we
have.
A
So
the
good
great
thing
here
is
that
you
know
talk
about
high
availability
and
as
long
as
you've
done
everything
you
know
in
an
item
potent
fashion
right.
You
can
tolerate
things
going
down
right.
So
if
you,
if
you
lose
part
of
your
storm
topology,
no
problem
storm
will
rear
out
replay
the
tuple
someplace
else.
It'll
get
run
right.
If
you
lose
a
note
in
your
in
your
cassandra
ring
no
problem.
A
You've
got
a
replication
factor
there
right
if
you
lose
all
of
storm
for
some
time
period,
no
problem
or
if
you
were,
if
something
was
going
wrong
in
your
system,
no
problem
just
rewind
your
offset
in
Kafka
right.
So
all
of
these
things
played
really
well
with
one
another
right,
we're
really
happy
when
it
all
came
together,
just
real
quick.
So
what's
coming
next,
we
have
a
desperate
need
for
a
graph
database,
as
I
mentioned.
A
We
maintain
all
these
affiliations
between
all
these
entities,
all
the
practitioners,
how
they
relate
to
one
another
and
all
the
organizations.
So
you
know
we're
gonna,
look
at
Titan
and
I'm
I'm.
Looking
for
anybody
that
has
experience,
running,
tighten
against
Cassandra
awesome,
we'll
talk,
okay,
so
I'm
we
were
looking
at
neo4j,
but
obviously
it'd
be
better.
If
we
can
manage
this
one
persistence
mechanism,
all
right,
I'm
gonna,
leave
this
to
you
right.
You
got
the
slides,
okay
and
then
here's
Taylor
Taylor
is
the
awesome
author
of
storm
Cassandra.
So
go
for
it.
B
Okay,
how
many
people
here
have
heard
of
storm
awesome?
Is
anyone
using
it
great
I'm,
going
to
go
into
a
little
bit
of
a
discussion
of
how
we
use
storm
and
some
of
the
open
source
software
that
we've
developed
for
storm
storm
is
a
real-time
distributed,
distributed
computation
system.
So
what
that
means?
Is
it
it
processes,
data
in
real
time,
and
that
contrasts
with
a
batch
system
like
Hadoop,
that,
where
you
have
a
clear
start
and
end
two
jobs
in
storm,
your
computations
are
open-ended.
They
run
forever.
B
One
of
the
things
that
Brian
mentioned
with
Hadoop
was
that
we
were
for
a
while.
We
used
to
do
to
do
some
of
our
batch
processing.
What
we
also
found
out
that
we
needed
to
reactivate
in
real
time.
So
in
some
cases
we
may
get
data
dropped
and
that
gets
dropped
as
the
batch.
So
we
could
have
a
Hadoop
job
that
would
deal
with
in
putting
that
data
into
our
system
of
record.
But
then
we
also
wanted
to
support
real-time
data
coming
in.
So
what
we're
refining
was.
B
B
So
what
does
a
storm
cluster
look
like
this
is
actually
a
diagram
of
one
of
our
development
clusters,
but
storm
has
a
an
architectural,
very
similar
to
Hadoop,
there's
Nimbus,
which
is
the
master
node
that
takes
care
of
assigning
tasks
out
to
the
work
of
know,
zookeepers
used
for
cluster
coordination.
A
storm
actually
doesn't
make,
doesn't
do
a
whole
lot
of
writing
to
zookeeper.
B
So
it's
it's
pretty
easy
on
your
zookeeper
cluster
and
then
supervisors
are
your
worker
nodes,
mm-hmm
and
they're
the
ones
that
actually
run
the
task
that
operate
on
your
data
and
in
our
development
environment.
This
is
not
a
I,
wouldn't
suggest
a
cluster
like
this
in
production.
This
is
what
we
use
for
a
a
small
development
environment
to
test
applications
in
a
clustered
environment.
In
a
production
environment
we
probably
wouldn't
co-locate
storm
and
Cassandra
and
we'd
have
a
multi-node
zookeeper
cluster.
B
B
The
next
primitive
is
a
spout
and
a
spout
emits
streams
of
tuples
bolts
are
your
unit
of
computation?
That's
actually,
where
you
do
your
operations
on
tuples
do
calculations
that
sort
of
thing
and
then,
if
you
combine
bolt
spouts
and
streams,
they
come
together
in,
what's
called
a
topology.
It's
a
topology
is
a
combination
of
any
number
of
spouts
and
bolts
and
defines
your
your
overall
computation.
B
So
to
dive
in
a
little
deeper
with
storm
spouts
they,
as
I
mentioned
before,
represent
a
stream
of
data
that
can
be
cues
like
Kafka
JMS,
kestrel,
Twitter
firehose
sensor
data,
whether
that
sort
of
thing
on
spouts
omit
tuples
into
the
topology
bolts,
receive
tuples
from
spouts
or
other
bolts
and
operate
or
react
to
that
data.
They
can
do
perform
functions.
B
Filters
joins
aggregations
and,
in
some
cases,
database
rights
and
look
up
and
we'll
get
into
more
that
a
little
later
and
they
can
also
optionally
emit
additional
tuples
storm
topologies
define
the
computation
of
your
application
and
it's
a
data
flow
between
spouts
and
bolts,
parallelism
of
components.
So
when
I
say
a
spout
or
a
bolt,
those
operate
as
tasks
in
your
topology
and
task
can
be
parallelized,
so
they
can
be
fanned
out
across
multiple
nodes
in
your
cluster
and
what
groupings
do
our
define?
How
that
data
routes
to
individual
tasks?
B
B
The
storm
Cassandra
project
that
bruh
I
mentioned
on
github
started
a
little
over
a
year
ago,
basically
what
it,
what
it
is,
what
it
boils
down
to
is.
We
have
several
components:
storm
components
that
are
bolts
and
they're,
also
Trident
functions
and
I'll.
If
I
have
time,
I'll
talk
a
little
bit
about
Triton
later,
but
we
have
these
generic
Cassandra
bolts,
one
that
allows
you
to
write
to
Cassandra
and
one
that
allows
you
to
look
up.
B
B
So
at
uppal
mapper
tells
the
Cassandra
bold
how
to
write
at
uppal
to
an
arbitrary
data
model.
So
your
your
data
model
can
be
anything
you
want.
All
you
really
need
to
do
is
define
that
interface.
So,
given
a
storm
top
a
little
map
to
a
column,
family
maps,
your
roki
and
then
map
that
data
into
Cassandra
Collins
and
the
columns
mapper
interface
does
the
exact
opposite.
It
tells
Cassandra
the
Cassandra
look
up
bull:
how
to
transform
a
sander
row
into
a
storm
tupple.
B
The
current
state
of
the
project
we
are
on
zero
dot,
4
20,
and
that
should
be
released
soon,
maybe
tonight,
depending
on
if
I
go
to
the
drink
up
or
not
we're
using
the
a/c
annex
client
for
a
while,
we
were
using
both
st
annex
and
hector
and
allowed
you
to
switch
that
at
runtime.
But
when
we
started
looking
at
supporting
composite
keys
and
composite
column
names,
it
became
a
lot
easier
to
settle
on
a
single
client.
B
B
B
B
Another
interesting
feature
of
storm
is
dr
pc,
so
I
talked
about
how
you
can
parallel
eyes
and
horizontally
scale,
your
topologies.
What
dr
pc
does
is
take
that
power
and
it
adds
a
dr
pc
server,
a
specialized
spout
and
a
specialized
bolt
that
does
the
coordination.
So
what
you
can
do
is
create
a
dr
pc
client
that
will
create
a
query
and
that
query
gets
parallelized
out
into
your
topology
and
then
using
transactional
mechanisms.
B
It
will
actually
return
you
a
spot,
a
response,
so
you
get
all
the
the
power
of
a
storm
topology
with
essentially
remote
procedure.
Call
one
example
of
that
is
the
reach
computation.
So
if
we
define
reach,
for
example,
as
let's
use,
Twitter
as
an
example,
so
if
I
tweet
out
a
URL
and
then
I
have
X
number
of
followers
and
each
one
of
those
has
followers,
the
reach
is
the
total
number
of
people
that
have
been
exposed
to
that
URL.
B
So
let's
say
we
are
Twitter
and
we
have
massive
amounts
of
users
and
followers
and
a
huge
graph.
So
what
you
can
do
with
storm
and
dr
pc
and
our
cassandra
implementation
is
break
that
down
into
several
simpler
functions
that
we
can
then
parallel
eyes.
So
in
this
example,
we
have
a
dr
pc
spouts,
that's
just
going
to
receive
a
URL.
The
URL
is
our
query.
B
We
want
to
find
out
how
many
users
have
been
exposed
to
that
that
gets
passed
down
to
a
cassandra
bolt
and
that
sander
bolts
sitting
on
top
of
a
wide
row,
valueless
column
and
the
row
key
to
that
is
just
the
URL
and
then
the
values,
well,
the
it's
valueless.
But
the
values
are
the
the
users
who
have
tweeted
that.
So
what
that
does
is
then
emits
the
URL
and
the
user
for
each
one
of
those
and
sends
it
to
the
followers
bolt.
B
What
the
followers
bolt
does
is
takes
the
user
that
ever
see
for
each
user
it
receives
it,
looks
that
up
and
emits
all
the
followers
of
that
individual
user.
So
now
we've
we
found
that
out
and
we
pass
it
on.
We
do
a
fields,
grouping
based
on
the
ID
and
the
follower,
and
that
goes
to
a
partial,
unique
or
bolt
which
gathers
up
partial
batches
of
counts
and
then,
finally,
to
account
aggregator
and.
B
This
is
a
notional
topology.
This
is
not
exam.
Tm
solution
looks
like,
but
this
gives
you
some
idea
of
how
we're
using
storm
I
mentioned
earlier,
that
we
treat
all
incoming
data
as
real-time
data.
So
when
we
do
batches,
we
read
that
data
and
write
it
to
a
Kafka
queue,
and
real-time
data
also
comes
in
the
same
way
it
gets.
There
is
some
transformation,
it
goes
into
a
Coptic
you
and
then
enters
one
of
our
topologies.
B
This
is
a
small
subset
of
the
previous
one.
This
is
an
example
of
what
part
of
our
loading
process
looks
like,
so
we
have
entity
fragments
coming
in
and
you
can
think
of
an
entity
fragment
as
a
smaller
piece
of
a
larger
entity,
so
it
might
be
address
information
or
state
license
information,
so
fragments
come
in
they're
written
to
Kafka.
We
use
a
transactional
Kafka
spout
to
route
that
into
the
topology.
There
are
a
couple
things
going
on
here.
B
We
shuffle
that
so
we
can
parallel
lies
amongst
multiple
fragment
functions.
There
and
the
several
streams
come
out
of
the
fragment
function
that
we
group
by,
for
example,
fragment
ID
and
naid.
We
do
that
so
we
can
maintain
counts
and
do
analytics
kind
of
things.
One
simple
count
would
be
the
number
of
fragments
we
have
per
entity,
the
total
number
of
entities
process,
the
total
number
of
fragments
processed,
and
that
goes
to
a
persistent
aggregate
function
and
those
counts.
B
Through
a
trident
state
implementation
get
written
to
Cassandra,
then
we
also
route
another
stream,
two
eggs
and
or
lookup
function,
and
what
that
does
is
looks
up.
Does
sort
of
reference
table.
Looks
up
lookups
to
enhance
the
data,
and
then
it
goes
to
a
Cassandra
write
function,
writes
it
to
Cassandra
and
then
passes
on
downstream.
B
A
Who's
that
comprehensive
okay,
just
one
more
note,
if
we
can
go
back
here,
real
quick,
so
we're
talking
some
of
the
things
I
think
it
is
an
evolving
aye,
sir,
it's
an
evolving
area
how
to
best
integrate
goosander
with
storm,
I,
think
and
they'll.
You
know:
we've
got
this
function
out
there,
the
exact
composition.
The
topologies,
I
think,
is
something
that
we
don't
yet
have
patterns
for.
So
if
you
take
a
look
at
this
topology,
there
are
multiple
or
in
some
of
our
topologies.
A
There
are
multiple
write
functions
right,
so
we
write
to
Cassandra.
Here
we
write
to
Cassandra.
Here
we
write
to
Cassandra
there.
You
know
one
of
the
patterns
we've
been
evaluating
is
whether
you
actually
construct
the
mutation
as
you
go
through
the
topology
and
do
one
single
right
at
the
end
right
and
that
takes
advantage
of
some
of
one
tues
properties
for
the
mutation
so
that
you
can
get
you
know
you
can
eliminate
a
little
of
that
eventual
in
the
consistency
so
that
when
you
write
this
thing
all
the
indexes
there
are
immediately.
A
A
Actually,
we
did
yes
yep,
yes,
you
know
yeah
and
that's
what
I
mean
it's
like
it's
you
know,
and
I
vacillate
some
days
between
like
everything
is
now
in
storm:
thou
shalt
go
through
the
data
flow
for
every
crud
operation.
You
know
I
come
in
and
that
gives
you
you
know
some
sanity
some
days
and
other
days
I'm
like
now.
That's
that's
terrible,
because
now
I
got
to
go
through
the
data
flow
for
every
crud
operation.
There.
B
They're
definitely
trade-offs
to
all
everywhere:
they're
they're
trade-offs,
the
index
solution
started.
It
didn't
scale
for
us
48
from
a
business
standpoint,
but
now
you
could
argue
that
we're
doing
everything
through
storm
that
involves
a
lot
of
network
traffic,
but
in
general
storm
is
extremely
performing.
That.
B
Because
Nimbus
is
essentially
like
the
master,
basically
both
with
a
storm
you
of
Master,
which
is
responsible
for
assigning
work
out
to
the
nodes,
and
then
you
have
supervisors
both
of
those
we
run
and
its
standard
practice
to
run
those
under
supervision
that
they're
designed
to
fail
fast.
So
if
they
go
down,
they
can
be
picked
up.
So
if
Nimbus
goes
down,
it'll
come
back
up.
If
you
lose
the
machine
that
Nimbus
is
running
on,
then
your
topology
will
continue
to
flow
and
work.
A
A
A
B
My
understanding
is
that
that's
coming
up,
you
can,
there
is
replication
and
you
can
operate
them
as
a
cluster
and
you
you
get
some
partitioning,
but
I
believe
I'd
have
to
look
at
the
current
version
of
cuff
guy.
Think
that's
coming
up
in
a
new
version
that
provides
some
enhanced
fault,
tolerance,
yeah
I
believe
so.
Do.
A
Yeah
yep
yeah.
It's
a
great
point:
storm.
B
Topologies,
basically,
they
get
packaged
up
as
a
jar
file,
a
fat
jar
file,
not
unlike
what
you
do
with
Hadoop
and
they
get
submitted
to
the
cluster
and
Nimbus
will
pass
out
the
jar
file
to
all
the
workers.
As
far
as
redeploying
yeah
changes
for
rules
yet
be
would
be
a
redeployment
of
your
topology
well
yeah.
In
some
ways.
A
B
Some
cases,
but
you
can
build
in
you-
can
build
in.
You
can
do
anything
you
want
in
a
bolt
or
spout,
so
you
can
build
in
some
sort
of
something
like
that.
There's
also
another
project
that
was
head
up
on
github
is
storm
signals,
and
what
that
lets
you
do
is
send
out
of
band
signals
to
components
of
your
topology,
so
they
can
essentially
communicate
with
one
another
yeah.
A
The
I
guess
in
a
previous
presentation
and
we've
done
the
same
thing:
we
have
a
bolt
in
arc
in
our
world.
We
have
a
bolt
so
as
to
poles
are
flowing
through
the
system.
As
the
data
is
flowing
through,
we
have
a
bolt
that
we
call
the
processing
engine
that
actually
calls
out
to
Ruby
it
handles
handle
hands
the
tuple
to
the
Ruby
script.
A
The
Ruby
can
do
whatever
the
hell
it
wants
to
transform
the
data,
you
know,
do
anything
and
then
it
hands
it
back
that
thing
loads,
the
Ruby
from
a
URL,
so
we
can
replace
the
logic
at
runtime
right
out
from
underneath
it
has
a
you
know:
five
minute
expiration.
So
if
we
update
the
Ruby
it'll
clear,
its
cache
pulled,
the
new
rubian
and
the
new
logic
will
take
effect
immediately
and
in
the
earlier
presentation
they
were
using
journals
for
the
same
thing.
A
B
Yeah
in,
depending
on
what
your
topologies
are
doing,
especially
with
cough,
gets
very
easy
to
overload
certain
systems.
For
example,
kafka
kafka
could
probably
perform
writing
to
an
oracle
database.
So
what
you
can
tell
storm
to
do
is
store.
Mascot
has
guaranteed
processing,
so
you
can
say
a
couple
must
pass
through
the
entire
bolt
tree
before
it
gets
acknowledged
if
it
doesn't
replay
it
when
you're
using
that
type
of
topology.