►
From YouTube: C* Summit 2013: Crossing the Chasm - SQL to NoSQL
Description
Speaker: Isaac Rieksts, Software Development at Health Market Science
Slides: http://www.slideshare.net/planetcassandra/1-isaac-rieksts
Over the past few years, Health Market Science has transitioned from traditional relational databases and enterprise systems to a massively scalable Big Data platform that combines Cassandra and Storm to ingest thousands of feeds of data from the health market industry to produce a single high-quality masterfile. Come hear the "Why?", "What for?" and "How?" of that evolution.
A
Sure
start:
alright,
alright,
hello,
sighs,
agree
x
time,
the
software
developer
at
health
market
science
and
today
I'm.
We
talking
about
how
we
made
the
transition
from
an
osu
a
relational
database
to
a
no
SQL
database.
So
the
first
thing
I
wanted
to
talk
about
a
little
bit
is
what
we
do
at
help:
Marcus
science.
We
manage
a
lot
of
practitioner
data
so
we'll
take
in
many
sources
and
consolidate
that
down
to
a
view
of
a
practitioner.
We
use
this
to
work
with
pharmacies
and
other
healthcare
industries
to
cleanse
their
data
and
reduce
fraud.
A
So
one
example
of
this
would
be.
If
you
go
into
your
local
pharmacy,
you
ask
for
prescription
we're
going
to
provide
the
data
to
see
whether
that
doctor
is
able
to
prescribe
or
not.
Another
thing
we
do
is
we
do
report?
So
if
you
have
healthcare
data
and
you
need
to
report
to
the
government
will
help.
You
generate
a
lot
of
those
reports,
and
the
last
thing
I
want
to
say
is:
we
also
provide
key
market
insight
as
far
as
drug
targeting
for
other
healthcare
industries.
A
Now
the
first
thing
when
talk
about
is
our
business
is
divided
into
two
segments.
The
first
segment
is
master
data
management,
the
second
one
being
our
claims
warehouse
now
on
the
both
of
these
segments
have
the
3
v's
variety
velocity
and
volume
on
the
master
data
management
side.
We
have
variety
and
velocity,
so
we
take
in
over
2,000
sources
of
data
with
all
with
different
schemas,
and
each
of
these
schemas
is
changing
over
time.
So
it
may
not
be
the
same
schema
between
two
different
files
we
receive.
A
A
So
many
of
you
may
not
be
familiar
yesterday,
management's
I'm
just
going
to
give
a
quick
overview
of
how
the
data
flows
through
our
system
the
main
idea
behind
master
date
management
is.
We
have
many
sources
and
we
want
to
provide
a
consistent
view
of
that
data.
That
is
precise
information
about
a
doctor,
so
we'll
get
data
in
from
our
customer
may
contribute
back
to
our
data,
will
get
information
in
from
the
web?
Will
crawl
the
web
looking
at
hospital
websites
and
we'll
get
data
from
the
government.
A
We
take
all
these
sources
and
then
we
validate
and
standardize
the
data
so
will
throw
out
bad
data
and
then
will
standardize
the
data
into
a
standard
format.
So,
let's
say
phone
number:
some
people
have
phone
number
as
spaces.
Some
people
have
dashes
some
people
have
nothing
in
between
at
all
and
we
need
to
bring
all
those
phone
numbers
into
a
standard
format.
We
also
do
it
with
many
other
fields
as
well.
A
After
we've
standardized
data,
we
need
to
match
data
together,
so
we're
going
to
say
that
these
two
sources
are
talking
about
the
same
practitioner.
The
some
ways,
like
so
example,
is
last
name.
Obviously
this
is
a
very
simple
example,
but
we
would
not
actually
use,
but
our
matching
algorithms
are
proprietary
after
we've
determined
that
two
sources
are
indeed
or
two
or
more
sources
are
talking
about
the
same
practitioner.
We
need
to
consolidate
that
into
a
single
view
of
that
practitioner
and
we'll
walk
through
an
example.
A
So
if
we
have
a
record
John,
David
Smith
and
Mike
Steve
Smith
we're
going
to
say
these
two
matched
and
they
are
the
same
practitioner
now,
source
a
and
source
be.
If
we
sb
has
a
good
first
name,
we
trust
the
source
be
for
its
first
name,
so
we're
going
to
bring
Mike
over
and
all
that
the
consolidated
record
but
source
a
we
know.
The
middle
name
is
good,
so
we're
going
to
bring
that
middle
name
over
and
then
the
last
name
is
the
same.
A
So
it
doesn't
matter
which
one
we
bring
over,
and
this
is
a
consolidated
view
of
our
record
in
our
legacy
system.
This
process
took
a
week.
We
have
a
relational
database
and
we
use
jboss
with
jboss
mq
know
a
lot
of
the
time
taken
for
this
was
due
to
the
data
checks
and
analytics
performed
at
the
back
end
to
speed
this
up.
We
introduced
Cassandra
into
our
stack,
so
we
use
Cassandra.
Now
is
our
primary
system
of
record.
We
use
drop
wizard
for
arrest
services,
which
I'll
talk
a
little
bit
about
a
few
slides.
A
A
Alright,
so
I
wanted
to
talk
briefly
about
our
data
model
when
we
go
in
from
relational
database
to
a
no
sequel
we
switched
from
thinking
about,
joins
to
thinking
about
the
entity
as
a
full
entity.
We
build
up
that
record
over
time.
So,
as
these
sources
come
in,
there
will
be
old
pieces
of
that
record.
Then,
when
we
fetch
the
data
out
of
Cassandra,
we
get
the
full
record
and
we
found
that
the
primary
key
determined
a
lot
of
how
we
query
our
database.
A
A
What
it
does
is
it
does
two-phase
fetch
in
the
first
phase,
you're
going
to
get
your
key
primary
keys
based
on
the
index
and
then
based
on
those
primary
keys.
You
can
do
a
direct
fetch
to
get
your
data,
so
I'm
going
to
walk
through
an
example
of
flow.
So
if
we
have
column
two
column,
one
column
two
as
our
key
and
we
have
our
index
column
two
column
one.
You
can
then
fetch
from
your
application
to
the
index,
get
back
your
west
of
keys
and
then
you
can
do
some
filter
on
those
keys.
A
Now,
a
little
bit
more
concrete
example.
We
have
first
name
last
name.
The
index
is
on
last
name,
first
name,
so
our
data
is
John
Smith.
We
have
Steve
Smith
and
David
Jones.
Now,
when
you
look
on
the
index,
you
can
ask
for
john
smith,
smith
and
you'll
get
back
john
smith
and
steve
smith,
but
you
will
not
get
back
david
jones
and
if
you
ask
for
Jones,
then
you
get
back
to
single
record
and
we
integrated
this
into
our
system
here.
A
So
we
have
our
tab
files
which
come
in
and
then
our
relational
database
is
triggers.
So
we
added
a
lot
of
triggers
into
that
relational
database
when
it
goes
into
Cassandra
and
then
at
that
point
we
extract
it
into
the
client
deliverable
and
then,
in
our
real
time
system
we
start
adding
storm
into
the
mix.
So
now
now
storm
loads.
A
A
We
either
place
the
entire
row
12
the
JMS
queue
with
the
sis
time,
and
then
we
write
to
cassandra
using
this
this
time
or
we
will
write
just
the
ID
to
the
JMS
and
then
query
back
to
oracle
for
the
actual
data.
Both
approaches
work
pretty
well,
there's
a
cons
to
both
in
the
first
approach.
You
have
to
make
sure
your
clocks
are
very
much
aligned
when
your
oracle,
RAC
or
else
one
node
will
tend
to
get,
will
tend
to
win
in
the
concurrency
base
over
the
other
one.
A
I
want
to
talk
about
after
we
have
the
system
in
place.
How
do
we
test
this?
So
we
use
in
memory
mock
Cassandra,
so
we
have
found
that
cassandra
is
very
easily
modeled
in
memory
with
the
map
string,
map
string,
map
string
string,
which
is
key,
space,
colony,
column,
row
key
and
then
the
value
is
either
string
or
object.
Depending
on
the
complexity,
we
handle
a
lot
of
strings
since
our
data
is
very
unstructured
and
then
for
our
integration
tests.
A
We
have
a
super
class
that
starts
up
our
Cassandra
of
my
greater
schema
and
then
goes
into
the
unit
test
itself
and
after
the
in
test
is
done,
come
back
to
the
super
class
to
clean
up
the
Cassandra
and
bring
it
back
down,
which
leads
me
into
how
our
quality
assurance
changed
a
little
bit
with
this.
So
we
write
rest
services
for
almost
all
of
our
services.
This
allows
for
soap,
you
I
integration
very
easily,
so
we
use
soap.
A
They
may
find
bugs
there
and
we'll
fix
both
the
soapy
I
bugs
and
the
software
bugs,
and
then,
when
that
software
is
ready,
it
gets
promoted
to
a
test
environment
where
we
have
Jenkins
run
the
soap
you
I
through
maven,
and
we
get
automatic
regression
at
that
point
now,
as
we've
been
using
Cassandra,
we
found
that
the
team
structure
was
very
important.
We
added
development
operations
to
our
existing
team.
In
the
early
days
when
we
were
first
used
in
Cassandra,
we
didn't
have
a
dedicated
DevOps
resource
and
we
found
this
was
a
struggle.
A
We
added
a
DevOps
person
to
manage
our
puppet
scripts
and
this
helped
smooth
things
out
quite
a
bit
and
I
want
to
talk
briefly
about
our
infrastructure
and
how
we
actually
deploy
our
software.
So
we
use
VMware
to
deploy
as
our
boxes.
I
know.
Virtualization
is
very
much
frowned
upon,
but
the
main
reason
for
why
we
use
virtualization
is
so
we
can
spin
up
a
cluster
at
any
point.
A
A
We
use
puppet
master
place,
a
message
on
the
queue
that
will
then
come
in
and
install
the
box
or
you
can
use
Jenkins
so
push
of
the
button,
and
then
the
box
will
come
up
to
whatever
version
we
need,
and
then
that
means
that
the
same
script
that
was
run
in
development
environment
is
then
run
in
production.
The
only
thing
that
changes
is
the
host
names
in
the
config,
so
you
may
have
a
bigger
box
in
production,
so
you
would
have
some
config
tweaks
now.
A
I
wanted
to
switch
gears
over
here
to
talk
about
a
real
time
system
and
walk
you
through
that
architecture.
So
we
start
with
that.
Rest
API
that
talks
about
with
drop
wizard,
and
that
goes
on
to
a
Kafka
queue.
So
we
put
all
the
data
into
Kafka
and
then
storm
can
rapidly
pick
up
that
data
off
the
kafka
q.
A
It
writes
it
down
to
Cassandra,
and
then
we
also
write
a
copy
of
that
data
in
JSON
format
to
elasticsearch
and
then
what
elasticsearch
does
for
us
is
out
of
the
box.
We
get
a
google
like
experience
for
all
of
our
data
and
all
of
our
practitioners,
and
then
we
also
have
some
customs
things
that
we
do
with
elastic
search
as
well.
A
Now,
why
do
we
choose
storm?
It's
got
guaranteed
once
delivery
for
all
messages,
and
then
it's
also
a
very
fast
distributed
system.
So
we
write
against
the
Java
API
of
our
business
logic
and
our
data
flow
and
then
storm
distributes
that
for
Isetta
when
your
scale
and
then
it's
also
got
a
lot
of
momentum
in
the
community
right
now
to
integrate
storm
and
Cassandra.
We
create
an
open
source
project
that
storm
Cassandra.
A
A
This
integration
piece
is
done
through
an
interface
with
a
mapper,
so
you'll
implement
one
interface
and
then
that
interface
determines
how
you
translate
your
business
object
from
storm
to
Cassandra
and
then
back
from
Cassandra
to
storm
one
sucked
briefly
about
our
vision
for
the
future.
We
are
currently
experimenting
with
graph
databases.
We
have
neo4j
in
POC
and
we're
looking
at
titan
to
see
if
we
can
have
a
single
persistence
layer
to
help
on
the
DevOps
I'd
instead
of
having
another
store.
A
So
we
have
right
now
Cassandra
elastic
search,
and
then
we
either
go
with
neo4j
or
Titan.
We
have
very
related
data
so
that
we're
that's
looking
last
thing
I
want
to
say
is
just.
In
summary:
we
have
the
Cassandra
indexing
project,
which
gives
you
alternate
key.
We
have
oracle
advanced
queue
which
helped
with
the
integration
between
relational
database
and
no
sequel,
and
then
we
have
our
storm
Cassandra
project,
which
then
gave
us
immigration
between
storm
and
Cassandra
with
that,
if
there's
any
questions,
yes,.
A
The
speed
Kafka
is
very
performant.
It's
pretty
much
the
fastest
q
out
there
that
we
found
it's
a
consumer-based.
So
with
Kafka,
you
put
all
the
data
into
Kafka
and
then
the
consumer
keeps
track
of
where
it
is
in
the
queue.
So
if
it's
you
have
the
offset
which
I
can
so
this
offset
here,
the
bottom
left,
the
consumer
keeps
track
of
that
inside
a
zookeeper,
so
zookeeper
make
sure
the
to
consumers
don't
have
the
same
offset
and
that
increases
performance
quite
a
bit
any
other
questions.
A
So
there's
I
can
tell
you
in
production
we
have
107
and
in
uit
we
have
one
to
one
yeah.
So.
A
A
A
A
A
We
haven't,
we
do
consistency,
level
quorum
and
we
haven't
faced
like
if
and
we've
had,
Cassandra
certainly
nodes
go
unavailable
and
then
storm
will
replay
those
if
there's
a
failure,
and
it
will
just
rewrite
it-
I'm,
not
sure
sick
storm
guarantees
wants
one
success.
So
storm
will
make
sure
that
if
there's
a
failure
that
it
gets
retried
and
then,
if
it
completely
fails
out,
then
you
go
to
a
dead-letter
queue.
B
A
A
A
A
It's
only
current
data,
so
it's
so
current
is
we
have
six
million
active
practitioners,
but
we
want
to
we
keep
track
of
them
throughout
the
ten
years.
So,
given
any
point
in
time
we
can
say
what
a
practitioner
looked
like,
so
the
data
tends
to
there's
not
a
lot
of
new
practitioners
that
enter
in
so
say,
answer
your
question
I'd.
So
one
in
the
back
I'm
sure
how
many
kind
of
questions
we
have
time
for.
We
think
we
have
time
for.
A
A
So
in
the
old
system
we
would
do
data
checks
and
we
would
run
through
the
entire
universe
and
see
what
for
data
that
was
inconsistent
or
dinner
wine
up
or
got
changed
wrong
in
the
new
system.
We
use
storm
to
run
those
data
checks
as
the
data
changes,
so
that
allows
us
to
have
much
smaller
pieces
of
data
going
through
and
then
to
think
what
the
the
other
thing
is:
the
purex
board
and
woke
up
time
we
found
the
relational
database
was
much
slower
for
individual
fetches,
so
Cassandra
sped
up
the
individual
bro
fetches.