►
From YouTube: C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to Cassandra
Description
Speaker: Mohit Anchlia, Architect at Intuit
Slides: http://www.slideshare.net/planetcassandra/3-mohit-anchlia
This session talks about Intuit's journey of our Consumer Financial Platform that is built to scale to petabytes of data. The original system used a major RDBMS and from there, we redesigned to use the distributed nature of Cassandra. This talk will go through our transition including the data model used for the final product. As with any large system transition, many hard lessons are learned and we will discuss those and share our experiences.
A
A
I'll
give
some
background
as
to
of
the
products
that
into
it
develops
then
we'll
go
into
problem
statement,
and
we
will
also
talk
about
why
we
developed
consumer
financial
platform.
Why
we
chose
Cassandra,
we
look
at
cfb
stack
is
basically
consumer
financial
platform.
Stack
and
I'll
briefly
go
over
CFP
data
model
and
we
also
look
at
some
of
the
learnings
we
had
in
production.
Then
we
open
up
for
Q&A.
However,
if
you
have
any
questions
in
between
just
raise
your
hand
and
I
can
answer
those
questions.
A
So,
to
give
you
a
background
into,
it
is
a
maker
of
turbotax.
We
can
quickbooks
and
other
small
business
unit
products
I'll
pause
here
and
ask
a
question:
how
many
of
you
here
file
your
own
taxes,
okay,
and
how
many
of
you
here
use
TurboTax
good
and
how
many
of
you
here
like
TurboTax,
say:
yes,
okay,
okay,
so
basically
we
have
various
services
that
work
behind
the
scene.
That's
obviously
not
visible
to
the
end
user
and
they
work
together
to
deliver
awesome
product
experience.
A
But
what
has
happened
over
the
year?
What
happened
over
the
years
about
two
years
back
when
we
started
to
look
at
we
started
to
look
at
overall
and
into
it
and
took
an
account
of
all
the
services.
What
we
found
that
we
have
many
services
that
we
built
over
the
years
and
these
services
as
they
came
along
the
creator,
their
own
technology
stack
basically,
every
service
that
got
built.
A
They
have
their
own
databases,
they
have
their
own
operations
team,
they
had
their
own
machines
and
hosts
and
all
those
things
and
as
new
sites
came
along,
they
also
quoted
directly
to
these
services,
so
that
led
to
some
of
the
problems
that
we
started.
Seeing
and
some
of
that
is
code
duplication.
Basically,
there
are
certain
core
services
that
our
products
need
and
mostly
any
new
product,
even
existing
product.
They
would
go
to
these
services
and
we
had
the
code
duplication
between
all
the
sides.
A
This
also
created
data
silos
and
it
became
really
difficult.
So
if
you
have
to
analyze
the
tons
of
data
massive
amounts
of
data,
how
do
you
do
that?
But
now
you
have
to
coordinate
between
different
teams.
You
have
to
ask
for
certain
set
of
queries.
Then
you
will
have
developers
go
and
run
those
queries
and
you
have
to
coordinate
across
various
teams.
That's
very
difficult
to
do.
Other
problems
were
if
there,
if
and
when
we
had
schema
design
changes,
say
XML
design
changes
on
change.
A
A
One
of
the
other
things
that
we
noticed
was
the
time
it
took
for
new
sites
to
come
on
board
and
build
new
products.
Basically,
they
there
was
a
added
overhead.
If
you
are
trying
to
put
something
out
in
production,
you
have
to
first
build
and
go
to
these
core
services.
So
it
wasn't
the
prototyping
itself.
Wasn't
that
fast
and
want
to
eliminate
that
as
well.
Go
ahead,
oops,
I'm!
Sorry!
These
are
websites
so
websites,
for
instance,
turbotax
online.
A
A
Okay,
so
looking
at
that,
we
knew
that
we
have
to
do
something
here.
We
have
to
make
some
changes
so
that
it's
it's
easy
for
client
teams,
it's
easy
for
sites
as
easy
for
products
to
integrate
into
the
platform.
So
we
said
yeah
I
mean
why
not
introduce
a
platform,
and
this
platform
is
going
to
abstract
most
of
the
services
most
of
the
functionality
out
from
the
sides
and
from
the
product.
So
this
is
similar
to
any
other
service
oriented
architecture
concept.
I
mean
we
were
using
some.
A
What
that
concept
before,
but
this
one
is
really
integrating
all
the
services
and
providing
a
service
steer
and
data
tier,
so
sites
and
products
and
apps.
They
don't
have
to
worry
about
creating
their
own
services.
They
don't
have
to
worry
about
creating
having
their
own
operations
team.
They
don't
have
to
worry
about
managing
data
and
so
on
developers
just
focus
on
a
writing
business
logic
and
it
also
speeds
up
the
process
of
developing
new
products.
A
So
by
doing
that
it
the
benefits
we
get
is
now,
since
you
have
your
data
in
one
place,
you
have
your
logic
in
one
place.
You
can
provide
more
personalized,
highly
personalized
services
to
end
users.
You
understand
your
end,
users
behavior
even
better,
and
that
helps
to
give
more
personalized
services
and
also
it
also
helps
in
delivering
secured
services
to
them.
A
A
A
It
also
helps
in
bringing
up
sites
of
into
production
the
new
sites
and
new
concepts
into
production
quickly.
You
don't
need
operations
team
to
do
that.
It's
all
managed
by
the
platform,
so
so
the
reason
that,
overall,
if
you
look
at
it,
the
reason
we
develop
this
platform
is
so
that
we
can
abstract
out
a
lot
of
the
services
and
the
functionality
that
deals
with
orchestrating
the
services
between
the
data,
as
well
as
some
of
the
functionality
between
the
site
and
the
data
is
completely
abstracted
out
in
this
manner.
A
Yes,
we
do
have
guiding
principles
for
those
and,
depending
on
what
kind
of
data
it
is,
we
follow
different
security
rules
for
that
type
of
data
and
access
is
defined.
Based
on
that,
and
that's
why
I
mentioned
by
having
this
data
together,
you
can
do
it
much
more
effectively.
You
don't
have
to
worry
about
and
coordinate
between
various
services
to
make
sure
that
they
are
doing
the
right
thing.
I.
A
Yeah,
so
are
the
platform?
Users
are
websites
and
products,
but
obviously
our
end
users
are
people
filing
taxes
using
those
products
yeah.
So
what
it
really
means
is
that
we,
since
it's
a
platform
we
manage
of
the
entire
infrastructure,
we
managed
infrastructure
for
our
size
for
our
products,
that's
what
that
means
yeah.
A
So
this
platform
is
kind
of
a
mix
of
platform
as
a
service
and
software-as-a-service
same
awesome.
So
I
I'll
talk
about
data
platform,
tier
which
is
more
relevant
for
this
discussion
for
this
session.
So
some
of
the
principles
we
laid
out
for
data
platform
tier
was
that
it
needs
to
be
highly
available,
highly
scalable,
fast,
easy
to
operate,
and
software
only
solution.
Software
only
solution
is
really
important
because
you
might
be
operating
in
your
into
data
center
or
you
might
go
out
to
amazon
or
any
other
data
center.
A
So
appliances
was
out
of
question
for
this
and
we
also
need
to
support,
structure
and
unstructured
data,
and
these
are
like
massive
amounts
of
deer
and
projection
for
just
this
a
platform
alone.
Next
2-3
years,
most
petabytes.
We
are
about
to
hit
that
in
first
year.
So
it's
it's.
This
massive
amounts
of
beer
and
support
to
support
five
ninths
of
us
LA,
and
it's
easy
right.
You
can
do
all
that.
A
So
we
said
okay,
let's
look
at
a
traditional
RDBMS
and
see
if
we
can
find
a
solution
there,
because
that's
something
that
that
everyone
is
familiar
with.
It
provides
sequel
interface.
It
has
all
the
common
tools
and
everything
that
has
been
is
being
used
and
has
been
used
for
several
years.
So
we
looked
at
it.
But
when
you
look
at
it
based
on
our
experience
what
we
have
seen,
we
know
there
are
challenges
with
a
traditional,
our
DPMS,
as
your
data
comes
in
when
your
database
starts
out.
A
New
all
is
good
and,
as
you
add,
more
data,
we
are
database
grows
your
indexes
groves
and
you
start
to
have
slowness.
Then
Sloane
has
their
various
reasons
why
you
could
have
slowness.
It's
the
volume
is
one.
Then
locking
is
another
one
you
have.
You
might
have
a
slow
io
subsystem
and
in
traditional
databases
generally,
you
would
use
either
an
eye
sore
ass
and
solution,
but
still
it
share
it
shared
by
between
other
applications
as
well.
And
then
you
have
a
big
day
and
there's
a
crash.
A
A
Sharding
is
good,
it
works
well,
but
there
are
some
problems
when
you
have
huge
amounts
of
data,
you
obviously
need
a
lot
of
instances,
but
that's
not
a
problem.
The
problem
is
that
you
have
to
manage
all
that
logic
in
your
code
of
how
you
shall,
then
you
have
to
worry
about
how
do
I
avoid
single
point
of
failure.
So
you
add
slaves
to
it
once
you
had
slaves
to
it,
your
operations
or
dramatically
changes
right,
I
mean
you
have
you
are
going.
A
You
are
just
multiplying
the
same
number
of
nodes
and
you
have
slave
nodes
and
you're,
not
even
writing
to
it.
These
are
the
slave
nodes,
which
itself
is
not
a
good
pattern
and
then,
if
you
have
a
failure
now
you
have
to
worry
about
failing
that
node
over
and
then
once
it
recovers,
you
have
to
worry
about
failing
it
back
and
then
you
have
to
worry
about
sinking
it
and
not
to
mention.
A
So
no
sequel,
easy,
not
that
easy.
You
look
at
no
sequel
definition
and,
first
of
all,
when
we
looked
at
it,
we
cringe
to
be
said.
How
is
it
going
to
work?
It
doesn't
have
any?
Is
it
guarantees
right,
atomicity,
consistency,
isolation
and
durably
it's
something
that
takes
time
to
get
over
it
and
also
I
think
this
has
the
s
it
has
been
used
historically,
but
they
are
ways
where
you
can
do
the
same
thing
without
having
transactions
and
it
works
really.
A
Well,
if
you
have,
if
you,
if
your
workload
is
kind
of
a
one
transaction
or
one
row,
updating
so
type
of
workload,
because
in
that
case,
then
you
can
manage
that
in
one
way
or
the
other,
it
still
is
not
same
as
taking
a
lock
and
having
a
transaction,
because
it
doesn't
provide
any
rollback
capabilities
in
Cassandra.
As
you
know,
atomicity
is
limited
to
a
given
row.
If
you
are
making
any
changes
to
a
row.
A
Anything
in
a
row
itself
Cassandra
makes
sure
that
it's
atomic,
but
you
cannot
combine
multiple
inserts
into
one
transaction,
that's
not
there
and
that
that's
for
good
reason
right
because
transaction.
Inherently
it's
one
of
those
things
that
limits
your
scalability
their
ways.
You
can
do
transactions
and
you
can
build
on
top
of
Cassandra.
There
are
some
tricks
you
can
follow,
but
essentially.
A
It's
something
that
you
should
try
and
keep
away
from
it.
So
we
looked
at
our
core
use
cases
when
what
we
found
was
that
we
actually
really
don't
need
transactions,
because
we
have
oltp
type
of
workload.
It's
mostly
one
row
insert
and
update,
and
in
most
of
the
cases
we
actually
go
any
transaction.
What
we
can,
what
we
did
is
we
put
logic
on
the
client
side.
A
So,
for
instance,
if
you
had
a
failure
or
if
we
had
a
change
that
was
made
to
the
database,
which
was
invisible
to
the
client
and
there
was
a
failure
on
the
client
on
the
second
read
on
the
next
we'd
we'd
detect
that
and
we
take.
Basically,
we
check
that
and
throw
error
back
your
joy
to
our
clients.
A
Mongodb
is
really
good
when
you
are
starting
up,
it's
easy
to
set
up,
and
it's
it's
pretty
straightforward.
You
just
have
a
simple
schema.
Json
schema
that
you
use
to
ingest
data
in
it,
but
the
problem
really
occurs
is
when
you
have
huge
amounts
of
data
when
you
have
to
scale
your
nodes,
and
at
that
point
you
start
to
see
that
MongoDB.
A
The
way
MongoDB
has
certain
things
like
setting
up
your
application.
Shouting
data,
it's
I
would
say
it's
not
ideal
and
it's
something
that
if
you're
going
back
to
our
principles
that
we
wanted
something
that's
easy
for
our
operations
team
to
manage.
So
it's
something
that
we
couldn't
use
because
of
that
HP
HP's
again
is
a
really
good.
It's
equally
fast.
We
did
benchmarking
between
HP's
and
Cassandra.
They
were
pretty
comparable,
accept
that
HBase,
favors
consistency
over
availability
and
one
of
our
principles
obviously
was
to
have
highly
available
system.
A
So
HBase
is
something
that
we
decided
not
to
use,
but
it's
something
that
we
do
use
for
our
analytics
platform
Cassandra.
So
why
Cassandra
Cassandra
it's?
You
can
scale
it
horizontally
and
it's
highly
available.
You
can
design
it
such
that
it
can
sustain
no
single
points
of
failure,
and
you
can
it's
easy
to
set
up
clusters
between
data
centers.
A
You
can
just
add
nodes,
you
all
you
have
to
define
as
us,
you
have
to
just
define
the
configuration
in
a
properties
file
and
you
throw
it
out
there
as
long
as
you
have
proper
easy
else
open
your
cluster
is
up
and
running,
it
provides
fast
snapshots
and
one
of
the
other
things
is
that
you're
able
to
do
rolling
upgrades
from
I
think
a
Sangha
one
on
your.
You
can
do
rolling
up
grace
without
worrying
about
backward
compatibility
and
operation
that
this
is
really
key.
A
Once
you
are
up
in
production,
it's
it's
really
important
that
your
operation
team
is
able
to
look
at
the
cassandra
performance
is
able
to
do.
Operations
on
cassandra
is
able
to
install,
is
able
to
upgrade
cassandra
quickly
and
if
you
give
it
right
hardware
you
can
get.
We
can
get
really
good
low,
latency
response
times
and
Cassandra.
A
A
It
helps
us
orchestrate
business
logic
and
business
rules
between
services,
so
you
can
basically
create
a
flow
between
various
services,
for
instance,
if
you're,
if
a
user
is
logging
in,
you
might
have
to
go
to
a
different
service
to
create
a
ticket,
then
you
might
have
to
log
it
in
the
database.
All
those
services
are
oxford.
In
mule,
ESB
q
service
provides
queuing
of
messages
and
cash
services.
A
Basically
caching
static
data
in
there
and
services
platform.
What
we
did
is
we
provided
a
framework.
The
idea
was
that
at
some
point
open
it
up,
so
that
anyone
into
it-
and
maybe
at
some
point
and
even
outside
into
it-
they're
able
to
just
drop
in
the
business
java
code
logic,
just
just
the
business
logic
applicable
to
their
services
without
having
to
worry
about
writing
the
entire
framework
around
orchestrating
services,
so
developers
just
concentrate
on
writing
their
business
logic
and
dropping
the
core.
In
the
framework.
A
A
A
There
is
the
active
this
activity
that
goes
on
in
the
background
called
compaction,
and
that
is
something
but
what
it,
what
happens
is
as
you're
writing
data
in
in
memory
as
you're
writing
data
and
you're
flushing
out
and
creating
assets
tables
and
at
some
point,
when
these
SS
tables
merge
that
merge
causes
lot
of
I/o.
Now,
if
you
look
at
the
blahblah
blahblah
as
huge
data
right
and
still
it
has
to
traverse
through
that
the
entire
data
set
to
merge
the
file
and
that
causes
huge
I
Oh
spike,
that's
our
additional
I
Oh
spike.
A
A
A
Well,
I
would
so
Cassandra
is
used
for
matadores
storage.
So
what
we
do
is
we
say,
for
instance,
they
were,
for
instance,
there
is
a
request
to
create
a
record
in
data
platform
right.
So
with
that
what
we
get
is
we
get
certain
set
of
attributes
that
defines
that
particular
record
and
with
that
we
also
get
a
document.
So
the
document
goes
in
redhead
storage
and
metadata
lives
in
Cassandra
and
there's
a
pointer
to
the
document
in
Cassandra.
A
So
this
is
our
stack
or
architecture
for
multi
data
center.
We
use
active,
active
multi
data
center
and
the
reason
we
do
that
is.
We
generally
try
to
keep
ourselves
away
from
active,
passive
type
of
architecture,
because
we
have
we
generally
scale
in
both
data
center
to
support
our
peak
activity
and
if
you
have
a
passive
data
center
that
if
you
have
a
passive
data
center,
that
is
not
doing
anything
you're
just
wasting
resources.
So
we
have
active
active
data
centers
and.
A
So
they're
Cassandra
application
is
really
helpful
because
replication,
the
Cassandra
tier,
is
really
fast
as
soon
as
he
righted.
It
replicates
right
away
across
data
centers
and
to
avoid
any
consistency,
issues
we
have
a
global
load
balancer
and
we
provide
30-minute
sticky
sessions.
So
the
user
is
actually
is
on
one
side
for
30
at
least
30
minutes,
and
that
gives
us
enough
time
to
make
sure
that
data
is
replicated
between
data
centers.
A
So
some
of
our
learnings
in
production,
based
one
of
the
biggest
learning
that
we
have
in
production,
was
monitor
your
heap
closely
and
pay
attention
to
your
cpu's.
You
will
see
uneven
CPU
spikes
when
you
see
uneven
CPU
spikes.
What
it
really
means
is
that
you
have
and
and
if
you
can
correlate
it
with
the
volume,
then
most
likely
it's
garbage
collection
and,
as
you
can
see
here,
we
had
that
problem.
A
We
had
this
problem
in
production
where
we
ran
into
garbage
collection
issues
at
the
time.
As
you
can
see,
we
we
had,
we
were
seeing
like
400
texts
and
then
after
we
tuned
garbage
collection
and
we
added
more
nodes,
it
was,
it
became
consistently
dropped
down
significantly
200
CPU
ticks-
and
this
is
this
is
during
peak,
and
one
of
the
other
things
that
was
really
important
is
to
monitor
your
disk.
It's
something
that
causes
I,
think
most
of
the
pain
so
keep
in
close
eye
on
it
is
really
important.
A
Yeah
so
bloom
filters
we,
this
is
setting
in
Cassandra
that
you
can
put
in
place
to
reduce
the
bloom
filters,
and
we
we
one
of
the
important
things,
is
to
look
at
your
cache,
misses
and
and
then
determine
what
the
right
ratio
is
for.
The
bloom
filters
like
turning
down
broome
bloom
filters
is
it's
something
that
has
other
performance
implications
as
well,
so
you
have
to
be
really
careful
and
also
obviously
you're.
It
also
depends
on
body
problem
is
like
for
us.
A
Yep,
that's
right
and
and
that's
what
that's
exactly
what
we
see
the
the
what
you
see
on
the
left
side
is
basically,
this
was
a
first
peek
that
happens
around
at
january
january
time
time
frame
and
what
happens
is
what
happened
there
was
we
had
the
spike
in
memory
spike
in
heap
usage
and
that
that's
something
that
you
you
generally
don't
see
right.
You
only
see
when
you
have
high
traffic
and
that's
what
we
saw
and
we
had
to
immediately
take
some
of
the
steps
it
was.
A
A
A
Oh
yes,
yeah,
so
one
of
the
things
that
fear
it
introducing
is
going
back
here.
If
we
look
at
it,
it's
possible
that
your
data
is
not
replicated
at
Cassandra
tier
right.
So
one
of
the
things
after
do
is
you
have
to
run
repairs
continuously,
but
it's
still
even
then
you
might
run
into
scenarios
right.
A
Now
so
we
are
entirely
using
Cassandra.
However,
cassandra
is
good
for
direct
lookups
right
like
if
you
have
key
based
look
up.
You
can
look
it
up
real
quick,
but
if
you
want
to
range
queries
or
if
you
want
to
do
some
kind
of
fuzzy
queries,
it's
difficult
to
do
so.
The
thing
we
are
looking
at
is
adding
a
search
tier
in
there
so
that
it
helps
our
operations
team
as
well
as
it
also
helps
us
analyze
that
data.
That's
something
that
getting
to
work
on
this
year.
A
C
A
A
A
A
A
Yeah,
so
we
have
queuing
service
in
between
where
data
gets
returned
and
analytics
platform
fetches
from
it
and
writes
to
hardwood
and
we
use
flume
there.
There
is
some
data
set
that
we
get
directly
from
sites
like
click
stream
type
of
data
that
goes
through
our
real
time:
processing
engine
through
flume
into
Hadoop,
and
then
there
is
data
that
comes
asynchronously
to
us
to
message
was.