►
Description
Speaker: Anastasia Zamyshlyaeva, VP Platform - Product Management
Cassandra's flexibility and scalability make it an ideal foundation for a modern data management architecture. Come hear how Reltio is using Cassandra, in combination with graph technologies and Spark to deliver a new breed of data-driven applications.
In this presentation you'll find out:
- How we ended up selecting Cassandra
- The unique characteristics of data-driven applications
- The best practices we learned by combining Cassandra, graph technology, Spark and more
A
A
How
many
of
you
are
using
Cassandra
for
operational
enterprise,
software
for
business
people?
Oh
pretty
good
group
of
people,
but
usually
enterprise
software
is
not
common
application
for
Cassandra,
which
traditionally
relies
more
on
relational
databases,
but
at
rel
do
we
use
Cassandra
to
build
very
powerful
data-driven
applications
and,
in
this
session,
I'll
tell
you
why?
What
do
we
mean
by
data-driven
applications
for
our
customers?
This
means
to
be
write
faster,
be
write
faster
with
reliable
data,
relevant
insights
and
recommended
actions.
A
You
can't
be
right
if
your
data
is
wrong
garbage
in
garbage
out,
it
is
impossible
to
have
relevant
insights
if
you
are
not
factoring
all
huge
amount
of
information
that
builds
complete
picture
and
recommended
actions
come
with
learning
of
data,
behavior
users
behavior
data
patterns,
in
other
words,
applying
machine
learning
in
this
session,
I'll
be
focusing
on
the
first
spot
of
royalty
equation,
reliable
data.
This
is
the
one
where
Cassandra
plaits
major
role,
but
first
let
me
start
with
some
introductions.
A
A
If
you
focus
on
the
smallest
details,
you
never
get
the
big
picture
right
like
in
this
picture.
When
I
show
you
the
first
fragment
you
can
try
to
guess
what
it
is.
You
can
even
make
some
decisions
on
top
of
this,
then,
when
I
show
you
the
next
fragment
of
this
data,
the
direction
of
your
thoughts
can
go
in
completely
different
direction,
but
the
truth
is
that
reality
could
be
completely
different,
and
this
is
what
is
happening
within
the
enterprise.
A
They
have
all
this
data,
but
business
people
are
making
decisions
having
just
smaller
picture
in
front
of
them,
they're
limited
to
the
application
they're
using
let's
dive
into
this,
with
more
details
with
company
that
most
of
you
are
familiar
with.
This
is
enterprise
company
from
the
office.
Tv
show
dunder
mifflin
for
those
who
do
not
know
today's
vessel
paper
and
office
supplies
enterprise
company.
They
have
sales
department
which
goal
is
to
find
customers
and
sign
up
contracts
to
you
to
supply
paper.
A
A
Dunder-Mifflin
bought
an
application
for
supply
team
that
allows
effectively
deliver
paper
across
various
customers
in
different
regions.
Of
course
they
have
marketing
department.
They
have
support
a
lot
of
other
departments,
so
in
this
picture,
just
five
of
them-
five
departments,
five
different
applications
that
perfectly
addresses
needs
of
each
department,
but
these
five
applications
have
their
own
database
with
each
own
data
store.
That's
why
all
the
data
is
isolated
and
captain
silos,
but
at
the
same
time,
information
about
the
same
object
could
be
stored
in
various
places.
A
So,
for
example,
on
company
website
there
could
be
information
about
John
Smith.
The
customer
that
he's
maintaining
there
could
be
information
in
sales
department
in
in
supply,
like
everywhere,
cross-up
enterprise
information
about
the
same
account
that
shows
data
from
from
different
perspective
and
data
can
be
updated
in
one
place,
but
other
applications
will
have
no
idea
about
it
and
to
bridge
this
gap,
IT
team
comes
into
play.
They're
asked
to
synchronize
data
from
one
application
to
another,
so,
for
example,
if
sales
team
sign
up
contract,
they
want
automatically
create
an
account
on
website.
A
Actually,
companies
are
spanning
huge
budgets
on
this
kind
of
activity
and
a
lot
of
time
they're
trying
to
introduce
keys
so
to
Leeson
what
is
happening
in
one
system
and
when
something
is
happened
there.
They
are
taking
this
data,
transforming
data
in
the
format
that
that
application
understands
SATA
and
save
it,
and
at
the
same
time,
what?
If
that
application,
is
not
available?
A
This
is
actually
real
diagram
of
such
enterprise
system.
You
can
see
a
lot
of
applications
with
their
own
databases
and
there
is
synchronization
with
between
various
applications,
and
synchronization
can
be
done
with
special
tools
that
companies
are
buying
that
focuses
on
bring
data
from
one
place
into
another.
A
A
Can
you
spell
it
like
yeah
really?
This
is
what
it's
happening
with
the
enterprise
and
after
this
Lu
logical
question
is:
is
data
up-to-date
another
logical
question?
Is
these
data
correct?
So
what
if
one
application
provided
incorrect
data,
then
this
is
just
spread
across
all
enterprise
and
all
enterprise
now
have
incorrect
information
how
to
roll
it
back.
How
to
understand?
What's
the
current
say,
a
state
of
the
object
should
be.
The
other
logical
question
is
data
complete
every
application
is
limited
to
its
data
structure.
Of
course,
there's
some
flexibility
to
expand
this
schema.
A
To
address
this
problem,
15
years
ago,
a
new
type
of
application
appeared
to
unified
data
from
multiple
sources,
open
they're
cold
as
master
data
management.
Their
goal
is
to
consolidate
data
from
multiple
sources,
bring
together
address
any
conflicts
blended
and
provide
unified
view
of
data
across
enterprise,
as
this
application
appeared
15
years
ago.
A
Traditionally
they
are
based
on
relational
databases
with
certain
problems
that
relational
databases
have
such
as
fixed
structure.
What,
if
you
have
new
new
attribution
that
you
want
to
bring,
then
you
need
to
do
alter
table
that
locks
database,
not
cool.
It's
close,
its
close
to
impossible
to
bring
big
data,
for
example.
Think
think
how
we
are
we
really
bringing
information
about
all
emails
across
enterprise,
about
click,
streams
of
of
clients
on
the
website
almost
impossible,
because
hardware
is
crazy,
expensive
for
such
systems.
A
Cassandra
doesn't
have
very
powerful
data
model
that
we
would
benefit
from,
or
it
doesn't
have
other
blinks
such
as
powerful,
query,
language
or
or
bulk
operations,
but
it
it
does
what
it
promised.
It
has
high-performance
fault,
tolerance,
linear,
scalability,
multi
data
center
availability.
So
all
the
things
that
you
hear
a
lot
of
this
conference
and
they
we
use,
we
use
a
commodity
hardware.
That's
why
we've
chosen
Cassandra
as
our
primary
data
story.
A
A
According
to
data
stacks
recommendation,
you
need
to
think
how
do
you
want
to
use
data?
How
do
you
want
to
expose
this
to
you?
I
what
queries
they
want
to
make,
and
this
will
drive
modeling
this.
In
our
case
in
case
of
platform,
we
have
no
idea
with
what
objects
we'll
be
working
with.
Will
it
be
organizations,
individuals,
products
or
maybe
it
will
be?
Database
of
cats
and
dogs,
like
anything,
could
be
there.
A
So
here
on
the
left
side
you
can
see
various
UI
is
that
were
generated
automatically
out
of
metadata
that
helps
to
maintain
doctors,
hospitals,
affiliations,
hierarchies
historical
data,
and
we
we
provide
insights
and
recommended
actions
for
sales
or
marketing
team.
In
this.
In
this
you
eyes
or
another
scenario,
the
same
cloud
just
different
configuration
where
we
have
configuration
for
oil,
wells
and
other
equipment,
and
with
this
application
we
are
targeting
completely
different
set
of
users.
A
These
are
users
who
are
interested
in
having
360
deal
of
oil
and
gas
production
across
hundreds
of
wells,
hundreds
of
thousands
of
wells
worldwide
or
in
the
country,
another
application
and
another
metadata
that
drives
all
UIs
and
api's.
Somebody
wants
to
manage
asset
catalog
for
movies,
for
songs,
for
TV,
shows
and
blend
all
this
information
with
social
media
so,
for
example,
have
some
sentiment
analysis
on
top
of
of
various
movies,
and
this
is
a
different
application.
A
In
your
audience,
you
can
store
entities,
organizations,
individuals,
you
can
store
relationships
such
as
John's
palette,
spousal
relationship.
You
can
store
information
about
Graf,
John's,
social,
social
graph
transactions.
How
John
went
to
the
website
and
what
Lord
links
he
has.
He
clicks
so
maybe
to
predict
what
he's
interested
in
we
store
historical
information
and
Cassandra
for
this
is
foundation
in
there
in
the
remaini
session.
Let
me
focus
on
what
challenges
that
we
have
using
using
Cassandra
while
building
voucher.
A
So
the
first
challenge
that
we
had
was
modeling
complex
documents
and
the
additional
complexity
that
we
have
on
relative
side
is
that
every
attribute,
even
even
simple
string,
it
can
come
from
multiple
sources.
That's
why
we
need
to
support
multivalued
attributes
for
everything,
because
sources,
different
systems
can't
agree
on
on
having
certain
value
and
additional
complexity
is
that
we
we
want
to
support
very
complex
structures
and
again
these
complex
structures
can
have
can
come
from
multiple
sources.
A
So,
let's
take
an
example,
we
want
to
build
an
application
that
maintains
individuals
with
emails,
add
addresses
names,
and
this
is
the
kind
of
business
object
that
that
our
users
are
configuring,
so
there
there's
no
need
for
them
to
go
in
a
dissent,
Cassandra
and
all
under
the
underlying
structures.
We
want
to
work
with
individuals
with
that
structure
and
we
we
drive
everything
by
ourselves.
A
A
One
value
for
simple
attributes
goes
into
one
column,
column
and,
and
over
name
is
metadata
driven
so
like
in
this
case,
name
dot
one
and
then
value
goes
into
cell
here.
You
can
see
that
even
for
simple
attributes,
we
we
have
index,
and
this
is
because
data
is
coming
from
multiple
sources
and
simple
attributes
can
have
multiple
values,
multivalued
attributes
very,
very
similar
way
as
as
for
simple.
So
we
each
value
goes
into
separate
column.
Column
is
metadata,
driven
and
unique,
and
value
goes
into
cell.
A
This
approach
allows
us
update
each
element
independently
of
each
other,
so
we
can
just
update
email
one
without
touching
data
that
is
stored
in
amount
here,
all
right.
Let's
go
into
more
complex
scenario
where
we
have
necessary
attributes.
In
this
case,
it
is
very
important
to
you
to
mix
data
from
multiple
for
multiple
nests.
For
example,
we
don't
want
to
say
that
California
is
billing.
Address
and
New
York
is
shipping.
We
really
need
to
preserve
this.
A
This
well
is
that's
why
every
attribute
is
having
its
own
idea
like
California
shipping
is
one
and
New
York
billing
is,
is
Tiye
and
then
for
nested
elements
such
as
such
a
state
or
type.
We
also
introduced
IDs
because
the
data
can
come
from
multiple
sources,
and
here
this
is
how
we
model
NASA
structures.
What
a
give
us
we
can
build,
NASA
structures
of
any
depth.
We
can
have
any
number
of
attributes
on
each
level
and
we
can
update
each
element
independently.
A
A
We
can
retrieve
just
sum
sum
of
attributes,
so,
for
example,
in
this
query,
I'm
interested
only
in
addresses
and
we
can
do
with
such
model.
We
can
support
thousands
of
attributes
that
have
a
lot
of
values
each
and
we,
with
such
model
can
store
only
those
attributes
that
have
actual
data.
So
this
saves
us,
as
disk
space.
Relative
platform
will
take
care
of
transforming
this
kind
of
structure
into
into
documents
and
we'll
will
address
on
any
complex
that
that
you
have
in
might
in
in
data.
So
this
is
this.
A
Is
this
smart
logic
that
relative
will
put
on
top
of
this
this
as
you
imagined
this
is.
This
is
three
of
two
terms
right,
and
this
is
how
we
started
and
we
were
quite
happy
with
thrift
and
then,
according
to
data
stacks
recommendation,
we
we
should.
We
should
move
to
Cassandra
client
which
uses
secure.
So
our
first
impression
was
that
it's
it's
close
to
impossible
to
support
our
very
complex
structure
that
can
have
thousands
of
attributes
and
each
attribute
can
be
multi
valued.
We
even
did
some
experiments.
A
We
tried
to
put
to
model
what
we
have
right
now,
with
with
static
schema
and
as
a
result,
our
performance
degradation,
and
then
we
start
thinking,
does
CTL
really
means
static,
or
can
we
continue
having
wide
rows
and
cql
does
support
wide
rows?
So
this
is.
This
is
definition
of
the
same
schema
as
I
was
talking
before
we
have
color
family
for
four
entities
that
have
dark
ideas
as
as
partition
key.
A
We
have
attribute
name
and
attribute
value
to
you
to
model
the
same
structure,
and
actually
we
can
continue
making
the
same
request
as
we
did
before,
for
example,
to
retrieve
just
some
of
the
information
such
as
addresses.
So
basically
everything
that
you
could
do
with
thrift
API.
You
can
continue
doing
with
with
cel-3
another
scenario
that
we
use
where
we
use
Cassandra.
We
use
it
for
hybrid
graphs,
I
already
work
you
through
how
we
can
model
entities
with
infinite
attributions,
but
having
information
about
entities
is
not
enough.
A
It's
not
enough
to
know
my
name
right
and
my
email
address.
You
Know
Who
am
I.
It
is
important
to
know
what
relationships
do
have
what
strengths
of
this
relationships.
That's,
why
relationships
are
very
important
and
we
do
support
this
through
Cassandra.
So
here
you
can
see
various
entities
such
as
organizations,
employees,
products
and
individuals
which
can
have
infinite
attribution
and
we
also
can
define
in
metadata
relationships
between
these
objects.
A
So,
for
example,
I
can
say
that
Dwight
is
employee
of
of
Dunder
Mifflin
John
buys
paper
copy
copy
copy
paper,
and
this
is
what
you
define
a
metadata
this.
What
drives
you
I?
This
was
drives,
drives
all
api's.
So
from
from
Cassandra,
we
use
basic
foundation
to
to
actually
save
our
data
from
relgious
side.
We
have
metadata
driven
graphs
with
with
very
rich
model
for
entities
and
relationships
with
infinite
Android
attributions,
and
we
do
partitioning
and
effective
joints
to
provide
high
performance
graph
operations.
A
A
So
that's
why
it
is
very
important
to
build
effective
deduplication
mechanism
and
to
have
high
performance
high
high
quality
dupe
mechanism.
Without
users
intervention,
it
is
impossible.
It
is
important
to
factor
all
information
so
like
in
this
example,
I
have
T
John
Smith's,
with
slightly
different,
slightly
slightly
different
names.
A
So
potentially
this
could
be
T
same
entities,
but
we
can't
guarantee-
and
if
you
start
making
decisions
just
on
top
of
attributes,
we
can
go
into
too
many
false
merges.
Then,
after
this
data
stores
need
to
go
and
unmerge
and
understand
pretty-pretty
non
convenient.
But
what,
if
you
have
additional
information
that
we
will
start
factoring?
For
example,
we
have
information
about
their
addresses
and
we
know
that
they
leave
within
two
miles
radius,
so
this
really
increases
chances
of
these
two
records
being
the
same.
A
What
if
I
tell
you
that
John
Smith
without
age
and
John
Smith?
They
have
the
same
daughter
Stephanie.
Then
all
these
information
really
increases
the
probability
of
these
records
being
the
same
close
to
hundred
percent.
So
it's
very
important
to
factor
all
information
to
to
build
to
build
effective
deduplication
mechanism.
A
Types
of
matches
that
we
do
support
just
using
Cassandra,
is
matching
by
attributes
both
fuzzy
and
exact.
In
this
example,
geo
matching
and
graph
matching
for
incremental
matching.
We
use
just
just
Cassandra,
but
if
you
want
to
do
very,
very
fast
bulk
matching
across
whole
tenant,
then
we
are
using
combination
of
Cassandra
and
SPARC.
A
One
more
interesting
use
case
of
Cassandra
Cassandra
doesn't
have
a
very
powerful
query
language.
It
has
some
integrations,
but
it
doesn't.
It
doesn't
work
perfectly
well
for
us,
because
we
we
need
to
transform
our
complex
metadata
to
to
certain
structures
and
for
our
use
case
we
found
that
elastic
searches
is
better
fit
and
we
we
did.
We
do
transformation
by
ourselves
so
for
search
by
elastic
search
perfectly
fine,
and
then
we
started
to
explore.
How
can
we
use
our
cluster
more
efficiently?
A
What
if
we
exclude
document
contents
from
from
elastic
search
and
use
it
only
for
indexing
and
returning
IDs
of
objects,
and
then
we
use
Cassandra
to
retrieve
whole
information.
So
this
is
what
we
called
hybrid
search
and
we've
got
some
very
interesting
results,
so
the
first
result
predictable
that
the
size
of
elastic
search
index
reduced
twice.
A
We
already
have
this
information
in
Cassandra.
That's
why
we
we
reduce
this
and
it
cost
to
our
cluster
with
zero,
with
zero
loss
function.
Second
function.
A
second
area
is
that
elasticsearch
indexing
performance
in
Croson
increased
twice,
and
that
means
that
we
can
use
the
same
elasticsearch
cluster
longer
without
without
need
to
scale
up.
A
So
tells
you
we
have
a
lot
of
interesting
scenarios
and
use
cases.
We
continue
collecting
data
from
multiple
sources,
blended
together
as
a
Clint
cleanse
it
with
data
providers
with
with
cleanse
mechanism
with
with
enrichment
from
social
media.
We
do
manage
relationships,
entities
graphs-
we
do
analytics
on
top
of
this,
and
we
do
insights
and
recommended
actions
and
for
all
of
this
we
are
using
Cassandra.
As
our
primary
data
store
I
bought
from
Cassandra.
We
are
having
heavily
using
elastic
search
for
indexing.
We
use
spark
for
for
analytics
such
as
segmentation,
clustering,
ranking.
A
We
use
elastic
search,
we
will
use
spark
for
for
machine
learning
for
bulk
operations
and
then
we
will
spark.
We
are
able
to
bring
this
data
back
to
Cassandra
and
for
SQL
interfaces
that
very
oftenly
used
in
in
in
enterprise
world.
We
use
redshift,
Amazon,
redshift
brawl.
Tio
allows
to
simplify
a
lot
architecture,
removing
a
lot
of
complex
and
moving
of
blocks.
A
A
Questions
craft
acknowledges
so
we
we
built
our
own
craft
technology
because
we
needed
that
kind
of
that
control
to
to
to
maintain
merging
of
crafts
to
understand
where
what
should
what
information
came
from
what
sources?
So
we
started
build
graph,
McGruff
foundation
for
our
solution
four
years
ago,
so
this
is
pretty
much
the
same
way
when
tight
on
that
period,
Titan
is,
is
good,
good
tool
for
managing
graphs,
good
database.
A
C
A
So
this
is,
we
are.
We
are
using
for
tracking
information
where
all
all
wealth
are
located.
We
are
tracking
information
from
each
well
from
various
equipments,
like
Internet
of
Things
of
what
is
happening
within
each
well,
so
also
you
can
on
a
map
you
can.
You
can
try
to
to
find
where
their
possible
places
for
wells.
You
can
try
to
predict
on
all
the
data
that
you
have
where
they
could
be
potential
brakes
on
on
wells
equipment.
A
D
B
B
Like
for
America
use,
Oracle
data
integrator
to
connect
to
Oracle
or
to
Seco
server
how
they
connected
those
databases.
A
E
A
We
we
didn't
have
a
lot
of
cases
when
elasticsearch
run
down,
because
this
is
also
distributed
highly
available
component.
So
if,
if
this,
if
this
happens,
then
this
is
considered
downtime
on
our
side
and
we
have
all
the
mechanism
to
repair
data
from
from
the
moment
when
something
happened.
So
we
can.
We
can
repair
all
the
information
on
elasticsearch
when
it
is
up
again.
A
A
Can
do
yes,
we
can
do
incremental
reinvesting
and
now
another
area
was
a
comparison
with
DSC
right
and
with
search
that
we
have
out
out
of
DSC.
So
the
the
problem
is
in
complexity
of
our
data
structure.
If
you
just
put
all
the
information
that
I
showed
you,
that
is
in
column,
families,
if
you
put
this
directly
in
solar,
it
just
doesn't
make
sense.
A
You
need
to
combine
this,
you
need
you
need
to
this
on
transformation,
plus
also
we
bring
data
from
from
multiple
column
families
in
one
search
document,
which
is
not
supported
by
by
DC.
This
is
just
a
just
some
very
specific
use
case
on
our
side,
but
if
you
have
all
the
data
that
you
have
inside
column
of
column
family
that
you
want
to
be
available
in
search
index,
DC
is
perfect
solution
for
that.