►
From YouTube: Scale Unlimited: Fuzzy Entity Matching at Scale
Description
Speaker: Ken Krugler, President of Scale Unlimited
Early Warning has information on hundreds of millions of people and companies. When a person wants to open a new bank account, they need to be able to accurately find similar entities in this large dataset, to provide a risk assessment. Using the combination of Cassandra & Solr via DSE, they can quickly find and evaluate all reasonable candidates.
A
A
A
So
I
realized
too
late.
That
I
had
a
really
lousy
title.
For
my
talk,
it
should
have
been
something
cool
like
so
mr.
Smith.
You
want
to
open
a
bank
account
like
that's
really
what
this
talk
is
about,
but
the
underlying
issue,
or
this
the
technical
name,
is
fuzzy
entity
matching.
So
we
got
the
obligatory.
Why
my
peer
talking?
Why
should
you
listen
to
me
so
I've
got
a
boutique,
Big
Data
consulting
company
boutique
sounds
way
better
than
small.
A
It's
just
five
of
us,
but
we
we
help
clients
with
issues
around
big
data
problems,
so
Hadoop
workflows
that
we
implement
using
cascading,
which
is
an
open-source
API.
We
do
stuff
with
cassandra
with
solar,
like
on
the
search
side,
machine
learning,
etc.
So
we
do
consulting
on
that
and
we
also
do
training.
So
actually
I
created
the
training
materials
for
datastax
for
their
vse
solar
search
and
occasionally
I
teach
that
class
for
them.
I
also
teach
classes
on
Hadoop
and
machine
learning
and
other
things
like
that.
A
I
enjoy
teaching
so
much
that
I
volunteer,
teach
high
school
programming
classes
which,
if
you've
dealt
with
high
school
students
like
it,
means
I
really
liked
teaching
all
right.
So
when
I
go
to
talk,
so
I
really
enjoy
having
a
concrete
use
case
like
as
the
context.
So
as
the
person's
talking
about
things,
I
can
get
a
sense
of
what
the
what
they're
trying
to
tell
me
and
why
it
matters.
So
in
this
case
we're
going
to
start
with
the
problem.
A
Let's
say:
I'm
going
to
a
bank
and
I
want
to
open
an
account.
So
in
this
case
this
guy,
who
is
sitting
across
the
table
from
me,
he's
pleasantly
chatting
with
me,
but
what's
really
going
on
in
his
head,
is,
should
I
open
an
account
for
you
like?
Would
that
be
a
big
mistake?
You
know
what
would
I
regret
doing
that?
Okay,
so
it's
it's
about
calculating
the
applications
with
the
applicants
risk
trying
to
figure
out
like
should
I
do
this?
Should
I
open
the
account
for
them.
So
what's
happened
is
I've.
A
Provided
me
as
the
applicant
a
bunch
of
information,
my
name
date
of
birth,
Mabel,
social
security
number
details
like
that
and
what
they
need
to
do
is
be
able
to
look
at
my
account
history.
They
want
to
be
able
to
find
every
single
account
that
I've
owned
or
had
control
over
and
then
associated
with
whether
there
were
problems
with
that
account.
That's
a
key
part
of
that
right.
So
really
it's
about
matching
up
who
I
say:
I
am
the
details,
I
provided
with
everybody
that
they
know
something
about.
Does
that
make
sense?
A
A
They
have
to
have
data
on
all
the
bank
account
activity,
but
like
there's
no
way
in
hell,
wells
fargo
is
going
to
give
bank
of
america
their
customer
list
and
their
account
status
right
and
vice
versa.
That's
just
not
going
to
happen.
These
are
competitors.
So
so
what
do
you
do?
Well,
the
solution
to
this
problem
is
a
company
called
early
warning
services.
A
They
provide
it's
a
joint
venture,
so
they
provide
this
commonplace
for
the
banks
to
send
their
data
to
where
they
know
it's
not
going
to
get
passed
around
to
each
other
right.
It's
the
trusted
third
party,
so
joint
venture
of
the
biggest
five
US
banks
plus
then
like
a
800
plus
other
financial
institutions,
send
information.
So
they
have
data
on
pretty
much
everybody
in
the
US.
Who
has
a
bank
account?
So
that's
the
key
they've
got
the
data,
but
now
the
problem
is
you've
got
that
data.
A
A
So
in
general,
what
do
I
mean
by
fuzzy
matching
if
I've
got
something
that
I'm
looking
for
like
this
blue
triangle
over
here,
I
want
to
get
everything
that
I
think
is
equivalent
and
nothing.
That's
different!
That's
my
goal
now
in
this
case
right
here,
you
know
key
point
is
how
do
I
define
what
is
similar
or
what
is
dissimilar
so
in
this
in
this
first
row
here
what
I've
defined
arbitrarily
as
I'm
saying
rotation
doesn't
matter
right?
The
triangle
is
rotated.
It's
still
the
same
triangle.
A
I've
said
that
if
the
color
is
closed
right,
it
doesn't
matter
so
here
it's
a
slightly
darker
blue
for
that
second
triangle.
Up
there,
this
guy,
right
here
and
I'm,
saying
okay,
that
that's
equivalent
that
doesn't
matter
and
I'm
saying
if
it's
slightly
smaller,
now
notice
that
I'm,
using
terms
like
slightly
so
right
away.
You
get
into
this
problem
where
it's
not
a
step
function.
It's
not
like
down
here
where
I
say:
okay,
if
it's
a
circle
versus
a
triangle,
forget
it
it's
not
the
same.
A
A
A
structural
change
like
this
to
the
triangle
is
significant
or
not
so
we're
arbitrarily
coming
up
with
aspects
of
attributes
that
we
decide
are
important
and
that
matter
and
then
have
to
match
or
that
gap
to
match
close
enough
or
that
you
know
if
they
don't
match
we're
done.
So
why
is
it
hard?
Well,
as
we
talked
about
there's
fuzzy
areas
like
in
this
case
right
here?
What
if
it's?
What
if
it's
a,
really
light
blue
color?
Is
that
a
match
I,
don't
know
or
what?
A
If
we've
got
some
attribute
like
a
really
thick
border
around,
it
would
I
consider
those
two
things
equivalent
or
what,
if
the
triangles
really
small
compared
to
the
other
one.
Alright,
so
again,
we've
got
these
slopes
and
you
have
to
the
problem
is
when
you
decide
if
it
goes
past
this
level,
then
it's
not
the
same
you're,
creating
you're,
creating
a
step
function
essentially,
and
then,
if
somebody's
just
a
little
bit
past
that
edge
and
they
fall
off
it,
and
so
now
it's
not
similar.
A
You've
removed
them
from
your
list,
all
right,
which
is
a
problem.
So
that's
one
problem:
lots
of
areas
of
grey.
The
second
problem
is
you
can't
use
Cassandra.
Alright,
the
fundamental
problems
like
Cassandra
is
great.
If
I've
got
a
row
key
then
wow.
It's
super
fast
for
me
to
look
up
that
roki
and
if
I
look
for
that
row
key
and
it's
not,
there
then
I
know
I.
Don't
have
it
all
right
so
fast
and
accurate,
but
I
don't
have
a
row
key.
It's
a
fuzzy
matching
problem.
A
So
fundamentally,
I
can't
apply
the
key
feature
of
Cassandra
to
the
problem
and
the
third
thing
is
computationally
intensive.
So
imagine
I'm
comparing
two
things
that
I've
got
like
I,
don't
know:
100
attributes
like
strings
and
I'm
doing
string
edit
distance
right
to
do
the
comparison
between
ok,
that's
a
computationally
intensive
problem.
Alright,
that
can
that
can
choose
some
serious
epu
cycles.
A
Alright,
so
we've
been
talking
about
triangles
now,
let's
go
to
people
so
in
this
case
right
here.
These
are
all
pictures
of
people
who
are
speakers
at
the
Cassandra
summit
here
and
there's
me
over
there
a
really
bad
headshot
that
somebody
did
a
couple
years
ago,
I'm
getting
a
new
one
on
Saturday,
so
I
so
got
information
on
all
these
people
and
now
I'm
being
asked
about
a
specific
person
right,
I'm
being
asked
about
me
like
mi
in
this
group
over
here,
and
the
thing
is
I
want
to
quickly
find
all
the
good
matches.
A
One
of
the
challenges
here
is:
it's
not
like.
I
can
search
until
I
find
a
good
match.
I
need
to
find
all
the
good
matches,
which
means
I've
got
to
go
through
everybody,
I
can't
there's
there
aren't
any
real
shortcuts
here
or
so.
It
appears
now
pointing
to
make
yours
we're
not
doing
batch
matching.
So
a
lot
of
people.
When
you
talk
about
fuzzy
like
matching
they're
thinking
about
batch,
which
means
it
could
be
a
self
join
like
I've
got
this
long
list
of
people
and
I
want
to
D.
A
Do
like
I've
got
entries
in
here
that
are
close,
I've
got
to
deduplicate
them
or,
let's
say
I've
got
a
list
of
people.
I
get
data
from
another
source
and
I
need
to
merge
those
together,
but
again,
there's
going
to
be
a
bunch
of
duplicates
in
there
there
or
near
duplicates.
How
do
I
do
that?
Merge
I
gave
a
talk
a
couple
months
ago
at
hadoop
summit
about
doing
that
kind
of
batch
fuzzy
matching
at
scale
different
kind
of
problem.
A
So
what's
a
good
match
so
we're
doing
here
for
people
as
we've
got
attributes
basically
fields
with
values
for
people,
and
in
this
case
we've
got.
You
know,
name
address
city,
state,
zip,
all
right.
So
are
these
two
people
the
same.
But
if
you
looked
at
this,
have
you
looked
at
this
right
here?
Would
you
say
these
two
are
the
same?
Everybody
says?
Yes,
you
know
because
this
person-
you
look
at
it.
You
know
oh
zip
code
matches.
A
A
Okay,
like
most
people,
look
at
and
go
yeah.
You
know
yeah,
probably
but
yeah.
There's
some
issues
here,
for
example,
I
mean
the
fact
that
this
one
doesn't
have
a
zip
code.
This
one
does
you
look
at
that
and
you
go
whatever
okay
Washington
versus
wa.
That's
an
alias
yeah.
The
fact
that
there's
no
middle
initial
here
is
okay,
like
sometimes
you
don't
get
the
middle
initial,
not
having
an
apartment
number.
Okay,
sometimes
you
get
the
apartment
number.
A
Sometimes
you
don't,
but
the
fact
that
it's
32
20
verses,
220
there
is
that
a
typo
when
people
are
entering
and
it's
amazing
how
noisy
data
is
like
in
your
mind,
or
at
least
in
my
mind
when
you
think
about
bank
data.
You're
like
this
is
Wow
it's
going
to
be
like
spot-on
No,
so
you
get
typos
weirdly
enough
because
you
got
humans
involved
entering
data
that
people
scratch
on
two
forms
in
eligible
ink.
So
you
get
typos.
A
So
you
look
at
these
two
things
and
you're
like
well,
you
wouldn't
be
a
hundred
percent,
but
you'd
say
yeah,
probably
so
two
things
that
one
is
normalization,
so
the
being
able
to
know
that
Washington
and
WA
are
the
same
thing:
they're
different
strings
to
a
computer.
Knowing
that
the
same
thing
that's
an
issue
of
normalization
right.
Similarly,
with
like
a
bob
vs
Robert,
you
could
treat
that
as
a
normalization
problem,
and
the
second
thing,
though,
is
how
do
you
know
what
features
to
focus
on
what's
important
when
it's
the
same
or
when
it's
different?
A
So
a
typical
approach
here
is
for
calculating
similarity
and
there's
many
many
ways
to
this.
There's
I
think
God
knows
how
many
research
papers
or
phd's
have
been
done
on
similarity.
But
this
is
a
common
approach
for
record
similarity
where
you
say
for
each
field.
I
can
calculate
a
degree
of
similarity
between
them
and
often
how
you
calculate
that
winds
up
being
fields
Pacific,
like
zip
code,
if
the
primary
part
matches
we're
pretty
close
to
being
one
point:
oh
you
know.
A
If
the
you
know,
the
sub
part
doesn't
match
it's
not
so
significant
and
then
for
each
field
to
give
it
a
wait.
How
significant
is
it
if
it
matches
or
doesn't
match
all
right,
like
zip
code,
zip
code
matching
is
more
significant
than
state
matching,
because,
if
codes
more
specific
than
state,
so
it
has
a
higher
weight.
So
then,
what
you
can
do
is
is,
if
you
give
each
field
a
weight
and
the
sum
of
those
weights
equals
one
right.
A
Ask
the
question:
does
it
scale
the
answer's?
No
right?
The
issue
is
again
for
a
single
person
like
me,
being
matched
against
hundreds
of
millions
of
potential
people.
You
have
to
do
that
comparison
for
every
single
record
and
we
talked
about
it
being
computationally
intensive
right.
It
works
for
a
couple
hundred,
maybe
even
a
couple
thousand
people.
It
does
not
scale
to
hundreds
of
millions
right.
So
fundamentally,
what
you
need
to
do
here
is
figure
out
a
way
to
do
fewer
comparisons.
A
How
do
you
do
furyk,
pure
comparisons,
all
right
and
that's,
where
search
comes
in
all
right,
so
search?
Who
here
has
experience
with
search
in
general
hands
up
anyone
quite
a
few
people?
How
about
specifically
solar
a
couple
people
all
right,
great
design
by
using
DSE
with
solar?
If
you
were
still
okay,
all
right,
if
I
accidentally
say
solar
in
here,
just
because
I'll
do
that
it's
an
open
source
project
that
provides
search
will
go
into
it
in
more
detail
in
a
little
bit.
A
But
what
I'm
going
to
talk
about
here
is
in
general
search,
so
search
is
fast.
Similarity
might
search
is
all
about.
It
takes
a
query
like
I'm
looking
for
scale.
Unlimited
blog,
let's
say
that's
my
query:
three
words.
It
actually
turns
that
into
a
docking,
just
like
any
other
document
that
I've
got
in
my
search
index.
It
turns
it
into
a
document
where
a
document
is
essentially
a
feature
vector.
It's
a
multi
dimensional
vector
where
there's
one
dimension
for
each
unique
word
right.
A
So
if
my
search
was
scale
and
limited
blog
I'd
wind
up
with
a
feature
vector
that
has
three
dimensions
to
it,
so
it's
a
three
dimensional
feature
vector
and
the
magnitude
of
each
weight
is
based
upon
something
called
tf-idf,
so
term
frequency
inverse
document
frequency,
really
what
it
means
is.
How
often
does
this
word
occur
in
this
document?
That's
the
term
frequency
part
like
how
significant
is
the
word
in
the
document
and
IDF
is
inverse
document
frequency.
A
It's
like
how
significant
is
this
word
across
all
documents
as
an
example
there,
the
word
the
is
not
very
significant
because
that's
in
every
single
document.
So
the
fact
that
my
query
has
the
word
done.
It
doesn't
matter.
So
it's
document
frequency
is
really
high.
It's
like
1
inverse
document
frequency.
Then
becomes
very
low,
so
it's
sort
of
how
significant
is
a
word
across
my
entire
corpus.
Basically,
you
get
a
weight
for
each
word
in
this
feature
vector,
and
that
actually
gives
you
a
vector
then,
because
each
dimension
has
to
have
a
magnitude.
A
Okay,
so
that
basically
gives
me
this
three
dimensional
vector
here
in
this
case,
for
the
three
word
case:
it's
three
dimensions,
so
I've
got
three
values
and
each
value
has
a
weight
or
magnitude
set
according
to
term
frequency
x,
inverse
document
frequency.
Think
of
it.
As
this,
though,
here's
this
term
vector
that
have
got
for
this
document,
I
can
compare
it
to
some
other
documents,
term
vector,
and
it
has
an
angle
between
them
right
vectors
have
angles
between
them.
A
If
that
angle
is
zero,
which
means
these
two
term
vectors
are
exactly
the
same
or
have
exactly
the
same
sort
of
direction,
then
the
angle
zero
cosine
of
that
is
one
so
great
they're,
exactly
the
same
cosine
similarity
gives
me
one
point
0
for
exactly
the
same:
if
they're,
90
degrees,
opposite
cosine
will
give
me
0,
there's
nothing
in
common
between
them
and
if
they're
opposite
each
other.
So
the
angle
180
degrees,
its
minus
1,
it's
actually
showing
negative
correlation
right
between
it.
A
A
However,
this
cosine
similarity
is
search
similarity,
it's
not
the
same
as
the
match,
similarity
that
we
were
talking
about
earlier,
obviously,
right
match
similarity,
had
/
field
weights
and
maybe
edit
distances
and
all
kinds
of
crazy
stuff
you
know
generally
doesn't
have
the
same
level
of
sophistication
because
it's
really
trying
to
be
fast.
So
how
do
you
deal
with
that?
A
If
you're
trying
to
use
search
to
narrow
down
the
set
of
candidates
that
you
then
have
to
do
match
similarity
with
right,
because
that's
the
key
point
if
I
can
use
search
to
say
out
of
the
hundreds
of
millions
of
people
that
I
know
about
what
are
the
hundred
that
are
likely
to
be
similar
to
be
good
candidates?
And
if
I
have
it
a
hundred,
then
I
can
do
that
match
similarity
and
it's
okay,
alright,
so
I'm
using
search
to
narrow
down
the
candidates
set
right.
A
But
the
issue
here
is,
we
said,
is
search
similarity
that
candidates
said
I
get
isn't
gonna,
be
exactly
the
same
as
match
similarity
the
score.
So
what
do
you
do?
Well,
you
throw
a
bigger
net
if
I
say
like
atmos
I'm
going
to
get
ten
matches.
I
expect
that
I'm
only
going
to
get
at
most
10
matches
or
maybe
10
matches
is
all
I
care
about.
A
If
I
get
more
than
that,
I'm
dealing
with
somebody
who's
creating
a
whole
bunch
of
bogus
ids,
and
I
don't
want
them
anyway,
but
let's
say
I
get
at
most
10
matches.
Well
then,
I
could
say:
okay,
let's
find
a
hundred
candidates
or
maybe
a
thousand
candidates
like
I
scale,
up
that
10
by
some
amount,
and
that's
my
candidate
list
that
I
didn't
do
match
similarity
on
all
right.
Does
that
make
sense
so
basically
using
that
super
fast
search
to
narrow
down
the
set
of
candidates
to
the
level
where
I
can
actually
do
this
match?
A
Similarity
on
what's
left
all
right,
so
it's
a
two-step
process.
We've
got
like
here's,
the
information
being
provided
by
some
person.
They
got
a
name
social
security
number
and
maybe
a
date
of
birth
right.
Maybe
I'll
have
address
as
well.
But
let's
say
we
got
these
three
attributes
form.
So
I'm
going
to
turn
it
into
a
query
and
one
of
the
interesting
things
you
can
do
with
solar
in
other
systems
is,
you
can
add
weights
to
the
fields
right?
A
You
can
put
weights
in
there
that
are
similar
to
the
weights
you
put
on
fields
when
you're
doing
that
match
similarity
which
fields
matter
more.
So
in
this
case
right
here,
I'm
saying
if
the
social
security
number
matches,
let's
boost
the
importance
of
that
by
10.
If
the
date
of
birth
matches
that's
boosted
by
five,
if
the
name
matches,
let's
boost
it
by
three,
so
essentially,
I
can
provide
some
hints
to
search
where
I'm
using
weights
for
fields
that
are
similar
to
the
weights.
I
use
when
I'm
doing
matching.
A
So
your
search
system
then
goes
out
and
hits
the
index
and
it
comes
back
with
its
list
of
what
it
thinks.
The
top
candidates
are
and
I
say,
give
me
a
ten
giving
100
whatever
and
it'll
come
back
with
these
ordered
that's
the
first
step
and
then
the
second
step
is
I
run
the
matching
logic
against
those
results
and
the
matching
lodge
is
going
to
come
up
with
different
numbers
based
on
its
it's.
You
know
way
of
comparing
things.
A
So
it's
essentially
re-ranking
things
right
it
reorders
things
and
then
you
can
say
well,
I
know
anything
less
than
a
give.
Some
number
point:
seven
I,
don't
consider
that
a
match
all
right,
so
I'm
going
to
get
left
with
some
subset
of
these
things
make
sense,
so
two-step
search
then
re-rank
using
the
match.
Similarity
one
of
the
things
I
should
have
said
at
the
beginning
of
this
talk
is
that
there
are
some
details
about
the
implementation
for
early
warning.
That
I
obviously
can't
talk
about
like
I'm,
not
going
to
say.
A
Oh,
you
know,
if
you
put
an
umlaut
over
a
vowel
in
your
name,
we
can't
match
the
name
at
all
that,
and
that
is
actually
isn't
the
problem.
So
don't
bother
trying
that.
But
but
you
know
obviously,
I
can't
go
into
the
details
of
exactly
how
their
match
logic
works.
I,
give
you
a
very
simple
approach:
they've
got
way
more
sophisticated
things,
but
you
get
the
basic
idea
here
right.
A
A
You
can
take
it
and
say:
let's
break
it
into
a
zip
field
and
a
zip
plus
4
field
and
we'll
search
on
just
the
zip
or
zip
plus
4.
If
we've
got
it
with
a
higher
weight
and
that
way,
if
the
person
only
gives
you
your
zip
and
you
have
zipless,
for
it
still
works
versus
not
finding
it
at
all,
if
it
didn't
find
it
at
all,
that
means
your
search.
Similarity
is
being
skewed
relative
to
your
match,
similarity,
so
the
more
normalization
you
can
do,
the
better.
A
Your
search
results
can
correlate
with
your
match
results.
Now.
Why
do
you
care
about
n?
Well
end's
too
big?
It's
like
this
case
right
here,
where
I'm
throwing
this
big
red
net
and
thus
up
in
Purple's
the
stuff
I
actually
care
about.
So,
if
n
is
too
big,
we
got
a
performance
issue
potentially
because
we're
putting
more
load
on
the
search
system
to
give
us
back
these
bigger
set
of
results
and
then
that
bigger
set
of
results
I
still
have
to
go
through,
and
do
that
matching
logic
against
each
one.
A
A
You
know,
dribbling
a
basketball
and
you
don't
know
about
it
and
so
you're,
like
oh
sure,
let's
open
an
account
for
this
person
right.
So
how
much
you
care
depends
on
your
use
case
and
the
key
here
is
white,
the
tuning,
the
search
to
mimic
the
match,
similarity
I.
So,
as
we
talked
about
before
you
can
do
things
like
wait.
The
value
of
match
of
matching
different
feels
when
you
make
your
search
query,
that's
a
common
technique
for
getting
these
more
in
alignment
and
the
normalization.
A
So,
as
I
said,
I
mentioned
solar
and
I
would
talk
about
it.
So
solar
is
an
open
source
project
out
of
the
apache
software
foundation.
It's
built
on
top
of
leucine,
which
is
a
which
is
a
low-level
information
retrieval
engine,
the
key
things
about
solar
or
that
it
and
leucine
is
that
it's
highly
scalable
like
up
to
billions
of
documents
and
pretty
darn
fast,
even
at
that
level,
and
you
can
customize
it
and
configure
it
to
your
heart's
content.
So
often
you
can
make
solar
do
your
bidding.
A
What
you
do
and
solar
is,
you
have
a
schema.
Solar
leucine
itself
is
pretty
it's
almost
like
Cassandra
and
away,
and
that
you
can
have
these
documents
and
every
document
can
have
different
fields
in
it,
and
you
can
put
anything
in
there
that
you
want
like
it's.
Schema-Less
solar
puts
a
schema
on
top
of
that.
So
you
have
fields,
and
you
say
what
type
they
are
and
by
saying
the
type
of
each
field,
you
say
how
it
gets
analyzed
and
how
it
gets
searched
all
right.
A
So
this
is
the
schema
that
you're
putting
on
top
of
the
data
that's
going
into
the
index.
So,
for
example,
I
can
have
a
name
field,
they're
a
type
text,
and
I
can
define
the
text
field
type
to
say.
If
you
see
robert
treat
it
as
bob-
and
I
can
do
things
like
that,
I
can
do
synonyms.
I
can
control
how
it
gets
tokenized
how
it
gets
broken
up
into
individual
words,
so
DSE
search
with
solar.
A
So
what
is
it?
It's
not
part
of
the
open
source,
Cassandra
project
right,
it's
an
extension
that
is
specific
to
the
commercial
product
you
get
from
datastax.
What
it
does
is
it
lets
you
say
for
this
table
in
Cassandra
I'm
going
to
have
a
solar
index
that
is
automatically
kept
in
sync
with
it.
So
if
I
write
to
Cassandra,
my
index
gets
updated
if
I
delete
from
Cassandra
my
index
gets
updated
and
likewise,
since
my
Cassandra
data
is
replicated
and
distributed
across
multiple
servers,
my
solar
index
is
replicated
and
distributed
across
multiple
servers.
A
Now,
as
I
mentioned
great,
it's
leveraging
the
existing
Cassandra
replication,
so
you
get
like
reliability.
Node
goes
down,
you
can
still
search
because
you
got
copies
of
the
data
on
other
nodes,
just
like
with
Cassandra
right,
no
going
down,
isn't
a
problem
and
the
data
is
distributed.
So
if
I
need
more
capacity,
I
can
just
scale
up.
My
cluster
and
I
got
more
search
capacity
and
you
get
like
replication
between
data
centers,
which
for
enterprise
solutions
with
solar,
is
a
beautiful
thing.
I
can
have
a
data
center
east
coast.
A
A
There's
something
called
a
secondary
index,
all
right
that
you
can
use
well,
there's
a
hook
there,
a
handy-dandy
hook,
that's
effectively
undocumented,
but
it's
there
strangely
added
by
a
datastax
engineer
that
lets
them
hook
into
it
and
do
this
thing
where,
as
documents
as
rose,
get
modified
or
added
to
Cassandra,
the
hook
is
in
there
and
it
says:
oh
ok,
I'm,
going
to
queue
up
something
I'm
going
to
cue
up
a
solar
index
change
for
every
cassandra
row.
Change!
Ok!
So
that's
how
it
keeps
it
in
sync.
A
But
a
key
point
is
its
way
slower
than
the
cassandra
writes
like
a
Cassandra
right.
You
know
it's
whatever
pick
a
number
twenty
thousand
writes
per
second
per
node.
You
know
a
regular
hardware,
pixel
number!
Ok,
solar,
is
nowhere
close
to
that.
For
a
couple
reasons,
one
is
solar
when
it
writes
it
does
analysis
on
the
data
like
it's
sitting
there,
tokenizing
it
and
normalizing
it
and
doing
all
this
work
versus
a
Cassandra
right
is
just
ok.
Here's
some
data
in
the
mem
table
I,
always
some
more
data
in
the
mem
table.
A
Hey
Solar
has
to
do
actually
do
work.
The
other
problem
is
solar.
When
you
write
in
a
document
like
a
row
and
solar,
you
need
to
have
all
the
data
to
write
that
row.
So
what
that
means
is
when
you
do
a
solar
update
when
you
update
akka
San
in
Cassandra
like
a
single
field,
you
know
single
column
in
your
cassandra
row.
What
it
has
to
do
behind
the
scenes
is,
it
has
to
read
that
row
it
to
get
the
whole
row
so
can
build
that
solar
document
to
write
it.
A
So
it's
violating
the
never
read
issue
right
when
you're
writing.
It
has
to
always
read
before
it
writes
to
build
the
full
solar
document.
So
how
much
slower
it
is
I,
don't
know.
I
mean
my
I
use,
something
like
10x
as
a
rule
of
thumb,
but
I
mean
I
think
it
varies
a
lot
on
your
configuration,
how
much
work
solar
is
doing,
etc.
A
A
Well,
the
secondary
index.
So
the
interesting
thing
is
the
secondary
indexing
hook.
It's
it's
not
at
the
level
where
you're
writing
to
some
Cassandra
node
and
then
that
gets
distributed
out
to
other
nodes.
It's
actually
on
the
individual,
node.
So
you're
going
to
wind
up
doing
this
indexing
process
three
times.
If
you
have
three
copies
of
the
data,
it's
actually
happening
on
the
on
the
node.
A
A
So
when
a
record
gets
flushed
by
TTL,
you
essentially
have
a
row
mutation
and
then
the
regular
process
happens.
Yes,
it
will
replicate
into
the
solar
index.
Yes,
you'll
get
the
effects,
so
the
goal
of
the
whole
system
is,
if
you
have
this
in
Cassandra
table.
If,
like
you
make
a
request
to
Cassandra
you
get
this,
then
you
get
the
same
thing
on
the
solar
side,
keeping
those
things
in
sync,
so
TTL
triggers
and
you
you
know
you've
got
some
data
getting
deleted.
It
will
get
deleted
from
solar.
A
A
So
the
solar
index
is
also
saved
on
the
same
Cassandra
nodes.
The
index
isn't
saved
in
a
cassandra
table.
The
index
is
saved
to
the
disk.
The
local
drive
yeah,
but
what's
interesting
is
they
do
leverage
the
cassandra
tables
for
other
things
like
that
solar
schema,
that
is
talking
about
with
the
fields
and
the
types
that
essentially
is
stored
in
us
in
a
cassandra
table
and
that's
how
they
replicate
it
out
to
all
the
notes.
So
they
leverage
that
a
lot.
A
So
the
question
is:
how
do
you
associate
a
column,
you
know
with
specific
solar
stuff
and
so
that
so
that
solar
schema
that
I
showed
you
there?
What
happens?
Is
it
associates
the
field
names
in
there
with
the
column
names
in
Cassandra
right?
So
if,
if
I
want
to
do
a
certain
thing
with
the
text
in
solar,
I
just
create
a
field
and
solar
that
does
whatever
I
wanted
to
do
and
I
make
sure
the
name
of
that
field
matches
the
Cassandra
column
name,
all
right,
so
I'm
going
to
I'm
going
to
keep
going.
A
We
can
definitely
talk
afterwards.
If
you
have
more
questions
so
you're
some
performance
like
using
mock
data,
we
wrote
170
million
records.
This
is
into
a
eight
node
cluster.
The
writing
took
two
and
a
half
hours
so
definitely
much
slower
than
if
we
were
just
slamming
records
into
Cassandra
without
solar.
When
the
writing
to
Cassandra
finished,
we
were
about
fifteen
percent
of
the
way
done
with
indexing
it
right,
so
the
indexing
is
running
behind
and
when
the
indexing
runs
behind.
It
also
slows
down
your
cassandra
rights
there's.
A
Actually,
this
back
pressure
support
that
like
tries
to
prevent
you
from
overloading
the
system,
so
it'll
slow
down
your
rights,
which
is
why
the
rights
took
longer
than
they
would
normally
and
then
afterward,
though
the
cassandra
rights
finished,
it
actually
took
about
another
12
hours
for
the
index
to
be
totally
in
sync.
Now,
there's
definitely
things
you
can
do
to
reduce
that
amount
of
time.
For
example,
you
can
offload
work
from
solar
into
whatever
is
generating
the
data
you're
putting
into
Cassandra.
A
A
After
it
does
that-
and
this
is
this
workflow-
that
we
build
our
lower
workflows
using
something
called
cascading
which
is
kind
of
a
Java
API
on
top
of
the
dupe,
so
you
you
can
stay
sane
when
you're
working
with
things
like
joins,
and
you
know
what
not
and
then,
after
it's
done
that,
there's
this
thing
of
getting
it
into
Cassandra
and
we
still
use
the
dupe
for
that.
We
actually
in
Hadoop
there's
map
and
reduce
tasks.
You
know
two
phases
do
MapReduce
and
in
reduce
you
can
control
the
level
of
parallelism.
A
You
can
say:
I
want
to
run
3
reduced
tasks
in
parallel.
I
want
to
learn
one
in
parallel:
I
want
to
run
100
in
parallel,
so
we
we
actually
in
the
reduced
task,
is
where
we
talk
to
Cassandra,
because
we
have
better
control
over
the
amount
of
parallelism
like
how
hard
we're
hitting
our
Cassandra
cluster.
It's
kind
of
interesting,
because
that
the
actual
Java
driver
for
Cassandra
also
lets
you
have
some
level
of
parallelism
in
it.
A
So
just
of
this
where's
the
right
place
to
do
it
and
as
I
mentioned
the
bottle
like,
though,
is
the
solar
indexing.
A
There
is
an
approach
that
we
tried,
which
is
kind
of
interesting.
You
can
write
it
a
Cassandra
table
where
the
Cassandra
table
isn't
hooked
up
to
solar.
You
just
have
a
regular
Cassandra
table
you
right
into
it.
It's
super
fast
and
then
after
you
do
that
you
say.
Oh
now,
we
want
to
associate
this
solar
index
with
a
cassandra
table
and
it
starts
doing.
The
indexing
in
the
background
turns
out.
That's
like
currently
single
threaded,
so
it
took
like
I,
don't
know
after
about
18
hours,
we
killed
it,
so
it
was
slower.
A
I
mean
the
cluster
resource
got
released
faster,
the
Hadoop
cluster,
but
the
overall
time
was
significantly
longer
to
get
a
full
index
in
there
all
right.
So
we
got
this
paduk
cluster
writing
things
into.
You
know
this
Cassandra
plus
solar
guy,
and
then
we've
got
this
actual
API.
That's
making
solar
queries
against.
It
makes
sense
right.
A
So
in
just
performance
we
talked
about
how
you
want
to
do
rights
without
reads
right,
that's
the
fundamental
thing,
but
the
problem
here
is,
if
you're
adding
an
entry.
What
we
tried
first,
is
what
we
got
to
see
if
we
already
have
this
entry
in
there.
So
we
do
a
solar
query
and
if
we
got
the
entry
then
we
just
add
the
data
to
it.
A
That
means
is,
if
all
the
searchable
fields
are
the
same,
the
hash
going
to
be
the
same.
We
know
we
got
that
same
entry
in
there.
Otherwise,
it's
going
to
be
different,
and
so
we
wind
up
doing
is
you
know
the
address
changed,
we'll
get
a
different
entry
in
there,
but
we
just
expand
our
net
a
little
bit
more
to
handle
those
duplicates
and,
and
then
our
in
just
performances
way
better.
A
So
if
we
try
and
add
essentially
the
same
entry
but
with
different,
like
account
data,
which
is
we
don't
search
on
then
we'll
say:
oh,
we
already
have
this
one
in
there
we
can
just
update
the
account
data
if,
on
the
other
hand,
any
of
those
fields
have
changed
their
date
of
birth,
etc.
Then
we
get
a
different
hash,
we'll
get
a
different
row.
Okay,
so
summary,
this
is
for
ad-hoc
kind
of
D
duping
fuzzy
matching,
not
batch
level.
So
you
might
I
think
my
slides
from
hadoop
summit
are
posted
on
somewhere.
A
If
you
care
about
similarity
at
scale
like
doing
the
batch
level,
dee
doop
goal
is
the
key.
Is
we
use
search
to
get
a
small
set
of
candidates
that
we
then
do
that
expensive
match
similarity
against
the
pain
like
with
all
this
stuff?
It's
always
the
data
prep
is
where
the
pain
is
like
the
normalization
cleaning
up
the
data
dealing
with
messy
data,
that's
the
problem
and
using
the
essentially
cassandra
plus
solar
in
DSC.
That
makes
this
actually
pretty
darn
easy,
architectural
e
to
handle
right.