►
From YouTube: Cassandra and Lucene
Description
presented by Jake Luciani of Riptano. Slides: http://bit.ly/9Rbuyp
The talk covers:
use cases for search and type of search applications
problems scaling and maintaining Lucene/Solr
Cassandra
Lucandra (Lucene + Cassandra)
A
I'm
tj
on
twitter.
You
guys
can
use
twitter.
I
work
for
rectano,
which
is
the
apache
cassandra
company
and
luxander,
which
is
this
project
on
this
talk
is.
A
A
Questions
probably
said
this
already,
but
if
you
have
any
questions
so
what
kind
of
search
apps
do
we
need
to
support
today?
Lucian
is
really
great.
It's
been
around
for
10
years.
It's
really
highly
optimized
for
lots
of
use
cases,
but
they
basically
fall
into
these
two
different
sorts
of
problems.
You
have
index
size,
meaning
you
know.
You
have
really
small
index
like
a
couple
hundred
documents
or
a
thousand
documents
up
to.
A
Wikipedia
or
you
know
anything
really
really
large,
with
millions
or
billions
of
documents
and
then
at
the
same
time
you
have
this
other
dimension,
which
is
you
know
your
your
your
index.
Freshness
like
how
real
time
does
it
need
to
be
wikipedia.
A
You
know
they
can
update
once
a
day
or
you
know
once
a
week
or
whatever,
but
you
know
what
application
like
twitter,
which
is
needs
to
stay
as
close
to
real
time
as
possible.
You
need
to
update
really
quickly
actually
at
the
lucene
revolution.
I
heard
I
didn't
go
to
it,
but
I
heard
the
the
so
twitter
is
now
using
solar
or
they're
using
lucine
a
custom
build
to
do
their
searching,
so
this
is
a
little
inappropriate,
but
also,
but
there's
slides
on
that
around
on
twitter
yeah
yeah.
A
I
I
think
I
mean
twitter's
search
problem
is
actually
really
straightforward,
so
they
don't
use
any
faceting.
Everything
is
ordered
by
time.
You
know
they.
You
know
they
do
put
relevant
tweets
at
the
top.
That
doesn't
necessarily
need
to
come
from
solar.
So
I
think
doing
you
know:
12
12,
000
searches
a
day,
given
their
use
case.
140
characters
is
actually
relatively.
A
You
know
it's
doable
with
what
we've
seen
in
how
they're
boxing
problem.
So
I
mean
not
not
out
of
the
box
but
with
their
modifications,
but
when
you
need
to
have
a
real-time
system
that
needs
to
do,
you
know
sorting
on
multiple
dimensions,
you
do
different
kinds
of
scoring,
it
gets
tricky
and-
and
it
doesn't
really
handle
lots
and
lots
of
indexes.
I
mean,
if
you
so
in
lucine
and
solar
an
index
is
a
physical
directory.
A
So
if
you've
got
a
million
users,
you've
got
to
manage
a
million
directories
which
is
not
very
fun
so
and
that's
what
I'll
go
into
here
so
part
of
the
problems
with
leucine
and
solar
today,
and
these
are
all
things
that
they're
working
on
as
well.
So
you
know
this
isn't
a
super
critique
of
it
because
I
know
they're
all
being
addressed,
but
this
is
where
I
thought
there
could
be
some
benefits
from
combining
with
cassandra.
A
So
and
as
you
just
generate
more
and
more
documents,
as
you
index
more
documents,
you
need
to,
you
know,
merge
your
your
index
files
together
and
then
you
need
to
reopen
them.
A
A
A
Your
monitoring,
your
failures,
you
need
a
big
ops
team
and,
in
terms
of
this
whole
thing
just
to
maintain
your
your
search
engine
and
I
feel
a
little
jab
with
my
suit,
because
you
have
a
lot
of
the
same
operational
problems
with
with
my
sql.
A
So
cassandra.
What
is.
B
A
Cassandra
is
it's
a
combination
of
of
of
two.
You
know
no
sequel
ideas.
There
there's
there's
the
big
table
paper
from
google,
which
is
what
you
know,
their
their
big
table
system
and
each
face
is
built
on
and
then
there's
the
dynamo
paper
from
amazon
and
they
talk
about
two
separate
systems
and
cassandra
is
sort
of
a
superset
where
it
combines
the
best
of
those
features
and
gives
you
this
sort
of
new
kind
of
distributed
database.
So
the
the
key
about
it
is,
is
it's
peer-to-peer,
so
it
works.
Sort
of.
A
If
you
remember
you
know,
kazaa
or
you
know,
I
guess
you
know.
A
Yeah
nutella:
they
they
those
systems
where
it's
basically
a
distributed,
hash
table
and
you
can
just
join
the
cluster.
You
you
just
connect
to
a
seed
and
you
join
the
cluster.
Cassandra
works
exactly
the
same
way.
It's
configurable.
A
You
know
in
terms
of
the
eric
viewers,
brewer's
cap
theorem,
you
can
you
can
sort
of
pick
and
choose
what
which,
which
dimensions
you
want
to
adhere
to
for
every
read
or
write,
which
just
makes
it
really
interesting,
and
it's
not
just
a
key
value
store.
It's
it's
a
for
any
given
key.
You
can
have
this
kind
of
like
a
tree
map
in
jobless.
A
You
have
this
sortable
tree
of
data
that
that
you
can
keep
under
under
under
given
key,
which
is
what
makes
it
interesting
in
terms
of
how
to
model
your
data.
So
there's
a
lot
of
data
modeling
pieces
that
you
can.
You
can
think
about
when
you're
building
something
on
cassandra.
A
The
plugable
replication
sorting
is
really
cool
too,
because
you
can
you,
you
can
write
your
own,
your
own
type
of
source.
So
if
you're
indexing
data-
and
you
know
it's
a
time
stamp
like
those
things
come
built
in,
but
let's
say
you
want
to
index,
you
know
an
object
and
you
want
to
do
it
a
certain
way
for
different
kinds
of
for.
If
you
write
into
different
column
families,
you
can
use
different
ways
to
sort
and
the
same
thing
goes
for
replication.
A
It
comes
with
different
types
of
partitioning
and
replication
strategies
and
you
can
write
your
own
and
just
drop
it
in
as
a
class.
C
D
A
C
You
mean
by
rights
are
very
fast,
yes,
is
it
synchronous
asynchronous?
Do
you
have
to
write
on
all
nodes
and
what.
A
Is
very
low.
Friction
there's
not
a
lot
going
on
right,
so
it's
basically
just
appending
to
to
a
log
file,
so
you're,
you're
you're
only
you're
only
lag
is
really
the
the
the
time
it
takes
to
to.
You
know
to
write
the
disc
and-
and
you
don't
have
sync
on
your
right-
you
can
you
can
configure
that
too,
but
I'll
just
walk
through
the
specifics
of
right
and
what
I
mean
by
that
it
integrates
really
well
with
hadoop.
A
C
A
Of
adoption
and
there's
lots
of
development,
like
I
said
I
mean
I
know,
there's
a
lot
of
big
companies
that
use
it
from
facebook
to
twitter
and
all
these
guys
and
which
is
the
commercial
company
is
offers.
You
know
commercial
support
and
training.
A
Actually
let
me
skip
this,
so
let
me
go
into
the
rights
and
all
this
stuff.
So
so,
when
you
do
a
write
in
cassandra,
so
let's
say
I'm
writing
two
keys.
So
let's
just
not
talk
about
the
internal
data
models.
Just
talk
about
you're,
storing
keys
around
a
ring,
so
it's
a
distributed
hash
table
in
this
case,
I'm
just
using
letters
right
so
from
a
to
z.
So
you
can
see.
The
first
note
up
here
is
is
a
to
c.
A
The
next
node
goes
from
d
to
f.
The
next
node
goes
from
g
to
I,
and
so
on.
So
the
the
cool
thing
about
it
is
when
you
write
things
so
I'm
showing
two
different
partitioners,
the
ordered
one.
A
So,
if,
if,
if
you
want
to
do
an
ordered
row
scan
versus
if
you
want
to
just
write
randomly
into
the
ring,
so
your
data
is
distributed
evenly,
if
you
do,
you
can
write
to
any
node
and
it'll
proxy
it
to
the
appropriate
node
with
with
and
then
depending
on
how,
depending
on
what
you
set
your
right
to.
If
you
wanted
to
to
write
up
to
all
replicas
or
or
or
a
form
of
replicas,
you
could
specify
that.
So
you
can.
A
So
you
can
make
those
rights
as
fast
as
you
want,
and
what
a
right
is
in
cassandra
is
so
it
keeps
a
what's
called
a
binary
mem
table,
which
is
the
which
is
a
internal,
sorted
set
of
of
rights
that
come
in
so
for
every
key
that
so
for
every
write
that
comes
in
every
insert.
If
it
it
it
sorts
it
in
its
current
binary
event
table
it,
writes
it
to
disk
and
then
once
that
binary
table
gets
to
a
certain
size,
it
writes
it
out
to
disk
as
an
essence
table.
A
Side,
you
know,
there's
there's,
there's
three:
I
can
go
through
the
green
side
if
you
want,
but
but
but
on
the
read
side
there
there's
a
bloom
filter
that
there's
a
vision
there
for
every
ss
table,
there's
a
bloom
filter,
there's
an
index
file
and
then
there's
the
the
actual
data.
A
That's
how
reason
why
rights
work
if
I'm
going
too
fast,
the
first
time
I'm
just
focused
on
time.
So
so
these
are
my
my
kids.
So
if
I
write
you
know,
live
as
a
key
it
and
I
write
to
this
random
note:
it'll
get
it'll
get
written
to
to
the
appropriate
spot
in
the
random
partitioner.
A
And
one
of
the
cool
things
is
obviously,
if
you,
as
you
add
nodes
and
remove
nodes
from
the
system,
cassandra
will
manage
that
for
you,
so
it
it
will
redistribute
the
data
appropriately
to
to
keep
up
with
your
replication
factor
that
you
set
as
well
as
as
well
as
you
know,
keep
track
of
what
data
should
be
on
one
node.
A
So
it
uses
a
approach
called
gossip
which
is
defined
in
the
diamond
mode
document
which,
which
basically
says
you
know
for
any
given
node
it's
so
you
can.
When
you
join.
When
you
want
to
join
the
cluster,
you
specify
any
seed
node,
any
node
can
be
a
seed,
so
you
specify
it.
You
know,
you
pointed
out
the
seeds
and
then
it
will.
That
scene
will
tell
a
couple
of
its
a
couple
guys
in
the
ring.
Hey
this
new
guy's
joined
and
that's
a
couple
of
disguises
with
this
guys.
A
Eventually,
you
you
end
up
with
the
ring
gaming
sync
and
as
well
as
if
you
want
to
remove
a
node
from
the
from
the
environment,
you
can
do
that
as
well.
You
can
also
move
tokens
so
for
any
given
node.
Let's
say
you
have
a
hotspot
on
certain
data
and
you
want
to
increase
the
replication
factor
and
you
want
to
move
some
nodes
around
your
tokens
aren't
evenly
distributed.
A
You
can
do
that
as
well,
so
it
so.
It
manages
all
the
cluster
problems
for
you
and
I
think
one
of
the
really
cool
things
is
being
able
to
scale
the
systems
down.
We
always
talk
about
scaling
up,
but
especially
in
a
world
where
you
know
where
you
can
buy
on-demand
hardware
on
ec2,
it's
really
nice
to
be
able
to
take
to
take
notes
out
of
the
cluster.
C
A
This
is
another
configurable
thing
you
can.
You
can
say
so
one
of
the
powerful
things
about
cassandra
is
you
can
write
across
data
centers,
so
there's
something
called
the
the
endpoint
snitch,
which
tells
you
know
when
one
rack
ends
and
when
the
next
begins
you
can
write
your
own
snitch,
there's
ones
for
ec2
put
it
in
different
regions.
A
You
know,
or
if
you
have
your
own
racks,
you
can
define
it
in
a
yama
file,
but
what
it
means
is
is
really
for
replication.
A
Cassandra
will
make
sure
if,
if
you're
doing
a
read
it
within
a
certain
data
center,
and
it
will
make
sure
that
there's
replicas
you
know
in
in
your
data
center
and
it
and
it
will
do
its
read
from
that
node
as
well
as
for
writes
in
the
in
the
rack
unaware
it
just
writes
it
to
the
node.
Next
to
it.
A
A
Okay,
so
that's
sort
of
my
cassandra
spiel.
I
guess.
Oh
sorry,
I
didn't
go
through
the.
Let
me
talk
about
the
data
model,
all
right,
so
the
data
model
is
entered,
so
you
have
a
key
space.
Key
space
is
sort
of
like
a
name
space.
You
know,
if
you
think,
if
you
have
an
application,
and
you
want
to
share
the
same
cluster
for
your
dev
environment
and
your
testing
environment,
let's
say
you
can
create
two
different
key
spaces.
A
A
In
a
database,
it's
basically
like
a
separate
schema
now
with
within
a
key
space.
You
write
keys,
but
each
key
lives
within
a
column
right.
So
so
this
is
a
very
hierarchical
design.
So
so
you
can
find
it
confidently.
There's
two
types
of
column:
families.
You
can
have
a
regular
con
family,
which
is
just
you
have
a
key
and
then
a
and
then
you
basically
have
a
a
list
of
of
of
underneath
key
value
pairs.
A
And
then
you
have
this
thing
called
a
super
column
which
is
which
gives
you
a
another
level
of
keys
and
values.
So
you
can
have
a
second
dimension.
A
A
So
as
an
example
of
this,
I
will
talk
about
the
lucian
stuff.
So
the
way
that
we've
seen
works
now
out
of
the
box
is
you:
have
you
have
a
searcher?
You
have
a
reader,
you
have
a
writer
and
you
have
your
discipline.
A
B
A
Sorry
so
where's
the
white
box
that
wraps
it
in
a
http
layer,
so
so
the
way
that
this
lusandra
stuff
works
is
is
instead
of
storing
it
on
disk,
I'm
actually
storing
the
the
inverted
index
in
cassandra
using
the
cassandra
data
model
right
so
for
any
write
or
any
read,
it's
it's
going
to
be
standard
to
get
to
get
the
data
and
the
way
that
this
works
is
so
there's
two
parts
of
what
we've
seen
index.
A
There's
a
there's,
your
there's,
your
actual
document
data,
you
know,
so
a
document
in
the
scene
is
there's
a
document
and
there's
a
field
and
a
field
is
just
a
key
value
pair
right.
So
you
could
say
you
know.
Field
date.
Is
this
field
title?
Is
this
you'll?
You
know
url?
Is
this
and
you
specify
how
you
want
each
field
to
be
indexed
so
how
it's
parsed?
So
if
it's
just
regular
english
or
if
it's
a
utf-8
language.
A
Internationalized
language
there's
specific
analyzers
for
that
and
then
so,
when
it
parses
that
data
it
creates.
What's
called
you
know,
like
your
your
your
term
factors
right,
so
you
have
it's
broken
up
into
a
number
of
pieces,
there's
a
term
frequency
so
for
any
given
field.
If
you
have,
let's
say
you
know,
a
bunch
of
text
say.
C
A
That
the
word
the
occurs
will
be
represented
in
the
term
we
can
see,
so
the
key
will
be.
The
the
key
for
cassandra
is
is
is
a
is
a
combination
of
a
few
fields,
the
index
name,
the
field
name
and
then
the
term
itself.
So
let's
say
I
have
my
index
is
called
field,
one,
my
field,
sorry,
my
index
name
is
index
one.
My
field
name
is
called
text
and
the
term
is
the
okay
and
then,
within
that,
the
value
of
that
key
is
going
to
be
the
document
id
right.
A
So
every
document
has
a
unique
id
and
then
with
that
then,
underneath
that
id
you
have
a
number
of
different
information.
So
the
term
frequency
the
number
of
the
times
that
the
occurred,
the
term
positions
like
where
in
that
field
did
the
occur.
The
term
offsets
like
how.
A
Oh
sorry,
yeah
in
terms
of
how
what
what
characters
are
before
or
after
including
you
know,
words
that
get
thrown
away
because
they're,
because
they're
they're,
they're
they're,
stop
words
and
then
the
normalization
factor.
So
in
so
in
terms
of
your
scoring
of
how
important
is
this
word
to
this
document?.
A
Okay,
so
and
then
what
you.
A
Is
you
know
so,
once
you
deploy
your
index
in
cassandra,
you
can
do
some
cool
things.
You
you
no
longer
have
a
a
a
single
writer,
a
single
reader.
You
can
go
through
and
you
can
you
can.
A
You
can
write
to
any
node.
You
can
read
from
any
node.
You
can
have
you
know
n
number
of
indexes.
You
don't
have
to
worry
about
replication
or
optimization
or
any
of
the
operational
stuff
that
comes
with
with
solar
and
lucy,
and
you
can
let
cassandra
do
the
work
for
you.
So
if
you
guys
want
I
can.
I
don't
know
how
much
time
I
have.
When
does
this
happen
minutes?
I
think
you
have
15
more
minutes,
oh
good
cool,
so
I
was
going
to
do
a
demo
if
you
guys
wanted
to
see
it.
A
But
before
that,
do
you
is
there
any
questions
on
cassandra?
We've
seen
that
you
guys
want
to
talk
about
first.
I
know
this
is
probably
a
lot
for
the
amount
of
time
I
had.
A
Yes,
yes,
all
right,
so
all
right
so
kind
of
fun.
A
What's
my
seed
or
list
of
seeds,
how
many
concurrent
readers
and
writers
do
I
want
what
port
do
I
want
to
run
on
and
down
here
is
where
I
defined
the
actual
column.
So
if
you
look
here,
compound
is
really
simple.
Right,
you've
got
documents,
you've
got
term
input
right,
so
I've
got
I'm
just
comparing
each
column
by
bytes
and
the
super
column
is
for
determined.
So,
as
I
described.
D
D
A
So
if
you
remember
in
the
other
slide,
the
keys
are,
the
keys
are
are
composite
where
they,
the
index
name
so
you're,
putting
your
your
your
index
as
just
a
part
of
the
key.
So
if
I
want
everything
under
under
a
particular
index,
I
just
do
a
row
scan
for
everything
containing
that
part
of
the
key,
and
so
it's
an
ordered
scan.
A
You
know
thrift
service
running
on
localhost,
I'm
actually
a.
A
A
Oh
and
actually
one
thing
I
gotta
do
so
one
of
the
new
things
with
o7
the
new
version
of
cassandra.
I
don't
know
when
you
guys
haven't
most
of
you
guys
haven't
used
it,
but
it
doesn't
have
the
older
version
of
cassandra.
A
You
have
to
find
your
key
space
information
up
front,
but
now,
in
the
new
version
cassandra
you
can
create
as
many
column,
families
or
you
can
drop
and
create
recreate
you
can
modify,
but
it
also
supports
secondary
indexing.
So,
instead
of
just
indexing
on
column
information,
you
can
also
index
the
values.
A
A
A
A
So
if
I
do
ring
here,
I
can
see
what
the
ring
is
since
there's
only
one
node,
but
obviously
you
could
get
a
whole
list
of
notes.
It
just
picks
a
random
token
by
default.
A
And
now
what
I'm
going
to
do
is
there's
a
demo
modification
that
comes
with
it.
I
guess.
A
All
right
and
then
what
I
can
do
is
I'll
do
one
demo
what
this
does
is
it
it?
It's
a
basically
like
a
a
delicious
search
engine.
So
I
took
a
bunch
of
like
a
text.
Limited
file,
tabbed
limited
file
with
bookmark
information
delicious
right.
So
it's
just
the
url,
the
title
of
the
url
and
then
list
of
tags.
A
B
A
So
solar
comes
with
an
example,
so
I
just
took
their
example
and.
A
A
So
it
just
searched
for
the
word
stolen,
but
yes,
it
does
work
with
fasting
and
everything
else.
I
will
talk
about,
though.
Let
me
just
go
back
to
the
presentation
right.
So
there's
the
thing
working,
but
what
I
wanted
to
show
real,
quick
and
is
you
know
some
examples
of
this
in
use?
There
are
companies
that
use
this.
I
wrote
this
initially
just
for
this.
A
I
built
a
toy
kind
of
show
how
it's
used.
It's
called
sparsely
so
sparsely
is
a
twitter
search
engine
kind
of
thing,
but
this
is
like
over
a
year
ago,
but
you
go
to
sparsely
and
see
what
you
can
do.
Is
you
log
in
and
it
creates
a
index
of
just
your
twitter
stream.
A
So
if
you
have
100,
if
you
follow
500
people,
you
can
search
just
across
those
500
people,
and
you
don't
have
to
you
know,
search
all
the
other
jokes
because
that's
I
didn't.
I
didn't
like
the
twitter
search,
because
the
fact
that
you
always
have
to
search
across
everyone
and
there's
always
lots
of
spam
and
garbage
in
there.
B
A
Wife
may
have
kicked
the
cord
if
you
you
typed
in
www.
A
D
Could
you
explain
a
little
more
about
when
solar
and
lucy
actually
do
get.
D
A
Sure
so
so
in
lucena,
you
need
to
reopen
your
index
every
time
you
do
a
write.
If
you
want
that
right
to
be
found
by
the
readers
so
seeing
what
it
does
is
it
keeps
track.
It
keeps
track
of
all
of
its
terms
and
information
from
all
the
the
subreaders
so
like
per
file.
There's
it's
a
subreader
and
when
a
reader
just
has
a
list
of
subreaders
and
each
of
those
subreaders
have
a
list
of
terms
they
have
available
and
the
last
time
it
was
updated.
A
A
So
what
so?
What
you
can
do
is
so
so
so
now
that
all
that
goes
away
with
part
of
this,
the
fact
that
the
data
is
stored
up
under
under
the
same
key
there's
no
longer
you
know
multiple
copies.
There's
you
know
the
eventually
consistent
copy
right.
So,
depending
on
how
consistent
you
want
your
data
to
be
you,
can
you
can
change
that
factor
on
your
rights
by
default?
It
writes
at
a
level.
A
You
know
one
has
to
write
to
one
replica
for
the
right
to
succeed,
but
the
minute
that
right
is
written
on
that
one
replica
it
will
it'll
replicate
to
the
others,
so
your
data
will
eventually
show
up
there,
but
you
don't
have
to
worry
about
it.
So
every
read
is
run
against
the
cluster,
so
you
don't
need
to
reopen
any
data
and
just
require
the
same
data
from
from
the
poster,
and
that
brings
me
to
my
last
point.
The
good
thing.
A
The
really
strong
thing
about
this,
with
with
with
sparsely
that
use
case,
works
really
well,
because
you
know
that
I
mean
at
one
point:
it's
popular
so
there's
you
know
a
few
there's,
probably
five
or
six
thousand
users,
so
there's
five
or
six
thousand
indexes
running
on
a
single
box.
So
there's
not
five
or
six
thousand
directories,
but
each
individual
index
has
you
know
under
a
hundred
thousand
documents
that
that's
a
good
use
case
for
the
current
version
of
luxembourg.
A
Is
to
is
to
really
take
advantage
of
solar's
sharding
and
make
it
sort
of
auto
shard
for
you
using
cassandra
underneath.
So
since
the
ring
in
cassandra
knows
where
what
data
lives,
it
will
create
a
maximum
of
a
maximum
index
of.
C
A
Let
me
describe
the
problem
with
the
current
version:
better,
the
current
version.
You
have
this
problem
where,
if
you
let's
say
you
have
a
million
documents,
you
search
for
the
word,
the
which
is
contained
in
90
of
those
documents.
That
means
you
got
to
pull
over
the
wire.
You
know
you
know
900
000
document
ids,
that's
that's
a
performance
problem
with
the
newer
version.
A
What
you
can
do
is
with
the
newer
version,
what
it
does
is
it
embeds
cassandra
as
part
of
solar,
so
solar
becomes
part
of
the
ring
right
and
each
index
has
a
maximum
size
size
of
let's
say,
100
000
documents.
So
when
I
search
for
you
know
the
word,
love
it
it,
the
the
cassandra
part
of
solar
says
all
right.
You
know
where
are
the
other
shards
in
the
ring
and
it
uses
the
solar
apis
to
talk
to
the
other
parts
of
of
solar.