►
Description
Speaker: Jason Rutherglen, Senior Big Data Engineer at DataStax
Slides: http://www.slideshare.net/planetcassandra/dse-solr-realtimeanalytics
The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.
A
A
Next,
one
will
basically
work
on
data
sex
enterprise
I've
worked
on
the
solar
integration
for
the
most
part,
so
I
do
a
lot
of
support
and
that
sort
of
thing
in
development
so
and
I've
done.
A
dud
did
did
one
book
programming,
hive,
40,
Riley
and
I
did
I'm
doing
introduction
the
solar
for
O'reilly
so
datastax,
we
kind
of
know
what
datastax
is,
but
we
we
do
Cassandra
and
we
do
did
a
sex
enterprise.
We've
got
data
sax
enterprise
3x,
so
that's
kind
of
our
main
main
version
that
we've
got
going
on.
A
A
Cassandra
is,
I
would
say,
real
time,
but
in
solar,
it's
not
efficient
to
do
that.
So
the
when
we're
talking
about
near
real
time
we're
talking
about
a
second
latency.
So
basically
that's
you.
Do
you
submit
a
document
and
then
you'll
search
on
it
you'll
be
able
to
search
on
it
within
about
a
second
or
so,
and
so
why
not
a
relational
database?
Basically,
just
solar
provides
horizontal
scalability.
Just
like
a
sandra.
Just
like
a
dupe
I
mean
a
lot
of
our
customers
are
going
from
relational
to
Big.
A
Data
Solar
is
really
useful
for
converting
SQL
applications
over
to
the
big
data
space.
Typically,
in
those
applications
are
not
the
batch
based
kind,
but
it's
the
kind
where
they
want,
like
it's
a
user
interface
or
something
and
people
want
to
do
interactive,
queries
ad-hoc
queries,
solar
gives
you
the
the
query
latency
is,
should
be
around
100
milliseconds
or
so
so
it's
pretty
it's
pretty
hot,
just
like
Google,
and
then
you
get
that
you
know
it's.
The
costs
are
a
lot
less
stuff
like
that.
A
So
you
might
ask
you
probably
familiar
with
solar
cloud,
and
you
know
elasticsearch
and
stuff
like
that,
so
why?
Why
Cassandra?
Why
we
integrating
solar
with
Cassandra,
will
basically
we
leverage
everything,
that's
good
about
sikandar
everything
that's
been
built
into
their
to
do,
distributed
stuff,
multiple
data,
centers
and
all
that
and
then
we
we
thought
we
solar.
We
we
let
solar
do
the
queries
and
the
indexing
and
that's
it.
So
we
don't.
We
don't
use
any
solar
cloud
or
anything
like
that,
because
we
don't
need
to.
We
get
all
that
with
Cassandra
in
Cassandra.
A
You
know
you're
probably
familiar,
but
it's
a
simple
simple
dynamo
model
works
really
well
in
distributed
environments
and
it's
just
automate.
The
dynamo
model
takes
care
of
placing
you
know
where
the
documents
go
and
all
that
stuff.
So
it's
very
it's
very
it's
in
it's.
It's
a
fairly
simple
architecture
in
that
regard,
so
I
mean
probably
probably
familiar
with
the
whole
Cassandra
versus
HBase
thing,
but
Cassandra's
just
a
lot
easier
to
use.
A
Let's
see
so
batch
analytics
people
people,
you
know,
there's
there's
basically
guy
the
way.
I
look
at
things.
There's
two
use
cases
there's
batch
analytics,
which
is
you
kind
of
wait
a
little
little
bit
and
you're
probably
going
to
do
a
join.
Otherwise
you,
if
you
otherwise
you
can
pretty
much
just
use
solar
or
you
can
use
Cassandra
and
then
you're
going
to
get
the
the
new
real
time.
A
A
So
real-time
analytics
I
think
you
can
use
solar.
You
can
use,
do
complex
event.
Processing
you
can
do
cassandra
and
then
there's
newer
stuff,
which
I
kind
of
hesitate
to
put
in
here,
which
is
like
Impala
and
stinger
and
stuff
like
that
for
hive
and
that's
kind
of
that's
latency.
There
is
typically
30
seconds
like
five
seconds,
so
I
wasn't
sure
where
to
put
that
complex
event,
processing
is
like
doing
all
the
calculations
in
real
time.
So
it's
a
little
different,
because
the
data
is
never
really
add.
A
The
data
is
at
rest,
but
the
every
all
the
queries
are
essentially
computed
it
as
the
data
is
streamed
through.
So
it's
a
different
architecture,
whereas
soul
or
actually
iterates
on
the
data
at
rest
or
typically
in
RAM.
Basically
so
loose
scene
is
a
it's
a
Java
library
and
it's
kind
of
like
leucine
and
solar
one
Apache
project.
It's
basically
at
its
basic
form.
It's
an
inverted
index,
but
it's
grown
to
be
a
lot
more,
which
originally
for
text
analytics.
A
Computers
basically,
so
what
is
an
inverted
index?
It's
important
to
know
what
an
inverted
index
is
because
that's
the
basic
four
base
basis
for
the
whole
solar
and
leucine
ecosystem.
It's
very
simple!
Really,
it's
just
a
terms
dictionary.
So
it's
like
a
sorted
list
of
terms
or
words
and
then
each
word
post
points
to
a
posting
list
and
a
posting
list
is
simply
some
metadata
and
it's,
but
it
is
basic
form.
It's
a
set
of
document
IDs
which
are
integers
and
so
it
bait
and
then
inverted
index
ball.
A
So
well,
leucine
will
tokenize
text
a
lot
of
our
customers
tend
to
not
actually
do
too
much
text
analytics
and
I'm
more
focused
on
just
raw.
What
I
would
call
like
relational
database
types
of
queries.
So
that's
it's
a
it's
a
little
different
than
your
typical
text
text,
query
type
of
thing,
but
we
do
support
that,
of
course.
A
A
What's
what's
been
missing
for
a
number
of
years,
is
the
whole
distributed
cloud
type
of
capability
so
that
they
can
in
the
in
the
Apache
Solr
community.
They
start
working
on
solar
cloud.
It's
my
opinion.
It's
got
a
little
ways
to
go
to
be
totally
useful,
but
it's
that
uses
zookeeper
and
provides
kind
of
the
missing
cloud
piece
in
data
sex
enterprise,
of
course
using
Cassandra.
Then
we
get
the
cloud
piece
really
easily
there.
So
solar
cloud
uses
zookeeper
I.
Think
that's
kind
of
a
fatal
flaw.
A
Zookeeper
is
like
on
yet
another
system
you
have
to
manage.
Cassandra
is
peer-to-peer,
so
it's
a
lot
easier
to
to
manage
and
you
get
the
multiple
data
center
replication
I.
Think
it's
based
I
think
solar
clouds
playing
catch-up
things
like
elasticsearch
elasticsearch
has
been
out
there
a
little
bit
longer
and
they
focused
on
the
things
that
solar
didn't
provide
for
a
number
of
years,
and
that
was
near
real-time
search
and
distributed
stuff.
So,
like
the
cloud
types
of
capabilities,
but
it's
the
feet,
this
should
be.
A
A
A
So
one
of
the
things
about
lucena
Cassandra,
that's
really
interesting.
Is
they
both
kind
of
implement
the
same
type
of
log
structured,
merge
tree?
And
that
means
that
if
you're
ala,
if
you're
buying
hardware
and
stuff
like
that
for
a
data,
sex,
Enterprise
installation,
then
you
pretty
much
can
whatever
is
going
to
work
for
your
Cassandra
nodes.
It's
going
to
also
work
for
your
solar
nodes.
Just
the
same
so
like
SSDs
are
really
good.
A
You
want
to
be
able
to
give
the
tune
the
heap,
and
so
there's
a
lot
of
similarities
in
in
using
the
two
in
using
a
having
solar
nodes
in
your
in
your
system
so
I.
Basically,
all
we
do
with
date.
Sex
enterprise
in
solar
is
we
store
the
data
in
Cassandra
and
we
index
the
data
in
solar
and
that's
it
and
we
let
Cassandra
place
the
data
on
given
nodes
and
things
like
that.
A
Do
all
the
replication
we
let
Cassandra
take
care
of
which
notes
are
online
etc,
and
things
like
that,
so
there's
a
very
clean
delineation
between
solar
and
Cassandra.
Basically,
solar
is
only
a
secondary
index
and
that's
that's
benefited
us
greatly
in
terms
of
not
having
to
build
a
distributed.
Solar
cloud
features
in
the
solar,
so
I
call
it
a
separation
of
church
and
state,
basically
now
so
indexing.
So
typically,
people
are
you're
doing
a
lot
of
indexing
when
you're
putting
data
into
solar.
It's
a
CPU
intensive
task.
A
It's
not
an
I/o
bound
TAS,
typically,
and
we've
done
a
lot
of
optimizations
to
make
that
fast,
with
datastax
enterprise.
Queries,
on
the
other
hand,
are
typically
IO
bound.
So
if
the
index
is
not
Ram,
basically,
everything
is
going
to
slow
down
by
an
order
of
magnitude
the
queries
both
in
terms
of
how
many
queries
per
second
you
get
in
the
overall
query
latency,
which
is
the
raw
query
times.
A
A
So
and
then
another
thing
that
we
do
and
when
any
time
you
have
a
distributed
search
is
we
round
Rob
automatically
round-robin
queries
to
different
nodes.
So
if
you
have
a
replication
factor
greater
than
one
like
two
or
three
or
something
like
that,
then
you
just
hit
a
node
and
it's
just
going
to
automatically
balance
the
distributed
query
across
nodes
for
you,
so
you
don't
have
to
really
worry
about
that
which
is
a
nice
feature.
So
three
point
0,
1
and
3
point
0
2
is
the
current
release
of
data
sex
Enterprise.
A
We
add
a
lot
of
cool
features
like
indexing
re-indexing.
Basically
so
because
we
have
all
the
data
store
in
Cassandra,
you
can
just
you
can
just
do
a
command
and
it's
in
reindex
all
your
data.
So
if
you
change
your
schema,
which
is
something
people
do
and
solar
a
lot,
let's
say
you
want
to
change
like
a
string
to
an
int
or
something
like
that.
Then
you
have
to
actually
in
that
rhian
that
you
have
to
recreate
the
entire
index
and
in
the
in
typical
usual
solar
and
solar
cloud
and
last
search
this.
A
And
yeah
so
there's
no
custom
code
required
and
that
sort
of
thing,
so
we
added
some
time
also
some
interesting
features
like
you
can
view
the
heap
space,
the
heap
usage
people
typically
run
out
of
heap
space
when
they're
using
solar
and,
I
would
say
doing,
support
that's
a
very
common
problem.
So
we
allow
you
to
view
the
memory
usage
there
and
that
allows
you
to
do
capacity
planning,
which
is
very
it's
just
something
that
people
miss
usually
and
then
it's
an
app
to
fight.
Then
we
have
to
come
in
and
try
to
fix
it.
A
A
A
Probably
this
is
I'm
not
sure
how
much
this
this
makes
sense.
But
it's
I
consider
this
a
major
pain
point
for
solar
and
we
we
kind
of
needed
to
do
it
because
we
do
range
queries
in
solar
to
correspond
to
the
ring.
So
if
that
makes
any
sense
if
you're
familiar
with
Cassandra,
you
know
it's
a
ring
model
and
we
need
to
narrow
down
the
the
query
to
the
part
of
the
ring
that
the
query
should
apply
to
and
then
we
we
need
to
cash
those
those
queries.
A
So
we
use
the
per
segment
filters
for
that
with
the
near
real-time
search.
We
also
support
V
nodes
in
composite
keys.
3.1
kind
of
the
this
3.1
is
really
going
to
be
a
fairly
good
achievement.
I
would
say,
because
a
lot
of
the
problems
that
we've
we've
had,
that
they've
been
there
I,
would
say
nagging
solar
and
maybe
day
sex
on
a
price
for
a
while,
totally
gonna,
be
fixed.
A
Everything
will
be
pretty
smooth,
I
think,
but
when
one
feature
that
will
be
really
good
in
the
future,
that
will
will
be
adding
really
soon
is
multiple
data
center
index
re-indexing.
So
that
way,
I
call
it
I
call
live
re-indexing.
Basically,
so
you
can
have
if
you've
got
a
production
app
and
you
want
to
change
the
schema,
you
don't
have
to
take
any
downtime.
A
If
you
have
multiple
data
centers,
you
just
take
one
data
center
out
Yuri
index
that
take
another
data
center
out
reindex
that
one
and
you
should
be
able
to
stay
online
all
the
time.
So
that
is
something
that
I
would
say.
That's
fairly
advanced
in.
Nothing
else
is
going
to
offer
that
last
text
search
lower
cloud.
It's
not
going
to
offer
that
I,
don't
think,
and
then
we
also
we're
looking
at
making
way.
So
you
can
actually
write
c,
ql
and
I'll.
A
A
So
one
of
the
things
I
like
to
go
over
I
that
I
find
people.
Don't
don't
talk
about
enough
is
how
would
you
know
if
you,
if
you've
got
an
existing
application?
That's
in
SQL,
you
know
Oracle
or
something
like
that.
How
do
you
convert
that
into
solar
and
there's?
Not
one
of
the
things
I
found
is
there's
not
really
good
guides
for
that,
so
we're
just
going
to
go
over
some
basic
SQL
queries
and
the
cow,
though,
is
look
for
solar.
A
A
It's
it's
a
little
bit.
It's
a
little.
It
could
be
a
little
bit
easier
to
use
right
now,
but
it's
you
know
maybe
we'll
fix
that
in
the
future
there's
soft
commit
times
soft
committees
basically
committing
an
index
into
ram
first
and
then
later
on.
It
goes
to
disk.
We
in
in
data
sex
Enterprise,
the
transaction
log
is,
is
held
in
Cassandra.
So
if
a
node
blinks
out,
you
don't
lose
any
data
not
only
because
of
the
whole
replication
and
the
quorums
and
stuff
like
that,
but
because
of
the
Cassandra
commit
log.
A
A
This
is
this
is
what
the
oddest
off
coming.
It
looks
like
I'm
just
kind
of
like
going
super
fast
through
stuff,
because
there's
not
a
lot
of
time,
but
the
slides
the
slides
will
be
available
later
so
filled.
Cash
is
really
a
really
important,
important,
todd's
concept
to
know
about
anytime,
you
do
a
sort
or
a
fasted
query.
Typically,
it's
going
to
load
this.
These
heap
structures,
these
heat
based
data
structures
into
ram
in
solar,
4.3,
there's
an
option
for
keeping
it
on
disk
or
in
on
the
SSD
whatever.
A
But
this
is
an
important
concept
to
kind
of
know
about,
because
customers
will
typically
try
to
run
a
sort
or
a
facet
query
and
then
all
of
a
sudden,
their
whole
system
goes
out
of
memory
and
that's
that's
it's
bad
in
production
and
bad
in
general.
So
it's
a
good
concept
to
know
about
so
solar
j
HTTP.
Basically,
everything
with
solar
is
HTTP
base.
Of
course
we
also
support
inserting
data
using
cql.
A
There's
no
way
to
do
cql
queries
to
solar,
or
I
should
say
there
is,
but
you
shouldn't
use
it
so,
every
if
you're
doing
queries
it's
always
http-based
and
you
can
insert
data
via
cql
or
you
can
use
the
native
solar
API.
So
basically,
if
you've
already
got
a
solar
application,
you
can
drop
in
these
data
sets
enterprise
and
everything
basically
will
just
work.
A
A
Yeah,
so
we
have
this,
we
have
a
you,
can
do
solar
queries,
BSC
cool,
I
don't
recommend
it.
I've
only
used
it
for
debugging
because
it
only
hits
one
note.
It
doesn't
distribute
the
query
out
and
it's
very
limited,
so
we
may
address
that
in
the
future,
but
it's
not
probably
not
in
the
near
future.
So
this
is
what
this
is.
What
a
typical
you
know,
sequel
insert
looks
like
you're
familiar
with
cql.
A
So
we're
just
going
to
kind
of
string
to
some
SQL
stuff,
really
quick.
So
what
what
would
it
typical?
This
is
just
like
the
more
your
most
basic
SQL
query
right:
it's
select
star
from
a
table
where
something
you
know
type
equals
PDF
whatever.
So
what
does
that
queer?
You
look
like
in
solar.
Basically,
you
solar
uses
HTTP,
so
these
are
got
the
HTTP
parameter.
Q
equals
type.
That
type
is
the
field
as
defined
in
the
schema.
It
will
also
be
a
column
in
Cassandra
and
then
we're
looking
for
a
PDF.
A
There's
no
need
to
create
specific,
like
in
SQL.
You
would
create
an
index
on
something
like
a
b-tree
index.
Leucine
provides
the
indexing
kind
of
very
out-of-the-box,
so
instead
of
having
I
think
you
can
probably
I.
Think
Lulu
seen
probably
sports
more
indexes
more
fields.
Having
an
index
then
probably
in
SQL
database,
haven't
tested
that,
but
hey
flipping
turning,
you
know
just
indexing.
Anything
even
over-indexing
is
fine,
I
think
with
leucine,
so
and
then
what
so?
If
you
want
to
select
columns.
A
This
is,
this
is
what
it
would
look
like
if
we
went
title
and
text
only
and
we
want
to
get
all
the
data
in
solar
would
be
the
queue
queue
asteroid
colon
asterisk
and
that
returns
everything
and
then
you
just
put
in
FL
parameter
which
stands
for
like
fields
and
then
you
just
do
title
comma
text,
so
some
pretty
basic
stuff.
If
you
want
to
do
a
count
star,
then
you
would
just
run
the
query
and
its
solar
returns.
A
A
A
If
you
want
to
add
a
group
by,
then
you
add
the
stats
dot
facet,
so
that's
going
to
actually
do
this.
Simulate
do
the
same
thing
as
a
group
by
Fung,
SQL
type
of
function
and
the
most
basic
thing
of
all
I'd
say
if
you're
converting
like
a
tech
space
app,
you
may
use
the
like
operation
and
the
way
the
way
to
map
that
is
to
instead
of
using
a
%.
You
just
use
an
asteroid.
A
A
A
Sudden
right
so
the
question
is:
is
there
a
different
process
or
a
different
note
for
solar
there
there
are
different
nodes,
I
mean
you
can
have
Cassandra
Cassandra
nodes
or
Cassandra
plus
solar
nodes,
but
it's
there's
always
Cassandra
there
and
then
the
question
is:
is
there
a
different
process
and
there's
no
it's
everything's
in
one
process
we
made
a
conscious
effort
to
do
that
because
it's
we're
just
we're.
The
two
were
just
totally
tied
together.
Basically.
A
B
A
Facets
in
solar
are
pretty
fast,
so
I
mean
you
have
to
test
it
out.
Yeah,
you
could
keep
the
count
or
you
could
use
the
facets.
I
think
it
depends
on
what
works
best
like
if
you
want
to
have.
If
you
want
to
enable
more
ad
hoc
queries
over
different
ranges,
for
example,
then
so
it
might,
you
know,
probably
better
just
to
keep
the
raw
data
and
so
or
and
then
execute
the
accounts
that
way.
Hi.
C
A
So
we're
so
where
we
are
we're
keeping
the
independent
in
index
per
node
and
indexing
is
typically
actually
fairly
fast.
These
days,
especially
given
that
it's
multi-threaded
so
in
it
were
a
pair
happens
if
full
reindex
occurs,
re
not
afore
index,
but
with
the
data
that's
moved
is
is
rayon.
Danu
index
is
created
for
that
data
yeah.
A
D
A
It's
it's
really,
it's
more
efficient,
it's
faster
and
more
efficient
fasting.
So
we,
where
we
made
a
conscious
effort
because
most
of
our
customers
do
near
real-time
search.
We
won
fascinating
to
also
be
near
real-time,
so
we
made
sure
all
that
is
every
type
of
fascinating
thing
you
want
to
do
and
everything
else
is
all
tuned
for
near
real-time
me:
it's
optimized
for
a
near
real-time
search,
which
is
typically
what
people
do
when
they're
or
white,
when
they're,
using
Cassandra
or
no
SQL
systems.
E
A
A
Even
if
it
did,
it
would
only
be
a
single
thread,
and
it's
it's
I
would
say.
Typically
it's
very
difficult
to
do
that
they
don't
usually
that
they
may
only
happen
if
you're
doing
something
crazy,
like
with
with
fastening
with
the
older
solar,
stuff,
you're
doing
fastening,
and
it's
loading
these
trying
to
look
these
big
data
structures
in
the
RAM
and
its
just.
Oh.
Well,
it's
just
out
of
memory
constantly!
A
E
A
That's
a
good
point:
I
mean
it.
We
do.
We
do
need
to
idea.
We
I
think
we
need
to
provide
better
monitoring
of
low-level
details.
The
question
would
be
how
I
mean
the
simple
answer.
Is
it
would
it
ends
up?
Looking
like
it
might
like
a
log,
a
real
time,
log
analysis
tool
which
would
be
like
splunk
or
you
know,
elasticsearch
is
actually
investing
in
someone's.