►
From YouTube: C* Summit 2013: High Throughput Analytics with Cassandra
Description
Speaker: Aaron Stannard, Founder and CEO at Marked Up Analytics
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-high-throughput-analytics-with-cassandra-by-aaron-stannard
Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.
A
Ok,
hello,
everyone,
my
name
is
standard
and
I'm.
The
CEO
of
a
company
called
marked
up
we're
an
early
stage
technology
company,
and
we
do
in
app
analytics
for
native
software
right
now,
primarily
in
the
Windows
platform,
although
we're
adding
support
for
other
technologies
to
I'm,
ex-microsoft
and
so
are
a
number
of
the
other
people
on
our
team,
so
so
before
I
really
get
into
the
substance
of
our
talk,
which
is
really
about
sort
of
building
a
real-time
analytics
cluster
using
cassandra
from
the
ground
up.
A
I
want
to
talk
a
little
bit
about
what
are
what
our
company
does.
So
we
help
developers
learn
three
types
of
things
about
their
software
after
it
gets
deploy
to
the
marketplace
and
is
running
on
their
end
users,
computers.
The
first
is
we
go
and
help
them
learn
like
who's
who's,
their
audience
who's,
actually
using
the
app
every
day.
What
types
of
devices
do
they
run?
A
How
often
they
come
back
and
use
the
app
over
and
over
again,
these
are
things
you
sort
of
need
to
get
a
sense
for
who
your
audience
is
and
what
you
should
be
testing
your
product
on.
In
terms
of
QA
and
hardware
and
so
forth,
and
so
that's
the
first
thing
that
we
do
the
section.
The
second
is
we
capture
diagnostic
information
about
our
customers,
apps
so
crashes,
exceptions,
performance
monitoring
and
so
forth.
We're
trying
to
help
the
operations
and
development
teams
understand
how
their
app
is.
A
Actually
how
well
it's
running
on
other
people's
computers
and
the
last
and
most
important
thing
we
do
is
we
help
our
customers
make
more
money
we
help
them
identify
and
which
customers
have
intention
to
buy
versus
which
ones
don't
so
you
know
we
work
with
some
pretty
big
video
game
companies
and
some
of
them
will
say.
Oh
well,
I
think
your
American
kids
between
the
ages
of
25
and
and
18
will
be
great
users
for
this
app
and
we'll
say
actually
Japanese
13
year
old
women.
A
Probably
if
you
should
be
targeting
for
this
application
so
we'll
have
you
know
we
basically
have
or
have
the
ability
to
back
that
up
with
data
about
our
sort
of
worldwide
install
base
for
Windows
8.
That's
what
we
do-
and
you
probably
heard
a
number
of
talks
today
about
companies
like
Netflix
and
kissmetrics
and
others
who
are
dealing
with
petabytes
scale
and
all
these
sort
of
crazy
sort
of
scalability
problems.
A
We're
not
there
yet
we're
a
very
early
stage
technology
company
we
deal
with
millions
of
data
points
a
day,
but
the
types
of
problems
I'm
going
to
be
talking
about
today
are
for
people
who
are
looking
to
build
their
really
their
first
sort
of
real-time
analytics
cluster,
where
you
need
to
be
able
to
measure
some
things
in
real
time.
I
want
to
be
able
to
integrate
hive
and
solar
into
it
and
so
forth.
A
So
I'm
going
to
sort
of
work
from
the
bottom
up
and
help
give
you
a
picture
of
how
to
get
started
with
Cassandra
and
Dana
sex
enterprise
and
production.
So
it
all
begins
with
one
question.
Oh,
this
is
our
product.
It
all
begins
with
one
question:
do
you
really
need
real-time
analytics
I'm
going
to
shatter
your
world
right
now
and
tell
you
that
real-time
analytics
is
a
developer
buzzword?
It's
just
like
it's
the
the
engineering
equivalent
of
big
data.
A
You
don't
need
real-time
analytics
for
everything
and,
in
fact,
there's
a
lot
of
use
cases
where
it's
a
bad
thing
where
you
shouldn't
conflate
your
operational
metrics.
You
need
to
keep
something
running
with
strategic
metrics.
You
need
to
decide
how
to
do
something
in
the
future.
So
let
me
give
you
an
example.
A
The
way
we
break
down
analytics
is
really
into
three
families.
We
have
real-time
retrospective
and
a
third
type.
That's
not
on
the
board
here
called
predictive
analytics,
which
is
what's
going
to
happen
in
the
future.
Real-Time
analytics
is
designed
to
help
you
keep
a
pulse
on
things
that
are
happening
as
they
happen,
and
the
only
analytics
that
really
need
to
be
real
time
are
ones
that
you
can
respond
to
in
real
time.
A
So
all
of
you
have
probably
had,
if
you're
a
developer
error
monitoring
software
at
some
point,
your
lives,
who
are
keeping
tabs
on
the
health
of
your
servers
and
production.
That's
the
perfect
use
case
for
real-time
analytics,
because
if
something
goes
down,
you
have
a
business
obligation
to
respond
to
it
in
real
time
and
try
to
fix
the
problem
as
it
happens.
Likewise,
if
you're
a
stock
trader,
you
need
to
get
pricing
information
in
real
time
about
stocks,
so
you
can
make
business
decisions
based
off
those.
A
These
are
examples
of
types
of
things
you
can
respond
to
as
they
happen.
But
what
about
instances
where
you
need
retrospective
or
historical
data?
So
imagine
you're,
a
scientist
and
you're
studying
changes
in
solar
flare
intensity
over
the
past
10
years,
trying
to
determine
if
there's
some
changes
in
the
sort
of
son's
behavior.
You
don't
want
that
in
real
time,
because
you
don't
want
to
have
a
report
every
day
when
there's
a
fluctuation
or
an
outlier
telling
everyone
that
we're
all
going
to
die
from
UV
exposure
one
day,
then
the
next
o,
our
bad.
A
We
had.
We
had
a
problem
with
the
server
silly
us
it's
much
more
important
to
actually
take
a
statistical
sample
over
a
period
of
days,
months
or
even
years,
and
produce
a
consistent
result
using
a
tool
like
the
dupe,
for
something
like
that.
So
we're
going
to
talk
about
today
is
how
to
build
an
analytic
system
that
leverages
both
of
these
types
of
techniques
to
really
drive
valuable
business
insights.
A
For
for
your
company
internally
or
for
your
customers,
if
you're
building
an
external
facing
product,
I'm
really
going
to
try
to
help,
you
think
about
it
from
both
the
technology
and
a
business
point
of
view.
So
the
point
being
is
that
you
know
real-time
analytics,
isn't
inherently
better
than
what
analytics
the
aren't
real
time.
It's
just
different
types
of
metrics
that
can
and
should
be
responded
to
in
real
time.
So
here's
how
we
look
at
real
time
analytics
at
marked
up
so
I
talked
a
little
bit
about
what
our
product
did.
A
So
most
of
our
metrics
are
operational.
These
are
things
like
install
rates
for
applications,
number
of
people,
who've
used
it
every
day,
error
rates,
custom
events
and
so
forth.
These
are
things
our
customers
can
respond
to
in
real
time.
So
one
thing
we
can
do
that,
like
the
windows
store,
doesn't
do
very
well
for
our
customers,
as
we
let
them
know
exactly
when
their
app
gets
approved
from
the
store.
How
many
people
are
installing
it
in
different
countries
around
the
world?
Because
what,
if
you're,
trying
to
time
you
know
PR
and
marketing
around
that?
A
And
you
want
to
be
able
to
hit
the
date
exactly
right,
that's
something
that
we
can
do
and
our
customers
can
respond
to
that.
Likewise,
what
happens
if
your
error
rate
suddenly
spikes
up
in
your
application?
A
lot
of
our
developers
are,
you
know,
Microsoft
sort
of
oil
developers
and
about
I
think
was
six
months
ago,
windows
azure
had
an
ssl
issue
with
storage
so
about
we
noticed
a
15-percent
climb
in
the
error
rate
across
all
of
our
applications.
A
When
this
happened,
we
didn't
know
what
was
going
on
and
turns
out
that
a
lot
of
our
apps
dependent
on
azure
for
storing
images
and
other
static
content
they
pulled
out
so
had
it
been
their
own
back-end.
They
could
have
done
something
about
it,
but
since
they
were
talking
to
Azure
directly,
they
were
kind
of
kind
of
screwed
for
the
most
part.
So
these
are
the
types
of
analytics
that
we
measure
and
report
on
in
real
time
as
they
happen.
Otherwise,
we
have
some
metrics
that
are
retrospective,
like
user
retention.
How
frequently?
A
How
long
do
you
retain
users
after
you've
gotten
them
to
install
your
app?
That's
something
where
we
need
to
measure
multiple,
distinct
events
over
a
window
of
30,
60
or
even
180
days.
So
we
do
that
with
hive
and
Hadoop
after
we
gather
all
this
raw
origin
data
inside
Cassandra.
So
let's
talk
about
how
we
actually
build
a
system
that
is
capable
of
doing
both
of
these
really
well
retrospective
and
real-time
analytics.
A
So
we
use
data
sacks
enterprise
will
heavily,
even
though
we're
early
stage
technology
company.
We
only
have
five
full-time
people
at
the
moment
and
I
went
full
time
on
it.
In
August,
we
we've
been
able
to
partner
with
datastax
and
they've,
been
a
really
great
business
partner
for
us.
So
we've
had
a
tremendous
experience
of
their
products.
A
Under
the
hood
and
a
bunch
of
really
cool
built-in
indexing
capabilities,
it
was
a
great
tool
for
prototyping,
but
as
soon
as
Thanksgiving
rolled
around,
we
went
live
with
two
of
the
largest
apps
in
the
Windows
Store,
including
like
a
number
one
or
two
video
game,
and
it
completely
pegged
our
servers
to
a
point
where
reports
are
running
behind
by
three
or
four
days.
Reports
that
are
supposed
to
be
in
real
time,
mind
you.
A
So
we
had
a
major
problem
on
our
hands,
so
it
took
us
about
two
months
to
move
everything
onto
Cassandra
off
of
Raven
and
we
were
able
to
do
that
because
of
data
stack
specifically.
So
I
can't
say
enough
good
things
about
their
technology
and
and
what
they're
able
to
do
for
us
so
getting
down
to
nuts
and
bolts.
Let's
talk
about
what
do
you
need
to
do
to
get
a
basic
analytics
cluster
up
and
running
on
amazon
ec2,
which
is
you
know
if
you're,
a
small
team
or,
if
you're
doing
this
with
your
own?
A
You
know
your
own
resources.
This
is
probably
the
first
natural
place
for
you
to
go
and
look
is
how
to
do
it
on
amazon
web
services.
So
from
a
vm
perspective,
datastax
actually
has
a
really
convenient.
What's
called
auto.
Clustering
am
I
for
datastax
enterprise.
What
this
will
do
is
you
bring
up,
you
know
four
or
five
or
six
nodes
and
it
will
go
and
negotiate.
Okay.
This
node
runs
opscenter.
This
node
has
hive.
A
This
note:
has
solar
and
it'll
go
and
set
up
all
the
configuration
settings
for
you,
it'll
automatically
form
the
ring
takes
a
lot
of
the
sysadmin
work.
That
would
probably
take
you.
You
know
a
week
or
two
on
your
own
just
done
for
you
automatically.
So
that's
what
I
recommend
is
using
your
sort
of
base,
ami
and
you're
just
getting
started
now.
There
are
some
limitations
to
be
aware
of.
A
If
you
need
to
do
multiple
availability
zone
or
multiple
region,
replication
on
on
Amazon
you're,
not
gonna,
be
able
to
get
that
out
of
the
box
with
their
built-in
am
I,
so
I
have
to
roll
your
own
eventually,
but
for
just
getting
up
and
running.
This
takes
like
30
minutes
to
set
up,
so
it's
it's
really
convenient
in
terms
of
your
bm's
themselves.
We
highly
recommend
using
you,
bundu
1204
LTS,
as
your
sort
of
operating
system,
which
that
will
usually
ship
by
default.
In
the
end,
the
Amazon
am
I.
A
This
is
a
really
good
basic
cluster
setup
and
the
way
you
should
design
your
application
around
this
is
have
it
talk
to
the
for
rideable
nodes
at
any
given
time,
and
then
that
hive
and
solar
sort
of
sit
off
on
the
side
and
do
the
wrong
thing,
then
from
an
actually
setting
up
your
first
key
space
in
Cassandra.
These
are
a
lot
of
the
settings
that
we
use
in
production
today
and,
with
all
due
credit
to
jay
patel
from
ebay.
A
I
stole
a
lot
of
this
from
his
talk
last
year,
when
I
was
trying
to
figure
out
how
to
set
up
my
cluster
for
the
first
time
so
for
consistency.
We
recommend
setting
the
right,
to
one
just
hand
it
off
to
the
server
and
bail.
If
you
know
you
that
you're
right
load
is
going
to
be
really
high,
set
it
to
two.
A
So
three
is
a
good
replication
factor
to
have,
and
then,
if
you're,
using
the
network,
topology
placement
strategy,
what
datastax
enterprise
will
allow
you
to
do
is
sort
of
add
new
analytics
and
solar
nodes
somewhat
independently
of
your
cassandra
ring.
So
if
you
want
to,
you
know,
distribute
your
a
hive
or
hadoop
workload
among
multiple
workers.
This
will
allow
you
to
sort
of
scale
that
going
forward
so
sort
of
has
like
mini
clusters
going
on
within
your
cassandra
ring.
A
If
you
think
about
it
that
way,
and
then
partition
errs
if
you're
using
anything
other
than
the
random
partitioner.
You
probably
are
way
too
sophisticated
for
this
talk.
You
know,
ordered
partitioner,
czar
expert,
you
expert
mode,
only
sort
of
sort
of
features
and
I-
don't
I,
haven't
actually
seen
very
many
of
them
in
production
ever
so,
but
they
but
they're
there.
If
you
do
need
it,
so
let's
go
ahead
and
talk
next
about
so
how
we
actually
work
with
Cassandra
productions.
This
is
from
the
applications.
Point
of
view
is
what
we're
talking
about
here.
A
So
with
analytics
the
right
to
read.
Ratio
is
going
to
be
astronomically
high,
you're
going
to
have
a
thousand
ten
thousand,
maybe
even
a
hundred
thousand
rights
for
every
read.
So
what
we
tend
to
do
in
our
setup
is
we
take
advantage
of
the
fact
that
Cassandra
is
generally
speaking,
much
more
performant
at
handling
rights
than
it
is
at
handling
lots
of
reeds.
So
we
d
normalize
our
data
heavily
at
the
API
level
before
we
write
to
Cassandra
and
we
use
a
batch
mutation
to
go
and
change
Cassandra
all
at
once.
A
So
here
we
have
like
three
column
families.
We
have
this
blog's
column
family
on
the
bottom,
which
is
I.
Think
of
that,
as
like
origin,
raw
data
that
we're
going
to
you
know
process
through
Hadoop
a
little
bit
later.
That's
the
raw
object
as
it's
being
sent
to
the
API,
and
then
we
have
some
counters
which
just
roll
up
daily
totals
for
each
of
these.
A
So
we
have
one
counter
that
shows
the
total
number
of
logs
for
each
application
on
our
platform
and
then
another
set
of
counters
that
show
the
number
of
logs
at
the
different
log
levels
that
we
support.
So
how
many
of
these
logs
represent
fatal
errors?
How
many
represent
just
normal
trace
versus
debug
info?
That
sort
of
thing
this
the
normals
are
tracing
levels.
So
what
we
do
is
a
new
log
hits
our
API.
A
We
de
normalize
it
and
put
it
into
a
batch
mutation,
and
that
batch
mutation
will
atomically
make
all
these
changes
throughout
Cassandra
at
once,
and
the
difference
in
terms
of
the
amount
of
time
it
takes.
For
you
know,
modifying
one
column,
family
versus
modifying
40
is
actually
not
that
much
to
it
to
the
client.
It's
actually
been
difficult
for
us
to
measure
the
difference
in
speed,
so
really
you're
not
really
giving
anything
up
by
using
batch
mutations.
A
So
that's
the
strategy
that
we
strongly
recommend
is
sort
of
when
you're
just
getting
started
now
for
read
strategy,
interesting
story,
so
we're
most
of
our
developers
are
all
dotnet
guys,
which
means
sequels,
like
in
their
blood
I,
actually
had
to
go
in
and
try
to
disable
the
sea.
Well,
three
driver
when
we
were
first
getting
started
because
they
kept
wanting
to
gravitate
towards
their
sort
of
sequel
habits
when
we
were
first
getting
used
to
Cassandra.
A
This
is
before
we
went,
live
with
it
in
production,
so
what
we
do
and
what
I
strongly
recommend
for
working
with
time
series
data
is
use
the
thrift
api's.
You
can
use
a
tool,
that's
called
a
slice
range
to
go,
and
let's
say
we
have
this.
You
know
column
family
here
we're
counting
the
total
number
of
logs
by
level
every
day,
so
how
many
crashes
versus
errors
versus
traces
do
we
have
and
this
30-day
window
for
this
application?
What
we
can
do
is
we
have
as
their
as
laser
pointer
one
here.
A
A
You
know,
there's
a
value
for
each
28
and
if
there
isn't
will
go
and
substitute
a
0-4
on
the
chart.
That
way
doesn't
look
weird,
and
this
is
the
sort
of
read
strategy
we
use-
and
this
is
very
fast
so
now
we're
going
to
talk
a
little
bit
about
how
to
design
a
schema
to
support
this
and
what
are
some
of
the
principles
and
rules
you
need
to
think
about.
So,
in
terms
of
a
schema
strategy,
things
that
I
recommend
doing
one
is
make
sure
that
your
row
keys
are
always
predictable.
A
Ie,
you
don't
need
a
separate
lookup
table
to
figure
out.
What's
rose,
you
need
to
fetch
in
order
to
satisfy
a
read
query
when
you're
trying
to
get
some
data
out
of
Cassandra
design
your
keys
in
a
way,
we
can
always
predict
what
the
values
are
going
to
be
and
I'll
show
you
a
little
bit
more
about
that
on
the
next
slide,
then,
on
top
of
that
make
sure
you're
leveraging
the
physical
sort
ability
of
columns,
particularly
if
you're
managing
time
series
data.
A
This
will
make
your
life
so
much
easier,
you'll
be
able
to
look
up.
You
know
anything
in
constant
time,
essentially
in
terms
of
I,
want
to
start
from
this
column
and
go
to
this
one.
Where
column
a
is
your
start
date
and
call
them
bees.
Your
end
date
and
it'll
grab
everything
in
between.
So,
if
you
leverage
that
property
it
makes
it
really
easy
to
manage
time
series
data
inside
your
ear,
Cassandra,
cluster
and
other
one
other
sort
of
gotcha.
A
So
we
totally
redesigned
our
schema
to
make
sure
that
the
all
the
column,
all
the
column
names
are
always
a
really
simple
type.
That's
sorts!
Well,
we'll
talk
about
some
of
the
other
things
on
the
right
hand,
side
as
we
go
one
recommendation
I
also
make
is
that
when
you're
just
getting
started
with
Cassandra
stick
with
distributed
counters
for
your
real-time
analytics
as
you
get
more
sophisticated,
you'll
find
you
know,
you'll
find
that
there
are
some
issues
with
distributed
counters
so,
for
instance,
there's
not
really
any
good
support
for
retry
logic
on
them.
A
The
counters
are
essentially
atomic
values
that
you
can
increment
or
decrement
with
individual
commands.
You
can
say
increment
by
this
much
decrement
by
this
much,
but
what
you
can't
really
do
is,
unlike
the
rest
of
cassandra,
where
you
can
go
and
overwrite
a
value,
you
can
go
and
say
I'm
just
going
to
overwrite
the
value
of
this
row
and
it'll
just
go
and
reset
everything.
So
it's
fully
item
potent
with
counters.
You
lose
that
ability
that
it'll
get
incremented
again
if
you
retry
it
and
the
operation
goes
through.
A
So
there
are
some
downsides
to
using
to
use
encounters,
but
for
the
most
part,
they're
really
easy
to
set
up
and
are
actually
pretty
easy
to
work
with.
So
it
depends
a
lot
on
the
types
of
data
that
you're
measuring
I
would
never
use
counters
to
measure
like
the
outcome
of
a
financial
transaction.
Five
minutes
left
Wow
so
better
step
on
it.
So
for
the
first
schema
here
no
go
ahead
and
leave
this
up
for
you
on
slideshare,
so
you
can
take
a
look
at
the
notes.
A
This
is
a
schema
that
you'd
use
for
a
total
ly,
predictable,
zor
data
structure,
so
in
this
case
daily
app
logs
prologue
level.
We
know
the
ID
of
the
app
that
we're
looking
for,
because
that's
contained
in
the
request
we
get
from
one
of
our
customers
and
we
know
the
different
log
levels
that
we
want.
So
our
roki
is
totally
predictable
and
we
also
know
the
date
range.
A
We
know
they
want
everything
from
this
state
to
this
state,
so
our
Cassandra
column,
our
Cassandra
column,
family,
has
all
the
data
it
needs
to
satisfy
query
sitting
in
a
single
row,
so
this
row
will
grow
over
time,
but
and
it'll
could
eventually,
you
can
become
a
really
wide
row.
Potentially,
if
we
have
you
know
millions
of
hours
worth
of
data
on
here,
but
there's
also
a
naturally
limiting
factor
to
that
right.
There's
only
X
number
of
days
or
hours
per
year,
and
so
you
can
sort
of
capacity
plan
around
that
real
easily.
A
The
next
sort
of
schema
type
that
we
use,
what's
called
a
bounded
number
of
knowns,
are
bounded
number
of
unknowns.
So
the
scenario
we
have
in
this
case
is,
let's
say:
one
of
our
customers
wants
to
know
the
number
of
users
by
version,
or
in
this
case
the
number
of
error
logs
by
exception
type.
So
we
have
no
way
of
knowing
what
the
exception
types
are
going
to
be,
but
we
know
that
the
number
of
different
types
of
exceptions
will
be
relatively
low.
A
You
know,
maybe
under
a
dozen
usually,
unless
you
have
a
really
bad
developer,
which
which
case
we
probably
don't
want
them
as
a
customer,
so
you're
not
going
to
have
more
than
you
know,
10
types
of
exceptions,
stack
overflow
out
of
memory,
etc.
So
we
tend
to
do
for
this
is
flip
the
schema
on
its
side,
where
we
have
the
known
properties,
sort
of
the
app
ID
in
the
air
log.
A
We
also
have
the
date
as
part
of
the
composite
key
here,
so
we'll
be
fetching
a
much
larger
number
of
rows
this
time
around,
but
all
of
the
unknown
values
are
contained
as
columns,
and
we
just
say:
okay
go
ahead
and
grab
at
least
200
columns
for
each
row,
and
then
we
sort
of
transform
it
back
into
a
time
series
at
the
application
level.
This
is
a
great.
A
This
is
a
great
technique
when
you
have
a
relatively
small
number
of
unknowns,
you
have
to
work
with
because
it's
simple
to
work
with
and
doesn't
require
multiple
round
trips.
Now,
if
you
have
a
totally
unbounded
number
of
unknown
things,
we
recommend
using
an
index
column
family
for
that.
So
in
this
case
and
our
batch
mutation,
whenever
a
developer
sends
us
a
custom
event,
usually
they
have
hundreds
of
these.
A
We
go
and
keep
the
names
of
the
custom
events
in
a
separate
column
family
we're
using
the
null
the
null
value
pattern
for
this,
where
the
actual
value
of
the
column
is
null
and
the
column
name
itself
is
the
value
that
we
want.
So
we'll
take
this
data
and
we'll
go
and
query
against
an
actual
type,
zero
column
family
that
has
all
the
time
series
data
for
each
custom
event
in
it.
So
it
takes
two
different
network
round
trips,
which
is
the
one
disadvantage
of
this
pattern,
but
it
really
limits
the
number
of
rows.
A
We
have
to
query
each
one
of
these
patterns
have
shown
you
takes
under
100
milliseconds
for
us
to
run
in
production.
So
in
terms
of
the
amount
of
time
it
takes
the
HTTP
request
to
hit
our
API
and
for
us
to
get
a
response
back.
So
most
of
the
time
that
our
clients
spend
waiting
on
data
is
actually
downloading
JSON
objects
from
our
server
or
than
anything
else.
So
it's
a
really
good
way
to
curb
some
of
the
complexity.
A
Here
you
guys
can
look
at
this
on
the
site,
adding
hive
and
Hadoop
to
the
mix,
so
I
want
to
touch
on
this
before
we
we
end
our
talk
when
is
to
dupe
necessary
as
a
question
that
I
found
myself
asking
right
when
we
were
getting
marked
up
started,
having
not
had
a
lot
of
experience
with
it
before
and
my
answer
on.
This
is
when
you
start
getting
into
the
100
gig
data
set
range.
Hadoop
starts
becoming
a
more
I,
more
and
more
valuable
feature
over
time.
A
There's
sort
of
a
minimum
size
of
a
data
set.
You
need
to
really
get
any
value
out
of
Hadoop
and
100.
Gigs
has
sort
of
been
the
threshold
for
us,
but
the
other
things
you
should
think
about
from
a
requirements.
Point
of
view,
if
you
need
Hadoop,
if
consistency
is
really
important
for
you,
Hadoop
is
a
great
tool
for
the
job.
For
that
of
speed,
isn't
requirement.
Hadoop
is
slow.
It
is
really
slow,
you're
not
going
to
get
real-time
results
from
it.
It
might
take.
A
You
know
30
minutes
for
it
to
go
and
produce
a
query
for
you,
but
the
results
will
be
consistent
and
we'll
touch
on
all
the
data
you
needed
to
get,
and
the
last
thing
is:
if
you
have
really
complex
query
pipelines
like
so,
for
instance,
counting
the
number
of
distinct
items
that
fall
under
these
different
cohorts
and
etc.
Hadoop
is
actually
a
great
tool
for
doing
that.
So
if
you
need
a
really
good
MapReduce
pipeline
who
dupes
a
perfect
tool,
so
we're
lazy.
A
So
we
like
to
put
Hadoop
on
easy
mode,
and
we
call
that
hive
so
data
sex,
Enterprise
Edition
has
bindings
for
pig
and
hive
pig
scared
us.
So
we
decided
just
to
go
with
hive
instead,
since
for
all
sequel
developers,
so
high
was
developed
by
Facebook,
originally
I,
just
like
Cassandra,
and
it
was
meant
to
be
a
data
warehousing
technology
to
allow
some
of
their
non-technical
people
to
be
able
to
go
and
get
information
about.
Facebook's
users,
so
what's
convenient
about
it
is.
A
Is
that
there's
actually
a
lot
less
deployment
and
overhead
that
goes
into
running
MapReduce
queries
if
you're
using
hive
or
pig
instead
of
raw
Hadoop,
you
don't
deploy
any
code
to
your
analytics
node
and
you
can
go
in
if
you
want
to
set
up
a
recurring
hive
job,
it's
as
easy
as
running
a
cron
job
that
just
invoke
something
on
the
command
line.
So
it's
really
easy
to
deploy
and
get
up
and
running,
and
you
know
to
give
you
a
sample
of
what
the
workflow
looks
like.
A
Let's
say
you
have
this
column,
family,
the
logs
column,
family
and
cassandra.
If
you
want
to
start
analyzing
and
hive,
what
you
do
is
you
go
ahead
and
create
what's
called
an
external
table,
and
so
I'm
not
going
to
read
all
the
syntax
on
there,
but
that's
actually
the
legit
syntax
for
map
being
a
cassandra
column
family
with
dynamic
columns
into
a
hive
table.
So
hive
tables
is
very
much,
is
sort
of
a
one-to-one
mapping
with
what's
in
Cassandra,
but
has
different
data
types.
A
You
have
to
account
for
some
translation
there,
but
for
all
intents
and
purposes
it's
pretty
easy
to
work
with.
You
know
what
the
best
part
is:
I've
automatically
fetches
data
back
from
Cassandra
as
it
updates
that
mapping
is
sort
of
perpetual.
In
other
words,
you
don't
have
to
go
and
run
jobs
to
reinsert
new
data
or
anything
else.
It
just
runs
and
works
so
and
then
here's
sort
of
what
the
query
syntax
looks
like
we're
getting
data
back
out.
A
As
you
can
see,
it's
virtually
identical
to
what
you
do
and
like
the
my
sequel,
command
line.
If
you
wanted
to
run
it
so
just
some
final
tips
and
tricks
about
hi
we're
gonna
have
to
skip
the
solar
part.
Unfortunately
I'm,
sorry
guys,
let
you
down
don't
write
it.
So,
if
you're
reading
and
writing
from
hive
and
Cassandra
make
sure
that
only
one
of
those
two
data
sources
is
doing
the
writing.
In
other
words,
if
you
have
a
column
family,
that's
being
updated
by
your
application
server
and
it's
writing
directly
to
Cassandra
column
family.
A
Don't
let
hive
right
to
it
also.
Otherwise,
bad
things
will
happen
so,
for
instance,
our
user
retention
and
our
average
time
spent
in
app
reports
all
that
data
gets
written
back
to
Cassandra
by
hive,
but
there
are
dedicated
column
families
that
Sandra
that
our
application
never
writes
to
under
any
circumstances.
We
found
that's
the
best
way
to
sort
of
make
sure
you
don't
have
one
cyst
one
service,
overwrite
the
other
and
then
the
second
is.
A
If
you
are
trying
to
test
hive
queries,
you
can
use
sampling,
so
4
is
instead
of
looking
at
the
entire
dataset.
You
can
look
at
just
the
past.
You
know
30
days
worth
of
data,
so
that
allows
your
hive
job
is
to
complete
in
like
10
minutes.
Instead
of
you
know
two
hours
depending
upon
how
big
your
data
set
is,
so
those
are
our
tips
and
tricks
for
working
with
it
all
right.
A
A
So
depends
on
depends
on
what
roll-ups
you
want
so
right
now,
the
way
we
do
date
columns
and
our
data
set.
Is
there
actually
time
stamps
that
are
marshaled
to
the
latest
UTC
day?
You
could
go
and
do
it
if
you
wanted
to
hourly
or
even
down
to
the
minute
if
you
wanted
and
that
technique
will
still
work.
All
that
you
have
the
other
they
have.
The
bear
in
mind
is
when
you're
doing
that
column
slice
from
one
day
to
another.
A
It
will
get
everything
in
the
middle
potentially,
so
what
I
recommend
doing
is
having
different
column.
Families
for
different
granularity
have
one
that's
a
daily
roll
up.
One,
that's
hourly
one,
that's
by
the
minute,
if
you
need
it
down
to
that
level.
That
way
you
sort
of
have
an
expectation
for
what
the
volume
of
data
is
going
to
be
right
there.
Any
other
questions,
interesting
anecdotes.
Yes,
that's
totally
not
fair.
The
answer
is
use
a
faceted
search
and
solder,
but
how
do
you
count
a
million
distinct
items
in
real
time?
And
the
answer
is
usable?
A
Actually,
I'm
gonna
cheat
this.
Basically,
you
go
ahead
and
define
a
solar
index
and
not
going
to
get
through
the
syntax
here,
but
essentially
you
can
use
faceted
search
and
one
of
the
things.
Solar
outputs
is
a
number
of
records
that
match
and
it's
able
to
sort
of
do
this
very
quickly
in
memory.
If
your,
if
your
index
documents
are
really
small,
so
we
use
this
technique.
Sadly,
on
ravendb
demand
account
millions
of
things
and
that
uses
leucine
under
the
hood
and
that
worked
really
well
and
we're
the
set
up
on
Cassandra
and
production
today.
A
Actually
so
I'm
excited
about
it,
but
yeah
a
development.
It's
worked
great.
So
that's
how
we
do
that
any
other
questions.
Yes,
how
do
I
deal
with
wide
rows
and
solar
and
Cassandra?
Okay,
let's
say
the
answer
is
with
solar.
You
don't
you
know
you're
supposed
to
run
away
screaming
from
wide
rows.
If
you're
using
solar
indexing
with
Cassandra
with
wide
rows,
it's
it's
pretty
straightforward.
We
basically
know
exactly
what's
column,
we
want
to
start
with
and
we
just
grab
the
slice
that
we
need.