►
Description
Speaker: Andrew Noonan, Developer at Gnip
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-dude-wheres-my-tweet-taming-the-twitter-firehose-by-andrew-noonan
Gnip ingests and must serve out hundreds of millions of social activities every day and social platforms are only growing. This makes the scalability of applications essential for Gnip. Enter Cassandra. Problem solved, right? Not exactly, Gnip's relationship with Cassandra was not all rainbows and unicorns. In this session we will walk you through why we began looking at Cassandra as a data store in the first place and the valuable lessons we with Cassandra that has made it an invaluable part of our infrastructure.
A
Welcome
everybody
to
dude
where's,
my
tweet
taming,
the
Twitter
fire
hose
I'm
Andrew,
Noonan
I'm,
a
software
engineer
with
I'm
controlling
two
laptops,
so
bear
with
me.
So
first
I'll
get
into
Who,
I
am
and
what
I'm
going
to
talk
about
today?
My
name
is
say
my
name
is
Andrew
noon
and
I'm.
A
software
engineer
at
can
dip
talk
a
little
bit
about
hooking
up
is
what
we
do
kind
of
what
our
business
model
is
and
how
Cassandra
fits
into
all
that
talk
about
how
Cassandra
and
first
got
together
what
it
was.
A
We
used
it
to
solve
a
little
bit
about
rainbows
and
unicorns,
and
hopefully
some
time
for
questions
at
the
end
of
it
all
so,
who
is
well
nip
is
paying
spelt
backwards.
That
has
a
lot
to
do
with
how
the
company
started.
We
we
tried
to
reverse
the
pinging
of
social
media
api's.
We
provide
a
centralized
streaming
HTTP
connection
for
social
media
api's.
We
are
in
beautiful,
Boulder
Colorado.
It's
pretty
awesome
place
with
a
lot
of
technical
companies
around.
A
A
So
all
these
all
these
different
publishers
provide
us
with
a
unique
problem.
So
twitter
is
a
large
publisher
in
terms
of
the
number
of
activities
that
we
see,
but
each
new
publisher
we
bring
on
might
have
a
different
format,
large
payloads,
that
we
have
to
deal
with
so
there's
a
lot
of
variability
in
the
scale
that
we
deal
with
and
social
media
publishers
are
only
growing.
Social
networks
are
only
growing,
so
we
as
a
business
have
a
large
need
for
scalable
solutions.
A
So
speaking
to
Twitter,
specifically
the
Twitter
fire
hose
as
it
stands
today
is
like
five
to
seven
thousand
tweets
a
second.
On
average,
we
regularly
see
spikes
of
up
to
20,000
tweets
a
second.
This
right
here
is
a
spike
we
saw
recently
that
was
like
up
to
I
think
it
was
like
fifteen
thousand
messages.
Second,
completely
unplanned,
we
didn't
know
it
was
coming.
A
Twitter
just
gets
customer
queues
or
user
queues
backed
up
somewhere
in
their
system,
all
of
a
sudden
they
flush,
and
we
see
some
downstream
event
that
we
need
to
have
our
systems
planned
in
order
to
handle
these
events
smoothly,
not
drop
any
messages.
There's
planned.
Events
that
we
see
all
the
time
Michael
Jackson's
death,
not
very
planned,
but
it's
a
social
media
event.
We
see
the
Super
Bowl,
for
instance,
last
year
was
like
twenty
thousand
messages.
A
Second
sustained
for
several
minutes,
so
definitely
have
to
build
these
scalable
solutions
that
can
handle
all
that
we
hope
hold
the
entire
twitter
archive
in
s3
actually,
and
we
save
all
these
messages
as
they
come
into
s3.
We
ingest
something
like
two
terabytes
of
data
a
day.
I
think
we
have
a
full
petabyte
at
this
point
up
in
s3,
we've
got
a
mix
of
cloud
and
dedicated
hardware,
so
clouds
a
great
solution
for
these
scalable
apps
that
we're
writing,
but
every
now
and
then
you
need
a
little
more
horsepower
out
of
your
single
machine.
A
So
we've
actually
made
some
investments
in
that
area.
We
serve
over
ninety-five
percent
of
Fortune
500
companies
with
social
data.
So
you
know
these
are
enterprise
customers
paying
a
lot
of
money
for
this
data.
They
need
reliability,
they
need
availability
and
they
don't
really
tolerate.
You
know
systems
going
down
that
kind
of
thing
and
speaking
to
our
scale,
some
more.
We
deliver
120
billion
social
activities
every
month,
just
growing
every
day.
A
A
We
also
require
redundancy
and
reliability,
so
our
customers
are
paying
quite
quite
a
lot
of
money,
for
this
they've
got
big
dollars
riding
on
it,
and
so
they
require
a
reliable
solution
that
has
redundancy
behind
it.
There's
always
going
to
be
failures,
so
we
build
redundancy
and
reliability
into
everything
we
look
into.
A
Obviously
that
relates
to
Cassandra
and
availability
is
very
important
to
us.
You
know,
as
as
we
encounter
real
time
problems
we
need
available
apps,
so
we
began
searching
for
different
solutions
to
our
data
problems,
and
you
know
we
were
looking
at
Cassandra.
I
was
on
the
team
task
to
kind
of
come
up
with
the
new
technology
that
we're
going
to
use
that
was
going
to
be
scalable
meet
all
these
requirements,
and
so
I
will
tell
you
about
how
getting
up
in
Cassandra
first
met.
A
This
system
was
seeing
a
regular,
sustained
write,
throughput
of
about
500
to
700
messages
a
second
and
we
were
starting
to
see
regular
spikes
up
to
1500
messages
a
second
and
we
were
building
a
new
product
to
put
on
top
of
this,
that
was
probably
going
to
was
definitely
going
to
increase
the
read
requirements
on
the
system.
We
were
thinking
about
250
clients
at
any
given
time
that
has
now
scaled
to
more
like
500
clients.
A
So
it
was
very
clear
to
us
that
the
solution
we
had
in
place
right
now
we're
right
then,
which
was
not
a
scalable
solution,
wasn't
going
to
work
in
the
long
term.
We
needed
to
come
up
with
something
else.
These
were
the
specs
that
we
were
looking
at.
This
is
actually
a
graph
from
back
then,
and
you
know
just
a
for
instance.
We
saw
this
random
spike
of
3,300
messages.
A
second
we
actually
didn't
know
what
the
hell
was
going
on
upstream.
A
At
the
time
we
talked
to
the
publisher
and
finally
found
out
that
they
were
doing
a
concerted
effort
upstream.
That
was
batch
processes
that
were
causing
these
increased
messages
to
come
down
every
now
and
then
so,
and
they
told
us
that
was,
you
know,
only
going
to
keep
growing
and
so
we're
like
all
right.
We
should
probably
do
something
about
this,
so
enter
Cassandra
Cassandra,
obviously
seems
to
fit.
A
lot
of
the
things
have
been
talking
about.
A
Definitely
has
a
high
write
throughput
so,
as
we
were
going
to
continue
to
see
growing
volumes
in
both
this
first
instance,
where
we
were
really
just
entering
metadata,
you
know
it'd
be
great.
If
later
we
were
just
shoving
all
activities,
we
were
seeing
ended
so
sure
and
Cassandra,
so
high
right
throughput
was
definitely
covered.
Definitely
a
scalable
solution.
So
as
we
grew
out
this
new
product
that
was
going
to
be
throwing
a
big,
read,
load
onto
the
system,
we
would
be
able
to
scale
out,
as
we
saw
more
customers
come
on
to
that.
A
So
that
was
definitely
a
check
for
us.
It's
highly
available.
So
you
know,
as
we
see
different
network
blips
which
regularly
happen
in
between
you
know
our
own
nodes
and
our
customers
and
our
our
system.
We
would
be
able
to
stay
available
to
serve
any
requests
coming
in
and
it
was
persistent
which
is
very
important
to
us.
You
know
we
hold
historical
records
of
social
activities
and
it's
very
important
that
we
persist
those
because
if
we
don't
persist
them,
no
one
else
really
is
we
can't
really
get
them
back.
A
It's
very
hard
for
us
to
go
back
to
publishers
and
say
you
know,
hey
we
were
down
for
a
little
bit.
Can
we
get
that
back
again,
but
that
regularly
happens
with
our
customers?
You
know
they
get
disconnected
from
a
real
time
stream
and
they
want
that
data
back.
They
come
back
to
us
and
they
say
hey.
Can
we
get
it
back
and
you
know
we're
in
that
position
to
say
yeah
you
can
you
can
have
to
pay
us
for
it,
though
so
Cassandra
its
rainbows
and
unicorns?
It
was,
you
know,
fit
these
specifications
perfectly.
A
It's
a
brand
new
technology.
That
was
everything
that
we
wanted
it
to
be.
We
stood
it
up
and
we
started
with
I.
Think
like
two
nodes
may
be
in
AWS.
You
know
I
set
it
up
locally.
First
tested,
it
seemed
cool
all
accounts.
Do
it
and
put
it
in
AWS
at
two
nodes
started
writing
to
it.
The
write
throughput
was
pretty
cool
but
figured.
We
should
grow
to
four
nodes,
but
it
is
four
nodes
and
you
know
we
wrote
a
little
test
app.
We
were
thrown
as
much
mock
data
as
we
could
add
it.
A
A
A
So
let's
talk
about
the
road
bumps
that
we
did
see
so
first
we
stood
it
up.
Like
I
said
I
did
some
initial
testing.
It
was
mostly
with
mock
data
of
some
apps
that
we
just
kind
of
wrote
up
to
test
Cassandra,
but
then
we
actually
threw
it
into
our
real
system.
You
know
first
in
staging,
obviously
not
in
production
but
threw
it
up
there
and
started
writing
to
it.
Things
looked
alright.
We
did
not
have
any
maintenance
put
in
place.
This
is
a
bad
idea.
A
We
learned
so
because
we
started
to
load
test
the
hell
out
of
it.
We
started
seeing
dropped
immediate
mutation
batches.
You
know
we
read
up
on
that
and
it
was
like
yeah.
You
should
probably
have
maintenance
in
place
to
make
sure
that
your
consistency
does
happen
in
case
there
are
drop
messages
in
between
your
nodes,
so
we
put
maintenance
to
place
and
all
of
a
sudden
we
see
a
2x
growth
and
data
on
disk.
We
have
no
idea.
A
Why
so
start
you
know
kind
of
getting
concerned
about
that
we're
looking
into
why
it
might
be
that
our
data
has
all
of
a
sudden
ballooned
on
disk,
and
you
know
we
totally
thought
that
if
anything,
it
was
going
to
shrink
so
obviously
Cassandra
scalable
right.
So
we
just
add
more
nodes.
It's
going
to
split
the
key
space
in
half
and
we'll
have
half
the
data
on
the
disks.
Obviously,
that's
not
really
true.
Either
we
start
seeing
lots
of
data
being
streamed
between
the
new
nodes
and
the
old
nodes.
A
You
know
it
wasn't
really
working
out
exactly
as
we
planned
so
started
to
panic
and
freak
out,
and
so
we
moved
beyond
that
and
we
started
to
think
about
whether
we
had
made
the
wrong
choice
was
Cassandra
really
the
right
choice
for
our
datastore
solutions
should
we
have
gone
with
a
different
technology,
and-
and
the
answer
is
no,
but
we
we
had
made
a
good
choice.
We
just
had
to
fully
understand
what
what
we
were
dealing
with.
So
we
started
to
turn
to
the
community.
A
It
was
around
us,
you
know,
ask
other
companies
in
the
area
that
it
we
know
use
Cassandra.
We
have
friends
in
the
tech
community,
so
talk
to
them.
Had
they
seen
these
problems
that
we
were
having
some
of
them
had
we
started
talking
to
datastax.
You
know
we
were
getting
a
little
bit
of
support
from
them.
We
started
to
really
hammer
them
pretty
hard
with
questions.
A
So
we
took
a
step
back
and
started
to
to
really
analyze
what
was
going
on.
We
found
out
that
you're
right
pattern
matters
just
as
much
as
your
read
pattern.
So
you
know
when
we
first
started
looking
at
at
Cassandra,
we
were,
we
heard
you
know
design
your
data
schema
around
how
you're
going
to
access
it
right.
A
It's
not
a
relational
database,
and
so
you
know
you
want
to
be
able
to
access
it
in
sort
of
a
predictable
manner,
and
if
you
want
that
performance
out
of
it,
which
was
very
important
to
us,
but
we
found
out
later
that
you're
right
pattern
matters
quite
a
bit
due
to
your
compaction
strategy.
We
were
using
sized
heerd
compaction.
It
turns
out
that
we
were
updating,
SS,
two
evils
or
updating
rose
all
across
the
cluster
they're
all
across
the
key
space
on
a
regular
basis.
A
A
If
you
continue
to
update
them,
you
will
start
to
see
fragments
of
that
row
across
several
SS
tables
and
then,
when
you
need
to
reconcile
that
I
mean
just
if
you
need
to
do
a
read,
you
might
have
to
read
from
several
SS
tables,
but
if
you
need
to
stream
that
data
or
if
you
want
to
do
a
compaction
with
size
to
your
compaction,
you
need
at
least
double
or
up
to
double
your
largest
SS
table
on
disk.
If
those
are
starting
to
get
fairly
large,
you
need
quite
a
bit
of
scratch
base.
A
We
learned
that
that
was
the
problem
we
were
seeing
when
we
put
maintenance
in
place
and
a
lot
of
confections
we're
taking
place.
Our
data
was
doubling
on
disk,
essentially
because
it
was
trying
to
reconcile
all
of
it.
They
were
updates
all
across
it
and
we
found
out
it's
extremely
important
how
much
data
you
store
per
node
when
you
do
want
to
do
growth
of
your
cluster
or
any
of
these
repairs
or
maintenance.
A
The
data
that
you
have
on
one
node
is
going
to
greatly
affect
the
amount
of
time
that
it
takes
for
that
to
happen.
Right
we
were
seeing
when
we
put
maintenance
in
place
that
it
could
be
up
to
a
week
to
do
maintenance
around
the
entire
cluster,
because
you
know
we
didn't
want
to.
We
needed
performance
out
of
the
cluster
at
all
times,
so
we
couldn't
just
take
the
whole
cluster
down
to
do
repairs
or
anything
that
we
wanted
to
do
it.
A
So
the
the
knobs
that
we
realized,
we
could
start
tuning
their
one
big
one
was
you
know,
so
we
learned
with
the
compaction
strategy
that
we
were
using
what
was
going
on.
We
decided
not
to
go
with
a
leveled
compaction,
because
we
were
worried
with
that.
It
would
cause
a
low
read
throughput
if
you're
worried
about
that
at
all
it
can
it
can
put
extra
strain
on
your
cluster.
We
decided
not
to
go
with
it,
just
understanding
what
was
happening
underneath
was
good
enough
for
us.
A
The
compaction
throughput
megabytes
per
second
was
another
knob
that
we
realized
could
really
speed
things
up
for
us.
The
compaction
zar
taking
quite
a
while
it's
recommended
that
it
be
16
to
32
times
you're
right
through
put
out
of
the
box,
I
think
it's
16
megabytes
per
second
and
we
were
seeing
our
rights
at
somewhere
around.
Like
eight
megabits,
eight
megabytes
a
second,
so
we
definitely
had
to
up
it.
We
upped
it
to
I,
think
150
and
we
were
able
to
see
compaction
'he's
we're
going
smoothly.
A
Our
reads
were
not
getting
too
greatly
affected,
and
so
we
were
happy
with
it
made
our
cluster
happy
and
things
were
humming
along
so
later
down
the
line.
We
came
up
with
another
use
case
for
Cassandra.
It
was
an
n-day
archive
of
Twitter
data,
so
we
literally
were
planning
to
throw
the
Twitter
fire
hose
at
it,
and
so,
like
I
said,
we
regularly
see
about
five
to
seven
thousand
messages,
a
second
in
the
fire
hose,
but
we
see
these
huge
spikes
up
to
20,000
messages.
A
A
We
decided
that
we
would
do
dynamic
column,
families,
so
Daniel,
McConnell,
column,
families
or
just
you
know,
create
and
expire
column
families
as
you
go
on.
We
bucketed
them,
basically
in
three
hour
chunks
and
we
wouldn't
use
the
I
forget
what
the
term
is.
You
can
set
a
lifetime
on
your
data
in
in
Cassandra
and
basically
it
will
handle
removing
that
data
at
some
point,
but
your
storage
reclamation
might
take
some
time.
A
We
would
actually
simply
we
would
simply
go
ahead
and
just
remove
the
column,
family
and
then
three
hours
later
delete
the
data
on
disk.
We
add
an
ex
FS
have
an
ex
FS
file
system,
so
the
deletes
on
disk
or
blazingly
fast,
and
so
things
seem
to
be
looking
pretty
good
here.
You
know
we
weren't
doing
we
weren't
updating,
rose.
Ninety-Five
percent
of
our
rows
were
written
to
once,
and
that
was
it
and
we
were
going
to
be
rolling
rolling
these
column,
families
off,
and
so
that
would
save
us.
A
You
know
essentially
a
constant,
a
constant
amount
of
space
taken
up
on
disk.
Our
reads
were
going
to
be
much
lower
than
our
rights,
for
you
know
we're
throwing
the
fire
hose
at
it,
but
the
the
usage
you
know
at
the
time
of
design
was
kind
of
undefined,
but
we
were
able
to
scale
that
we
knew
it
would
be
kind
of
one-off
reads
at
user
requested
rather
than
anything
streaming
at
it.
So
so
things
were
looking
perfect
for
us,
everything
looked
fantastic
and
we
thought
we
had
already
been
through
the
trenches
with
Cassandra.
A
A
So
a
little
bit
about
our
current
cluster
that
we
have
back
in
this
product.
We've
got
16
nodes
in
the
cluster,
a
replication
factor
of
three
each
node
holding
about
2.5
terabytes
of
data,
trying
to
remember
if
that's,
including
the
replication
factor
or
not,
but
we
had
40
billion
keys,
which
was
an
extremely
important
note
that
we
hadn't
really
looked
into
because,
as
we
were
growing
it
like
I
said
it
was
an
n-day
archive
and
at
first
you
know,
we
we
scaled
it
up
to
15
days
than
220
days.
A
Everything
was
looking
fun
and
then
every
time
we'd
get
try
to
get
past.
20
days,
we'd
start
seeing
GC
thrashing.
We
were
trying
to
figure
out
why,
and
so
we
started
to
look
into
the
memory,
consumption
aspects
of
Cassandra
and
so
I,
don't
know
if
you
guys
were
at
the
keynote
speech
earlier
with
when
Jonathan
spoke.
A
He
spoke
directly
to
these
actually
being
addressed
in
future
version
versions
of
datastax
Cassandra,
but
there's
a
couple
key
components
when,
when
a
reed
happens,
it
first
checks
a
bloom
filter
and
the
bloom
filter
is
a
is
basically
a
piece
in
memory
that
will
grow.
Based
on
your
data,
you
have
on
disk
the
partition.
Key
cash
is
a
constant
hold
of
you
know
a
number
or
a
size
of
keys
that
you
live
in
memory
and
then
the
partition
index
is
also
another
piece
of
memory
that
will
grow
with
your
data
on
disk.
A
You
know
we
we
knew
about
our
key
cash
and
our
row
cash
and
we
were
tuning
those
to
appropriate
degrees,
but
we
really
hadn't
taken
a
look
at
the
partition
index
and
the
bloom
filter
very
much
specifically
the
bloom
filter.
We
weren't
very
concerned
about
it.
We,
the
the
bloom
filter,
I,
think
it's
the
bloom
filter,
false-positive
chance
is
a
setting
and
I
think
out
of
the
box.
A
It's
like
point,
zeros
zero,
seven
percent
or
something-
and
so
you
know
we-
we
hadn't
really
read
into
changing
it,
but
it
turns
out
with
40
billion
keys,
I
think
Jonathan
actually
had
the
stat.
Earlier
today
said
it
was
like
12
gigabytes
per
billion
keys
that
you
have
is
going
to
be
taken
up
in
memory,
so
40
billion
keys
was
a
significant
amount
of
memory.
We
were
holding
up
just
over
the
the
bloom
filter.
The
index
interval.
The
next
one
is
basically
on
startup.
A
Cassandra
will
look
through
your
keys
and
grab
every
nth
key
and
pull
that
that
index
into
the
SS
table
into
memory.
That
way,
when
you
go
to
do
a
read,
it
will
basically
say:
okay,
you
have
key
you're
looking
for
key
10
I
know
where
key
5
is
in
this
SS
table,
so
it
can
go,
seek
directly
there
on
disk
and
then
start
looking
past
there
for
your
key
and
so
out
of
the
box.
I
think
that
one's
like
128
every
128
key
is,
is
brought
up
into
memory
again,
40
billion
keys.
A
We
were
looking
at
315
million
keys
in
memory,
and
so
we
bump
that
up
to
every
512
key,
then
we
were
down
to
78
million.
So
it
was
a
lot
better
on
our
memory
and
with
the
bloom
filter.
We
brought
that
up
to
a
point,
zero
five
percent
chance
of
a
false
positive,
which
is
fine
for
us,
because
we
could,
on
the
client
side,
determine
whether
or
not
it
was
looking
like
a
valid
key.
You
know
we
can
determine
if
it's
a
well-formed,
tweet
ID.
A
A
A
Given
our
right
pattern,
updating
rows
all
over
the
cluster,
you
know
it
wasn't
a
perfect
use
of
it
and
our
read
pattern
was
fairly
intensive.
You
know
we
were.
We
were
throwing
a
lot
of
clients
at
it
and
you
know
we
thought
it
would
scale
out
which
it
has.
It's
definitely
solved
our
our
problems
after
we
were
able
to
tune
it
enough,
but
at
you
know,
you're
not
always
going
to
fit
the
perfect
description
of
how
Cassandra
should
be
used.
A
But
if
you,
if
you
learn
how
it
works,
underneath
the
covers-
and
you
use
the
you
know-
400
configuration
options,
it
has
to
to
tune
it
to
exactly
your
use
case.
You
can,
you
can
really
get
a
lot
of
horsepower
out
of
it
and
so
explore
those
options.
There's
you
know
that
huge
configuration
file
there's
a
lot
of
startup
options
for
Cassandra.
You
can
set
a
lot
of
properties
for
your
column,
families
as
you
create
them.
So
you
know
after
we
figured
out
what
those
settings
really
meant
and
what
we
could
do
with
them.
A
We
were
able
to
leverage
the
power
Cassandra
more
than
we
ever
thought
understand
the
consequences
of
your
choices
when
we
decided
to
just
start
running
repairs
and
throw
new
nodes
of
the
cluster,
we
didn't
quite
understand
why
it
was
trying
to
stream
almost
all
the
data
from
the
first
node
when
we
thought
it
was
splitting
the
key
space.
Then
it
turns
out
it
was
just
trying
to
reconcile
the
data
after
it
computed
the
Merkel
tree.
A
We
didn't
really
realize
why,
though
another
big
one
that
we
hit
was
keep
your
staging
environment
in
your
production,
environment,
identical
if
it's
possible,
you
know,
there's
there's
a
lot
of
things
that
you
don't.
Quite
you
might
not
think
matter.
You
know
you
might
think.
If
I
have
half
the
nodes
and
I
keep
just
half
the
data
than
my
production
environment,
I
should
see
the
same
things
happen
at
the
same
times
and
that's
not
always
the
case.
A
Keeping
the
staging
environment
in
the
production
environment
identical
will
help
you
to
run
stress
tests
that
you
know
will
will
exactly
mimic
what
you'll
see
in
your
production
environment
when
your
customer
or
your
publisher,
or
something
like
that
stress
test
your
system
one
day,
and
so
that
kind
of
wraps
it
up.
If
we
have
any
questions,
I
can
answer
them.
I.