►
From YouTube: C* Summit 2013: How Not to Use Cassandra
Description
Speaker: Axel Liljencrantz, Backend Developer at Spotify
Slides: http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252
At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.
A
A
We
accomplish
this
through
a
back-end
of
70
services,
or
so
most
of
these
are
written
in
Python.
A
few
were
written
in
Java,
one
in
C++,
but
yeah,
mostly
Python
and
Java
shop
as
pretty
much
anyone
who
isn't
a
spectacular
failure
is
doing.
We
write
many
small,
simple
services.
We
try
to
keep
them
responsible
for
one
thing
and
one
thing
only,
and
sometimes
we
fail
and
that's
when
we
get
really
badly
functioning
services.
A
So
when
we
started
out,
we
were
pretty
much
exclusively
a
Postgres
shop
and
we
have
a
really
good
experience
with
Postgres.
It's
it's
a
system
that
works
very,
very
well,
but
once
we
started
hitting
a
really
large
scale
some
time
around,
when
we
were
when
we
moved
to
the
US
as
well,
we
were
starting
to
get
a
lot
of
problems.
A
Postgres
has
compared
to
say
cassandra
very
poor,
cross-site
replication
support.
If
you
get
a
network
outage
or
a
net
split
or
something
then
getting
back
in
sync
again
will
is
a
painful
process
that
often
requires
manual
intervention
and
that's
not
something
you
want
to
do
on
a
regular
basis.
And
of
course,
if
you
need
more
than
one
node
in
your
cluster
to
handle
more
more
load
than
sharding,
we'll
throw
out
most
of
the
relational
advantages
out
the
window.
So
for
a
certain
workload
we
know
we
felt
that
we
can't
really
use
Postgres
anymore.
A
So
a
little
bit
more
than
two
years
ago,
we
started
playing
around
a
bit
with
Cassandra,
and
by
now
we
have
something
something
like
24
services
that
use
it:
300
Cassandra
nodes.
We
store
something
like
50
terabytes
of
data,
so
overall
a
pretty
heavy
Cassandra
user
back
when
we
started
using
Cassandra
there
wasn't
that
much
information
out
available
about
how
to
use
Cassandra
how
to
design
your
schemas
stuff
like
that,
so
we
really
managed
to
screw
things
up
a
lot,
and
that
was
quite
boring.
A
I
mean
you
guys
here
most
of
you
don't
have
the
same
good
excuses.
We
can
do.
I
mean
there.
Was
this
really
good
talk
this
morning
by
Patrick
from
data
stacks
about
modeling?
If
you
didn't
manage
to
catch
its,
you
should
see
it
online
because
everyone
needs
to
know
these
things,
but
I'm
also
going
to
be
talking
a
little
bit
about
what
not
to
do
so.
This
brings
us
to
the
actual
subject
of
our
talk
and
the
first
part
about
how
not
to
use
Cassandra.
A
This
sounds
like
a
pretty
neat
feature.
Right
should
be
no
problem
with
doing
this
and
your
position
here.
Anyone
who
would
okay,
so
it's
an
interesting
feature
of
Reed
repair,
is
that
it
is
performed
across
all
data
centers.
So
if
you
say
have
a
reed
repair
factor
set
to
one
in
a
multi
DC
set
up,
every
single
Reed
request
will
sound
something
like
half
a
dozen
requests
o'clock
across
the
Atlantic
Ocean
and
that
can
get
boring
quite
quickly.
A
So
we
have
made
this
mistake
multiple
times
and
it
was
equally
boring
every
time
and
in
Cassandra
1-1
there
is
finally
a
DC
local
repair
configuration
item
so
that
you
can
configure
there
to
only
be
read
repair
inside
of
your
data
center,
which
is
probably
the
only
type
of
Reed
repair
that
should
be
long
to
exist.
Okay,
so
another
cool
thing
in
Cassandra
is
the
RO
cache.
You
guys
all
know
about
the
key
cache
right.
A
That's
we
store
individual
parts
of
the
indices,
to
speed
up
things
that
were
so
that
you
don't
have
to
go
to
the
disk,
look
at
the
index
file
and
so
on.
Well,
the
RO
cache
is
even
cooler.
You
can
store
an
entire
result
there,
so
that
if
you
hit
the
road
cache,
then
you
will
not
touch
the
SS
tables,
not
touch
this.
You
can
fulfill
one
request
entirely
out
of
memory.
This
was
intended
at
something
of
memcache
alternative.
A
If
you
guys
are
putting
mem
cache
in
front
of
Cassandra
for
performance
purposes,
you
can
just
stop
doing
that
and
switch
to
enabling
ro
cache
and
everything
will
just
be
faster,
and
you
won't
have
to
worry
about
consistency
and
cache
invalidation,
because
that
will
just
be
handled
transparently
by
Cassandra.
Pretty
awesome
right
right.
What
do
you
say?
A
That
means,
if
you're
doing
a
get
request
on
a
single
column
that
read
if
it's
a
cache
miss
will
become
promoted
into
a
full
row
slice,
and
if
you
have
big
rows,
that
means
your
performance
will
probably
go
down
by
at
least
one
maybe
two
orders
of
magnitude,
and
that
can
be
bad
and
to
make
things
even
more
interesting.
Whenever
you
write
to
a
row
that
is
also
cache
and
validated
just
thrown
out
the
window.
So
in
conclusion,
never
ever
ever
use
the
row
cache.
A
Unless
you
know
all
of
your
data
usage
patterns-
or
you
will
just
destroy
your
performance
and
we
have
accidentally
enabled
grow
cache
every
once
in
a
while
and
a
few
services
and
just
wondered
why
the
hell
did
the
performance
just
go
down
like
that.
So
that's
a
good
thing
to
know.
So
another
cool
useful
feature
in
Cassandra
is
compression.
A
This
is
really
nice
with
Cassandra
you
can
just
toggle
a
switch
and
Cassandra
will
start
compressing
all
of
your
Estus
tables.
Their
compression
algorithm
in
in
this
case,
snappy
is
really
super
fast.
You
can
compress
and
decompress
like
a
hundred
times
more
than
that
data.
Then
Cassandra
can
deliver
per
per
second,
because
it's
almost
no
overhead.
So
it's
really
super
fast
and
efficient
and
you
could
probably
just
enable
it
and
well.
You
will
have
less
date
on
this.
A
When
you
use
compression,
you
will
not
notice
this
if
you
are
using
something
slow,
something
that
hits
the
disk
most
of
the
time
and
something
that's
like
a
slow
data
set.
But
if
you
actually
are
handling
thousands
of
requests
per
node
per
second
or
even
hundreds,
you
will
get
a
quite
severe
performance
degradation
and
you
will
once
again
be
a
sad
panda.
Okay,
so
that's
it
for
how
to
miss.
Configure
Cassandra
and
I
would
I'm
now
going
to
move
into
different,
interesting
ways
to
misuse.
Cassandra.
A
A
A
It's
part
of
note
2
and
what
you
can
do
with
no
to
see
if
histograms
is
basically
find
out.
How
many
S's
tables
are
my
reads
touching
so,
whenever
I
do
a
read,
how
many
different
seeks
does
the
disk
probably
have
to
make
and
so
on?
You
can
also
find
the
the
time,
distribution
and
a
bunch
of
other
really
useful
things.
A
So
I
would
strongly
recommend
to
anyone
who
has
to
do
any
kind
of
performance
work
with
Cassandra
to
really
get
to
know
see
if
histograms,
because
there
is
a
huge
amount
of
very
useful
information
tucked
into
that
window.
That
kind
of
looks
like
a
matrix
with
just
a
bunch
of
numbers,
so
know
it
and
love
it.
A
One
thing
that
we
have
done
that
Spotify
is
a
patch
to
make
sure
that
for
every
single
SS
table
you
store
the
smallest
and
the
largest
column
name
written
to
any
row
in
that
SS
table,
and
this
means
for
specifically
for
things
that
are
or
look
like
time
series
data,
or
you
have
a
very
sharp
time
partitioning.
You
will
get
awesome
performance
increases
if
you're
using
any
size
tier
compaction.
You
will
see
that
every
single
read
you
do
will
only
have
to
touch
the
SS
tables
where
your
data
is
all
actually
located.
A
So
I
promise
you
I
promise
my
coworker
Marcus.
That
I
wouldn't
make
anyone
applaud
for
him,
but
Mark
was
throw
that
and
it's
a
really
useful
patch,
so
don't
applaud
or
anything
but
he's
awesome.
Alright!
So
another
thing:
that's
we
have
encountered
in
foundries
the
problem
that
we
are
one
of
the
few
Cassandra
users
that
use
big
traffic
and
big
sets
of
data
that
are
spread
across
the
Atlantic
or
across
a
large
ping
chasm.
A
Basically,
so
there
aren't
many
cross
continents
and
reducers,
which
means
that
we
have
been
more
or
less
on
our
own
when
it
comes
to
a
bunch
of
in
issues
that
you
only
encounter
them,
we
wrote
a
patch
and
when
I
say
we
I
don't
mean
the
the
majestic
plural
Meister.
All
this
way,
I
mean
Marcus
wrote
a
patch
to
disable
TCP
no
delay.
So
that's
when
you
are
writing
very
quickly.
You
will
actually
try
to
batch
up
multiple
writes
in
the
TCP
network,
stack
and
that's
reduced.
A
A
This
image,
who
doesn't
I,
mean
it's
beautiful,
and
one
thing
that
should
be
noted
here
about
Cassandra,
I'm,
I'm
being
kind
of
mean
I'm
saying
mean
things
about
how
much
Cassandra
sucks
I'm
saying
mean
things
about
how
stupid
we
are
and
bad
at
using
Cassandra.
But
an
important
thing
to
note
here
is
that
we
have
never
ever
lost
any
data
with
Cassandra.
We
have
hits
every
major
bug
that
Cassandra
ever
had.
A
Probably
we
have
had
so
many
pains
and
back
in
the
oh
six,
oh
seven
days,
Cassandra
was
really
buggy
back
then,
and
still
we
never
lost
any
data,
and
that
is
probably
due
to
the
mutable
Ness
of
the
SS
tables
and
the
upend.
The
only
way
that
compactions
and
and
everything
works
so
because
of
these
properties,
cassandra
is
really
something
that,
in
spite
of
being
a
new
and
immature
product,
it's
compared
to
things
that
have
been
around
for
a
few
decades
like
oracle
or
Postgres,
or
my
sequel
or
whatever.
A
We
still
have
decided
to
trust
it
with
our
data,
and
we
haven't
ever
had
to
regret
that
decision
anyway.
So,
let's
go
back
to
being
mean
they
upgrade
from
0.7
to
0.8
was
the
first
major
upgrade
of
a
production
cluster.
We
did,
and
everyone
on
the
internet
said
that
Oh
rolling
upgrades
just
work,
take
down
your
cluster,
upgrade
it
and
start
it
about
back
up
again
and
everything
will
be
super
awesome
and
it
was
not.
There
was
in
fact,
a
bug.
A
That's
meant
that
it
failed
to
start
up
and
nothing
worked
so
we
downgraded
and
we
located
the
problem.
We
wrote
the
patch
we
submitted
the
patch
and
everything
worked,
and
you
would
expect
that,
given
that
we
upgraded
to
0-8
6,
someone
else
would
have
already
discovered
this
problem
and
tried
it
out,
and
that
turned
out
obviously
to
be
the
wrong
conclusion.
A
So
our
takeaway
was
to
always
try
rolling
upgrades
in
a
testing
environment
before
doing
it
in
production
and
never
believe
what
people
tell
you
on
the
Internet
by
the
way.
If
you
are
watching
this
on,
like
YouTube
or
some
kind
of
streaming
over
the
Internet,
you
can
make
an
exception
to
me,
because
I
am
a
very
trustworthy
person,
even
though
they
okay,
so
the
next
upgrade
we
did
was
from
0
8
to
1.0
and
wise
from
our
previous
experience.
We
upgraded
first
in
our
test
environment.
A
Everything
worked
fine
and
then
we
tried
it
in
production
and
everything
worked.
Fine.
We
learned
from
our
mistakes,
don't
we
except
for
the
last
cluster,
and
once
we
upgraded
there,
all
of
the
data
was
gone,
no
keys,
no
nothing!
It
was
just
empty
just
gone,
and
this
was
the
around
the
time
were:
transfer
shots.
A
So
anyway,
what
we
did
was
that
well,
what
happened
was
that
for
very
large
clusters,
there
was
a
bug
with
the
bloom
filter
code
and
made
Cassandra
think
that
the
bloom
filter
was
somehow
dead,
an
empty
which
meant
if
the
bloom
filter
is
empty.
That
means
there's
nothing
there.
So
every
read
you
would
do
you
would
check
the
cluster
and
it
would
say
no,
no,
there
data
there,
and
so
everything
was
just
gone.
We
did
a
data
scrub
and
the
data
was
back
and
everything
was
fine
and
dandy
and
we
were
on
the
road
again.
A
After
all
of
the
previous
experiences,
we
did
the
test
with
production
data
in
the
testing
environment.
Everything
worked,
then
we
did
exactly
the
same
steps
all
over
again.
I
noticed
people
giggling,
so
you're
reading
ahead
of
my
slides
and
some
nodes.
Well,
some
services
were
reporting
missing
data,
we're
not
talking
about
empty
data.
This
time
it
was
just
like
well,
half
the
data
is
there
half
the
data
is
gone
and
some
things
are
all
inconsistent
and
weird.
A
So
we
scrubbed
all
of
the
data
restarted
and
everything
was
well
again
and
since
then
we
have
never
been
able
to
reproduce
the
problem.
We
actually
snapshot
that
the
SS
tables
of
the
the
Cassandra
clusters
before
we
did
the
scrub.
So
we
have
the
exact
files
exhibiting
the
problems
before
and
when
we
move
those
over
to
a
new
server
to
just
try
and
reproduce
everything
just
works.
We
don't
know
this
might
have
been
a
classical
problem
exists.
A
A
A
What
happens
in
Cassandra
if
a
single
node
decides
to
be
super
slow
and
when
I
mean
slow,
I
mean
slow,
not
dead,
because
there's
this
thing
called
the
snitch
in
Cassandra
that
will
notice.
Oh,
this
guy
is
dead,
I'm,
gonna,
I'm
gonna,
take
him
out
and
I'm,
not
gonna.
I'm,
not
gonna,
send
data
here,
but
if
a
node
is
slow
and
there
can
be
many
reasons
why
this
would
happen.
A
One
of
our
Spotify
favorites
is
bad
rate
and
RAID
controller
batteries,
because
you
have
to
replace
your
rate
controller
batteries
roughly
once
a
year,
because
if
you
don't,
then
they
will
switch
to
slower
mode
because
they
are
afraid
of
you.
You
no
longer
cash
rights
as
much
because
then
you
might
lose
data
that
you
save.
You
have
fdisk,
F,
synced
and
so
on.
So
so
that's
countin
can
slow
you
down
and
well
parent
we
suck
at
replacing
our
batteries.
A
Another
reason
for
for
sudden
slowness
can
be
just
bursts
of
compactions
or
repairs,
or
something
or
a
bursty
load
or
network
hiccup
or
maybe
you're
doing
a
major
garbage
collection.
In
reality,
slowness
happens
in
a
60
ml
+
to
a
hundred.
No
cluster
or
whatever,
maybe
you
barely
ever
encounter
it
with
three
nodes,
but
in
a
big
cluster,
this
will
be
a
thing
that
will
happen
many
times
a
day.
A
A
Basically,
the
queue
will
just
start
filling
up
with
more
and
more
and
require
and
request
this
one
single,
slow,
node
and
the
queue
will
become
full
and
then
you
will
have.
If
you
have
a
16
ohm
cluster,
you
will
have
one
node,
that's
post
and
59
nodes
that
aren't
receiving
any
traffic
because
all
of
the
coordinator
queues
are
full
does.
Has
anyone
ever
told
you
that
Cassandra
has
no
single
point
of
failure
because
they
lied.
A
A
So
one
way
to
solve
this
particular
single
point
of
failure
is
to
use
a
petitioner
aware,
client,
that
is
a
client
that
knows
the
petitioning
schema
and
knows
where
all
of
the
data
is
located,
so
that
when
the
client
sends
a
request,
it
will
only
send
a
request
to
one
of
the
nodes
that
actually
holds
the
data.
This
is
actually
also
performance
improvements,
because
then
there
will
be
less
cross.
A
No
chatter
and
you
can
talk
within
the
club
within
the
same
process,
and
things
will
be
slightly
faster,
but
mostly
I
would
say
it's
a
reliability
improvement,
and
this
means
that
at
most
because
of
things
like
this,
three
nodes
can
go
down
instead
of
the
entire
cluster
and
that's
quite
a
big
improvement
and
Astyanax.
The
the
Java
client
for
cassandra
has
this
feature,
and
you
should
always
use
st
onyx.
You
should
not
use
hector's,
among
other
things,
because
of
this,
so
a
big
shout
out
to
Netflix
for
writing
and
awesome
cassandra
client
there.
A
Alright,
our
next
topic
is
how
not
to
delete
data
and
I
could
just
make
this
really
short
and
say:
don't
delete
data,
but
let's
go
into
some
more
detail
specifically.
The
way
cassandra
deletes
data
is
kind
of
messy,
the
reason,
of
course,
being
the
much
much
valued
immutable
Ness
of
SS
tables.
If
once
data
hits
disk,
you
never
ever
write
or
mutate
that
as
a
stable
again,
how
can
you
remove
data
from
it?
The
obvious
way
to
do
it
is
through
tombstones.
A
That
is
that
you
put
in
a
write
with
the
same
column
value
as
as
the
previous
thing
that
you're
trying
to
delete
and
some
kind
of
special
flag
saying
this
is
a
tombstone.
So
this
date
has
been
deleted,
and
this
is
what
Cassandra
does
and
well
tombstones
are
basically
any
other
column
rights.
They
have
the
same
type
of
time,
stamps
for
versioning
and
so
on.
So
we
can
look
at
them
as
any
other
one
right
that
can
override
other
rights
and
so
on,
and
the
question
then
becomes
tombstones
are
kind
of
special
right.
A
I
mean
they
are
things
that
we
would
like
to
go
away.
Once
the
original
data
is
gone,
then
the
tombstones
can
go
away.
So
does
that
actually
ever
happen?
Well,
the
answer
is-
and
this
is
really
complicated
to
reason
about
tombstones-
can
get
merged
into
the
SS
tables
that
hold
the
original
data
and
finally,
the
tombstones
then
can
become
redundant,
but
and
also
once
that
has
happened
because
of
the
the
possibility.
That's
some
nodes
in
the
cluster
have
gone
down
and
they
might
come
up.
There's
also
a
grace
time.
A
A
Tombstones
I'm
gonna
have
to
read
this
because
it's
so
complicated,
at
least
to
me
to
express
tombstones
can
only
be
deleted
if
all
values
for
the
specified
row
are
all
being
compacted.
So
if
you
have
perform
the
rights
to
the
same
row
as
the
column
you're
trying
to
delete,
then
at
some
point
in
time-
and
it
has
it
an
SS
table
that
you're
not
currently
compacting,
then
that
column
can
never
go
away.
A
And
this
is
the
important
difference
here,
because
maybe
you
only
wrote
to
that
to
that
specific
column
once
and
then
deleted
it,
and
that
doesn't
matter,
because
if
that
column
is
in
a
row
that
is
kind
of
popular
that
gets
written
to
every
once
in
a
while,
then
that
tombstone
will
much
never
ever
going
away
with
sighs
tear
:
compaction.
It's
just
not
going
to
happen.
So,
let's
talk
a
little
bit
more
about
level
compaction,
because
level
compaction
is
the
savior
of
Cassandra.
A
It
should
because
in
theory
they
say
that
90%
of
all
rows
live
in
a
single
SS
table
with
Cassandra,
and
this
is
assuming
a
bunch
of
less
and
that
about
the
distribution
of
Rights
and
the
distributions
of
this,
and
that
what
we've
found
in
our
production
clusters
is
that,
somewhere
between
50
and
80
percent,
depending
on
the
type
of
data,
we're
storing
between
50
and
80
percent
of
all
reads
only
hit
one
SS
table.
The
other
hits
multiple
SS
tables.
A
That's
generally,
not
good
enough,
because
if
you
think
about
it
in
where
the
level
compaction,
when
you're
doing
a
leveled
compaction,
you
will
only
ever
be
compacting.
Two
separate
SS
tables
that
hold
the
same
data
space
because
of
the
partitioning
scheme
used
in
n
level
compaction.
So
the
result
of
this
is,
if
you're
doing
a
lot
of
writes
to
a
specific
row,
then
that
row
will
exists
in
all
or
most
of
the
levels
in
the
level
compaction
scheme.
A
And,
if
that's
the
case,
then
the
tombstones
will
stick
around,
because
you
won't
be
compacting
all
of
the
levels
at
the
same
time
and
the
tombstones
will
stick
around
so
even
with
level
compaction.
This
can
definitely
still
be
a
problem.
The
conclusion
that
you
should
take
away
from
this
is
that
deletions
in
Cassandra
are
super
messy,
and
the
only
surefire
way
to
get
away
from
from
tombstones
is
to
do
major
compactions
and,
as
you
probably
know,
they
have
there
huge
set
of
problems.
A
So
also
the
problem
will
generally
be
much
more
popular
and
problem
bigger
with
problem
with
the
popular
rows,
so
that
all
of
the
rows
that
you
don't
care
about,
if
you
have
a
very
skewed
workload,
will
have
no
tombstones
but
the
ones
that
you
really
need
to
be
quick.
They
will
be
slow,
avoid
schemas
that
delete
data.
A
So
all
of
the
smart
people
will
now
be
thinking.
Well,
what
about
TTL
data
once
the
TTL
of
column
expires?
You
can
just
compact
it
away.
You
know
that
you
don't
need
it.
You
know,
that's
there's
no
resurrection
thing.
You
don't
need
a
tombstone
right,
so
we
don't
need
the
data
anymore.
We
can
just
drop
it
and
TTL
data
will
be
super
fast
right,
come
on
yeah
right,
yeah,
awesome,
cool,
no
okay.
A
If
you
haven't
happened
to
do
a
compaction
on
the
old
non
TTL
data,
then
that
might
reappear,
and
that
would
be
so
terrible
right
so
because
of
this
very
esoteric
use
case,
where
people
are
writing
the
same
values
into
the
same
column,
family,
sometimes
with
and
sometimes
without
TTLs,
they
decided
to
do
the
whole
tombstone
singing
a
song
and
dance
thing
that
breaks
everything
even
for
TTL
data
and
so
TT
ELLs
will
not
help
you
at
all.
It
will
just
not
one
thing
that
does
help.
A
A
A
Alright,
I'm
going
to
go
back
a
little
bit
more
about
cool
ways
to
fail
at
using
Cassandra,
specifically
I'm,
going
to
go
into
a
very
specific
way
to
fail
with
Cassandra,
which
is
our
playlist
service.
Now
this
is
a
really
big
service.
We
have
something
like
1
billion
playlists
in
there
we
are
receiving
a
load
of
somewhere
around
40,000
reads
per
second,
a
few
hundred
writes
per
second
22
terabytes
of
data
compressed
some
almost
double
if
you
uncompress
it.
A
It
would
just
say,
take
several
weeks
even
that
didn't
help
us
after
a
while,
because
we
had
so
many
files
that
we
just
had
a
long
period
of
time
where
we
had
no
backups
for
one
of
our
most
important
core
services.
That
was
in
the
old
days
when,
when
we
were
a
lot
smaller
and
more
naive
and
even
less
so
well
anyway,
we
also
had
a
home-brewed
replication
model
that
basically
combined
the
world
the
worst
of
all
the
worlds.
A
If
a
single
node
went
down,
we
lost
data,
but
we
still
had
to
write
to
a
bunch
of
machines
and
masamune
and
so
on,
and
so
it
was
very.
It
was
an
interesting
time
we
had
frequent
down
times
back
then
none
of
our
users
will
notice
this
so
much
anyway.
We
figured
that
this
would
be
the
perfect
test
case
for
Cassandra.
Let's
take
our
most
important
piece
of
user-generated
data,
and,
let's
just
and
we
have
so
so
many
problems,
let's
just
use
that
as
a
testbed
for
an
entirely
untested
cool
new
product.
A
You
know
that
we
don't
know
anything
about.
Isn't
it
all
right?
Here's
an
interesting
thing
that
not
a
lot
of
people
know
about
the
Spotify
playlists.
Every
playlist
is
actually
a
revision
object.
You
can
basically
think
about
it
as
a
slightly
simplified
gates
repository
it's
a
distributed
version
in
version
control
system.
A
On
a
sidenote,
though,
it
should
be
noted
that
this
is
a
design
decision
that
has
kind
of
managed
to
save
us
from
ourselves,
because
this
is
obviously
in
many
ways
silly,
but
on
the
plus
side
it
whenever
the
playlist
system
would
go
down.
Well,
people
were
having
these
revisions
object,
things
in
their
clients
and
they
wouldn't
actually
notice
all
that
much
unless
they
had
multiple
devices
which
people
didn't
back
in
the
old
days.
A
A
This
is
a
rendering
from
a
web
interface
of
how
a
playlist
actually
looks.
You
can
see
here.
That's
here.
We
have
a
fork
with
a
bunch
of
concurrent
notifications
that
are
have
them
been
merged,
simple
stuff
anyway,
so
we
have
had
huge
amounts
of
issues
with
playlists
and
Cassandra
I'm,
going
to
most
be
talking
about
one
of
them
which
had
with
the
head
the
column
family.
Now
the
head
column
family
store
something
very
much
akin
to
the
head
in
gate.
A
That
is
the
pointer
to
the
latest
revision
ID,
and
this
serves
actually
somewhere
around
90%
of
all
requests
to
the
playlist
service,
the
reason
being
that
whenever
a
client
logs
in
it
will
say
I
have
this
version
of
this
playlist.
What
is
the
latest
version
and
then
usually?
No
one
else
is
modifying
the
playlist
in
question
and
it
will
just
get
back.
Oh
you
have
the
latest
version
cool,
and
if
you
get
a
mismatch,
then
you
will
have
to
figure
out
and
read
the
changes
that
have
happened
since
and
things
like
that.
A
But
90%
of
the
time
you
have
the
latest
version.
So
all
you
need
to
do
is
access
head.
It
should
be
noticed.
Oh,
it
should
be
noted
here.
That's
we
actually
for
a
long
time,
mem
locked
head
into
RAM
because
it
was
so
frequently
used
and
we
had
performance
issues
because
of
page
cache
mission,
ministers
and
so
on,
and
this
is
something
that
might
be
a
good
idea
to
do,
depending
on
your
use
case.
A
So,
but
when
I
will
talk
about
now,
is
that
even
though
we
had
had
locked
completely
into
RAM
so
that
accessing
had
was
known
to
never
ever
cause
disk
I/o,
we
noticed
that
there
were
requests
that
actually
took
several
seconds
to
complete
and
we
could
just
reproduce
this
in
the
Cassandra
CLI.
We
would
issue
and
get
to
the
right
to
the
right
row
and
what
we
will
get
back
was
15
seconds
of
waiting
before
getting
the
data.
A
Why
well
we
it
turns
out
that
the
way
we
store
data
in
head
is
that
we
stored
the
revision
ID
of
the
latest
revision
in
the
column
name
and
have
an
empty
column
value.
Why
do
we
do
this?
Because
when
we
do
two
simultaneous
writes,
remember
your
lovely
vacation
Cambodia.
When
we
do
that
we
want
there
to
be,
we
don't
want
the
rights
to
clobber
each
other.
A
So
when
that
happens,
when
you
do
two
things
like
that,
you
will
simply
end
up
with
two
values
in
the
head:
column,
family
and
then,
when
you
do
a
read,
you
do
a
slice
request
across
everything
in
head
and
you
will
almost
always
get
back
just
the
one
result.
If
you
get
back
to
results,
then
you
know
oops
I
need
to
do
a
merge,
and
then
you
will
do
that.
So
it's
just
a
fork
detection
thing,
but
of
course
this
turned
out
to
not
be
a
really
optimal
thing
for
Cassandra
to
store.
A
We
hadn't
really
thought
about
this.
We
designed
this
like
I,
said
two
plus
years
ago
back
when
we
didn't
know
everything
about
Cassandra.
We
still
don't
know
everything
about
Cassandra,
but
we
do
know
more
about
Cassandra.
We
didn't
realize
that.
Well,
if
you
keep
writing
to
the
same
row,
then
there
there
will
be
values
and
for
that
row
in
a
bunch
of
different
SS
tables
and
then
the
tombstones
in
and
can
in
fact
never
be
compacted
away.
So
for
all
of
our
most
popular
playlists.
A
A
Our
solution
to
this,
which
is
a
bit
humble,
is
to
perform
major
compactions
because
with
a
major
compaction,
you
are
actually
merging
all
of
theses
tables
and
what
that
means
is
that
you
know
for
a
fact
that
you
don't
have
any
stray
other
things.
So
if
there
is
no
value
for
specific
column,
then
you
can
just
delete
the
tombstone.
A
So
that's
the
only
plausible
way
to
safely
remove
tombstones
and
we
did
that
and
things
started
looking
up
and
things
were
speedy
again
for
about
a
week
until
we
noticed
that
everything
was
slow
again,
so
that
was
boring.
Why?
Well?
It
turns
out
that
we
were
running
repairs
in
between
the
major
compactions.
A
We
solved
this
by
doing
repairs
on
weekdays
and
major
compactions
on
weekends,
and
hopefully
you
will
already
know
that
you
should
never
ever
use
Cassandra
to
store
a
Q
and,
of
course,
if
you
think
about
that,
the
head
column
family
was
a
queue
that
always
tried
to
help
hold
only
one
value
in
the
work
you,
but
still
some
kind
of
queue
and
just
Cassandra
is
terrible
at
storing
Hughes.
Don't
do
that.
A
Okay,
so
another
interesting
Cassandra
feature
the
Cassandra
distributed
counters,
so
in
the
Spotify
UI
there
are
a
bunch
of
things
that
we
want
to
count.
We
want
to
count
how
many
people
are
following
a
playlist.
We
want,
if
account,
how
many
people
are
following
an
artist.
How
many
times
has
this
song
being
stream?
This
kind
of
thing,
and
this
distributed
counters
in
Cassandra,
is
supposed
to
solve
this
and
just
transparently
just
count
things
and
it
will
bump
and
everything
will
just
work.
So
it's
this
awesome,
yeah
yeah
sure
it's
there.
A
The
really
important
thing
here-
and
this
is
something
you
tube
is
not
realized-
is
that
if
I
click
reload,
then
the
number
has
to
go
up
by
one
and
that
will
pretty
much
work
with
distributed
counters.
It
can
be
off
by
a
bunch
of
things
and
sometimes
it
will
go
up
by
a
thousand
or
down
by
a
thousand,
but
no
one
cares
so
long.
It's
it's
not
the
same
number.
Then
people
will
just
be
happy.
A
So
look
another
dog
picture,
I
wonder
how
that
got
in
there
anyway.
So
how
not
to
fail-
and
I
should
note
here
that
this
is
only
a
partial
list.
One
thing
that
you
should
do,
if
you
want
to
use
Cassandra
scale,
is
you
should
treat
it
as
a
utility
belt.
This
is
a
set
of
really
well
tested
cool,
interesting
algorithms.
A
If
you
don't
know
what
the
algorithms
do,
you
will
poke
it
and
you
will
lose
a
finger
and
you
will
be
sad,
but
if
you
know
what
algorithms
do
and
what
they
do
well
and
what
they
do
poorly,
then
you
can
put
together
a
very
interesting
storage
system
that
will
solve
your
problems
in
a
very
efficient
and
cool
manner.
Also,
flash
storage
is
completely
awesome.
A
We
do
a
lot
of
one
of
things.
I
main
mentioned
most
of
these,
but
I'm
just
going
to
reiterate,
we
do
a
a
bunch
of
really
weird
things
in
our
cluster.
We
have
clusters
where
we
do
major
compactions
on
a
weekly
basis.
We
have
clusters
where
we
will
delete
all
of
the
data
physically
delete
the
vs.
tables
and
just
recreate
things
from
scratch.
We
mem
log
frequently
used
as
a
stable
Center,
and
we
we
do
weird
things
and
you
need
to
you
need
to
understand
what
the
algorithms
are.
A
You
need
to
understand
what
the
performance
profile
of
the
things
are.
If
you
don't,
when
you
hit
scale,
you
will
be
very
sad,
and
this
is
actually
a
stark
difference
between
Cassandra
and
say
Postgres
in
Postgres
the
things
that
Postgres
does
well,
it
does
well
and
things
post
Chris
does
poorly
it
does.
Okay,
Cassandra
is
really
good
at
the
things
it
does
well
and
it's
terrible
at
the
things
it
does
badly.
A
So
you
need
to
keep
that
in
mind
and
that
pretty
much
goes
for
all
the
sequel
things
all
the
old
school
databases
and
so
Cassandra
read
performance
is
heavily
dependent
on
the
temporal
patterns
of
your
rights.
Specifically,
if
you
write
the
same
row
over
a
program
long
period
of
time,
you
will
get
bad
performance.
A
A
Is
faster,
or
whatever
is
faster?
All
of
these
benchmarks
are
pretty
much
useless
because
they
didn't
actually
run
that
benchmark
for
two
years
before
trying
measuring
things.
They
ran
it
for
like
a
week
or
something
and
that's
not
relevant.
That
just
shows
you
off
what
Cassandra
will
do
for
the
first
few
months,
so
don't
believe
in
the
benchmarks.
A
Also,
there
is
still
a
bunch
of
esoteric
problems
working
with
large-scale
Cassandra
installations.
I
think
the
next
few
years
in
storage
are
going
to
be
super
interesting
because
we've
probably
gotten
roughly
as
much
performance
as
we
can
out
of
using
flash
in
the
format
of
SSDs
that
still
use
the
salt
or
whatever
protocol.
In
order
to
really
start
reaping
benefits,
we
will
have
to
come
up
with
new
hardware
protocols
to
talk
to
storage.
A
We
will
suddenly
have
really
cool
new
abilities
when
the
database
products
can
go
down
and
manipulate
things
in
the
lower
layers
and
we're
in
for
a
really
interesting
time
in
the
next
few
years.
And
if
you
agree
with
that
statement,
you
should
totally
come
work
with
us,
because
we're
gonna
have
a
blast,
so
Spotify
dot-com,
slash
jobs,
I
think
there
supposed
to
be
a
microphone
somewhere
around
here.
If
anyone
wants
to
ask
a
question,
so
there's
a
hand
in
there
and
there's
a
man
with
a
microphone.
They
seem
like
a
good
fit.
D
A
D
A
A
D
B
E
B
B
A
Would
say
that
in
general,
if
you
have
performance
problems
with
Cassandra,
you
probably
need
to
switch
your
data
schema
specifically
in
the
case
of
the
user
row.
If
you're
deleting
columns
your
use
of
data-based
and
what
on
earth
are
you
doing?
You
should
not
do
that.
If
you
have
the
need
to
remove
and
add
columns
a
bunch
of
times,
you
should
probably
serialize
all
of
that
data
into
some
kind
of
binary
blob
or
whatever
protobuf
JSON
I,
don't
really
care
and
then
put
that
into
an
column
with
a
constant
name.
B
F
F
A
It's
been
a
significant
difference
for
us,
in
some
column,
families,
and
we
are
happily
using
compression
in
other
column,
families,
so
the
way
to
detect
that
is
to
turn
on
compression
and
measure.
The
performance
difference
once
again
know
to
see
if
histograms
is
your
friend,
because
it
will
show
not
only
the
average
access
time,
but
also
the
distribution,
and
it
will
give
you
quite
a
bit
of
data,
so
you
have
a
lot
of
data
at
your
fingertips.
Just
notice.
Is
this
a
problem
for
me
or
not.
A
C
To
handle
these
Tom's
tombstone
problem
right
so
have
you
looked
at
writing
your
own
compaction
strategy,
so
Cassandra
the
compaction
strategy
is
pluggable
right.
So
since
you
seem
to
have
submitted
couple
of
patches
to
Cassandra
right.
So
what
is
your
ex?
Have
you
looked
at
it
and
go?
What
is
your
experience?
There
was.
A
The
question
of
whether
we
have
thought
about
writing
our
own
compaction
strategies
and
the
answer
is
definitely
yes.
My
awesome,
co-worker
Marcus
is
still
sitting
over
there
and
he'll
turn
his
redder
every
time.
I
mention
him.
He
has
a
plan
for
a
very
useful
compaction
strategy
which
is
based
on
time.
A
It's
for
time
series
data,
so
you
basically
just
compact
things
that
have
happened
concurrently
and
that
will
make
for
very
efficient
usage
with
specifically
time
series
data
because
you
will
always
have
you
will
never
have
to
for
any
given
time
point
check
more
than
one
SS
table.
If
you,
what
you're
storing
is
time
series
data,
so
we
have
also
thought
about
a
bunch
of
different
things.
A
Honestly,
though,
level
compaction
is
a
really
really
clever
thing:
it's
it
solves
a
bunch
of
problems
for
the
size,
tier
compaction,
I,
don't
think
people
in
general
understand
how
it
works.
Well,
enough,
I'm,
probably
going
to
at
some
future
conference,
try
and
they
were
talking
exactly
how
it
works,
because
I
find
that
most
people
know
that
it's
something
about
charting
or
whatever,
but
it's
a
very
good
thing
and
it
has
some
drawbacks.
But
if
you
know
about
them
you
can
often
work
around
them
and
it's
it's
a
very
interesting
compaction
strategy.
E
A
A
Well,
obviously,
if
well
what
we
want
to
do.
There
is
basically
we
want
to
replace
all
of
the
old
data.
We
have
data
and
we
regenerate
it
from
scratch
every
day,
and
if
we
would
just
insert
it,
then
the
old
data
would
to
be
compacted
away,
because
we
are
just
deleting
the
SS
tables.
We
can
turn
off
compaction
entirely.
We
can
just
write
the
data
and
be
happy
and
that
works
very
well
for
us.
For
that
use
case.
E
A
Is
a
very
good
question
and
I
cannot
give
you
a
short
answer
better
than
it
varies.
We
have
very
many
different
services
and
they
all
have
different
storage
needs
and
we
really
try
not
to
stay
away
from
doing
any
kind
of
one-size-fits-all
solutions.
We
have
some
data
that
lives
only
in
one
data
center.
We
have
some
that
is
replicated
to
all
data
centers.
We
are
moving
towards
making
more
and
more
data
home
sited.
A
So
that's
it's
located
only
in
the
data
center,
where
the
user
accesses
it
the
most,
but
we're
not
doing
that
a
lot
currently
because
we
try
to
do
it
and
we
failed
spectacularly,
but
I
don't
think
that
there
can
be
a
good
one
answer
to
that.
It
depends
on
how
much
rights
you're,
seeing
how
big
your
data
set
is
and
a
bunch
of
other
factors.
Also,
of
course,
how
much
data
in
question
is
shared
between
people.
So
it's
a
really
complex
question
and
I.
Don't
think,
there's
a
one
single
good
answer.
Sorry
about
that.