►
From YouTube: C* Summit 2013: Time-Series Metrics with Cassandra
Description
Speaker: Mike Heffner, Engineer & Co-Founder at Librato
Slides: https://speakerdeck.com/mheffner/time-series-metrics-with-cassandra
Librato's Metrics platform relies on Cassandra as its sole data storage platform for time-series data. This session will discuss how we have scaled from a single six node Cassandra ring two years ago to the multiple storage rings that handle over 150,000 writes/second today. We'll cover the steps we have taken to scale the platform including the evolution of our underlying schema, operational tricks, and client-library improvements. The session will finish with our suggestions on how we believe Cassandra as a project and its community can be improved.
A
Get
started
here
thanks
for
coming
out
of
the
talk,
I'm,
Mike
Hefner
and
my
talk
today
is
time-series
metrics
with
Cassandra,
and
it's
gonna
be
a
sort
of
brief
overview
of
what
we've
been
doing
with
Cassandra
at
liberado
and
some
of
the
things
we've
learned
over
time
and
some
of
the
problems
that's
we've
saw
faced
so
just
a
quick
background
on
what
we
do.
Liberado
is
a
time-series
metrics
platform.
We
run
that
as
a
hosted
service
we
take
in.
We
have
a
fairly
simple
API.
A
A
We
started
off
sort
of
mid-2011
and
sort
of
the
first
thing
we
made
the
decision
decision
on
was,
you
know
we
knew
the
measurement
data
was
going
to
be
our
largest
problem
and
we
wanted
to
make
that
scalable.
So
we
did
start
with
the
decision
to
put
all
that
data
into
Cassandra
and
we
did
start
on
ec2
a
six
node
M,
our
m1
large
ring.
A
This
was
about
the
zero
eight
milestone
that
we
started
off.
Thinking
about
midway
through
that
and
the
key
takeaway
was.
We
really
did
not
know
what
we
were
doing
it's
time
so
fast
forward
to
today.
We
actually
run
multiple
charted
rings
today,
so
we
shard
on
a
couple
different
aspects
with
a
few
very
heavy
users
that
we
put
onto
their
own
rings.
We
also
store
different
resolutions
of
data,
so
we
actually
do
roll-ups
for
extended
duration
and
we
actually
have
different
TTLs
depending
on
the
rollout.
A
So
we
actually
put
those
on
two
different
charted
rings
as
well.
We're
still
on
ec2
we've
upgraded
our
node
class
a
little.
The
extra
large
is
the
m2
Forex
largest.
We
are
in
the
1-1
milestone.
We
are
looking
to.
Hopefully,
get
to
1
2
very
soon,
and
our
loads
are
we're
doing
over
a
little
over
200,000
rights,
the
second
across
our
rings
and
the
takeaway
is
we
really
have
a
very
small
read
low,
so
we
really
have
to
optimize
for
the
right
path
in
this
case.
A
So
just
some
highlights
for
what
I'm
going
to
talk
about
today.
Why
sure
to
say
how
we
iteratively
improved
our
schema
due
to
some
understanding
of
performance
issues,
we've
had
with
the
storage
and
we
do
TTL
expiration
for
all
of
our
data.
So
a
large
part
of
our
problem
is
getting
data
out
of
this
system
fast,
fast
and
efficiently,
and
just
one
touch
on
soar,
how
we
monitor
our
rings
and
performance
so
adapting
the
schema
to
the
storage.
A
So
just
a
quick
background
for
us
on
what
a
measurement
is
so
we're
a
little
different
than
some
other
time
series
storage
systems
and
that
we
we
have
a
split.
We
actually
have
metrics
that
we
identify
by
name.
We
actually
use
my
sequel
system
to
map
those
to
an
integer
ID.
So
inside
of
our
kiss
and
restore
everything
is
identified
if
your
integer
ID,
but
we
also
have
this
concept
of
the
source,
so
you
can
actually
have
multiple
sources.
Publishing
a
single
metric
to
us
a
source
may
be
a
host
name.
A
A
We
have
our
metric,
ID
and
epoch
column
keys
to
identify
the
time,
stamps
Arbor
data
and,
as
you
can
see,
we
have
a
number
of
sources
publishing
for
this
particular
metric
and
that's
a
that's
a
field
that
actually
can
expand
fairly
well,
like
some
customers
may
have
thousands
of
sources
some
may
have,
so
we
actually
rotate
this
by
the
base
epoch.
We
do
that
fairly
frequently
to
ensure
that
we
don't
have
these
giant
large
rows
and
then
our
standard
data
is
actually
encoded
into
a
JSON
format
every
store.
A
So
the
part
that
actually
the
most
problems
for
us
was:
how
do
we
identify
where
these
rows
are
I
like
to
say
you
can
have
a
metric
that
publishes
fairly
sporadically
it's
over
like
a
week
period
of
time.
This
may
publish
a
few
hours
here
and
there,
so
we
actually
need
some
system
system
to
efficiently
identify
where
these
metrics
that
we
actually
publish.
A
So
we
actually
have
another
column
family
that
we
identify
just
by
the
metric
ID
and
every
time
that
metrics
published
we
stick
in
at
timestamp
into
this
table,
so
do
so
to
look
up
no
data
for
a
metric.
We
do
a
column
slice
on
this.
We
then
identify
what
the
unique
unique
time
bases
are
for
where
that
metrics
publish
and
then
we
go
in
and
do
the
multi
jet
or
you
know,
parallel
get
to
find
the
actual
raw
data.
So
you
know
when
we
were
building
this,
we
sort
of
did
the
math
on
all
right.
A
This
is
our
one-minute
records,
which
we've
defined
are
gonna,
be
a
one
week
TTL.
We
can
actually
figure
out
what
the
maximum
row
size
were
we're
talking
about
here.
For
us,
it
was
about
ten
thousand
rows
and
about
three
hundred
twenty
K
for
tracking
this,
so
not
terribly,
not
that
terribly
bad,
and
we
should
be
able
to
do
efficient
slices
on
this.
So
we
were
running
for
with
this
for
several
months
before
we
actually
started
to
hit
some
fairly
major
performance
problems.
Trying
to
read
from
this
table.
A
Multi
second
Layton
sees
trying
to
read
out
of
column
slices
of
this,
and
the
important
part
is
actually
we're
doing
fairly
small
column
slices
on
this
inside
the
app
most
of
our
views
into
the
one
week
and
the
one
minute
data
is
three
or
six
hour
periods.
It's
over
a
week
we're
doing
slices
of
like
about
one
to
two
percent
of
that
actual
full
row,
so
to
sort
of
understand
what
was
going
on.
A
We
dug
into
the
node
tool
utility
here,
specifically
the
CF
histograms
utility,
so
this
will
actually
print
out
how
many
S's
tables
your
reads
are
actually
consulting
when
you're
doing
an
operation.
So
when
we
ran
this
on
the
tip
on
this
particular
column
family,
we
found
that
we
were
doing
anywhere
from
1
to
10
SS
tables
that
we're
actually
touching
to
do
these
fairly
small
palm
slices
anywhere
and
averaging
about
you
know.
A
You
know
4
to
6
there,
so
that
obviously
was
not
not
what
we
wanted
and
what
and
was
leading
to
the
performance
issues
that
we're
seeing
there
so
to
sort
of
understand
what
was
going
on
yet
take
a
look
at
how
this
actually
was
coming
into
our
system,
how
we
were
storing
it
so
we're
getting
thousands
of
metrics
in
overtime
and
as
they
come
in
they're,
getting
dumped
and
asses
tables.
Some
of
those
are
getting
compacted
together.
A
So
you
have
varying
sizes
of
estes
tables,
but
in
general
the
SS
tables
are
sort
of
following
the
progression
of
time.
Across
this
you
know
one
week
of
TTL
that
were
maintaining
for
this
particular
column
family,
so
that
one
row
that
identifies
where
that
metric
is
is
actually
spread
across
these
various
SS
tables.
A
So,
even
though
we
just
needed
a
very
small
home
slice
of
this
row
and
we
actually
needed
to
look
up
and
identify
when
we're
doing
this
column
slices
we're
actually
having
to
seek
to
the
various
locations,
all
the
SS
tables
doing
a
lot
of
random
I/o
to
basically
then
just
pick
out
a
very
small
portion
of
that
row
to
do
our
operations,
so
that
sort
came
to
our
next
design,
which
was
alright.
Even
though
this
is
a
fairly
you
know,
site
and
you
know
defined
size
row.
A
Let's
actually
start
rotating
that
on
a
time
basis
as
well.
So
we
actually
started
rotating
by
adding
a
10
unit
base
rotation
onto
the
key.
So
this
was
a
fixed
rotation
that
we
defined
for
this
column,
family,
and
then
you
know
whenever
we
needed
actually
do
a
lookup.
We
just
found
out
what
the
cover
time
bases
were
between
these
actual
epoch
time.
A
A
Identify
and
then
you
know
running
afterwards,
the
same
has
CF
histograms
tool.
We
can
see
that
we've
actually
improved
significantly
we're
now
reading,
no
no
more
than
five
SS
tables
at
time
and
averaging
right
around
two
to
three
now
for
almost
reads.
Additionally,
not
only
are
we
reading
them
fewer
SS
table,
but
we
can
actually
spread
that
load
across
the
nodes
of
the
Ring.
A
Now
now
that
we're
rotating
those
keys
more
frequently,
so
this
then
just
the
graph
this
then
actually,
you
know
once
we
start
doing
this-
that
we
actually
saw
a
significant
improvement
after
we
put
this
in
place.
So
that
was
something
just
when
we've
been
building
the
system
we
identified
later
that
you
know
actually
understandings
for
how
the
data
is
coming
into
the
system,
then
actually
progressing
through
time,
really
helped
us
or
improve
the
schema
in
terms
of
how
we're
storing
that
on
disk.
A
So
the
next
part
once
are
about
is
the
exploration
of
data
in
our
system.
So,
like
I
said,
we
t-tail
expire
everything
out
of
our
system.
We
don't
do
any
deletes.
We
do
not
update
data,
so
we
turn
about
750
gigs
a
day
across
our
rings
and
that's
about
6
percent
of
our
total
data
set
that
were
actually
trying
to
every
single
day
and
we
reset
grace
equals
zero.
A
So
once
data
actually
teach
L
expires,
we're
not
keeping
tombstones
around
for
extended
periods
of
time,
and
we
found
that
with
our
right
load,
the
size
tiered
compaction
work
best
for
us.
So
problem
we
actually
found,
though,
was
we
had
these
periods
that
we
would
end
up
with
a
lot
of
large
uncompacted
s's
tables.
That
would
build
up
on
the
nodes
and
they
would.
Finally
the
threshold
where
there
are
enough
SS
tables
to
do
a
large
compaction
on
that.
A
So
we
stepped
through
a
number
of
different
things,
trying
to
figure
out
a
way
we
can
a
break
up
the
synchronization
of
when
these
guys
would
go
through
their
large
compactions,
but
also
see
if
we
could
force
expire.
This
data
earlier
from
the
ring
closer
to
when
it
actually
expired
and
was
no
longer
useful.
A
So
a
few
things
we
looked
at
obviously
no
tool
compact.
We
could
do
a
major
compaction
on
our
data.
However,
this
would
generally
to
the
larger
problem
of
you
end
up
with
a
giant
single
essence
table
that
then
was
much
harder
to
compact
the
future,
also
significantly
much
more
overhead
than
we're
looking
for
so
generally,
not
a
great.
No
idea
to
do
so.
A
A
More
than
that,
we
actually
run
that
frequently
on
some
of
our
rings
and
when
it
goes
through
an
SS
table,
it
will
look
for
rows
that
are
no
longer
active
in
that
particular
SS
table,
so
particularly
with
our
new
row,
key
rotation
scheme
that
helps
reduce
the
references
that
single
row
key
has
across
us
as
tables
in
our
system.
So
when
I
actually
go
through
and
clean
up,
SS
tables
on
our
system,
we
actually
throw
out
a
lot
of
TTL
expired
data,
reducing
the
the
load
on
the
particular
nodes
in
our
ring.
A
Now
the
downside
that
is
is
it
still
has
to
go
through
our
SS
table.
That
is
still
significant
overhead
or
not
insignificant
overhead
on
the
ring,
so
really
go
farther
in
terms
of.
Can
we
take
the
properties
of
the
SS
table
themselves
and
the
data
that
we're
putting
in
there
to
understand
if
there's
a
way
to
do
this
a
little
more
efficiently?
A
A
So,
for
example,
in
this
case
this
is
May
17th,
and
this
is
our
data
that
is
a
one
week
TTL.
So
by
definition,
we
can
actually
go
in
and
completely
remove
this
SS
table
from
the
system,
because
we
know
all
the
data
and
it's
expired
from
the
system.
So
this
is
nice.
This
is
fat,
has
fairly
low
overhead
and
quickly
removes
you
know
the
entire
SS
tables
from
the
system,
the
downslide
we
we
don't
want
to
just
rip
these
out
from
underneath
Cassandra.
A
Another
policy
we've
set
in
place
in
some
of
our
rings
is
and
say
we
use
the
size
tier
compaction
and
by
definition,
that
requires
a
minimum
of
four
SS
tables
to
actually
do
a
compaction
on
the
system.
So
we
end
up
with
a
lot
of
large
SS
tables
that
don't
have
enough
peers
to
compact
with
and
then
we'll
sit
around
for
quite
a
long
time
and
and
not
actually
be
able
to
compact.
So
we've
experimented
with
dropping
this
down
to
two
on
some
of
our
rings,
particularly
more
rings
that
have
high
volume
but
lower
TTL.
A
A
So
there
are
two
points
I
want
to
mention
that
we're
actually
looking
forward
to
for
this
particular
problem
is
that
in
the
one
two
like
John
is
dead,
there's
a
lot
of
the
bloom
filters
and
compression
metadata.
It's
been
moved
off
heat,
so
a
lot
of
problems
we
do
have
is
one
pass
almost
about
a
400
gig
limit
per
node.
We
start
to
see
a
lot
of
keep
contention
a
lot
of
issues
with
the
garbage
collector,
so
moving
that
stuff
off
heap,
we
believe,
is
actually
gonna
help
us
a
lot
in
reducing
that
issue.
A
A
So
we
believe
actually
that
that's
going
to
help
us
a
lot
in
the
future
as
well
just
to
do
those
frequent
compactions
without
having
to
run
at
the
the
lower
size,
chair,
compaction,
minimum
threshold,
so
yeah
the
number
of
ways
for
Reap
pull
data
out
of
the
system.
Frequently
it
varies
based
on
the
ring
that
were
running
it
in
the
properties
of
the
data
that
we're
storing
there.
But
we
do
spend
a
lot
of
time
trying
to
get
data
out
of
the
system
as
quickly
as
possible,
too.
A
We
are
a
monitoring
company,
so
we
have
a
couple
a
few
different
monitoring
things
that
we
use
ring
dashboards,
I
highly
recommend
setting
up
a
ring
dashboard
to
monitor
the
sort
of
key
properties
of
your
ring.
We
snapshot
JMX
attributes
from
cassandra
and
our
ops
guy
spend
a
lot
of
time.
Looking
at
these
graphs,
but
basically
yeah
I
mean
we've
been
able
to
detect
a
lot
of
issues
just
by
watching
the
trends
over
the
over
the
dashboards
and
have
been
able
to
go
in
and
fix
issues
before
they've
become
a
much
larger
problem.
A
Discouraged
this
is
pretty
straightforward,
but
we
actually
weren't
looking
at
this
originally.
But
if
you
ever
see
in
your
D
message
logs
an
end
request,
IO
air
for
the
device
that
there
Cassandra's
running
on
I
highly
recommend
throwing
that
nat
anode
out
very
quickly.
It
is
likely
to
die
on
you
soon.
So,
interestingly,
we
have
found
that
there's
no
metric
easily
to
track
this
in
Linux
systems,
at
least
using
collecti
and
their
typical
system
metrics.
A
A
This
is
a
the
Cassandra
system.
Log
is
is
important
in
terms
of
tracking
the
volume
of
messages
per
node
over
time
can
actually
give
you
sort
of
an
overall
health
indicator
of
your
node.
So
we've
started
trying
this
recently
and
every
10
minutes.
We
scrape
the
number
of
log
lines
we've
seen
in
the
system.
Log
submit
that
up
as
a
metric
track
that
comparatively
over
time
of
the
nose
within
the
system.
So
a
few
of
the
things
we
have
been
able
to
see
is
when
we
did
have
a
few
unbalanced
for
clothes.
A
Schema
disagreements,
so
we
started
our
upgrade
from
one
zero
to
one.
One
I
think
it
was
maybe
two
minor
releases
into
the
one
one
release
line.
We
rolled
two
nodes
over
to
one
one,
and
we
found
that
there
is
a
schema
disagreement
that
was
causing
the
two
nodes
to
flush
almost
in
a
infinite
loop
scenario,
but
we
saw
the
log
volumes
on
this.
A
Two
two
nodes
shoot
up
pretty
drastically,
so
we
saw
that
when
in
investigated
a
little
further,
we
actually
then
backed
out
and
downgraded
back
to
one
zero
on
that
that
ended
up
being
a
juror
issue
that
we
have
filed
and
I
think
got
fixed
later
on.
So
I
was
something
that
we
just
detected
from
the
log
volume
we
wouldn't
normally
have
seen
that
issue.
A
Yes,
phantom
gossip
nodes,
we've
had
numerous
situations
where
we've
removed
a
node
and
several
days
later,
but
the
cluster
is
still
gossiping
about
it,
either
causing
the
hinted
handoffs
to
flush,
frequently
or
flushing
of
some
of
the
other
system
tables.
So
we
actually
see
that
spike
in
log
volume
as
well,
which
causes
us
to
go
in
and
look
and
then
also
garbage
collector
activity,
if
that's
being
a
bit
more
overactive
than
I,
know
we'll
see
that
a
lot
of
ahlian
change
will
go
in
and
potentially
restart
that
node
to
see.
A
If
that
clears
up
the
garbage
collector
and
then
yes,
all
the
messages
from
kiss
sander
have
the
the
file
name
in
them.
So
we
found
it's
actually
useful
to
also
track
any
log
messages
that
don't
have
a
dot
Java
extension
in
them,
because
that
will
usually
point
to
an
exception
on
a
particular
machine.
Even
if
that
exception
hasn't
actually
taken
down
the
machine.
We
find
it
useful
to
go
in
and
figure
out
why
that
node
has
thrown
an
exception.
So
just
a
general
I
guess
health
metric
that
we
found
is
quite
useful
to
track.
A
A
So
we
have
two
rings
that
run
a
little
over
24
and
then
a
12
node
ring
they're
running
we've
done
a
few,
where
we've
actually
added
a
few,
much
larger
your
CPU
instance
nodes
into
it.
We
found
that
we
can
direct,
writes
to
them
disproportionately
to
CPUs.
Usually
are
our
bounded
resource,
so
will
actually
push
more
rights
to
those
as
coordinators,
even
if
they
don't
have
a
lot
of
the
token
space.
A
So
we
actually
I
think
we,
through
the
nose
out
and
just
bootstrap
back,
are
like
one
zero,
a
my
that
we
had
so
I
I'd
have
to
talk
to
our
opposite
engineer,
but
I
think
I
think
it
went
fairly
smoothly
and
there.
Actually,
there
was
a
couple
issues
that
we
then
tested
offline
and
one
one
with
that.
After
having
that
experience,
so
we
actually
waited
I
think
a
couple
patch
releases
past
what
we
had
been
trying
to
do
at
the
time.
A
A
A
A
Yeah,
so
actually
one
of
the
things
are
measurements,
so
I
put
the
measurement
up
there
definition
as
a
simple
value.
We
actually
support
a
complex
measurement.
So
when
you
send
data
to
us,
you
can
send
in
a
min
a
max
account
the
sum
for
that.
For
that
particular
measurement
and
the
use
case
we
have.
A
There
is
a
lot
of
times
if
you're
tracking,
like
request
your
API,
you
may
have
thousands
of
those
a
second,
so
you're
doing
some
client-side
aggregation
of
that
already
and
you
publish
that
up
as
a
predefined
aggregate
for
us
and
then,
when
we
roll
up
your
raw
data
to
higher
resolutions,
the
our
being
the
sort
of
course
or
courses
granularity,
restore.
We
actually
do
do
the
aggregation
when
we
do
that
roll
up,
but
we
are
looking
to
do
more
advanced
aggregates
across
metrics.
A
A
Right,
so
we
have
two
other
things
on
that
just
quickly.
We
can
actually
aggregate
on
the
service
side
across
spread
out
tears.
So
if
you're
sending
your
various
aggregates,
we
aggregate
on
its
timescale
and
actually
only
store
that
aggregate
that's
sort
of
a
stream
based
model
that
we
have
well.
We
also
we
do
have
an
index
into
our
data.
So
it's
that
if
you
want
to
search
for
only
a
given
set
of
sources,
we
actually
get
into
that
data
fairly
efficiently.