►
From YouTube: C* Summit 2013: Cassandra at Instagram
Description
Speaker: Rick Branson, Infrastructure Engineer at Instagram
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-cassandra-at-instagram-23756207
Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.
A
A
we
had,
we
were
storing
some
data
in
redis
and
it
was
filling
up.
We
had
a
you
know,
running
out
of
memory
like
redis
does
in
combination
with
that.
It
was
kind
of
the
end
of
like
our
little
love
affair.
We
had
with
redis
for
a
bunch
of
different
reasons,
so
why
does
redis
suck
in
some
sometimes
widescreen?
That's
why
it
sucks.
Oh.
A
A
Maybe
capslock
was
a
reason.
I
don't
know
all
right.
Let's
go
back
here,
we
are
well
it's
obviously,
the
most
obvious
thing
about
redis
that
sucks
is
memory,
is
expensive,
redis,
isn't
in
a
memory
data
store
if
you're,
storing
stuff
in
it
that
you're
not
reading
all
the
time
it
just
kind
of
like
falls
apart
for
those
use
cases.
It's
it's
actually
really
easy
to
prototype
stuff
in
redis,
because
you
don't
have
to
worry
a
lot
of
times.
A
You
don't
have
to
worry
about
how
fast
you're
putting
data
in
how
fast
you're
getting
it
out,
because
it
just
it
uses,
memory
and
memory
is
very
fast.
So
the
problem
is
that
in
memory
degrades
poorly,
so
this
is
actually
a
little
less
obvious.
A
So
yeah
the
cliff
there's
always
cliffs
in
many
systems,
this
is
just
a
particularly
really
bad
cliff.
It
also
has
other
problems.
It's
got
a
flat
name
space,
a
lot
of
people
like
they
kind
of
poo
poo.
The
whole
idea
of
having
schemas
schemas
are
great
for
finding
how
much
data
is
being
you
know
how
much
memory
and
how
much
cpu
and
how
much
disk
space
is
being
used
by
certain
types
of
things.
A
If
you
just
have
a
big
flat
name
space,
you
don't
know
what's
in
there,
so
it's
that
that
just
sucks
heat
fragmentation
sucks
so
redis
is
all
in
memory.
Obviously
you
generally,
the
rule
of
thumb,
is
30
to
40
overhead,
on
top
of
the
data
just
for
to
deal
with
heat
fragmentation,
issues
that
build
up
over
time,
it's
single
threaded,
which
is
really
painful.
A
You
get
on
a
box,
it's
you're,
starting
to
see
some
errors
on
reeds,
it's
just
pegging
one
core
and
you're
sort
of
sol
at
that
point,
which
is
no
fun
there's
also.
The
snapshot
process,
which
we
like
try
to
say,
is
like
the
speed,
because
you're
sort
of
whenever
you
do
a
snapshot
which
is
either
saving
stuff
to
disk
or
when
you
connect
a
new
slave
to
the
master,
and
you
need
to
sync
it.
It
has
to
perform
this
bg-save
process
where
it
streams
all
the
data,
that's
in
memory
to
disk.
A
It
does
this
by
doing
a
fork,
which
is
a
really
simple
way
to
implement
that.
The
problem
is
that
you're
accruing
all
these
changes
in
memory,
and
you
eventually
a
lot
of
times.
You'll
eventually
just
run
out
of
memory.
So
it's
it's
kind
of
like
speeding
towards
the
you
know
going
off
of
a
bridge.
I
digress,
so
the
data
that
we
moved
from
redis
to
cassandra
the
first
time
basically
boils
down
to
centralized
logging
very
high,
skew
of
rights
to
reads
ever-growing
data
set
durability,
very
highly
valued.
A
The
main
thing,
the
main
reason
why
I
think
we
chose
cassandra
specifically
for
this
use
case-
was
that
it
was
really
in
its
wheelhouse
of
you
know
very
high
amount
of
rights.
So
there's
really
nothing
like
cassandra
for
that.
This
is
our
initial
setup
cassandra,
one
one,
three
nodes,
rated
ephemerals,
rf3,
six,
good,
keep,
etc,
etc.
We
use
the
hsha,
which
is
the
thrift
server
that
allows
you
to
take
a
lot
of
connections
we
have
at
the
time.
A
I
think
we
had
something
like
fifteen
thousand
connections
open
to
the
cluster,
and
we
need
to
run
this
so
that
you
know
it
wasn't
one
thread
per
connection.
So
that's
a
little
tip
on
that.
A
It
worked
mostly
who
likes
to
hear
when
people
come
tell
them,
they
put
something
into
production
and
it
worked
perfectly
and
nothing
went
wrong.
A
I
hate
going
to
talks
where
people
do
that,
so
I'm
going
to
tell
you
all
the
things
that
went
wrong.
One
really
cool
thing
about
chef
is
that
you
usually
have
this
nice
log
of
disaster.
You
know
basically
a
little
captain's
log
of
your
disaster
as
you
went
through
and
navigated
some
of
these
issues,
so
I
was
kind
of
able
to
go
through
and
see
what
we
did.
This
was
just
like
an
example.
It's
did
something
really
stupid.
Didn't
configure
the
listeners
right.
A
A
Was
this
guy,
which
there's
this
thing
where,
if
you're
running
a
certain
version
of
the
jvm,
it
needs
a
larger
stack
size
and
it
didn't
get
put
into
like
the
etsy,
whatever
that
you
know
whatever,
that
was
part
of
the
debian
packaging
and
the
stack
size
was
too
small
and
it
manifested
itself
by
nodes
just
randomly
dropping
out
of
the
ring
not
being
able
to
talk
between
each
other
for
no
reason
they
would
join
the
ring
and
they
would
sit
there
and
do
nothing.
A
It
was
really
very
scary,
but
once
once
we
realized
what
the
fix
was,
it
was
really
easy
and
it
was
magically
worked.
Most
people
who
are
just
installing
cassandra
will
just
never
have
to
have
that
problem,
but
it
was
fun
to
fund
a
fun
to
reminisce
about
the
the
good
old
days
november.
So
this
is
about.
Six
weeks
later
we
doubled
this,
the
cluster
to
six
nodes.
We
saw
that
it
was
spending
a
lot
of
cpu
time
just
dealing
with
connections.
We
were
up
to
18
000
connections
at
the
time
2000
nope
2012.
A
yep
little
do
you
know
we
talked
about
time
in
the
last
one
so
ran
into
some
heat
pressure,
so
we
dropped
the
key
cash
size.
We
were
running
like
a
one
gig
key
cash,
which
is
like
apparently
freaking
huge.
We
dropped
that
down
to
256
megabytes
and
that
helped
that
there's
really
not
a
lot
of
documentation.
I
found
around
what
your
key
cache
size
should
be,
so
we,
the
ratio
of
our
heap,
which
was
six
gigs
to
256
megs
of
key
cash.
We
looked
at
our
hit
rate.
A
It
was
fine,
didn't
really
go
down
very
much,
so
we
were
fine
with
that.
We
also
had
more
gc
pressure.
We
lowered
our
mem
table
sizes.
This
was
before
we
upgraded
to
1.2,
which
has
moved
some
more
stuff
off
heaps,
so
you
don't
run
into
some
of
those
things,
so
I
did
that
and
everything
was
stable
at
that
point,
good
pun.
A
Until
of
course,
I
got
the
the
urge
to
you
know,
fix
things
that
aren't
broken,
which
is
to
upgrade
to
a
very
early
release
of
the
next,
the
new
major
release
of
cassandra,
and
actually
I
was
really
surprised
at
how
well
it
went
so
surprised
that
I
got
really
confident
and
really
screwed
myself
over
and
the
horses
just
kind
of
like
ran
everywhere,
and
it
really
kind
of
boiled
down
to
these
two
commits
that
you
can
see
are
pretty
far
apart
from
each
other.
A
So
that's
like
a
week
worth
of
on
and
off
sort
of
trying
to
migrate
this
cluster
to
here,
we'll
get
this
moved
up
or
something
trying
to
move
move
a
cluster
that
was
not
using
v
nodes
to
v
nodes
on
a
very
old
early
version
of
that
one
two
release
we'll
probably
try
it
again,
but
that
was
that
was
fun.
Take
away!
Let
it
operate.
You
know
let
people
heal
your
experience
deal
with
this
stuff.
A
First,
everybody
says
that
you
know
use,
don't
use
one
two
one
use
like
one:
two,
six
or
one,
two
seven,
you
know
caveat
mtor.
We
doubled
it
on
february
to
12
nodes.
We
replaced
the
node
just
various
things.
This
is
our
exceptions
for
the
last
six
months.
I
feel
pretty
good
about
that.
A
This
is
what
the
cpu
looks
like
so
pretty
much
everything
in
that
pink
line.
That's
all
I
owe
so
it's
mostly,
I
o
bound
so
great
3.4
terabytes
of
data,
we'll
try
the
vnode
migration
again.
Take
away
from
this
is
adopt
technology
by
understanding
what
it's
good
at,
like
I
said
earlier,
and
do
that
first,
don't
don't
throw
it
at
something
that
is
really
new
or
you
don't
understand
and
all
those
things
use
it
for
what
you
understand.
First
and
you'll
be
much
more
successful
at
it.
A
A
This
was
in
a
sharded
redis
cluster
32
nodes,
68
gigs
of
node
16
masters,
16
slaves.
We
found
that
we
were
just
memory
bound
with
this
and
we
started
to
get
alarms,
so
we
were
like
well,
maybe
we
could
we
could.
You
know
we
could
either
spend
a
bunch
of
time
manually
resharing
this
or
we
could
try
to
put
it
to
cassandra
and
try
to
get
some
some
benefit
out
of
it.
So
we
decided
to
port
it.
A
A
Redis
does
this
operation
to
basically
bound
the
size
we
want
to
keep
like?
I
don't
know
whether
the
exact
number
is,
but
it's
like
a
hundred
of
these
and
a
hundred
of
these
items.
We
don't
want
to
really
go
above
that
they
become
irrelevant
at
a
certain
point
and
we
also
wanted
to
be
able
to
undo.
A
So
we
want
to
be
able
to
say
if
somebody
liked
a
photo
and
then
they
undid
that
we
didn't
want
that
to
show
up
in
the
in
that
in
that,
in
that
view,
so
a
kind
of
a
basic
way
to
implement
this
in
cassandra
kind
of
the
most
naive
way
that
we
started
with
was
just
to
use
a
time
uid
column.
A
This
is
basic
time
series
in
cassandra.
It's
reverse
ordered,
so
the
newest
stuff
is
at
the
head
of
the
the
row.
We
just
take
the
thrift
serialized
activities
and
just
stick
them
in
as
they
are,
and
the
problems
start
to
come
when
we
get
into
like
details
like
bounding
the
size
we
could.
The
naive
way
to
do.
This
would
be
to
read
the
row
and
say:
look
at
all
the
columns
that
are
past
100
and
just
start
dropping
them.
One
at
a
time
problem
is
that
justin
bieber
uses
instagram
and
he
gets
he
has.
A
This
photo
in
particular,
has
921
000
likes
on
it,
and
so
what
happens?
Is
that
over
time?
As
if,
if
you
know
you're
constantly
trimming
that
one
column
at
a
time,
you'll
end
up
with
millions
of
tombstones
in
his
row
and
he'll
never
be
able
to
load
his
app
so
and
obviously
this
is
a
extreme
case,
but
there's
lots
and
lots
of
users
that
you
know
in
the
tens
of
thousands,
maybe
hundreds
of
thousands
now
that
would
run
into
this
issue.
A
So
this
is
nice
basically,
the
time
component
of
our
time.
Uid
can
be
the
same
as
our
time
stamp
and
what
that
basically
lets
us
do
is
we
can
do
a
row
delete
so
normally,
when
you
do
a
row
delete
you
say,
delete
this
row
and
you
use
whatever
time
stamp
is
at
right
now.
The
thing
is:
if
you
use
the
time
stamp
in
the
past,
you
can
actually
say,
delete
everything.
That's
past,
that
time
stamp
so
or
anything
before
that
time
stamp.
A
So
this
is
a
really
good
way
of
sort
of
trimming
old
data
off
of
a
row.
It
gives
you
a
little
bit
of
read
optimization.
So
when
cassandra
is
reading
a
bunch
of
tables
on
disk,
it
can
actually
start
skipping
some
of
them,
because
it
knows
that
anything
older
than
that
timestamp
can
be
discarded.
It
marks
each
file
on
disk,
and
it
says
this
is
the
this
is
the
highest
time
step
in
that
file.
They
can
completely
ignore
it
using
data
in
memory.
A
The
other
thing
is,
we
have
to
do
undo's
and
and
surprising.
To
me
when
I
looked
at
this
is
that
10
of
actions
are
undo's,
that's
the
I
like
to
photo,
and
then
I
want
I
unliked
it
and
we
have
to
fix
the
data
that
we
stored
kind
of
the
unique
the
kind
of
initial
way
you
might
think
about
doing.
This
would
just
be
to
kind
of
get
the
row
as
it
is
and
then
find
the
matching
thing
activity
and
then
just
delete
it.
A
This
actually
kind
of
creates
a
race
condition
which
in
any
time
you
see
a
get
and
then
a
write,
a
read
and
then
write
in
a
system
within
the
same
sort
of
you
know,
one
after
the
other.
You
really
start
to
it's
kind
of
a
code
smell
in
a
way,
so
you
cannot
guarantee
that
things
didn't
change
between
your
git
and
your
delete
and
when
you're
talking
about
a
system
with
a
lot
of
data
coming
in
it's
very
it's
almost
certain
that
you'll
experience
this
all
the
time.
A
You
also
have
issues
with
diverging
replicas,
so
you
can.
You
can
have
a
right
fail.
So
if
I
like
something
and
it
inserts
and
everything's
fine,
one
of
the
rights
can
actually
fail.
Even
if
I'm
writing
at
quorum,
we
like
to
use
a
reconsistency
level
of
one,
because
we
don't
really
we
don't.
You
know
the
people
who
are
actually
the
actors
in
the
system
who
are
writing.
This
data
are
not
the
same
people
who
are
consuming
it.
If
I
like
your
photo,
it
doesn't
matter
that
it
might
be
delayed
by
five.
A
You
know
five
seconds
or
something
like
that,
so
we
want
to
still
use
cl1,
but
then
you
render
these
issues
with
these
types
of
things.
So
what
do
we
do?.
A
We
decided
that
super
column
was
really
confusing,
so
we
decided
to
come
up
with
this
thing
called
anti-column
and
an
anti-column
was,
is
basically
just
an
md5
hash
of
the
data
that
you
want
to
mark
as
deleted
similar
to
kind
of
a
tombstone
in
in
the
basic
use
of
cassandra,
but
it's
specifically
for
by
value
instead
of
by
by
name.
So
the
idea
is
that
the
first
component
of
this
column
name
is
either
a
zero
or
one
a
zero
is
for
empty
columns,
and
it
makes
sure
that
they're
ordered
at
the
beginning.
A
A
So,
eventually
everything
ends
up
with
abc.
In
this
case,
read
before
write
is
almost
always
a
smell.
Try
to
try
to
model
your
data
about
you
know
kind
of
as
a
log
of
user
intent.
Think
about
things
that
way.
It's
really
helpful
way
to
think
about
kind
of
modeling
data
correctly
for
these
types
of
situations.
A
We
also
these
are
some
random
notes.
We
keep
a
30
buffer
for
trims.
This
allows
this
whole
thing
allows
us
to
do
an
undo
without
read
by
by
buffer.
I
mean
there's
a
little
bit
of
extra
room
in
there
when
we're
doing
when
we're
doing
deletes,
we
don't
try
to
keep
it
exactly
at
100..
We
want
to
say
our
target's
100.
We
might
keep
125..
A
This
doesn't
work
for
large
lists.
Unfortunately,
anything
over
a
few
hundred
items.
It
just
starts
really
breaking
down.
This
is
why
I
opened
cassandra
5527
to
try
to
maybe
get
delete
by
value
pushed
into
the
the
core.
So
obviously
that's
great,
so
we
built
this
in
two
days.
Experience
is
really
important
from
our
first
implementation
to
do.
This
was
really
a
huge
reason
why
we
were
able
to
get
this
done.
A
So
what
was
the
actual
rollout?
We
did
it
with
cassandra
one
two:
three:
we
did
it
with
v
nodes,
because
v
nodes
are
great.
They
allow
you
to
add
nodes
and
subtract
nodes,
and
it's
just
it's
so
much
better
than
than
doing
it
without
it
level
compaction
the
idea
with
level
compaction.
Is
it
really
helps
you
kind
of
boost
your
read
performance
there's
also,
one
of
the
also
the
benefits
is
that
you
can
kind
of
cram
a
lot
of
a
lot
more
stuff
on
a
node.
A
In
terms
of
in
terms
of
space,
because
you
don't
need
all
that
extra
headroom
for
the
size
to
your
compaction
plus,
we
wanted
to
try
this
out
for
further
use
cases,
so
that
was
a
kind
of
a
way
to
prove
that
that
type
of
compaction
out
it
was
a
12-note
cluster.
We
used
the
ssd
instances
in
amazon,
we
were
over
three
azs
replicated
three
times
we
wrote
with
a
rf
of
two
or
I'm
sorry
a
consistency
level
of
two.
The
idea
there
is.
We
want
to
get
durability.
A
So
if
one
node
completely
dies,
it's
ec2,
so
sometimes
you
can
you
can't
you
know
you,
don't
you
can't
just
go
back
and
get
the
disks,
sometimes
data.
You
can
just
lose
data
if
honest
if
it's
on
a
single
node
entirely.
So
we
write
with
a
a
seal
of
two
read
with
one
egg:
keep
800
make
new
size,
our
rollout.
We
did
this
by
you
know.
Basically
turning
on
double
writes.
A
While
we
were
migrating
data,
we
tested
it
all
with
shadow
reads,
so
we
were
reading
both
from
cassandra
and
redis
at
the
same
time,
and
we
could
kind
of
crank
that
up
and
crank
that
down.
So
we
get
a
really
good
idea
of
how
this
is
going
to
behave
in
production
when
we
we
cut
it
over
and
then
obviously,
once
we
verified
that
the
cluster
could
take
the
reads
and
everything
was
fine,
it
was
responding
correctly.
A
Everything
went
pretty
well,
we
ended
up
bumping
our
heat
size
to
10
gig
because
of
heat
pressure.
That's
probably
not
a
problem
anymore
with
one
two
five.
There
are
some
changes
in
there
that
reduce
the
heap
usage.
A
A
couple
of
things
we
ran
into
bootstrapping
sucked
because
the
default
table
size
is
five
megs
and
ten
thousand
tables
pushing
that
through
compaction
takes
like
a
day.
If
you
cut
that,
you
can
basically
cut
that
by
a
lot
by
going
to
larger
ss
tables
everybody
I
talked
to
that's
using
level
compaction.
Production
is
doing
like
100
megasys
tables,
so
we'll
probably
eventually
bump
that
up
to
50
megs.
A
After
that,
a
little
change
I
came
in
on
monday,
and
I
had
a
friendly
node
that
was
unable
to
flush
and
had
spawned
8
000
commit
log
segments.
I
still
haven't
figured
that
one
out
but-
and
I
wish
I'd
had,
but
the
bigger
problem
was
that
I
did
a
node
rebuilding
correctly.
So
this
is
how
you
normally
do
a
node
rebuild,
especially
in
the
old
world,
of
no
v
nodes.
So
you
stop
the
node.
A
You
move
the
data
out
of
the
way
and
you
start
again:
everything's
fine
it'll
reboot
strap
its
data
because
it
has
this
the
initial
token.
The
best
practice
is
to
save
this.
In
your
cassandra
yaml
file,
it'll
come
back
up.
Everything
will
basically
be
the
same
way.
It
was
oh
problem
when
you
start
a
vino,
a
node
in
a
v
node
cluster,
it
it
when
it
starts
it
randomly
generates
its
tokens
and
this
isn't
by
by
design
it's
a
good
thing.
A
The
problem
is
that
cassandra
uses
the
ip
address
of
the
node
as
basically
as
the
primary
key.
So
if
I
connect
to
a
cassandra
cluster-
and
I
start
talking
to
the
nodes-
it's
going
to
say-
oh
rick
is
1.2.3.4-
I'm
going
to
trust
him
for
all
of
the
any
of
the
gossip
data
about
what
data
he
has
so
story
short.
What
happened?
A
So
we
did
this
to
kind
of
like
the
evasive
maneuvers.
Were
we
added
a
stat
for
people
that
had
empty
inboxes,
so
we'd
basically
be
collecting
those
metrics
and
we
also
cranked
up,
and
that
was
at
our
app
level.
We
cranked
up
the
read
repair
chance,
the
1.0
so
that
every
single
time
you
did
a
read,
it
would
start
to
repair
some
of
the
data
in
the
background
which
this
is
again
a
very
evasive
maneuvers.
A
So
we
kicked
off
no
tool
repair
to
try
to
permanently
fix
all
these
issues
and
we
waited
and
we
waited
and
we
waited,
and
we
found
that
this
doesn't
work
when
you're
using
level
compaction
plus
v
nodes
repair
just
is
completely
falls
apart
and
there's
good
reason
for
this.
I
was
wondering
why,
and
I
was
seeing
a
bunch
of
of
these
tasks,
queue
up
on
the
thread,
pull
stats
and
I
saw
it
you
can
do
this
thing.
Basically,
where
you
issue
a
kill
three
to
the
cassandra,
node
and
it'll.
A
Do
a
basically
do
what's
called
a
thread
dump
where
it'll
tell
you
what
every
single
thread
is
doing
in
in
the
in
the
jbm,
and
I
found
this
guy-
and
I
found
this-
I
did
this
on
several
nodes.
I
did
this
over
a
period
of
time
and
I
kept
seeing
this
over
and
over.
It
turns
out
that
every
repair
task
was
trying
to
scan
basically
every
ss
table
file
which,
as
you
imagine,
is
not
the
most
efficient
way
to
do
things.
A
The
reason
why
this
is
a
problem
with
size
to
your
compaction
is
you
only
have
a
few
dozen
asses
tables
so
scanning
them
all
is
no
big
deal.
The
reason
why
it
does
it
doesn't
matter
when
you're
not
using
v
nodes
is
that
you
are
doing
repair
once
per
token.
If
you
only
have
one
token
like
you
do,
when
you
don't
have
v
nodes,
it
doesn't
really
matter
because
you
really
have
to
scan
all
the
data.
A
You
have
to
really
scan
the
whole
data
set
anyway,
when
you
do
have
v
nodes,
it
means
you're
doing
it
256
times.
So,
basically,
we
built
the
patch
submitted
it.
It's
in
1.2.6,
20x
increase
in
repair
performance,
and
that's
probably
a
pretty
it
actually
works.
Basically
is
what
that
means,
instead
of
not
working
so
take
away
from
this.
If
you
want
v
nodes
in
lcs
use
1.2.6
when
that
patch
has
merged
in
so
where
were
we,
it
was.
A
It
was
the
the
takeaway
for
me,
for
this
was
that
it
was
a
really
bad
thing
that
we,
the
first
people
that
noticed
and
reported
the
problem.
Were
our
users
like,
we
should
be
able
to
notice
this
stuff
before
our
users
do,
and
so
it
was
kind
of
a
silent
it.
It
wasn't
like
it
logged
a
message
and
said:
hey
idiot,
like
you're,
you
just
did
something
stupid,
so
we
wanted
to
kind
of
catch
this
in
a
bunch
of
different
ways.
A
So
what
we
did
is
we
created
a
patch
that
met
that
measures
and
reports
the
number
of
times
data
when
you
do
a
read
repair
the
number
of
times
that
there's
a
mismatch
of
data
between
nodes.
We
set
basically
a
one
percent
sample
effectively
a
sample
rate
of
read
repair
so
that
one
percent
of
the
reads
that
we
do
do
this.
A
But
this
background
check
to
see
if
the
data
is
consistent
across
replicas
and
it
turns
out
that
in
production
pretty
steadily,
we
get
a
99.63
rate
of
inconsistency
on
reads-
and
this
is
it's
kind
of
you
know-
I
don't
want
to
get
too
deep
into
that-
that
figure
anyway,
so
we
alert
on
this
now
and
it
allows
it
would
have.
No,
we
would
have
noticed
it
because
it
would
have
jumped
to
like
90
or
95
percent,
so
we
we
basically
alert
it
around
97
and
a
half
take
away.
A
If
you
rebuild
rebuild
a
v-notebox
and
vino
cluster,
there's
some
better
ways:
they're
working
on
making
this
a
little
bit
better.
You
really
want
to
build
a
new
one
and
remove
the
old
one
with
no
tool
remove
node.
This
is
easier
on
ec2
than
it
is
on
real
hardware.
A
Just
be
careful
when
you
do
this
in
general,
it's
running
very
well.
This
is
a
cpu
graph
for
the
cluster
there's
very
little
io,
because
their
ssds
are
awesome.
It's
saved
us
money
from
the
redis
cluster.
This
is
our
fetch
and
deserialize
time.
Measured
from
the
app
this
is
mean
versus
p90
trough
to
peak.
So
our
trough
is
something
like
10
milliseconds
and
our
peak
is
something
like
20
millisecond
mean
times.
A
A
This
is
what
our
cf
stats
look
like.
Re-Latency
is
just
under
two
milliseconds
and
right.
Latency
is
something
that
ganglia
doesn't
even
want
to
graph
zero
point.
Thirty
or,
let's
just
say,
31
microseconds
compacted
romaine
size
is
thirty.
Three
three
thousand
twenty
bytes
just
means
stuff
small.
The
rows
are
relatively
small.
So
all
right,
thank
you.
That
was
quick.