►
From YouTube: BigStep: Cassandra’s Scaling Economics
Description
Speaker: Alex Bordei, Product Manager at BigStep
We all know Cassandra is supposed to scale but what is its exact scaling pattern? How much faster does it get if you add an extra node? Is it truly linear? How sensitive is it to hardware constraints? Bigstep has benchmarked Cassandra in an attempt to understand how to size underlying infrastructure for optimum performance. We’ve tested using a custom jmeter sampler built using the java driver on beefy bare metal machines with 192GB of RAM, 10Gbps network and SSDs. We’ll share our findings and other best practices for scaling NoSQL DBs in cloud environments, found by working with some of big data’s most popular DBs.
A
So
I'm
alex
I'm
alex
from
big
step,
and
this
talk
is
going
to
be
about
the
economics
of
scaling
cassandra.
A
So
the
question
is:
to
which
extent
this
holds
true
for
a
particular
set
of
technologies,
and
today
this
analysis
is
going
to
focus
on
cassandra
or
specifically
data
stacks,
so
we're
going
to
go
over
a
little
bit
about
the
way
we
did
the
benchmarks
about
scaling
horizontally,
scaling
vertically
a
little
bit
about
virtualization
and
about
docker,
and
I
would
really
love
to
make
this
a
conversation
rather
than
you
know,
boring
monologue.
So
please
stop
at
any
time,
raise
your
hand
and
ask
me
anything
right.
A
That's
all
we
do
we're
essentially
a
bare
metal
cloud,
but
we're
different
than
soft
layer
or
rack
space
in
the
sense
that
we
have
compute
nodes
and
storage
nodes,
and
we
put
the
compute
nodes
from
the
storage
nodes
and
the
storage
nodes
are
all
ssds
and
they're
distributed
across
the
entire
network.
There
there's
no
connection
between
them.
We
have
a
very
fast
high
throughput
and
low
latency
network
that
uses
cut
through
switching.
So
we
only
look
at
the
first
few
bytes
of
a
packet
before
following
it
to
the
appropriate
port.
A
So
because
of
this
architecture
that
we
have,
we
can
do
stuff,
like
you,
know
cloning
and
snapshotting,
and
we
can
do
it
without
any
hypervisor.
So,
essentially,
you
have
pretty
much
the
same
flexibility
that
you
have
in
virtualized
environments,
but
with
bare
metal.
So
this
means
we're
very
good
for
essentially
everything
that
it
runs
in
memory,
because
you
know,
if
you
take
virtualization
out,
you
get
a
better
performance
level
and
actually
I'm
going
to
show
you
exactly
how
much
you
can
gain,
not
necessarily
on
our
platform
but
in
general,
on
virtualized
versus
experimental
machines.
A
We
also
partner
and
we
deploy
a
lot
of
other
technologies
like
datameer.
You
know
elasticsearch
splunk
and
we've
we've
we've
done
a
lot
of.
We've
done
a
lot
of
benchmarking
with
them
to
try
to
understand
what
what
happened.
So
essentially,
we
like
to
say
about
us
that
we're
like
the
fastest
cloud
in
the
world.
You
know
it's
a
bold
statement,
but
we
are
willing
to
put
that
to
the
test
right.
A
So,
as
I
said,
we
have
been
doing
a
lot
of
testing
to
understand
the
technologies
and
also
because
it's
quite
cool
to
you
know
play
with
them
and
also
we
have
this
constant,
recurring
question
coming
from
our
clients.
A
A
So,
let's
get
into
the
more
you
know,
techy
stuff,
the
benchmarks
that
we
did
were
done
on
fairly
big
machines
with
192
gigs
of
ram
the
target
ones,
the
ones
that
actually
hold
the
data
are
192
gigs
of
ram.
The
other
ones
are
128,
both
of
them
with
dual
dual
socket
configurations:
very,
very
fast
cpus
and
10g
networks.
Essentially,
two
10
gig
networks
in
parallel,
one
for
the
rpc,
the
other
one
for
the
regular
traffic
we
use
center
s,
we
use
dev
shm.
A
We
again
did
a
little
bit
of
optimizations
with
irq
balance
and
disable
transparency
pages,
and
we
use
jmeter
instead
of
the
cassandra's
built-in
benchmarking
tool,
we've
written
a
custom
sampler
for
jmeter.
Have
you
used
jmeter
before,
by
the
way,
all
right,
oh
cool,
so
we
used
we've
written
our
own
custom
sampler
that
uses
the
java
driver.
A
A
So
what
we
did
essentially
was
we
inserted
about
10
million
records
into
cassandra
and
got
them
back
and
then
changed
every
one
of
them,
so
it's
essentially
we're
doing
three
operations.
It's
insert
and
select
and
update
on
those
records
using
2000
concurrent
connections
from
all
the
eight
machines.
A
So
a
lot
of
jvms,
they're
working
and
essentially
this
is
what
we
got
now.
The
actual
the
absolute
figures
are
not
necessarily
important,
although
they
are
quite
quite
fast.
I
mean
those
are
milliseconds
and
there
are
milliseconds,
for
you
know
the
entire
data
set
and
2000
concurrent
connections,
so
single
digit
and
two
digit
milliseconds.
A
But
if
you
add
an
extra
node
you're,
not
again
you're,
not
reducing
that
a
whole
lot
again
for
four
nodes,
it
actually
increases
and
the
same
happens
for
select
and
updates
right
so
again
on
one
next
to
no,
not
necessarily
it.
You
know
it
increases
performance,
but
not
necessarily
doubles
it.
The
same
is
for
throughput
and
throughput.
Actually
is
amazing.
We
got
up
to
250
000
operations
per
second
with
this
cluster
of
four
nodes,
which
is
very
good,
and
we
pretty
much
see
the
same.
You
know
the
same
ratios.
A
Okay,
so
keep
that
in
mind
for
a
second
now
scaling
vertically.
Essentially,
we
are
comparing
two
instance
types
that
we
have.
One
is
the
ford
32,
which
essentially,
it
has
four
cores
and
32
gigs
of
ram
smaller
one
to
the
one
that
has
20
cores
and
192
gigs
of
ram,
and
there
is
an
increase
right.
It's
almost
you
would
say
twice
the
performance
for
the
bigger
one
as
opposed
to
the
one
with
32
gigs
and
a
single
cpu.
A
Again,
you
have
to
remember
this
a
bit
so
twice
twice
to
see
the
slice,
the
power
and
now
in
order
to
really
understand
how
that
translates
into
money.
We
did
what
we
call
an
analysis
of
performance
to
price
ratio
and
in
order
to
get
a
filling
or
performance,
we
had
to
have
like
this
composite
metric
out
of
all
the
other.
A
You
know
metrics
that
we
got
and
we
ended
up
something
like
this
and
we
divided
it
to
the
price
of
that
particular
configuration
and
long
story
short.
It
looks
like
this,
so
essentially
what
this
tells
us
and
by
the
way
this
is
identical,
or
I
mean
looks
the
same
for
other
technologies
as
well.
It
looks
the
same
for
couchbase,
looks
the
same,
for
impala
looks
the
same
for
all
the
nosql
technologies
out
there.
You
have
this
curve
that
drops,
which
tells
you
the
more
nodes
you
have
the
less
efficient.
A
You
are
in
the
way,
you're
spending
your
money,
unfortunately,
so
the
actually
the
best
place
configuration
is
the
one
that
has
you
know
the
smaller
cpu
and
and
less
ram,
because
that
one
costs
about
half
a
pound
per
hour
and
the
next
one,
the
one
with
192
gigs
of
ram
costs
about
two
pounds
per
hour.
So
it's
four
times
more
expensive,
but
only
twice
more
performance.
So
hence
it
has.
You
know
it
has
a
better
price
to
performance
ratio
right
now.
Why
does
this
happen?
A
You
might
be
asking-
and
some
of
you
already
know
about
endless
law,
and
maybe
some
of
you
already
know
about
the
harder
prices,
so
the
anders
law
essentially
tells
you
that,
regardless
of
the
technology
that
you
use,
you'll
always
have
something
in
common,
something
that
can
be
paralyzed,
because
the
no
problem
is,
you
know,
embarrassingly
paralyzable
and
so
yeah.
You
know
you
you,
you
will
not
be
able
to
paralyze
completely
and
the
other
one
is
the
harder
prices
now
this
happens
for
every
component,
but
the
cpu
is
easily
verifiable.
A
There's
this
website,
which
is
called
the
cpu
mark,
which
rates
benchmarks
all
the
cpus
on
the
market
and
rates
them
right,
and
you
can
also
find
the
prices
for
the
particular
cpus,
either
on
an
interside
site
or
on
that
particular
site.
And
if
you
take
all
the
cpus
in
this
e5
family
and
you
try
to
plot
them
and
do
something
like
hey.
A
If
I
have,
if
I
have
a
cpu
which
costs
twice,
will
I
have
you
know
twice
the
amount
of
power
and,
of
course
the
answer
is
not
by
you
know
not
by
far
apparently
the
you
know,
the
more
you
know
the
more
power
the
specific
cpu
has
you
get
like
an
exponential
increase
in
price
right,
so,
for
instance,
for
10k
whatever?
That
is
it's
a
10k
mark
for
the
e5
two
six
three:
it
costs
about
600
bucks
and
for
about
twice
that
much
for
the
e52670.
A
It
costs
31.
You
know
three
thousand
dollars,
so
it's
a
huge
huge
change
in
in
prices
and
the
same
happens
for
every
other
type
of
component
in
in
you
know
in
a
pc
in
the
server
all
right.
So
this
is
essentially
how
money
spending
efficiency
looks
like
you're
best
off
using
the
single
node.
If
you
can't
use
a
single
node,
because
you
shouldn't
use
a
single
node,
don't
try
to
overkill
it
with
two
big
machines.
A
A
So
if
you
do,
if
there's
a
there's,
a
tool
that
you
can
download
and
install,
it's
called
sysbench
and
you
can
use
it
to
read
and
write
one
terabyte
of
data
to
and
from
memory,
and
you
get
something
like
this
on
a
virtualized
host
versus
a
native
one,
and
this
is
on
the
latest
hardware
we
could
find
and
the
reason
is
and
I'm
gonna
get.
You
know
lower
level
a
bit.
A
I
hope
you
don't
mind
is
because
if
you
ever
written
any
c
application,
you
know
that
that
you
know
number
that
you
work
with
with
pointers
is
not
necessarily
it's
not
an
actual
address
in
memory.
It's
actually
a
reference
to
the
virtual
memory
address
space
right,
so
every
process
has
its
own
virtual
memory
now,
in
order
to
do
this,
translation
between
the
virtual
memory
space
and
the
actual
physical
space,
the
cpu
and
the
operating
system
work
together.
A
To
do
this
translation
in
this
translation
native,
you
know,
if
you
would
do
it
badly
enough,
it
would
have
to
have
at
least
you
know
two
and
two
requests
to
memory
in
order
to
actually
access
one
memory,
so
essentially
would
double
your
latency
to
the
memory.
So,
of
course
it
doesn't
happen.
This
way
an
operating
system
have
implemented
something
which
is
called
pages
or
memory
pages,
which
are
you
know,
segments
of
continuous
memory
blocks
that
can
be
allocated
to
applications.
A
The
typical
size
for
the
page
is
about
4k
now,
in
order
to
do
the
translation
between
the
same
translation,
but
to
the
pages
on
the
operating
system
maintains
something
which
is
called
the
page
table
right
and
the
page
table
has
about
32
bits
and
it
describes.
Does
the
translation
between
you
know
physical
memory
in
the
actual,
the
actual
virtual
memory?
A
Now
the
problem
with
this
approach
is
that
if
you
have,
as
I
said,
a
page
is
4k
long
so
to
map
a
four
gig
memory
segment
to
to
these
pages
means
that
your
page
table
should
have
like
no
4
million
divided
by
4k
is
about
1
million
entries.
So
for
every
4k
four
gigs
of
ram,
you
have
to
have
like
a
million
entries
in
the
page
table
now
for
40
gigs
of
ram
again
10
million
entries
of
you
know
two
bytes,
so
the
page
table
grows
bigger
and
bigger
and
bigger.
A
In
order
to
whenever
you
have
to
access
memory,
you
would
have
to
look
up
this
entire
page
table
and
do
the
translation,
which
is
not
efficient.
Now
in
the
70s.
A
solution
has
been
introduced
to
this,
which
is
called
the
tlb
right.
The
translation,
the
translation,
leukocyte
buffer-
and
this
is
essentially
a
fourth
cache
in
the
cpu
which
helps
do
the
cpu.
Do
the
translation
without
looking
into
the
entire?
You
know
page
table
now.
A
I
know
it's
a
lot
to
take
in
right
now,
but
I'm
getting
somewhere.
So
the
problem
with
the
problem
with
the
tlb
is
that
whenever
you,
whenever
you
hit
the
tlb,
it's
fine,
you
do
the
translation.
A
The
cpu
does
the
translation
for
you
and
it's
very
very
fast,
but
whenever
you
hit
something
which
is
not
in
the
tlb
and
by
the
way,
the
tlb's
typical
size
is
about
200
entries,
which
is
very
small
in
comparison,
the
amount
of
ram
that
we
currently
have
you
do
something
which
is
called
the
tlb
miss
and
a
tlb.
Miss
incurs
a
penalty
which
is
proportionate
to
the
size
of
the
page
table
now
normally
about
150
cycles,
and
this
is
on
bare
metal.
A
So
we
are
hugely
inefficient
for
applications
that
use
you
know,
do
a
lot
of
random
reads
and
guess
which
applications
have
the
highest
ulb
miss
ratio
in
the
world.
There's
a
study
done
by
intel,
which
is
called
memory
system
characterization
for
big
data
workloads,
it's
fairly
new,
and
I
encourage
you
to
read
it.
They
analyzed,
you
know
hadoop
tasks
and
nosql
applications
to
see
which
ones
have
the
highest
tlb
miss
ratio,
and
they
did
something
even
more
amazing.
They
put
some
chip
between
the
actual
motherboard
and
the
memory
team
to
record
the
traffic.
A
A
The
problem
with
this
is
that
when
virtualization
came
along,
you
had
like
a
virtual
to
virtual
to
physical
memory
translation.
So
it's
a
third
level
of
indirection
which
of
course,
blues
up
every
you
know
performance
benefit
that
you
would
get
from
regular
tlb
cache.
Now
you
know
and
that's
why
the
first
implementations
of
virtualization
software,
like
qm,
were
like
incredibly
slow.
Now
people
thought
about
this
and
tried
to
solve
the
problem
with.
A
Now
that's
good,
but
when
you
do
context,
switching
between
virtual
machines
and
you
do
one
vm
to
another
vm
and
you
have
to
flush
the
entire
tlb,
because
it
doesn't
make
sense
for
another
vm
one
vm
has
you
know
a
state
and
the
other
one
has
another
one,
and
whenever
you
flush
the
tlb
you
get
like
a
huge
performance
penalty.
You
have
to
rebuild
the
tlb,
or
at
least
you
know
repopulate
it
so
that
didn't
work
either.
A
So
now
they
have
this
technology,
which
is
called
vtx
by
the
way,
the
the
other
one
was
still
vtx.
With
another
version,
this
vtx
analogy:
what
happens?
Is
it
modified
the
entries
in
the
tlb
cache
so
that
they
have
an
additional
byte
that
you
can
identify
which
vms
yeah
to
to
which
vm
that
particular
entry
belongs
to?
A
So
you
don't
have
to
flush
everything
out
every
time
you
do
context
switching
between
vms,
but
the
problem
with
that
is
that
tlb
misses
now
incur
like
10
12
times
more
time
to
rebuild
that
particular
entry,
because
it's
a
lot
more
complex
to
build
a
tlb
cache.
So
essentially,
they
sacrificed
a
particular
type
of
applications
and
a
particular
type
of
memory
access
patterns
to
get
performance
for
smaller
applications
that
use
you
know
smaller
space,
and
this
is
something
which
is
documented
on
on
vmware
as
well
on
intel
and
other
solutions.
A
Essentially,
it's
a
trade-off.
So
it's
now,
whenever
you
have
a
high
tlb,
miss
ratio,
it's
actually
a
lot
more
dramatic.
You
get
even
less
performance
from
from
virtualized
environments
for
big
in
memory.
Applications
such
as
you
know
cassandra
or
spark
for
that
matter.
So
long
story
short.
I
encourage
you
to
avoid
virtualization.
A
If
you
can
now
there
are
solutions
to
you
know
if
you
want
to
have
something
else
and
of
course
docker
is
one
potential
solution
unnecessarily
covering
the
same
area
as
virtualization,
but
for
certain
it
could
help
you
and
we
did
a
benchmark
on
this
as
well.
I
want
to
make
sure,
because
you
know
people
say
that
docker
is
native
speed.
I
want
to
make
sure
that's
the
case
and
we
did
the
test
by
running.
You
know
a
native
node
and
a
native
node,
with
one
docker
container
running
the
same.
A
You
know
the
same
benchmark
in
it
and
with
two
containers
and
four
containers,
and
apparently
it's
a
bad
idea
to
run
multiple
containers
on
already
saturated
node,
which
of
course
you
would.
I
could
have
thought
about
it
before
I
did
anyway,
so
you
do
get
a
bit
of
performance
penalty
with
docker.
I
don't
know
yet
why
we
did
a
lot
of
optimizations.
We
run
the
docker
container
in
privilege
mode.
A
I
think
it's
because
of
the
additional
routing
that
has
to
happen
in
the
machines,
but
it's
there
I
mean
it's
almost
at
the
same
level
you
get
about
21
versus
19
milliseconds
for
the
inserts
18
versus
19,
so
almost
barely
noticeable,
especially
in
you
know,
when
you're
not
driving
the
cpu
to
you
know
close
to
100.
So
I
encourage
you
to
use
the
darker
if
you
have
to
use
something
now
this
is
a
throughput.
This
is
how
the
triple
looks
like
it's
pretty
much
the
same
story
now
the
conclusions
are
cassandra
is
actually
very
fast.
A
The
only
thing
that
was
faster
in
some
particular
workloads
was
couch
base,
but
you
can't
really
compare
the
two,
but
again,
cassandra
is
very
fast,
especially
I
mean
it
was
fast
on
reese
as
well,
people
usually
associate
cassandra
with
rights.
You
know
write
performance
and
write
in
general,
but
it's
very
good
in
reads
as
well.
A
I've
also
tested
the
in-memory
feature
that
has
been
you
know
recently
introduced
and
it
gets
to
pretty
much
the
same
numbers
as
I
have
here
because
remember
I
use
device,
but
the
problem
with
the
in-memory
option
of
data
stack
is
that
it's
limited
to
one
gig.
A
So
if
you
have
to
use
table
which
is
bigger
than
one
gig,
you
can't
use
a
you
know
out
of
the
box
in
memory
section
at
least
right
right
now,
because
it
runs
on
it
runs
in
in
heap.
So
you
would
also
get
a
lot
of
garbage
collection
related
issues
so
maybe
later
on
it
could
you
know
it
could
work
or
you
could
do
what
I
did
with
devs
now
use
docker.
A
If
you
want
it's,
not
a
problem,
but
don't
try
to
use
more
than
two
containers
again,
the
more
nodes
you
add,
the
less
efficient.
You
are
in
the
way,
you're
spending
your
money,
but
we
haven't
taken
into
account
iops
and
disk
in
general,
and
that's
very
you
know
very,
very
much
platform
specific
so
depending
and
you
know
workload
specific
and
if
you
have
like
a
very
write,
intensive
or
you
know,
read
intensive
iops
bound
application.
A
Of
course
you
might
want
to
consider
you
know
looking
at
the
disks
a
lot
more,
but
that's
a
different,
separate
discussion.
Now
again
the
bigger
the
nodes,
the
less
efficient.
You
are
on
performance,
the
price
ratio.
But
if
you
need
the
you
know,
increase
performance
for
that
particular
node.
Maybe
reducing
latency
across
the
entire
platform
go
for
those.
A
A
No,
I
haven't
done
any
pneuma
pinning
for
the
darker
containers,
but
I
know
that
cassandra
does
it
natively
anyways?
So
right?
So
I
think
because
we
had
numero
ctl
installed.
A
What
happens
is
that
you
have
to
traverse
this
qpi
link
and
we
did
tests
previously
and
there's
there's
a
talk
on
o'reilly
that
I
did.
That
explains
how
the
qpi
link
affects
performance.
A
But
the
idea
is
that
whenever
you're
doing
this
cross,
qpi
link
memory
access
you're
like
at
least
doubling
the
latency
of
the
memory
access,
if
not
like
four
times
slower
so
I've
seen
occasions,
if
you
do
that
memory
test
with
the
sysbench
that
I've
been
showing
you,
you
can
get
actually
on
a
two
socket
machine,
you
can
get
way
lower
performance
in
accessing
memory
than
you
do
in
a
single
socket
machine,
except
if
you
do
pneuma
pinning,
which
essentially
means
that
there's
this
utility
called
numa
ctl
and
there's
a
daemon
called
numandi,
which
you
can
use
to
pin
a
process
to
a
specific
pneuma,
node
numa
stands
from
non-uniform,
mirror
access,
it's
the
architecture
of
multi.
A
You
know
multi-socket
machines.
So
when
you,
whenever
you
do
pinning,
you
will
only
access
memory
from
that
particular
node.
So
you
get
a
lot
more
performance
and
happily
enough
cassandra.
Does
it
automatically
if
you
have
numa
ctl
installed,
which
is
a
great
thing
to
have
and
by
the
way
you.
I
also
encourage
you
to
use
something
like
irq
balance,
because
by
default
operating
systems,
don't
distribute
the
irq's
interrupts
whenever
you
have
like
network
interrupts.
A
There
are
driver
stacks
for
the
network
cards
which
can
be
used
to
move
to
move
access
from
the
kernel
space
to
the
user
space,
so
you're
not
doing
context,
switching
between
the
current
space
in
the
user
space
and
apparently
that
really
pumps
up
pops
up
performance
because
you're
not
using
interrupts
anymore
you're
using
like
a
pool
model
or
event
based
model.
If
you,
if
you
like
and
it's
it
increases
your
performance
on
the
network
side
lowers
the
latency,
reduces
the
cpu
usage.
A
So
if
you
use
like
this
new
generation
of
cloud
providers
like
us,
which
essentially
provided
with
pretty
much
the
same
level
of
flexibility
like
you
can
do,
as
I
said,
you
can
do
cloning,
you
can
do
you
can.
You
know,
hit
the
button
and
in
10
minutes
you
get
up
and
running
the
instance
and
then
after
10
minutes
you
shut
it
down.
You
have
access
to
the
actual
physical
machine
via
you
know
a
kvm
like
interface,
you
get,
you
know,
get
to
see
the
the
screen
of
the
server.
A
If
anything
fails.
So
there's
this
new
breed
of
cloud
providers
which
come
along
and
so
try
to
solve
this
problem
of
performance,
and
do
you
know
performance
penalties
and
use
by
virtualization,
but
you
know
giving
you
pretty
much
the
same
level
of
flexibility,
so
I
I
can't
say
for
certain
that
it
is
exactly
the
same
level
of
flexibility.
For
instance,
what
you
can't
do
right
now,
you
can't
do
like
virtu.
You
can't
do
like
live
migration
right
between
hosts
with
physical
machines.
Actually
we're
working
on
that.
A
But
right
now
there's
no
way
of
doing
live
migration.
You
have
to
do
like
restarts
or
reboots,
but
what
you
can
do,
though,
is
you
can
do
upgrades,
so
you
can
do
your
machine.
You
have
like
192
gigs
of
ram.
Now,
if
you
want
to
switch
to
another
box
with
like
256,
you
can
do
just
a
restart
and
reboost
on
another
machine.
So
essentially
what
happens
is
that
the
virtualization
layer
tends
to
move
towards
the
hardware
lower
and
lower.
A
I
mean
the
things
that
virtualization
accomplishes
now
will
be
done
in
harder
later
on,
anyways
and
but
not
not
in
the
sense
that
it
will
help
the
software
do
this,
but
rather
the
harder
itself
will
enable
virtualization
related.
You
know
paradigms
to
happen
and
we
already
see
service
server
providers
that
automate
servers
with
rest
apis.