►
From YouTube: FamilySearch: Performance Tuning Apache Cassandra in AWS
Description
Speaker: Michael Nelson, Development Manager at FamilySearch
A recent research project at FamilySearch.org pushed Cassandra to very high scale and performance limits in AWS using a real application. Come see how we achieved 250K reads/sec with latencies under 5 milliseconds on a 400-core cluster holding 6 TB of data while maintaining transactional consistency for users. We'll cover tuning of Cassandra's caches, other server-side settings, client driver, AWS cluster placement and instance types, and the tradeoffs between regular & SSD storage.
A
Let's
see
so
I
would
be
amiss
if
I
didn't
make
a
just
a
short
plug
for
why
we
do
that.
The
mormons
have
a
doctrine,
that's
a
little
bit
different
from
many
churches
and
we
believe
that
all
of
mankind,
even
those
who
did
not
have
the
opportunity
to
hear
about
jesus
christ,
can
be
saved
and
and
I'll
say,
one
more
on
that
if
you
have
ever
been
up
in
the
middle
of
the
night
and
wondered
what
the
purpose
of
life
is,
why
we're
here?
Is
there
really
a
god?
A
Does
he
know
me
and
does
he
care?
I
assure
you
that
the
answer
to
all
those
is
yes
and
that
there
are
logical,
rational
and
answers
to
those
questions
that
bring
peace.
A
If
you
want
more
on
that
mormon.org
or
come
see
me
afterwards,
okay,
so
the
family
tree
application
currently
is
just
over
900
million
persons
in
there.
With
about
500
million
relationships,
we
record
things
it
by
changes
and
I'll
talk
about
the
data
model.
Here
in
just
a
moment,
8.4
billion
change
log
entries
about
a
million
changes
a
day
come
in
to
those,
and
that
equates
all
that
equates
to
currently
about
seven
terabytes
in
cassandra,
13
terabytes
in
oracle.
A
Our
current
system
is
on
oracle
and
we
have
a
intimate
and
painful
relationship
with
oracle
for
quite
a
while
here.
So
when
you,
when
you're
in
a
pinch
and
you
need
to
scale
oracle,
it
is
possible
to
fork
over
lots
and
lots
of
money
for
hardware
and
scale
up
oracle
and
that's
what
we
have
done
for
a
number
of
years,
and
so
we
are.
My
group
is
a
research
project
to
get
off
of
oracle
onto
cassandra.
A
So
we
went
through
a
progression
here.
We
first
went
on
and
did
a
bunch
of
testing
with
decided
we
needed
to
do
a
side-by-side
with
cassandra.
We
did
that
and
went
with
cassandra
two
main
reasons:
better
performance
and
administratively
it
was
better.
It
was
more
stable.
A
So
we
are
an
online.
An
oltp
system.
Online
transaction
processing
users
expect
their
changes
to
show
up
right
away,
at
least
for
them,
and
so
there's
that
transactional
side
that
we
have
to
maintain,
and
we
have
some
interesting
data
performance
issue-
data
related
performance
issues.
You
might
think
you
could
assume
that
people
usually
don't
have
more
than
say,
20
children.
A
However,
you
know
you-
everybody
here
probably
has
stories
about
crazy
data
that
you
encountered
well,
we
have
people
with
thousands
of
children
and
and
yeah
there's
interesting
histories
on
how
that
happens,
but
there
you
go.
Similarly
people
people
have
one
set
of
biological
parents,
but
again
through
oddities
and
how
people
use
the
system.
We
come
up
with
dozens
of
parents,
and
so
so
we
have
these
issues
that
have
been
painful
to
resolve,
because
people
still
want
it
to
render
and
at
least
show
them
that
that's
what's
going
on
right.
A
So
a
couple
of
screenshots.
This
is
a
what
we
call
a
fan
chart.
You
see,
there's
there's
me
down
there
in
the
middle.
That's
the
circle!
I
have
two
parents,
that's
the
next!
The
next
ring
they
each
have
two,
and
this
goes
back.
Nine
generations
there's
511
slots
on
there
and
we
want
it
to
render
like
that
right,
everybody's
used
to
that,
and
so
you
have
to
walk
those
relationships
to
generate
that
thing.
A
This.
This
is
actually
done
through
a
partner
that
we
provide
apis
for
everything
we
do
so
so
we
have
many
partners
that
offer
things
based
on
our
data:
here's
another
screenshot.
This
is
now
the
interactive
pedigree,
the
other
one
kind
of
gives
you
the
forest.
Now
you
drop
down
to
the
trees
and
there
are
two
red
rectangles
here.
The
the
one
on
the
left
is
kind
of
the
default
view.
When
you
come
on
and
then
you
can
expand.
A
We
currently
have
big
performance
issues
trying
to
render
this
so
in
in
just
pulling
up
enough
records
out
of
oracle
and
and
walking
them
to
render
it,
and
so
that's
what
we
are
out
to
fix
and
scale
at
the
same
time,
here's
a
shot
of
the
person
page
again
those
families
can
be
an
issue.
It's
not
uncommon
to
have
to
load
dozens
of
people
just
to
show
the
spouse's
children
and
parents
and
siblings
of
a
person
on
there,
and
then
that
we've
had
some
real
issues
with
the
change
history.
A
Okay,
the
the
data
model
we
use
at
a
this
is
this:
is
our
new
data
model?
It's
event
sourced?
That
means
there's
two
two
data
models:
actually
one
for
writing
and
one
for
reading
the
journal.
I've
pictured
here
on
the
on
the
left
is
a
series
of
journal
entries
and
we
always
write
to
that.
So
so
an
example
of
a
journal
entry
would
be
change.
The
name
of
this
person
from
mike
nelson
to
michael
nelson,
a
change
like
that
or
add
this
relationship
to
a
person.
A
Any
any
kind
of
changes
always
get
written
to
journal
in
journal
entries.
They
are
immutable,
they
never
change.
So
if
you
want
to
fix
the
record,
you
always
append
another
another
journal.
Entry
on
that
does
a
couple
things
we're
using
time
based
uuids
as
the
key
on
those
and-
and
that
means
you
can
be
inserting
from
to
the
journal.
A
Multiple
application
servers
can
be
inserting
at
the
same
time
and
they
will
always
resolve
to
a
coherent
view,
so
there's
no
locking
at
all
necessary.
So
that's
the
writing.
When
you
go
to
read
that
always
comes
out
of
what
we
call
views
and
I've
there's
two
of
them
shown
here
for
person
one.
I
show
two
views
a
and
b.
We
actually
have
three
views
currently
that
we
compute,
so
one
is
a
full
person
view.
A
Here's
the
current
snapshot
of
the
person,
all
the
information
about
them
and
everyone
they're
related
to
there's
a
summary
view.
That's
just
basic
information,
name,
birth,
death
and
places
for
those
things
and
again
who
they're
related
to
and
there's
a
change.
History
view
that
we
do
there.
So
what
that
does
is
all
the
business
logic
goes
into
when
we
write
and
then
when
we
compute
the
views
and
when
there's
a
read,
cassandra
has
exactly
what
we
need
and
it's
a
simple
lookup
and
return
return
it
in
every
case.
A
So
so
we
get
the
the
benefits
of
the
denormalization
there.
The
view
has
exactly
what
we
need.
Okay,
so
we're
talking
performance.
So
you
need
to
know
some
basic
information
about
us
here.
77
percent
of
our
trafficker
reads:
23
rights,
so
that's
pretty
typical.
Being
very
read.
Heavy
our
reads
are
consistent,
almost
always
consistency
level
one.
There
are
a
few
cases.
We're
not
and
are
very
simple
rights,
are
much
more
complicated,
we're
using
atomic
batches
and
quorum
consistently
the
consistency
level
for
the
rights.
A
A
So
so
it's
not
uncommon
for
some
operations
to
to
involve
a
dozen
rows.
So,
for
example,
if
you're
adding
a
parent
on
a
person,
it's
going
to
affect
multiple
people
and
multiple
relationships
and
views
for
all
of
them,
so
there's
there's
a
number
of
tables
there
and
then
business
logic,
there's
some
pretty
basic
things
that
we
have
to
enforce
and
they
can
be
quite
complicated
to
to
actually
enforce
so,
for
example,
we
will
not
allow
you
to
enter
a
relationship
that
will
make
you
your
own
grandfather
or
grandparent
right.
That's
obviously
incorrect.
A
We
won't
allow
that
okay,
so
I
I
did.
We've
done
a
series
of
tests
earlier
this
year
on
a
28-0
cluster
learned.
A
lot
of
things
did
a
lot
of
optimizations
on
the
application,
and
I'm
just
going
to
talk
here
about
the
optimizations
that
were
cassandra
related
optimizations.
So
so,
overall,
we
ended
up
with
about
a
10x
improvement
in
what
we
could
scale
to,
but
I'm
not
you're
not
going
to
be,
maybe
you'd
be
interested,
but
I'm
not
going
to
share
the
cassandra.
A
I'm
sorry
the
application
specific
optimizations,
so
the
tests
earlier
this
year
were
on
a
big
cluster
28
nodes
on
amazon
that
is
they're.
All
of
them
were
high.
One
four
xl
nodes,
so
that's
450
cores
in
that
cluster
and
we
got
to
250
000
ops
per
second.
I
went
back
into
a
series
of
tests
in
preparation
for
this
presentation
on
an
eight
node
cluster
and
with
all
the
optimizations
that
I'm
going
to
show
you
here.
A
At
least
those
were
the
most
of
the
most
of
them
were
not
the
application
we
were
able
to
get.
I
was
able
to
get
up
to
200
000
with
just
the
eight
nodes,
so
it
makes
a
a
little
optimization
goes
a
long
ways
here
and
those
are
very
expensive.
The
high
one
for
excels
are
very
expensive.
There's
a
lot
of
cost
savings
there,
okay,
so
the
test
system
that
we
used
that
I
used
here.
A
And
then
I
put
the
app
servers
up
here.
Purposely
did
more
app
servers
and
more
load.
Runners
than
load
agents
then
were
required
because
I,
the
purpose,
was
not
to
test
the
app
here.
I
was
trying
to
push
cassandra,
so
I
wanted
to
make
sure
there
was
plenty
of
cpu
and
network
for
the
app,
and
I
should
mention
here-
it's
using
borla
and
silk
performer
for
the
load
to
do
the
load
on
it.
We've
had
we
did
some
stuff
with
jmeter.
A
A
What
we
found
of
what
I
found
overall,
the
the
two
big
gains
for
us
were
wrote
in
the
row
cache
and
making
the
driver
token
aware,
and
overall
cassandra's
been
pretty
stable
for
us.
We've
had
a
few
incidents.
We
we
have
our
first
cassandra
cluster
in
production
right
now
that
is
not
family
tree.
Not
this
app!
It's
a
different
one
and
no
big
incidents.
Yet
overall,
I've
been
pretty
impressed
with
the
stability
we're
using
ssds
for
everything
here.
So
the
big
advantage
that
that
we've
found
with
ssds
is
that
repairs
become
a
non-event.
A
So
overall,
it's
about
a
2x
throughput
increase,
okay,
so
row
cache
yeah.
I'm
sorry.
A
It's
about
seven
terabytes
on
here
and
replication
factor.
Three.
A
Yes,
yes,
so
rowcache
our!
We
have
a
a
good-sized
data
set
900
million
persons,
but
there's
a
lot
of
the
working
set
is
not
nearly
that
size
and
so
on
an
eight
node
cluster.
We
can
fit
the
working
set
in
ram
and
and
that's
why
I
believe
the
row
cache
was
was
a
big
benefit
for
us
row.
A
Are
we
put
the
views?
There's
they're
small
in
the
row,
cache
not
the
journals
themselves
and
the
reads
are
consistency
level
one.
So
that
means
that
all
all
that
can
be
in
the
row
cache
and
we
get
hits
and
a
nice
bump
from
row.
A
Cache
latency
went
from
11
milliseconds
down
to
about
seven
and
a
30
35
bump
in
how
much
the
cluster
could
handle,
and
I
should
mention
I
was
pushing
the
cluster
as
hard
as
I
could
in
every
case
here
took
the
the
load
load,
agents
and
said:
push
it
just
as
hard
as
you
possibly
can,
and
so
all
the
nodes
go.
Yellow
and
op
center
thinks
we're
sick
and
but
get
as
much
out
as
we
can.
A
How
many
columns
so
in
the
views,
there's
typically
only
five,
I
think
five
columns,
the
journals
are
different:
they
have
a
there's
a
column
for
every
journal
entry,
so
there
can
be
thousands
there
in
those
what
was
the
other?
That
was
what
was
your
first
question
row
size.
So
the
views
it
depends
on
the
column
family.
A
A
A
Okay,
so
the
road
cache
has
worked.
The
question
was
data.
Stacks
is
discouraged
using
the
road
cache.
What's
our
experience,
our
experience
has
been
great.
35
percent
bump
in
throughput.
Latency
is
down
and
haven't
experienced
any
problems
with
it.
I
a
lot
of
that
is
that
we,
you
know
they
talked
about
this
this
morning.
Jonathan
did
that
it'll
bring
the
whole
row
in,
and
so,
if
you
have
really
big
rows,
you
you
have
an
issue
and
we
only
put
our
views
which
are
small
and
contained
in
there.
A
So
it's
it's
worked
great
for
us.
I
was
I've,
been
surprised
to
hear
them
talk
it
down,
but
what
was
that
yeah?
Yes,
we
are
very
fortunate
that
we
fit
into
here
well
and
the
underlying
theme,
I
think,
of
this
whole
thing
ought
to
be.
You
need
to
test
your
app
right.
Every
app
is
different
and
you
need
to
go.
You
need
to
go,
try
it.
A
So
here's
how
to
configure
the
row
cache
there's
two
sides
to
it
in
cassandra.yaml.
You
need
to
enable
it
so
nodes
can
have
you
have
to
tell
each
node
how
much
memory
to
put
a
set
aside
for
the
row
cache
and
it
does
it
off-heap.
A
We
did
32
gigs
and
so
eight
times,
32
gigs.
We
had
250
gigabytes
of
of
cash
there
to
use,
and
then
you
have
to
turn
it
on
for
the
individual
tables,
and
so,
as
I
mentioned,
we
did
it
on
our
tables
that
have
small
rows,
not
on
the
big
ones.
A
And
then
a
screenshot
from
op
center
when
we
were
running
okay,
so
one
point
to
make
here
is
you
see
there
is
about
a
90
hit
rate
on
the
rowcash
is
what
we're
getting
and
the
disc
over
on
the
over
on
the
top
left.
There
is
the
disc
utilization
and
it's
zero
right,
because
everything
fit
in
memory
now.
The
interesting
thing
is
everything
fit
in
memory
before
we
turned
on
the
row
cache
as
well,
so
so
it
was
all
coming
out
of
the
buffer
cache
right.
A
Linux
had
cached
all
of
the
files
into
memory,
and
so
before
we
turned
on
the
row,
cache
it
was
every
read
was
hitting
the
key
cache.
It
would
know
where
it
was
and
go
to
straight
to
the
right
spot
in
the
files
and
those
were
already
in
memory,
because
linux
had
cached
them.
So
everything
was
in
memory
both
before
and
after
and
yet
we
got
a
30
35
percent
increase
in
throughput
and
a
big
reduction
in
latency
with
the
row
cache.
A
A
Okay,
so
that's
the
row
cache
that
was
the
first
big
win
on
cassandra
tuning.
The
next
one
was
when
we
went
token
aware.
So
this
is
another
one
where
data
stacks
I've
I've
had
data
stacks
people
talk
it
down
going
token
aware,
and
yet
we
saw
a
big
improvement
from
it.
So
with
cassandra,
as
I'm
sure
you
know,
when
you
go
to
do
a
read,
the
default
load.
Balancing
strategy
is
round
robin,
so
the
client
will
will
pick
a
node
and
contact
that
node
and
give
it
its
query.
A
It
will
then
act
as
a
coordinator
and
contact
the
replicas
of
that
data,
gather
the
results
and
then
return
them
to
the
client
that
network
hop
but
to
the
replicas
and
just
the
coordination
of
multiple
nodes
means.
You're
talking
to
multiple
nodes
means
that
you're
adding
to
the
load
of
those
machines
and
that
there's
network
latency
that
you're
introducing
there
so
cutting
that
out
again
a
big
reduction
in
latency
from
seven
millisecond
reads
down
to
two
millisecond
reads
and
fifty
percent
more
throughput.
A
When
we
went
token
aware
so
the
token
aware
the
client
gets
from
the
cluster
where
which
regions
of
the
of
the
keys
are
in
which
v
nodes
and
where
those
are
and
then
we'll
direct
the
query
to
the
node
one
of
the
replica
one
of
the
replicas,
which
means
that
node
has
the
data
and
can
answer
the
question
if
it's
consistency
level
one
without
consulting
any
other
nodes.
So
you
get
a
big
gain
on
that,
one
that
so
here's
how
to
here's?
A
A
A
For
the
token
that's
needed,
and
then
the
round
robin
inside
of
that
means
round
robin
between
those
three
we've
had
a
little
bit
of
back
and
forth
with
the
other
team.
That's
using
cassandra
internally
around
round
robin
versus
the
latency
aware,
where,
with
that
one,
the
client
will
keep
track
of
the
latency
with
each
node
and
direct
it
to
the
one.
That's
responding
that
has
been
responding,
the
fastest.
A
Okay,
finally,
the
last
thing-
and
this
was
an
incremental-
gain-
a
five
percent
throughput
improvement.
When
we
bumped
up
the
concurrent
reads-
and
this
one
really
surprised
me,
I
thought
for
sure
we
would
need
to
bump
these
up
earlier.
We
went
from
32
to
250,
I'm
sorry
to
128,
I'm
sorry,
nope,
256.,
concurrent
reads
and
concurrent
rights
that
are
allowed
and
bumped
up
the
native
transport
max
threads
to
256..
A
We
have
not
changed
that
number
nope,
but
we
haven't.
I.
I
was
talking
to
the
sony
guys
earlier
and
they've
had
some
real
issues
with
v
nodes
and
lots
of
stuff
that
are
you
from
sony.
You're
smiling.
A
Okay,
so
this
is,
this
has
led
to
a
little
bit
of
dilemma
now,
which
we're
getting
adequate
throughput
and
we're
probably
not
going
to
push
a
whole
lot
harder
on
this.
But
still
a
mystery
for
me
is:
where
is
the
bottleneck,
and
I
haven't
got
to
the
bottom
of
this?
A
The
network
side
has
been
a
bit
of
a
mystery,
so
the
highest
ever
have
gotten
cassandra
to
use
in
these
clusters
has
been
800
megabits
of
the
network.
High
one
four
excels
have
10
gigabit
network.
They
were
deployed
in
the
same
the
term
deployment
request.
That's
not
the
right
term
placement
group
there
we
go
placement
group,
so
we
so
we
really
can
get
high
throughput
there.
I've
done
iperf
between
all
the
different
ones,
measured.
A
Okay,
so
a
little
more
on
the
network
and
yeah,
the
network
seems
suspicious
to
me
so
so
there's
this
five
second
cycle
that
I
keep
seeing
and
and
netstat
and
I
have
tried
you
know
they
talk
about
tweaking
these
parameters
and
I've
tweaked
them
and
have
not
been
able
to
measure
impact
from
any
of
those.
A
So
the
and
the
client
machines
you
seem
okay
at
the
same
time.
So
I'll
show
you
that
here
this
is
a
net.
Let's
see,
the
top
of
this
is
from
tc
traffic
control
utility,
and
you
can
see
right
here
this.
This
means
that
at
this
moment,
there's
870
packets
in
the
queue
to
be
to
be
processed,
that's
the
backlog
and
and
two
and
a
half
megabytes.
A
A
The
send
queue
is
send,
cues
are
almost
empty
and
the
receive
cues
have
a
bunch
backed
up
in
them
all
right
a
couple
seconds
later
and
it's
looking
fairly
normal.
This
is
this
is
kind
of
how
you'd
expect
to
see
a
healthy
running
network
connection,
and
I
wish
I
could
explain
it.
I
don't
the
amazon
has
their
new
enhanced
networking
haven't
played
with
that.
Yet
so
maybe
that
will
help
with
this
type
of
thing.
A
This
is
a
high
one
4xl,
so
there's
16
cpus
60
gigs
of
ram
yeah,
oh
yeah,
10,
gigabit
network,
the
the
application
machines
are
all
moderate
network
machines
and
yet
there's
plenty
of
those.
So
there
should
be
and-
and
I
can
go
put
iperf
between
them
and
the
and
the
cluster
and
get
lots
of
bandwidth.
So
I
do
not
believe
it's
simple
lack
of
bandwidth.
A
A
Okay,
so
that
is
pretty
much
the
results
of
what
we
have
found.
As
far
as
the
cassandra
optimizations,
we
have
tried
some
other
things
that
didn't
get
us
anything,
and
so
I
didn't
include
those,
and
I
would
emphasize
again
that
your
mileage
will
vary.
Your
application
is
different.
Those
mix
of
queries
the
how
much
rights
the
types
of
rights,
the
complexities
of
the
rights,
the
types
of
data,
the
access
patterns
on
the
data,
all
those
things
vary,
and
so
you
really
got
to
try
your
app
and
see
what
you
find
any
questions.
A
Does
any
part
of
the
application
use
a
blob?
Yes,
we're
storing.
A
Almost
all
of
our
data
is
blob
like
we're,
storing
it
in
in
text
columns,
but
it
is,
it
is
json
com.
Let's
it's
hex
encoded,
compressed
json
that
we're
putting
in
there.
A
A
One
at
a
time
we
have
found
good
success.
Doing
many
parallel
asynchronous
reads:
that's
what
we've
done!