►
Description
Speakers: Seán O Sullivan, Service Reliability Engineer, & Tim Czerniak, Software Engineer, at Demonware
This presentation covers the eight-month evaluation process we underwent to migrate some of Call of Duty’s core services from MySQL to Cassandra. We will outline our requirements, the process we followed for the evaluation, decisions we made around our schema, configuration and hardware, and some issues we encountered.
A
A
So
what
we've
been
working
on
this
year?
We
did
a
bunch
of
stuff
for
call
duty,
advanced
warfare
which
released
not
very
long
ago
and
soon
pretty
well
Diablo
Reaper
souls
on
the
console.
We
didn't
work
on
a
pc
version,
one
on
the
coastal
console
versions.
We
did
a
lot
of
stuff
which
is
earlier
this
year.
You've
got
Skylanders
their
trap.
Team
came
in
not
too
long
ago.
A
We
do
a
lot
of
stuff
for
the
little
figures
that
go
on
the
podiums
and
appearing
game
and
all
the
kids
love
it
and
then
there's
also
a
title
in
China
based
on
call
of
duty,
which
is
releasing
just
for
the
Chinese
market
early
next
year
that
we've
been
working
on
as
well
and
there's
also
destiny
which
isn't
there,
but
a
Activision
works
with
bungie
who
originally
did
halo
and
they're
working
on
destiny,
which
is
like
a
big
FML.
We
also
helped
out
with
that.
A
So
some
of
the
stuff
we
done
in
the
past
every
single
call
of
duty
since
call
of
duty,
3,
there's
one
every
year
and
we've
had
a
big
hand
in
in
all
of
them,
and
a
bunch
of
Guitar
Hero
games
in
the
past
DJ
hero
band,
hero,
all
the
heroes,
gold
and
I,
all
the
double,
oh
seven
games
and
Skylanders
and
a
bunch
of
other
stuff.
A
We've
I
think
we
have
a
hundred
and
we
around
a
hundred
and
second
title
or
something
that
we've
been
involved
with
at
this
point
lots
of
stuff,
and
so
we
provide
services
and
so
things
like
matchmaking.
So
you
want
to
play
a
game
when
I
shoot
someone
in
the
face
online,
you
say:
I
want
to
shoot
someone
in
the
face
and
someone
else
goes:
hey
I
want
to
shoot
some
in
the
face
and
they
both
talk
to
the
matchmaking
service,
and
then
we
match
them
up
and
they
go
see
each
other
and
face.
A
So
we
do
things
like
leaderboards,
chassé
file,
storage,
progress,
storage,
which
we'll
hear
about
a
bit
little
bit
little
bit
and
leagues,
social
network
integration,
etc,
etc.
We
have
about
a
hundred
services
or
more
I,
can't
even
remember
at
this
stage
and
yeah
lots
of
stuff.
So
some
of
the
technologies
we
use,
we
use
C++
for
the
client
because
of
course,
games
are
all
written
in
C++.
So
we
give
the
studio,
that's
making
the
game
a
library
written
in
C++
and
the
integrated
into
their
game
and
hey
presto.
A
We
also
use
a
lot
of
http-based
stuff.
Like
we
integrate
with
websites-
and
we
have
lots
of
restful
web
services
internally,
what
kind
of
stuff
and
on
the
server
side,
we
use
a
lot
of
Python,
mostly
Python
Erlang
as
well
mysql,
mostly
for
databases,
and
we
run
everything
on
centos
and
we
use
puppet
for
automation,
they're
the
kind
of
main
things
we
have
a
lot
of
technologies
in
our
stack
but
they're
the
kind
of
big
big
players
em.
So
we
have
a
slightly
unusual
use
case,
most
services.
A
You
know
you
start
a
company,
you
start
small,
you
get
a
few
users,
you
might
get
gain
a
bit
of
popularity,
you'll
gradually
grow
in
popularity
and
gradually
get
more
and
more
users
more
and
more
people
online.
You
might
get
a
couple
of
spikes.
If
someone
writes
about
you
in
some
famous
website
or
something,
but
I'm
gerry
speaking,
it's
like
a
slow
and
long
uphill
scaling
up
for
us.
We
get
the
exact
opposite
so
on
day,
one
a
game
launches.
A
Everybody
goes
online
and
then
gradually
over
a
long
period
of
time,
maybe
a
few
years
it'll
tail
off.
So
you
can
see
here.
This
is
an
online
users
graph
for
one
of
the
call
of
duty
games.
There's
the
race
day.
You
can
see
that's
the
first
weekend,
that
is
the
peak
number
of
online
users
for
the
game
ever
and
that's
Christmas,
so
you
can
actually
see
the
weekends
there
as
well.
So
you
can
see
that's
a
run
up
to
Christmas.
So
in
the
words
of
Benjamin
Franklin.
By
failing
to
prepare,
you
are
preparing
to
fail.
B
So
our
predicament
is
we
typically,
we
have
one
title:
/
DC
or
we
we
I,
say
titles
/
DC,
into
Sigma
clusters.
So
we
don't
generally
share
data
green
titles
and
that
change
for
ghosts
where
your
new
platforms,
you'd,
ps3
and
ps4
and
xbox
360
xbox
one
and
people
would
often
want
to
play
in
360
and
then
why
did
buy
the
new
copy
and
xbox
one
and
migrate
and
keep
on
playing
for
my
list
where
they
left
off?
So
we
need
to
share
data
across
TC.
Mysql
is
not
great
at
doing
that.
B
So
we
looked
us
using
Cassandra
used
literally
using
a
non-relational
database
for
some
of
our
services,
so
we
brought
a
tangible
for
services
which
were
mostly
on
relational
key
value
type
services.
The
first
service
was
the
progress
tour.
This
is
kind
of
blob
of
data,
we
don't
really
know
what's
in
it,
but
it's
typically
used
for
what
level
you
are
some
of
your
stats
and
your
loadout.
So
what
guns
you've
equipped
that
kind
of
stuff?
B
It's
a
the
read,
write,
read,
read
right
ratio
for
that
is
124,
and
your
file
size
is
pretty
small,
so
before
four
kilobytes
and
persistent
data,
so
you
don't
want
your.
Never
it
never
goes
away.
You
always
want
the
load.
That
is
to
be
there.
You
are
the
level
to
be
there
all
the
information
with
the
player
to
be
there.
B
The
next
one
is
presence.
So
the
present
service
is
when
you
go
online.
You
I
am
online
new,
regular
service
and
you're
right
there
service
a
few
times
a
minute
and
when
people
check
what
friends
are
buying
online
you'll
check
the
presence.
Your
internal
presidency
is
this
guy
in
line
which
my
friends
are
online.
So
again,
the
read/write
ratio,
for
that
is
one
to
ten,
where
you're
reading
quite
and
frequently,
but
you're
writing
all
the
time.
To
make
sure
that's
your
at
the
information
is
correct.
Davis
has
a
minimal.
It's
pretty
much.
B
Just
a
case
of
users
online
and
some
very
small
metadata,
its
transit
data
as
well.
Of
course,
the
messaging
service
is
quite
generic.
It
commutes
from
messaging
it
can
be
for
game,
invites
agrees
for
males
it
can
be
use
for
any
kind
of
that
kind
of
messaging
type.
Stuff
again
read:
right
ratio
is
50
to
1,
so
you're,
constantly
checking
to
of
emails
to
have
invites
has
something
changed.
How
someone
send
me
a
message?
B
The
next
part
is
consolidation
and
expansion.
So,
as
Tim
mentioned,
we
typically
have
our
biggest
growth
on
the
first
few
weeks
and
then
I
kind
of
slowly
eases
off.
That
means
we
spent
a
large
portion
of
our
time,
consolidating
hardware
and
making
cluster
smaller,
because,
again,
once
Christmas
is
gone
for
certain
titles,
that's
kind
of
biggest
boom,
so
Ian
he
reclaimed
by
Carter
from
that
and
an
expanding
again,
sometimes
we'll
add
titles
to
current
clusters
so
need
to
make
thruster
Roger.
B
So
we
need
to
make
sure
whatever
we
looked
at,
we
could
easily
consolidate
and
expand
the
clusters
without
much
effort
and
we've
been
using
my
skull
for
approximately
10
11
years,
the
stage
so
we're
pretty
confident
in
us.
We
can
do
it
well,
we've
got
a
lot
of
tooling
around
us
and
we've
got
a
sharted
set
up,
which
means
we
can
just
keep
on
adding
adding
right
through
putumayo
that
issue.
So
we
need
to
make
sure
whatever
system
we
chose,
we
could
manage
just
as
easily.
B
We
need
to
be
able
to
able
to
automate
the
point
where
you
can
have.
It
won't
take
click
a
button
and
it
will
deploy
new
server
allowed
to
and
I
to
the
group
of
the
cluster
and
lesser
expand,
that's
where
he
needs
to
be
it
and
I
also
for
the
operations
teams.
We
need
the
case
where,
if
a
server
died,
they
could
easily
replace
or
without
any
without
much
effort.
And
so
then,
of
course
the
last
requirement
was
throughput.
B
So
we
looked
at
the
previous
titles
we
had
and
we
estimated
what
the
new
one
would
be
and
require.
So
there
are
performance,
wear
for
the
progress
store
million
behalf
request
a
minute
after
the
presence,
a
quarter
of
million
requests
a
minute
and
for
messaging
850,000
a
minute.
Now
again,
that's
a
request
in
the
application
layer,
so
that
could
actually
turn
two
three
four
or
more
requests
at
the
Cassandra
level
of
database
level.
B
So,
during
a
hundred
valuation,
when
the
operations
engineers
came
to
the
office
with
mismatch
and
footwear,
so
the
joke
went
around
that
he
was
highly
available,
not
consistent,
and
this
did.
We
did
require
a
shifting
toward
going
from
a
relational
database
to
going
to
non-rational
and
the
whole
cap
theorem
we're
used
to
mysql.
We
used
to
using
large
heavy
beefy
boxes
and
we
know
how
to
configure
blue
know
how
to
do
my
score.
B
Well,
when
you
start
doing
no
SQL
select
sandra's
of
like
refocus
and
are
using
dynamodb
types
of
you
have
to
really
think
differently
and
realize
this
isn't
just
my
cigar,
such
as
in
the
database,
the
eft
understand
the
architecture
actually
works,
so
we
shortlist
the
two
available
options:
rating
Cassandra,
the
main
reason
shortest
those
were
there
Walter
nobody
dunno
DB
they're,
both
leveldb
devotes
for
the
V
nodes
as
easy
fix
and
consolidation
and
expansion.
Tutoring
was
pretty
good.
B
We
rewrote
our
application
back
in
twice
so
our
application
back-end
currently
supported
mysql.
So
we
added
react
and
cassandra
support.
This
meant
for
load
test.
It
was
that
easy
ER
for
load
testing.
We
built
two
clusters
and
product
pretty
production
like
cluster
of
production
like
hard
work,
so
the
first
cluster
was
single
CPU
SSDs
fast
disks,
an
average
memory
being
about
think
was
32
gigs
at
the
second
cluster,
which
is
different.
Dc
was
jewel,
CPUs,
spindles,
so
larger,
but
slower
disks
and
high
memory,
so
that
ninety
six
gigs
around
I
think
it
was
so.
B
This
was
a
deliberate
decision.
We
talk
because
we
wanted
to
see
where
the
bottlenecks
where,
if
we
just
had
one
cluster
and
we
did
a
tweaker
to
change,
we
wouldn't
exactly
if
it
fixed
one
bottleneck.
We
wouldn't
know
what
the
next
bottleneck
is,
where
vio
would
be
CPU
so
having
two
clusters
of
meant
that
we
could
actually
see
where
certain
bottlenecks
where
and
make
configuration
change
and
see
how
that
affected
both
types
of
hardware.
It
also
meant,
since
we
haven't,
bought
the
hardware
after
this
project,
that
we
could
see
what
we
could
see
better.
B
Whatever
Hardware
we
needed
and
comparing
the
two,
so
we
used
run
software
stack
to
load
test.
We
didn't
get
a
specific.
We
didn't
have
a
special
special
test.
Software
going
to
software.
We
drove
just
a
load
test.
We
had
our
own,
the
normal
software
we
use
in
production.
We
have
any
reproduction,
we
modified
that
to
use
Cassandra
or
can
use
react,
so
our
load
test
will
be
awkward.
I
think
the
most
important
thing
we
learned
from
load
testing
is
you
have
to
is
really
you
have
to
use
real
estate
and
realistic
user
profiles.
B
So
we
can
check
your
development
environment,
see
what
Cubs
remade.
We
can
check
your
previous
production
environments,
get
a
rough
idea
of
the
quantity
of
calls,
and
then
we
can
actually
emulate
that
and
we
have
our
own
load
tests,
clients
which
act
as
users
and
do
the
whole
log
in
log.
At
that
whole
process
and
an
issue
we
didn't
include
peaks
and
troughs
so
users,
login
user
logged
enough,
and
that
bed
is
pretty
hard
as
Tim
I'll
go
into
later
and.
A
A
The
tooling
for
react
is
actually
excellent
at
the
time
it
was
the
first
one
we
actually
we
did
the
load
tests
on,
so
we
thought
okay.
This
is
looking
pretty
good.
You
know
everything
is
performing
well
and-
and
we
had
actually
also
previously
evaluated
it
for
something
else,
and
so
we
thought
okay.
This
will
be
good.
You
know
like
this
looks
pretty
pretty
decent,
but
then
we
did
Cassandra
and
it
was
way
better,
so
it
won.
In
the
end.
The
right
performance
was
about
four
times
better
than
react
for
our
particular
use
case.
A
That
is,
it
had
more
features.
It
had
a
compound
primary
Keys,
it
had
partitioned
clustering
keys
and
that
kind
of
added
to
the
the
available
query
set
that
you
could
that
you
could
use
had
cql,
which
is
a
bit
more
friendly.
It
was
better
for
our
developers
and
it
was
a
bit
more
mature,
as
the
react
was
fairly
at
the
time.
This
is
maybe
a
year
and
a
half
ago
react
was
a
bit
it
was.
It
was
kind
of
a
bit
early
in
the
in
the
in
the
cycle
of
that
of
that
particular
product.
A
So
you
know
we
were
a
little
bit
happier.
Wikus
on
are
being
a
bit
more
mature.
The
only
thing
is
I
guess
we
thought
with
Cassandra
were
like
Java.
A
lot
of
our
developers
around
engineers
in
general
would
run
a
mile
if
you
mentioned
Java,
but
the
nice
thing
about
Cassandra's.
It
really
does
not.
It
doesn't
reveal
its
innards
at
all
really
like,
and
so
you
don't
really
have
to
deal
with
Java.
You
don't
need
to
know
that
it's
Java,
which
is
great
on
it,
it's
the
way
it
should
be.
A
So
that
was
good,
and
so
that
was
it.
Then
we
just
continued
testing
on
Cassandra
cluster
for
twenty
four
seven.
We
have
three
offices,
so
we
have
one
in
Dublin,
one
of
mine,
Coover
and
one
in
Shanghai,
and
so
we
were
able
to
take
that
ran
the
clock
coverage
to
just
monitor
the
thing
and
do
loads
of
soap
testing,
those
about
testing
tweaking
etc.
A
You
know
new
technologies,
make
you
a
little
bit
paranoid,
so
we
planned
okay,
well,
we'll
have
a
mysql
cluster,
but
also
have
a
Cassandra
cluster
will
write
to
both
and
then
we'll
read
from
mysql
and
then
maybe
we'll
dial
up
the
cassandra
reads,
and
then
we
have
this
fancy
thing
where,
if
for,
if
something
terrible
happens,
the
cassandra
cluster,
we
have
a
disaster
recovery
plan
where
we
can
migrate
mysql
data
over
into
cassandra
eventually,
and
it
ended
up
being
way
too
much
way
too
much
overhead.
It
was.
A
It
was
overly
complex,
too
much
overhead
for
our
development
and
ops
teams
and
we
just
at
the
end
we
had
low
tested
so
well.
We
trusted
cassandra
and
we
dropped
that
idea.
So
we
just
went
straight
straight
to
Cassandra,
am
so
the
schema
so
for
the
progress
store,
it
was
pretty
pretty
easy
as
a
perfect
fit.
It
was,
you
know,
key
Valley,
you
always
know
the
key.
It's
mostly
rights,
Cassandra
loves
that
the
presence
was
a
bit
more
relational
and
we
had
two
tables
and
we
had
an
issue
with
TTL.
A
A
And
then
you
know
deleting
that
data
again
a
minute
later
doing
the
exact
same
thing
over
and
over
and
over
again
keep
writing
deleting,
and
so
it
was
building
up
a
load
of
tombstones
in
one
of
the
partitions,
and
so
we
hit
a
Cassandra
bug
actually,
which
bit
is
pretty
hard
but
turned
out.
It
had
already
actually
been
fixed.
We
just
haven't
got
it
on
our
cluster,
so
we
Julie
updated
our
cluster
but
yeah.
A
That
was
a
another
interesting
thing
we
came
across
and
so
the
lessons
we
learned
from
the
schema
keep
it
simple:
it's
not
relational.
You
need
to
relearn
things
a
little
bit
get
your
partition
keys
and
your
clustering
keys
right,
because
Cassandra
will
work
very
very
well.
If
you
do,
you
won't
have
to
worry
about
it.
You
need
to
learn
how
the
data
will
be
stored,
how
it
will
be
accessed
before
you
design.
Your
schema,
listen
to
Patrick
McFadden.
A
B
So
again
we
started
on
one
dot,
2
or
DSC
304
thing
it
was,
and
at
that
stage
at
least
a
lot
of
the
settings
that
were
there
by
default,
aren't
really
right
for
anyone.
We
were
using
the
hardware
relatively
high
spec
and
the
setting
is
quickly
run
for
that
again.
Looking
after
the
vm
point
of
view,
we
also
did
a
dev
environment.
Vms
royal
settings
are
also
wrong
for
that.
So,
if
you
don't
just
trust
defaults,
definitely
go
into
a
poco
settings,
so
we
kind
of
took
one
pass
through
the
config
reddit
change.
B
The
time
settings
see
this.
You
know
what
seemed
reasonable
to
us
to
change
one
example
in
multi-threaded
versus
single-threaded
compaction
and
change
all
the
settings,
and
it
seemed
pretty
mostly
fine
except
there's
a
fair
few
others.
A
few
change
lease
which
didn't
go
so
well
as
in
backpack
compaction,
setting
where
the
multis
already
come
back,
she's
actually
slower
than
single
threaded
during
actual
testing.
It's
a
case
of
making
one
change
at
a
time,
doing
a
full
load
test
and
in
seeing
results.
Comparing
you,
the
results
and
graphs
to
your
next
test.
B
B
One
other
example:
the
kind
of
the
one
of
the
early
changes
was
the
SS
table
size
where
I
think
the
default
was
12
Meg's
or
something.
So
we
look.
We
load
test
with
everything
from
12
Meg's
up
to
512
Meg's,
and
we
we
stuck
with
192
Meg's
the
best
for
us
I,
think
the
Cassandra
guys
now
have
changed
their
default
of
160.
So
it's
nice
to
see
where
it
closed.
Anyway.
B
Sexual
hardware
we
chose,
we
am
going
for
a
kind
of
a
mixture.
The
two
pieces
are
two
stars
of
hardware:
grand
for
two
CPUs
to
SSD
and
read
1
32,
gigs
Aram
and
one
gig
Nivin
and
area
network,
and
that's
mostly
due
to
the
infrastructure
we
have
in
the
data
center
and
we
typically
have
an
ops
back
in
TV
spec
host
each
year.
So
up
spec
is
typically
high,
CPU
high
memory
and
low
disk,
so
let's
say
spindles
or
smaller
disk,
where
a
DB
spec
with
a
large
disk,
lower
memory,
oddly
and
also
lower
cpu.
B
So
we
can
emerge
those
two
specs
specs
to
critics
and
respect
and
for
this
spec.
For
us,
the
disk
capacity
is
the
kind
of
the
factor
we
have
to
kind
of
scared
by.
So
we
already
scared
by
this
capacitance,
but
is
it
does
mean
that
in
the
future
we
can
easily
pop
the
disc
so
add
an
extra
ad
in
larger
discs
and
then
increase
the
capacity
of
luster.
B
Those
are
fabulous
question
red
one
and
three
Cassandra
guys.
There's
the
data
sucks
guys
and
many
people
advice
for
raid
0
or
you
know,
forget
to
know
your
nodes.
Are
your
notes
are
expendable.
Your
nodes
are
don't
trust
in
their
arrival,
our
hardware's
exterior
arrival.
So
we
decided
that
we
don't
lose
our
that.
Often
we
do
those
disks,
probably
the
most
frequently
that'll,
actually
fail,
so
we
decided
to
go
81,
so
they
should
know,
does
go
down
or
reduce
performance,
one
node
for
a
while
we
can
see
it.
B
B
It's
on
github
I
think
it's
prob,
a
check
which
uses
jolokia
so
use
that
initiative
is
ok,
but
by
deploying
jolokia,
which
exposes
JM
JM
x,
metrics
over
HTTP
epiny,
easily
write
our
own
checks,
negative
checks
in
using
python
and
then
quite
easily
have
stuff
like
long-term
and
short-term
checks.
We
actually
look
at
look
at
graphite
data
for
short
term
and
long
term
usage
averaged
out
over
the
day
and
then
keep
an
eye
and
actually
do
negative.
Nigeria
checks
if
our
read
latency
is
increasingly
decrease,
not
kind
of
thing.
B
Jolokia
is
a
fantastic
way
to
actually
for
non
java
house
places.
Gel
oak
is
a
really
great
way
to
actually
look
at
metrics
and
monitoring
for
Cassandra,
because
you
can
get
everything
off
HTTP.
You
can
also
make
change.
You
can
also
change
like
do
your
jmx
mayor,
right
and
kind
of
config
changes
or
setting
changes
where
you
can
tell
it
to
flush
something
over
J
over
the
jalopnik
thing
to
one
of
the
interesting
graph
so
hit
which
which
we
found
pretty
weird,
was
the
key
kashyap
rate,
and
we
monitored
this
for
a
farewell.
B
We're
kind
of
we
know
to
square
around
90,
plus
percent.
We
were
very
happy
with
that.
We
kind
of
put
in
monitoring
checks,
490
alert.
It
was
until
a
few
months
after
we
launched.
We
know
that
this
check
this
metric
is
actually
wrong
or
it
doesn't
actually
show
you
the
day.
You
think
it's
showing
when
you're
doing
compaction,
every
single
read
is
a
Kish
hits
the
cash
so
for
a
periods
of
time
you're
doing
a
back.
B
You
got
hundred
percent
ki
cash
yet
which
then
screws
off
your
actual
percentage,
resection
you're
at
seventy
percent
for
us,
so
the
only
way
to
actually
get
is
by
looking
to
metrics
by
looking
at
your
reads
per
key
space
and
then
looking
at
your
cache
hit
50
space,
you
can
text
ya.
One
metric
for
the
key
cache
hit
rate
doesn't
actually
work
much
alright.
Well,
some
of
the
gotchas
we
run
into
and
the
first
one
was
v
nodes
and
rec
weariness.
B
I,
don't
know
if
this
still
the
case
now
we
were
still
on
306
the
moment,
but
vino
vino
de
NOC
awareness.
Don't
as
far
as
they
know,
work
very
well.
Together
we
saw
herbs
and
thirty
percent
difference
in
tazers
data
distribution
and
we'd,
the
two
enabled,
and
so
again
as
I
mentioned,
this
capacity
was
a
remains.
Scaling
factor
remain
kind
of
limiting
factor.
So
we
start
saying
three
percent
disk
usage
difference
across
nodes.
That
was
no
for
us,
so
we
were
using
blades.
B
So
because
of
that,
we
hope
to
turn
the
chassis
into
kind
of
a
rack
and
get
data
moved
round
that
way
carefully,
and
we
have
ended
up
having
to
ignore
that.
Just
go
v
nodes
without
rec
awareness
in
the
end,
lobe
answers,
so
we
don't
use
the
new
Python
driver
yet
we're
currently
using
the
old
Python
driver
and
github,
and
so
because
that
we've
a
low
balance
for
the
Cassandra
all-region
rights
go
to
love
answer.
B
Because
of
that
we
need
to
make
sure
we'd
have
a
client
each
client
with
disconnect
after
approximately
10
minutes
and
then
reconnect,
because
otherwise,
if
we
expand
our
cluster,
if
we
start
start
seeing
problems
with
the
number
knows
we
had
if
we
know
that
line
to
reconnecting
the
Klein,
we
just
keeps
the
action.
Opening
would
never
to
use
new
nodes
and,
of
course,
which
multiple
I'm
sure
a
lot
of
people
here
to
have
difference
from
production.
So
our
development
environment
was
VMs
or
production.
Environment
was
really
machines
and
because
costs
are
harder.
B
So
I
was
close
to
lunch
when
we're
pretty
busy
with
things
that
all
of
a
sudden,
we
saw
burning
problems
in
two
thousand
apparent
where
it
was
crashing.
Few
nodes
are
crashing
and
ended
up
being
that
the
we
use
the
same
settings
in
dev
and
prod.
So
we
need
to
modify
the
settings
pretty
heavily
in
dev
like
reduce
memory
put
in
compaction
thresholds
because
we
were
just,
we
had
no
thresholds
in
compaction,
so
it
made
a
few
changes
there.
B
Eventually,
we
moved
or
development
environment
production
machines,
so
at
least
they
would
float
more
closely
match
Brooks
environment
and
with
a
new
data
center
as
well.
So
we
didn't
get
chance
load
test.
New
data
center
has
been
built
at
the
time,
so
we
aren't
a
few
issues
there,
where
our
Linux
bills
we're
running
cpu
speed
by
default,
so
wondering
why
one
was
slower
than
the
other:
the
newer
faster
machines
virtually
slower
than
the
older
machines,
because
they
were
actually
a
CP
user
used
to
one
gigahertz
salsa
to
and
between
data
centers
we're
using
geo
regional.
B
So
lunch
lunch
is
boring.
Thankfully,
and
the
week
of
lunch
want
to
develop
management
over
and
a
request
to
similar
note
failure
he's
a
bit
concerned.
Yeah
we
test
some
of
this
in
load
testing.
We
would
never
actually
used
it
in
our
production
systems.
So
it's
quite
pleased
I'll
tell
him
the
first
week
of
lunch.
We
had
a
node
failure
and
it
was
no
doesn't
users
didn't
notice?
There
was
no
impact,
it
was
nice
to
nerds
guns,
/
dad
for
christmas.
B
This
did
impact
as
a
little
bored
and
we
expected
to
know
themselves
dying
to
have
any
any
problem
or
impact.
The
problem
impact
came
from
the
repairs.
We
started
repairs.
The
Cassandra
cluster
again
are
with
two
clusters:
30
nodes
in
each
DC
for
a
large
cluster
and
that
the
smaller
cluster
is
16
notes
HGC.
So
when
two
nodes
failed
out
of
30
nodes,
we
had
repairs
on
both
both
sides.
B
Data
center
and
repair
started
hanging,
which
is
quite
common
in
the
window
to
ranch,
and
so
we
started
seeing
latency
increased
by
a
factor,
but
for
first
I
was
fine
because
we're
still
our
legacy.
We
get
our
application
and
can
handle
us,
but
it
caused
learn
bells
for
a
while,
and
it
meant
that
during
Christmas
instead
of
saying
you
know
its
failure,
we
can
deal
with
after
Christmas.
We
ought
to
actually
kind
of
go
in
there
fix
it
and
keep
an
IQ
Freight
Los
Dinos.
B
Also
20
other
titles,
so
this
is
built
for
Sandra
for
color
duty
ghosts
and
we've
expanded.
Then
last
year
to
use
develop.
Three
reversals
also
uses
it
now
and
call
duty,
advanced
warfare
users,
as
a
two
are
planning
to
expand
more
titles
on
the
same
clusters.
So
any
new
title
rebuilding
will
start
pushing
more
more
to
Cassandra
because
from
management
manageability
and
operations
point
of
view,
it's
a
lot
nicer,
easier
to
use
and
you
don't
worry
about
charting
expanding
its
holidays
and
illustrator.
If
it
comes
easy
questions.
B
A
B
Yes,
perfect,
so
the
repairs:
how
do
you
go
about
running
them?
We
don't
so
we
kind
of
break
the
Connecticut
and
Rhode
advice
for
they
say
run
repair
once
a
week,
we're
going
to
repair
one
repaired
once
once
a
week,
we're
going
to
repair
less
than
GC
gray
seconds.
We
don't
do
that.
We
are
nodes.
Don't
our
nodes
are
don't
change
that
often
so
we
typically
run
repairs
if
nodes.
If
there's
any
changes,
apology
or
we
change
any
node
will
then
run
repairs.
We
don't
run
repair
as
frequently
when
we,
when
we
do.
B
We
have
scripted
ourselves
so
with
a
script
that
we
run
from
one
node.
It
creates
a
lot
have
to
us
appear
on
the
screen.
Credits
lock,
file
gets
a
node
list,
hops
onto
every
node,
does
a
repair
there
and
that's
if
I
finished
successfully
goes
the
next
node,
so
I
think
they
ops
center
Hauser
repair
service,
which
is
kind
of
designed
to
her
all
the
time.
We
don't
really
want
that,
because
I'll
impacts
latency
pretty
heavily
so
instead
we
just
have
the
script
thing.
We
run
penitas.
Okay,
thank
thank
you.