►
Description
Speakers: Seán O Sullivan, Service Reliability Engineer & Tim Czerniak, Software Engineer, at Demonware
This presentation covers the eight-month evaluation process we underwent to migrate some of Call of Duty’s core services from MySQL to Cassandra. We will outline our requirements, the process we followed for the evaluation, decisions we made around our schema, configuration and hardware, and some issues we encountered.
A
So
that's
me,
I'm
tim,
I'm
a
software
engineer.
That's
sean
he's
an
operations
engineer
and
we
both
work
for
demon,
ware,
so
demon
who
so
demon
wear,
is
a
subsidiary
of
activision,
blizzard,
we're
owned
by
activision
blizzard
and
we
write,
deploy
and
maintain
client
and
server
applications
for
activision
and
blizzard
games.
A
We've
been
around
for
about
11
years
now.
At
this
point
we
were
bought
in
2008
by
activision
and
or
2007.
Actually,
I
think
and
yeah,
so
we've
been
doing
it
for
a
little
while
here's
some
of
the
titles
that
we've
been
involved
in
this
year,
there's
also
destiny,
which
is
not
there,
which
released
on
tuesday.
A
So
we've
got
advanced
warfare
diablo,
which
it
turns
out,
was
our
100th
title.
We
were
involved
with
100
titles,
there's
call
of
duty
in
china,
which
is
releasing
later
in
this
year
on
this
year
and
skylanders,
which
is
releasing
in
I
guess.
Next
month
we've
been
involved
in
a
lot
of
titles
skylanders
every
call
of
duty
since
call
of
duty.
3
loads
of
guitar
heroes
and
dj
heroes
and
other
heroes
and
bond
games-
and
you
know
mobile
apps,
etc.
A
We
have
a
lot
of
services
and
we
do
things
like
matchmaking,
which
is
you
know
if
you
want
to
play
a
game
with
someone,
you
have
to
say:
hey,
I
want
a
game
and
someone
else
has
to
come
along,
say:
hey,
I
want
a
game
and
then
we
match
you
up
and
and
off
you
go
leaderboards
chat,
file,
storage,
you
know,
leagues,
social
network
integration.
You
know
uploading
things
to
youtube
streaming
them
big
content
servers.
We
have
a
lot
of
stuff.
A
We
have
about
100
services
that
we
use
for
in
various
configurations
for
each
game,
so
some
of
the
technologies
we
use
on
the
client
side.
We
got
c
plus
plus,
of
course,
because
you
know
consoles
run
c
plus
they
need
to
run
fast,
http
for
web-based
applications
and
and
websites
and
on
the
server.
These
are
the
main
things
we
use
so
we're
a
big
python
house.
We
use
airline
as
well
mysql,
mostly
we
run
everything
on
centos
and
we
use
public
for
for
automation,
there's
plenty
more,
but
they're.
A
The
main
things
so
we
have
a
bit
of
unusual
use
case.
Most
people
in
this
industry
talk
about
ramp
up
they.
You
know
you'll
you'll,
start
out
small.
You
might
have
a
few
customers
you'll
gradually,
maybe
gain
some
some
popularity
and
eventually
you
know
maybe
you'll
have
a
couple:
spikes
and
over
the
over
the
years,
you'll
gradually
ramp
up
and
up
and
up.
We
have
the
exact
opposite
case.
So
here's
one
of
our
games,
you
can
see
there
there's
release
day.
A
That
is
the
peak.
This
is
not
working,
that
is
christmas
and
what
happens
is
over
the
next
year
or
two.
These
numbers
will
gradually
just
tail
off,
so
we
have
the
exact
opposite
case,
where
it's
straight
up
and
then
tail
off.
So
this
means
that,
because
we're
in
america
I
thought
I'd
use
an
american
quote,
we
need
to
be
prepared
on
day
one.
B
B
B
B
B
It's
high
right
hand,
high,
read
it's
constantly
pinging
saying
I'm
online,
I'm
online,
your
friends
every
so
often
check
who's
online.
So
it's
kind
of
it's
transient
data.
It's
very
small
data
size
and
the
last
one
the
notification
looked
at
was
messaging,
so
messaging
is
mail
messaging.
So
it's
sending
mails
or
messages
for
their
users.
So
it's
low,
read
low
right.
You
don't
get!
It
doesn't
happen
too
often
in
game.
It's
like
in-game
invites
for
jumbo
party,
that
kind
of
thing
and
again
transient.
B
We
also
looked
at
the
fourth
service,
but
we
went
about
about
two
or
three
weeks
and
we
decided
it
was
too
relational
and
just
to
keep
it
mysql
that
ended
up
having
kind
of
the
two
mysql
dbs,
both
dc's,
doing
crosshair
application.
It
works,
but
it's
not
pleasant
to
maintain
or
administer
so
the
requirements
I
mentioned
across
dc.
That
was
that
was
our
main
push
so
that
whatever
we
chose
had
to
do
that
and
a
happy
master
but
master
in
both
dc's
and
rights
to
both
the
next
was
consolidation
and
expansion.
B
As
tim
mentioned
earlier,
we
have
to
build
high
early
on
and
we
consolidate
months
or
a
year
later,
and
we
reduce
our
server
size
because,
as
the
user
kind
of
drops
off,
so
we
typically
may
spend
three-ish
months
a
year,
maybe
doing
consolidations,
maybe
maybe
less
maybe
more
depending
on
the
year.
It's
a
lot
of
ups
time.
It's
a
lot
of.
We
have
to
get
people
capacity
planning
involved,
we
have
operations
involved.
Development
involved,
do
a
lot
of
planning
and
then
actually
we'll
do
the
work
itself
and
it's
mysql.
B
We
the
way
we
typically
do.
Is
we
follow
by
title.
So
we
have
a
single
cluster
and
that
single
cluster
does
a
single
title.
If
we're
getting
another
title,
we
run
a
new
cluster
up
and
the
new
cluster
is
this
new
title.
We
keep
things
very
separate
because
we
never
want
to
have
a
situation
where
one
type
will
impact
another
title
again.
This
doesn't
help
with
the
whole
miles.
You
have
many
many
more
squad
hosts
and
for
sure
and
consolidation
then
becomes
a
bigger
task
as
well.
B
So
whatever
another
requirement
was
whatever
we
picked
consolidation
expansion
to
be
easy
manageability,
so
we
had
an
operations
team
of
about
15
people,
15
20
people
because
of
that
and
they're
all
used
to
mysql.
B
So
if
we're
going
to
rip
up
mysql
for
some
of
our
core
services,
so
the
progress
star
progress
stores,
especially,
is
a
required
service.
If
that's
down
the
game
doesn't
work,
you
don't
get
on.
So
if
we're
going
to
replace
one
of
our
core
services
with
an
alternative,
we
need
to
make
sure
the
alternative
was
easy,
manageable
and
it
was
relatively.
B
Now
each
of
these
requests
I
mentioned
from
the
application
level.
So
if
you
convert
this
to
nosql
non-relational
databases,
you
may
be
turning
that
into
two
to
five
different
requests.
B
So
we
one
of
our
operations,
engineers
was
getting
introduced.
The
whole
cap
therm
and
as
this
proves
kind
of
he
got,
he
had
the
ap
on.
He
had
the
ap
sort
of
not
too
much
to
see
bit.
B
We
shortlisted
suitable
options,
rick
and
cassandra's
off
the
list.
They
were
both
level
db.
They
were
both
based
in
dynamo
db.
They
both
would
roughly
have
a
data
looking
into
with
available
options.
There
came
close,
so
we
want
and
they
both
use
expansion
both
v
nodes.
We
we
deployed
in
cassandra
one
1.2
and
v
nodes
was
a
major
reason
for
that.
B
We
also
wrote
our
application
back
end
twice,
so
we
believe
in
testing
our
stack.
We
don't
want
to
use
some
automated
testing
tool.
Would
you
put
numbers
in
and
see
how
that
performs
a
cassandra?
We
want
our
complete
stack,
built
out
and
see
how
that
actually
performs.
So
we
built
our
stack
to
use
cassandra
and
rake
in
the
back
end,
and
then
we
did
load
testing
for
approximately
six
to
eight
weeks
of
each
as
an
evaluation
before
actually
doing
proper,
full
unload
testing.
B
B
B
B
The
reason
we
did
these
two
two
very
different
clusters
was
we
identified
bottlenecks.
So
if
we
had
to
the
same
cluster,
you
see
it's
a
bottleneck
and
I
or
a
bottleneck
in
memory,
there's
very
little.
You
can
do
other
than
shipping
your
hardware
in
or
trying
to
upgrade
your
hardware,
which
is
pain.
So
by
this
we
actually
saw.
Okay,
cpu
is
fine
here,
but
memory
is
not
so
good
here
we
could
actually
tweak
it.
Then
we
could
get
a
rough
idea.
What
we're
looking
at.
B
B
We
also
run
a
soak
test.
I
think
this
from
the
very
start
we
decided
a
soak
test
is
one
of
the
most
important
things
we
have
to
do
where
typically,
our
remote
load
tests
of
approximately
three
to
four
hours,
maybe
five
or
six,
because
cassandra
will
change
over
time
a
lot.
We
decided
we
needed
at
very
least
three
to
four
days.
Ideally,
two
weeks,
I
think
we
ended
up
with
five
five
days
because,
as
we
always
run
out
of
time,
these
things
some
interesting
numbers.
B
So
some
of
the
numbers
we
load
tested
where
we
load
us
with
five
million
users
at
that
it
was
about
six
thousand.
This
is
a
per
node
thing.
It's
about
six
thousand
reads
for
cassandra
standard
node
and
about
fifteen
thousand
rights,
and
at
that
it
was
about
2.5
millisecond,
read
latency
at
the
98.
A
So,
as
we
said,
we
we
load,
tested,
riak
and
cassandra,
and
the
winner
was
react.
No,
it
wasn't
react,
we
thought
react,
would
be
a
slam
dunk,
because
well,
it's
erlang
based
and
we
know
erlang.
A
The
tooling
is
actually
excellent
for
react.
It
performed
very
well,
we
did
we
were.
We
did
the
react
testing
first,
and
so
we
thought.
Okay,
you
know
this
is
looking
pretty
good,
but
oh
yeah
and
we
previously
evaluated
it
as
well.
One
of
our
one
of
our
engineers
had
gone
into
evaluating
it
for
another
for
another
product
previously.
So,
however,
cassandra
was
actually
the
winner
in
the
end.
Otherwise
we
wouldn't
be
here,
it's
just
not
working.
A
So
the
right
performance,
like
seriously
seriously
beat
react.
It
was
about
four
times
better,
just
for
our
workload,
that
is,
it
had
a
richer
feature
set
and
it
compounds
keys.
It's
got
the
partition,
keys
and
the
clustering
columns,
and
that
enabled
us
to
do
a
bit.
More
react
is
quite
simple.
Really
it's
it's
just
key
value,
mostly
or
at
least
at
that
time
it
was,
and
so
so
from
a
developer
perspective
and
feature
perspective.
A
It
was,
it
was
much
more
inviting
cassandra
was
much
more
inviting
at
that
time.
The
maturity
of
the
code
base
was.
A
It
was
a
bit
more
mature.
There
was
a
bit
more
community
around
it.
It
seemed
like
it
had
gone
through
the
the
the
gone
through
the
wars
a
little
bit
and
and
was
a
bit
better
off
for
it,
and
is
this
not
working?
A
I
might
have
to
use
yours,
but
soon,
so
we
continue
testing
24
7
until
launch
oh
yeah.
A
The
other
thing
I
meant
to
mention
there
is
so
generally
speaking,
if
you
mention
java
to
our
engineers,
because
we're
not
java
heads,
we
run
a
mile,
but
one
thing
about
cassandra
is
that
it
doesn't
really
reveal
too
much
the
the
the
technology
on
which
it's
built,
which
is
really
nice
and
react,
shows
its
guts
a
little
bit
more,
and
even
though
we
are
longheads,
it's
it's
always
nice
to
have
something.
That's
a
bit
more
rounded
a
bit
more
a
bit
more!
A
You
know
you
don't
have
to
get
too
in
too
into
it.
If
you
don't
want
to
so
we
continue
testing
until
launch.
We
have
two
offices
that
we're
doing
this
testing.
We
have
one
office
in
vancouver
and
one
office
in
dublin
in
ireland
and
so
that
that
obviously
helped
we
were
able
to
do
very
big
sub
tests.
As
sean
pointed
out,
we
planned
as
well
mysql
cassandra
hybrid
thing
for
one
of
our
services,
just
because
it
was
one
of
our
really
really
really
critical
services.
A
That's
the
progress
store.
We
kind
of
plan
to
you,
know,
write
to
both
and
then
have
a
percentage
of
reads
from
one
and
then
gradually
move
over.
It
turns
out
it
was.
It
was
overly
complex
and
just
had
too
much
overhead
operationally
and
and
developmentally,
and
so
we
dropped
it
and
we
just
went
with
cassandra
so
so
a
bit
about
schema.
A
So
the
progress
score
was
a
perfect
fit.
It's
it's.
You
always
know
the
key,
because
it's
user
based,
so
it's
just
a
user's
progress
and
it's
mostly
right
and
you
just
read
when
you
log
in
and
then
you
just
update,
save
as
you
go
along.
So
it's
like
perfect,
really
really
simple.
Like
schema
was
just
it
wrote
itself
sorry.
A
So
then
we
have
present
a
bit
more
relational.
It
was
kind
of
a
little
bit
tricky
because
there
was
various
ways
to
index
into
the
data
and
we
had
to
figure
out
the
best
schema
there.
It's
a
high
throughput
service,
and
so
we
had
a
lot
of
tombstones
and
as
sean
was
mentioning,
the
peaks
and
troughs
of
of
you
know.
Daily
logins
and
log
outs
helped
us
actually
iron
out
a
couple
of
a
couple
of
issues
with
that.
A
So
we
had
an
issue
we
had
to
use
ttls
because
it's
you
know.
Obviously
it's
high
throughput
and
you
know
someone
logs
off
or
if
they
don't
log
off
or
they
you
know
if
they
disconnect
uncleanly,
you
need
to
just
kill
the
data
eventually,
so
we
had
an
issue
where
we
had
two
keys
space
or
sorry,
two
tables
with
the
same
ttl,
but
one
was
being
written
more
often
than
the
other,
and
so
what
happened
was
one
eventually
one
just
like
died
or
just
went
away,
and
so
we
had
all
these
errors.
A
Where
you
know,
one
thing
was
indexing
into
the
other,
and
so
we
we
learned
a
lesson
from
it,
but
that
was
you
know
just
an
interesting
issue.
We
encountered
there
and
then
for
the
messaging
service.
It
was
time
series
data
so
well
suited
it's
literally
just
you
know
you
got
some
messages,
reader
messages,
but
again
we
had
an
issue
with
tombstones
where
our
nagios
check
actually
caused
it.
A
Interestingly
enough,
so
the
nadius
check
was
using
the
same
partition
key
every
time
for
its
check
and
it
built
up
a
load
of
tombstones
in
one
partition
and
eventually
or
gradually,
the
performance
of
the
cluster
went
down
like
and,
and
it
eventually
hit
a
point
we
couldn't
sustain.
So
we
were
like
what
is
going
on.
We
figured
it
out.
It
was
all
all
the
one
team
tombstones
and
the
one
partition
and
we
actually
hit.
It
was
actually
a
cassandra
bug
which
actually
had
been
fixed
already,
but
we
had
not
deployed
so
yeah.
A
That
was
an
interesting
issue
we
encountered
so
lessons.
We
learned
from
schema.
Keep
it
simple,
so
the
shift
in
thinking
of
of
of
you
know
coming
from
a
mysql
background
or
an
sql
background.
You
know
it's
not
a
relational
db.
You
need
to
rethink
the
way.
You
model
your
data,
listen
to
patrick.
He
knows
what
he's
talking
about.
We
worked
with
him
on
it
and
and
figured
it
out.
It
was
really
really
good,
so
so
keep
it
simple.
A
A
A
Some
issues
are
not
evident
in
unit
tests
and
they
will
not.
They
will
not
show
their
faces
until
until
you
do
them
at
scale
or
you
do
them.
As
we
said,
you
know,
with
peaks
and
troughs
and
whatever
else
that's
it
head
over
to
the
shop.
B
So
config
the
default
settings
when
you
install
cassandra
at
least
this
packet
for
dse314316-
probably
not
what
you
want
they're
somewhere
in
between
a
vm
and
a
high
performance
server,
and
they
don't
really
work
for
it.
They
didn't
work
either
for
either
very
well.
For
us
at
least,
we
took
a
run
quicker
into
the
config
changed
a
ton
of
settings
based
on
common
sense
and
then
reverted
a
few
of
them.
B
After
we
kind
of
did
a
quick
run
through
the
config
with
the
kind
of
the
common
sense
settings,
we
then
did
a
kind
of
a
change.
One
setting
load
test
for
three
to
four
hours,
check
results
compared
to
previous
load
test
and
then
load
test
again
we'd
never
touched
the
heap.
We
didn't
we
pos.
If
we
had
more
time,
we
probably
possibly
would
look
to
us,
especially
after
seeing
some
of
the
talks,
and
here
and
other
other
youtube
videos
we
looked
at.
B
It
does
look
possibly
a
worthwhile
thing
to
do,
but
again
it
takes
a
fair
bit
of
time.
I
would
imagine
there
is
an
appendix
at
the
end
of
the
talk
where
we
list
all
the
config
changes
we
made.
So
when
you
get
when
the
slides,
just
I'm
sure
you
can
see.
B
That
so
the
hardware
when
we
ended
up
going
with
we're
going
with
two
and
so
some
background
this
in
dean,
where
we
typically
have
two
specs
of
hardware
each
year,
we've
kind
of
an
app
spec
which
is
high
cpu,
high
memory,
high
cpu
high
memory
and
pretty
crappy
discs
and
db
spec,
which
is
low,
cpu,
low
memory
and
very
good
disc.
So,
for
the
cassandra
we
ended
up
merging
the
two
pretty
much.
We
went
with
two
intel
cpus
at
two
gigahertz.
B
We
went
to
raid
one
with
two
40
gig
ssds,
32,
gigs
of
memory
and
one
gigantic
network.
B
So
I
saw
some
of
the
explanation
to
this
talking
to
patrick
and
data
stacks
they're
they're
really
pushing
the
raid
zero
thing
and
it
got
us
to
thinking
about
the
cap
theory
a
bit
more
and
got
us
thinking
about
kind
of
relational
databases
and
traditional
databases
versus
like
apache
with
a
high
availability,
where,
typically
you
don't
care
if
nodes
die,
we're
not
really
used
to
that.
So
there
was
kind
of
internal
discussion
as
well
as
discussion
with
data
stacks,
and
it
ended
up
being
a
case
of
we
trust
our
hardware.
B
Pretty
good
our
hardware
vendor
is
very
stable.
We
get
very
low
failure
rates
in
hardware
and
disk
failure
rates
is
probably
the
highest
we
get
and
even
they're
pretty
low.
So
we
decided
raid
one
would
suit
us
better,
because
we
actually
have
to
replace
hardware.
Less
will
be
less
less
effort
and
what
better
for
the
game.
B
This
capacity
was
our
bottleneck,
so
we
ended
up
pretty
much
having
to
scale
the
cluster
up
based
on
the
disk
size.
Now,
that's
kind
of
why
you've
got
a
high
cpu
there,
because
the
plan
is,
we
can
increase
our
disk
size
when
required.
We
can
go
up
to
one
terabyte
or
maybe
two
terabytes
when
they
come
out
and
when
we
do
that
we
don't
touch
cpu
or
memory
or
anything
else.
They'll
be
fine.
B
So
monitoring,
when
we
first
started
doing
our
load
tests,
we
ended
up
not
only
load
testing,
cassandra
graphite
as
well
one
of
the
settings,
the
graphite
reporter
was
how
often
to
send
metrics.
So
we
didn't
really
notice
or
see
this
too
quickly
and
it
was
sending
metrics
at
one
everyone
a
second,
so
we
had
a
60
node
cluster
for
load
testing
and
we
were
sending
from
each
node
170
metrics
every
second
second
half
brought
down
graphite
so
quickly,
change
that
to
our
normal,
which
is
going
to
one
minute
metrics.
B
The
key
cash
hit
rate
is
an
interesting
metric.
We
kind
of
we
used
a
lot
and
we
thought
we're
doing
very
well
and
we
understood
it
better.
So,
on
the
key
cash
hit,
we
were
typically
seeing
us
about
90
low
90s,
no
90
percent
loading
been
90
percent
and
we
thought
this
is
great,
we're
doing
good
and
then
one
of
the
one
of
the
other
operations
engineers
after
lunch
looked
into
it
a
bit
notice,
we're
actually
only
about
60
or
70
percent.
B
What's
happening
is
when
action
was
happening,
compaction
read
all
the
assets
tables
that
would
bump
our
key
cash
hit
100.
So
on
the
average,
which
is
the
number
you
generally
see,
you
see
a
hundred
percent
so
to
get
the
actual
key
cash
hit,
you
have
to
go
for
the
column,
family,
key
cash
hit
versus
the
actual
queries
done
and
you
can
get
it
out.
You
can
get
the
accurate
one
that
didn't
bite
us,
but
it
could
have
been.
B
It
could
have
been
bad
if
we're
allowing
it
heavily
negative
next
thing,
so
we
generally
use
90s
follower,
monitoring
and
typically
new
service
comes
along.
You,
google,
niger's
service
name,
you
get
you
get
a
lot
of
plugins
didn't
work
so
well
cassandra.
There
was
very
few
plugins
available
and
we
ended
up
doing
a
lot
of
a
lot
of
ourselves
and
jolokia
helped
a
lot
with
this,
where
geological
export,
jmx,
metrics
and
jmx
in
general
over
http,
and
then
you
can
quite
easily
write
your
own
checks
or
your
own
tools
to
actually
change.
B
Jmx
settings
I'll
use,
hp
interface,
which
is
again
we're
a
pythoner
link
shop.
So
getting
into
java
and
java.
Tooling
and
jmx
wasn't
something
anyone
really
wanted
to
do
so.
The
fact
we
could
easily
do
over
http
was
really
good.
B
We're
still
changing.
Excuse
me
we're
still
changing
what
we're
actually
monitoring
alert
on
we've,
we've
gone
through
several
iterations
of
doing
long
term
and
long
and
short
term
read
latency
right,
latency
checks
and
the
problem
is
when
you
deploy
cluster
changes
over
time.
So
if
you
look
at
your,
if
you
look
at
a
metric,
so
during
load
testing,
for
instance,
one
metric
may
not
have
gone
over
half
millisecond,
so
we
said
right.
B
B
B
B
We
have
problems
of
enough
direct
awareness
where
enabling
rack
awareness,
when
using
v
nodes
led
to
a
thirty
percent
data
distribution,
difference
on
some
nodes,
and
as
I
mentioned,
we
had
our
main
capacity
issue
was
actually
this
capacity.
So
when
we
started
seeing
some
of
our
nodes
have
30
percent
more
than
other
nodes,
that
was
pretty
bad
for
us,
so
we
decided
to
turn
red
awareness
off.
We
weren't
really
using
rack
awareness.
B
We
actually
had
a
blade
blade
system,
so
we
were
trying
to
do
was
use
a
kind
of
a
chassis
awareness
based
on
rack
awareness,
but
bench
eventually
didn't
end
up
going
with
it
low
balancers,
weren't,
really
gotcha,
but
because
we're
using
the
old
python
driver
we're
doing
sequel
over
thrift,
so
we're
using
the
old
python
driver
with
no
awareness
of
node
awareness
or
token
awareness,
so
we
put
low
balance
from
load
balance
in
front
of
it
all
we
used
them.
We
during,
I
think,
a
half
way
to
our
load
testing.
We
realized.
B
We
must
get
the
clients
to
reconnect
every
10
minutes
because
otherwise,
in
a
production
situation,
if
you
need
to
expand
your
cluster
and
you're,
not
actually
reconnecting
every
10
minutes,
you're
receiving
person's
connect,
persistent
connection
open,
then
you're
not
going
to
use
any
node
you
out
your
cluster,
so
you
have
to
reconnect
in
every
for
every
every
application,
server
spec
for
every
type
of
v8
and
then
our
dev
dev
cluster
differs
from
production
or
did
dev,
for
us
is
production
as
well,
because
game
studios
actually
develop
against
dev.
B
So
dev
is
prod
searches,
the
certification
environment,
that's
also
broad
and
product,
is
also
proud.
So
we
very
little
testing
environments,
but
in
our
development
environment
we
typically
use
vms,
so
we
built
center
and
vms,
and
I
was
pretty
much
after
load
testing.
B
No,
our
throttling
was
turned
off
for
compaction
memory
size
for
the
heap,
a
lot
of
the
search,
this
sort
of
modify
it
hit
us
pretty
hard,
and
eventually
we
actually
did
push
for
getting
dev
to
be
the
same
as
production,
at
least
for
cassandra,
because
I
think
we
wasted
it
ended
wasting
about
a
week
or
two
debugging
issues
which
weren't
affecting
wouldn't
and
weren't
affecting
the
game
but
were
affecting
this
development
environment,
which
was
kind
of
important.
B
We
also
ran
a
couple
of
different
issues
initially
before
launch
or
network
around
the
network.
So,
during
the
evaluation
period,
where
we
had
tested
two
different
dc's,
we
had
a
problem
with
the
network
capacity,
not
network
capacity
as
such,
but
each
individual
connection
was
only
pushing
up
to
about
one
or
two
megabytes,
but
the
overall
link
was
two
gigs.
I
think
we
spent
about
a
month
debugging
into
that
and
eventually
it
turned
out.
B
B
So
we
ended
up
doing
the
same
thing
after
another
month
of
testing.
We
aren't
doing
the
same
thing
again:
jury
tunnell,
not
turning
off
ipsec.
We
test
each
individual
component
network
and
each
individual
each
area
and
it
all
works.
Fine,
except
when
you
do
the
full
link
and
then
they
all
failed
horribly.
B
So
lunch
was
boring.
Thankfully,
as
patrick
mentioned
his
speech,
his
talk
in
the
first
week,
the
dev
manager
came
over
to
my
desk
and
asked
to
simulate
second
week
go
over
my
desktop
similar
note.
Failure,
because
you
know
we
put
the
practice
in
we've
got
documentation,
we
did
run
books,
we've
gone
through
it
all,
but
you
never
really
know
what's
going
to
be
like
until
you're
actually
doing
prod.
So,
let's
you
know
take
a
note
down,
so
I
was
quite
pleased
to
have
a
note.
B
B
We
do
two
of
the
nodes
I
have
for
christmas.
This
didn't
go
as
well.
We,
the
nodes,
designing
themselves
were
fine
that
didn't
cause
any
problems
cluster
and
what
did
cause
the
problem
was
repairing
the
nodes
so
replacing
the
loads
and
running
compact
running
repairs
that
caused
a
lot
of
compaction.
B
Then
the
repair
would
actually
stall
in
cases
you
have
to
kill
the
new
node
and
then
put
the
node
in
again
and
then
start
repair
again
and
hope
that
a
pair
of
work
this
time,
so
it
didn't
affect
user
too
badly.
I
mean
I
think
latency
went
up
about
10
or
20
times,
but
we
had
a
kind
of
pretty
good
buffer
there,
so
that
was
that
wasn't
a
problem
for
us,
but
it
did
cause
niagara
starts
to
spam
and
knock
to
start
ringing
us
every
10
minutes
or
so
and
explain
to
other
titles.
B
So
this
talk
was
against
again
about
them.
Launching
call
of
duty
for
launching
ecology
with
with
cassandra
and
now
that
it
worked
well
and
we've
actually
seen
that
it
works
well
reduces
operations,
time
we're
actually
planning
now
to
use
other
titles
with
it.
We've
put
diablo,
we've
put
a
certain
aspect
of
develop
onto
it.
The
newest
diablo
reaper
solves
onto
it
and
we're
hoping
now
for
next
year
to
put
more
titles
onto
it.