►
From YouTube: Sony: PlayStation4 and Cassandra - Journey Continues
Description
Speaker: Alexander Filipchik, Principal Software Engineer
It has been 2 years and 20 million+ consoles sold since the Playstation 4 launch, and Cassandra is still alive and well within our infrastructure. We will cover various aspects of running Cassandra at large scale, share our findings, and discuss some tricks that can make your lives easier. We will share how we handle varying use cases such as batch analytics using Spark to how we provide real-time personalized search. And just like before, we will be having a raffle (Last time, one lucky attendee walked away with a brand new PS4 destiny edition!).
A
B
Yeah
thanks,
everyone
for
coming
I
know
why
you
guys
are
really
here,
but
it's
all
right.
You
have
to
sit
through
our
presentation
which
hopefully
you'll
get
something
out
of
it,
so
we
launched
ps4,
probably
less
than
two
years
ago,
and
we've
done
a
lot
of
local
Cassandra.
We
probably
made
every
mistake
in
the
book
and
we've
learned
from
it:
we've
been
able
to
scale
and
really
handle
crazy
amounts
of
traffic
and
really
have
learned
a
lot
of
you
want
to
share
what
we've
learned
with
you
all,
especially
in
this
past
year.
So.
C
And
also
so
you'll
get
you
will
get
this
really
beautiful.
I
will
like
it
as
raffle
ticket,
so
keep
it
with
you
at
the
presentation
ends
and
if
you
want
just
continue
keeping
it
afterwards,
just
like
art
and
throw
the
numbers
and
we'll
just
at
the
end,
we'll
we'll
try
to
pick
a
winner
by
generating
random
number.
So
it's
I
I
think
it
will
be
fun
and
then
be
careful
by
the
way.
With
this
thing,
I
spent
two
hours
yesterday,
stamping
it
manually
and
I'm,
not
really
good.
B
B
C
Who
why
do
you
want
to
listen
to
us
actually
slide?
That's
why
that's
what
people
put
the
slides
right,
so
I'm,
Alex,
I
principal
software
engineer
with
Sony
network
entertainment,
here's
my
playstation
network,
ID
Laser
laser
toy,
so
you
guys
feel
free
to
add
me
as
a
friend
we
just
read
it
from
Chan
allottees.
That
allows
you
to
add
me
as
a
friend
and
find
me
in
a
PS
and
then
we'll
talk
about
it
later
and.
B
B
The
ps4
has
really
really
risen
in
popularity
is
one
of
the
most
fastest
growing
consoles
ever
and
this
kind
of
give
you
a
feel
for
what
the
kind
of
user
base
PlayStation
Network
deals
with
we
deal
with
over
65
million
active
users
and
on
record
we
have
several
hundred
million
users,
so
you
can
imagine
the
amount
of
interactions
that
happen.
The
amount
of
user
level
content
that
we
have
to
store
in
our
Cassandra.
B
It's
quite
a
large
task
and
when
you
think
about
what
playstation
network
is
you've
got
to
start
thinking
outside
of
ps4
and
ps3
we're
very
much
a
multi-platform
set
of
services
to
provide
a
broad
set
of
experiences
on
my
ps4
sui
has
three
vita
phones
and
eating
televisions
that
are
not
something
so
samsung
TV.
We
have
some
experiences
of
them
there
and
will
continue
to
expand
and
innovate
in
that
direction.
B
B
Just
imagine
having
a
two
million
users
logging
into
your
services
over
like
midnight
immediately
seeing
that
traffic
spike
was
quite
an
interesting
challenge
and,
as
you
can
see,
our
growth
have
to
spend
going
on
continuously
well
over
20
million
consoles
right
now,
we've
grown
to
10
times
the
amount
that
we
started
with
and
in
fear
one
thing
you
don't
see
or
just
the
traffic
spikes,
so
this
is
only
ps4
is
sold.
We
also
deal
with
traffic
spikes.
B
C
Yes,
that
is
a
very
sensitive
topic.
We
have
a
really.
We
have
a
really
really
big
competitor
and
then
there
are
a
lot
of
very
smart
people
working
for
microsoft,
but
it's
very
hard
to
compete
with
microsoft,
a
lot
of
bright
engineers,
but
we
was
able
actually
to
do
really
really
well
and
to
give
you
a
how
how
bad
is
the
competition?
How
tough
it
is.
C
Yeah,
but
so
what
does
it
mean
for
us?
It
means
that
we
need
to
iterate
fast.
We
need
to
bring
new
features
and
also
want
to
make
sure
that
our
services
and
up
and
running
and
then
they
can
scale
and
then
our
back-end
doesn't
fall
apart
every
month
or
every
year.
I'll
show
we
we
we
don't
want
to
see
any
outages
right,
100
impact
we
wanted
good
latencies
went
really
great
customer
or
user
gamers
experience.
Actually,
okay,.
B
So
what
we
develop,
I
mean,
if
you
think,
about
all
the
functionality
that
we
provide.
I
mean
it's,
it's
not
just
saying:
okay,
we
have
a
ps4
or
some
services
inside
the
ps4
and
also
off
ps4,
there's
quite
a
bit
of
experiences.
For
example,
we
actually
have
a
very
large
social
network,
like
I
mentioned
before,
we
have
over
100
billion
users
over
65
billion
active
users,
so
you
can
imagine
a
social
network
of
that
scale.
We
keep
track
of
all
the
user
activities,
what
they
do,
what
they
play,
what
what
their
friends
are?
B
We
have
a
pretty
rich
social
graph
at
Sony.
We
also
do
things
like
maintaining
the
gaming,
video
libraries
and
being
able
to
search
through
those
libraries
and
cassandra
is
actually
a
part
of
that
search
solution
which
alex
we'll
talk
about
shortly.
We
do
authentication,
we
have
some
new
features
like
communities.
That's
coming
out
your
parts
that
power
the
store
we
use
it
for
just
some
general
purpose
stuff
like
caching
and
there's
a
lot
more.
So
if
you
look
I
mean
we
have
a
whole
suite
of
products
that
we
provide.
Some
are
very
innovative.
B
For
example,
we
are
one
first
to
do
live
streaming,
TV
over
over
IP,
so
we
have
playstation
view
product
which
is
extremely
innovative,
always
on
the
cutting
edge.
We
have
game
streaming.
We
have
assortment
of
products,
we've
partnered
with
Spotify
for
our
music
business
and,
as
you
see,
there's
a
lot
of
use
cases.
We
have
the
handle.
What
kind
of
back-end
do
we
need
to
support
all
those
use
cases,
and
what
we
found
is
that
when
we
try
to
make
the
tough
decision
a
couple
years
ago,
what
can
we?
B
What
database
solution
can
provide
us
with
scalability
low
latency
and
also
be
able
to
have
us
provide
for
all
these
new
experiences?
You
know
law
debates,
a
lot
of
fights
lot
of
technologies
that
we
looked
over
and
Cassandra
was
the
winner
and
I
look
back
at
those
meetings
and
those
discussions
and
I'm
pretty
happy
with
the
decision.
So
it's
kind
of
give
you
an
idea.
B
If
you
look
at
all
these
experiences
on
the
ps4,
cassandra
is
in
there,
it
may
not
be
obvious,
but
it's
actually,
and
so
many
of
the
experiences
on
the
PlayStation-
and
this
is
this
position
for
we're,
not
including
other
platforms,
which
also
use
that
so
I
mean
you
sort
of
like
traffic
I
mean
when
we
talk
about.
So
our
Cassandra
clusters
are
based
around
serving
the
end-user.
The
analytics
portion-
we're
not
about
analytics
data
in
here,
but
we're
talking
about
live
to
the
customer.
So
we
have
no
bunch
of
clusters.
We
have
nodes
that
cluster.
B
A
C
Right
so,
let's
now
talk
about
real
stuff
right
advertising
party's
over
guys
yeah,
so
keep.
What
is
this
so
France
search?
What
does
that?
Actually,
it's
personalized
search
where
you
had
several
use
cases
when
we
needed
a
beard
like,
for
example,
for
PSN
user
right
I
want
to
be
able
to
find
my
friend
of
mine
on
a
platform.
C
I
want
to
be
able,
search
by
name,
going
to
be
able
by
search
by
online
ID,
and
if
you
think
about
it,
just
general
search
like
how
many,
if
how
many
users
are
there
on
this
planet,
like
7,
billions
possible
gamers,
so
you
can
just
I
mean
the
solution
is
pretty
obvious.
You
just
put
everything
in
solar
and
then
scale
it
up,
and
then
you
will
be
able
to
search
it,
but
we
had
to
had
several
interesting
requirements.
C
How
many
PlayStation
4
games
give
me
all
PlayStation
4
games
with
and
that
give
me
all
the
preorder
it
PlayStation
4
games,
and
then
it
can
go
and
on
and
on
and
on
and
on
and
on.
So
solar
is
not
really
good
solution,
let's
say
to
hand
to
do
such
to
handle
such
use
cases.
So
we
had
to.
We
had
to
build
something
something
else.
So
I
won't
be
talking
about
France
use
case
here
only
so
we
have
way
station
for
we
have
Cassandra
the
problem.
C
Is
you
probably
not
easy,
this
big
empty
area
and
then
I'll
just
all
just
tell
just
there
will
be
one
error
just
going
down
I,
don't
know
it
will
not
be
so.
The
first
problem
that
we
needed
to
solve
is
how
do
we
get
the
social
graph?
Social
graph
can
be
very,
very,
very
big.
It
could
be
billions
of
billions
of
billions
of
edges.
It
cannot
fit
in
one
box
memory,
it
should
be
distributed.
C
You
you
probably
want
to
be
able
to
access
this
really
fast
in
parallel,
so
the
obvious
solution
was
to
build
a
micro
servers
on
top
of
it
and
then
and
then
Cassandra
I
was
like
a
good
choice
to
250
stores.
This
data,
the
problem
with
it
is
there
are
several
problems.
So
the
first
problem,
the
you
can
have
those
power
users
with
a
lot
a
lot
of
friends
and
then
how
do
you
store
it
good
question?
How
many
of
you
use
cql
rights,
reefed.
A
C
Right,
so,
if
you
use
cql,
you
probably
know
that
there's
this
partition
thing
which
lives
on
the
same,
only
one
note,
and
then
we
Swift
it's
to
me.
It's
easy
just
one
vote.
It
lives
on
the
same
box
on
the
same
note
and
can
be
replicated.
But
if
you
keep
adding
four,
if
you
model
it
as
a
account,
ID
and
bunch
of
friends,
then
this
think
it
gets
really
big
and
then
loading
it
in
memory.
C
All
the
time
can
get
can
be
expensive
and
it
will
be
slow
and
also
people,
people
create
new
connections,
people
break
connections.
So
a
lot
of
updates
going
into
this
thing,
a
lot
of
tombstones.
We
had
very
interesting
challenges,
so
we
decided
at
the
end
that
lets
ledges.
Let's
just
build
another
service
on
top
of
it.
Let's
cash
everything
locally,
let's
have
a
way
to
evict
the
cash
and
memories
faster.
Unfortunately,
and
fortune
truth
is,
memory
is
much
much
faster
than
go
over
the
network.
So
if
data
is
local,
the
access
is
just
blazingly
fast.
C
You
can
probably
go
through
all
this
graph
in
matter
of
seconds
rather
than
ours,
if
trying
to
load
it
from
a
database.
So
next
thing
real-time
indexer.
Why
did
we
need
it?
So
when
you
search
for
something
inside
your
friends
like,
for
example,
you
want
to
find
someone
who
is
friend
of
your
friend
by
name
and
sorted
and
all
the
stuff,
if
you
stir
those
personalized
indices
somewhere
and
we
tried
it,
we,
we
failed
actually
you'll
notice
very
soon
that
the
data
size
grows
exponentially.
C
C
We
have
this
layers
that
can
give
us
all
the
graph
really
fast
or
pieces
of
the
graph
partitions,
let's
say
and
then
we'll
just
index
it
in
real
time,
and
then
there
are
good
technologies
to
do
it
like
loosing,
for
example,
right
very
good
one
used
inside
solar
and
then
elastic
search
uses
it
where
we
can
do
it,
we
can
do
the
same
for
sure
and
then
last
piece.
Actually
you
don't
want
to
be
in
big
scene
every
time
right,
because
index
indocin
is
CPU
intensive.
C
So
here
it
goes
like
last
piece
of
the
puzzle
in
memory
personal
index.
It's
a
cache
distributed
cache.
It
keeps
like
user
sessions,
let's
say
for
some
time.
So
when
you
go
to
the
platform
you
search
we're
not
rebuilding
illnesses.
How
fast
is
this
thing
and
then
how
many,
how
many
account
would
say,
we'll
processing
so
we're
processing
millions
of
accounts
per
second,
that's
number
of
account
that
is
added
to
index
and
then
search
latency
is
less
than
one
millisecond
on
average
afternoon
indexes
build.
So
we
learnt
several
good
lessons
and
we
found
several
bugs.
C
So
if
you
use
st
onyx
actually
and
then,
as
I
saw
a
lot
of
people,
don't
use
this
t
onyx,
but
it
doesn't
mean
that
cql
might
not
have
similar
issue
right.
So
in
st
onyx
you
can
specify
a
number
of
connections
can
note
which
your
application
will
try
to
open,
and
then
it
there
is
a.
There
is
an
issue
with
it.
The
number
of
connections
always
growth.
There's
no
mechanism
to
reduce
the
pool
when
low
it
spikes.
The
top
server
will
open
as
many
connections
as
possible
load
goes
away,
connections
so
open.
C
And
then
it's
inside
thrift,
implementation
you're
like
buffers
buffers,
always
grow.
They
never
shrink.
So
when
we
try
to
store
indices-
and
this
is
can
be
big,
so
what
we
saw
spycams
a
lot
of
connections
are
open,
then
connections
a
circle
they
go
in
like
this
and
then
sooner
or
later
there
will
be
huge
payload
that
will
be
is
a
fetch
of
written
through
the
can
action,
buffer
buffer
gets
bigger
and
bigger,
and
bigger
and
application
just
leaks
memory.
So
there's
a
fix
for
it
is
just
evict.
C
Connections
right
actually
watch
them
and
they,
if
they
growing
too
big,
and
then,
if
you
see
that
right
to
me
too
many
idle
connection,
just
kill
them
and
maybe
kill
connections
every
several
minutes,
for
example,
to
make
sure
buffers
are
not
not
not
there
no
leaks
in
buffers,
so
another
one
might
apply
to
secure
as
well.
So
when
you
want
to
fetch
friends
of
friends
right,
if
we,
if
you
store
it
as
a
count,
a
different
the
first
quarter
will
be
give
me
all
the
friends
of
myself
right.
C
The
second
quarter
will
be
give
me
all
the
friends
of
my
friends
right.
It
feels
like
a
range
query
when
you
do
a
row
slicing,
so
default.
Implementation
in
st
annex
goes
like
this.
Even
if
you
use
talking
about
it
will
go
to
a
random
note.
Actually,
if
you
go
to
their
code
the
the
way
they
do
it
right
when
they
trying
to
find
the
nodes,
its
responsible
for
a
range
of
a
key
just
returns.
Now,
where
is
implementation
right
and
then
it
picks
random
coordinator
and
then
coordinator?
C
That's
all
this
magic,
and
then
it
goes
back
is
at
optimal.
Not
really
have
a
network
up,
so
let's
go,
let's
go
to
the
slide
and
then
the
improvement
actually
will
be
to
do
something
like
this
right.
You
know
talking
ranges,
so
you
can
technically
find
all
the
right
coordinators
in
your
ring,
because
you
know
ring
topology.
St
onyx
gives
you
this
information,
and
then
you
can
actually
do
a
quick.
Decor
is
nodes
directly,
so
it
will
be
1
network
hope
that's.
C
Hopefully
this
cassandra
has
this
one
interesting
mechanism
to
redistribute
in
the
load
and
then
I'll
go
back
to
a
previous
slide.
It's
another
idea.
So,
as
I
said,
ro
can
grow,
a
partition
can
grow.
So
when
we
find
found
that
you
can
actually
write,
Cassandra
uses
multiple
tokens
and
then
distributes
load
by
I,
say
no,
this
this
ad
actually
thing
for
sympathy
stock
in
range.
You
can
do
it
yourself
as
well
inside
Cassandra
and
that's
what
we
would.
What
we
did
for
several
be
close.
C
We
just
was
splitting
them
in
multiple
buckets
based
on
column
names,
for
example,
based
on
based
on
based
on
identifiers
here,
and
the
problem
with
it.
You'll
need
to
do
more
reads
to
fetch
everything
when
you
do
a
scan
with
shrift,
if
you
can
do
probably
sequential
scan
but
with
rift,
you'll
need
to
do
at
least,
like
n
reads
to
fetch,
like
all
those
rows.
So
but
writing
is
fast
and
then
it
you
you
want.
C
You
want
blue
or
up
your
memory
when
you're
always
extremely
huge,
because
I
worst
case
never,
for
example,
friend
account
who
knows
everyone
on
a
Playstation.
It
will
be
like
it,
a
really
really
big
bro
like
all
right,
then
I'll
skip
this
one.
So
rotation
seems
like
a
good
idea.
I
mean
why
don't
why?
Why
not
utilize
memory?
The
memory
when
stuff
in
memory
it's
very
fast,
like
reddit,
is
fast
memcache
this
fast
and
we
thought
could
be
a
good
idea
actually
to
use
it.
So
we
tried
it
and
it
was
painful.
C
Honestly
didn't
didn't,
go
really
well,
even
though
we
thought
we
had
a
really
good
use
case
for
it.
So
we
had
this
juicy
house.
It
wasn't
like
complete
jisuk
help,
but
after
a
day
of
running
instance
was
go,
instances
was
going
into
weird
GC
mode.
When
there
was
this
firing,
gc's
one
after
another,
one
after
another,
cpu
was
growing
up.
C
C
Yes,
it
takes
longer
to
restart.
Actually,
if
you
can
figure
it
the
store
stuff
on
disk,
and
if
you
restart
the
bouncing
out
note
takes
time
to
review
it
all
the
caches,
and
then
we
just
decided
that
there's
are
a
better,
better
ways
to
spend
memory,
actually
them
to
use
it
to
use
rotation,
spark
yeah.
C
The
problem
is
it's,
so
what
we
thought
we
can
use
spark
to
run
analytics
on
top
of
our
production
data
with
that,
oh
maybe
we
can
stream
have
the
second
bc
only
for
analytics,
and
then
stream
data
in
real
time
zone
will
run
spark.
On
top
of
this
different
Cassandra.
We
ran
some
tests.
We
didn't
really
like
it
so
right
now
we
use
park
or
use
park
with
Cassandra
will
use
Park
to
monitor
Cassandra
right.
We
use
Park
to
process
all
these
metrics
and
then
actually
all
those
metrics
going
into
our
logins
infrastructure.
B
Okay,
so
I
want
to
talk
about
supper
kind
of
handling
abuse
cases
which
is
designing
for
migration,
and
what
do
I
mean
by
that?
Well,
I
want
to
say
that
change
is
inevitable.
I
mean
we,
we
launched,
you
know
little
under
two
years,
but
the
user
place
has
grown.
A
lot
of
new
features
have
been
added
and
things
don't
let
a
scale,
as
you
initially
thought
so
column
family
that
exists
in
a
key
space
may
have
grown
more
than
you
thought
or
different
paths.
B
Different
different
sets
of
clusters
may
have
more
traffic
than
you
anticipated
when
you
first
designed
everything,
and
the
thing
is
because
we
have
so
many
interactions,
we
have
interaction
between
users.
Interaction
between
games,
users
interact
with
different
features
at
different
times,
game
launches
totally
new
things
are
happening
all
the
time.
So
what
does
this?
Do?
It
may
cause
more
load
on
different
aspects
of
your
cluster.
So
what
really
is
the
strategy
here
and
how
did
we
handle
it
and
I
think
people
who've
done?
B
Data
migration
probably
have
done
very
similar
things,
but
I
think
when
you
deal
with
massive
amounts
of
data,
it's
it's
very
crucial
to
know
what
your
shaggy
is
and
how
to
make
sure
you
don't
affect
the
end
user.
So
we
have
is
a
sense
of
load.
Very
so
let's
say
you
have
an
application
that
handles
multiple
use
cases.
Okay,
so
this
is
mix
st
hugh's
cases.
Kind
of
fit
under
this
feature
makes
sense
to
put
in
this
application
and
okay.
B
Well,
this
feature
should
have
its
own
Cassandra
cluster,
because
that's
what
we
should
do
have
a
cluster
/
feature
and
that
manage
different
use
cases
well,
which
is
good.
But
then
what
happens
is
that
well
sometimes
you'll
have
like
unknown
or
unexpected
ways
of
traffic
coming
in
and
basically
just
kind
of
bloats,
some
some
load
into
a
particular
user
journey
user
flow
and
what
happens
is
there's
an
impact
on
the
cluster
right.
B
You
may
have
some
content
that
may
be
a
little
too
wide
and
it's
basically
affecting
other
use
cases,
other
flows
inside
the
user
experience
and,
what's
even
suckier,
is,
if
there's
a
critical
path
into
one
of
those
other
use
cases
you're
really
compromising
that
flow.
So
there's
a
need
to
change.
So
we've
had
situations
like
this,
where
it's
like.
Okay,
we
reached
this
point
where
our
cluster,
its
use
is
a
ton
of
imbalance.
We
need
to
to
do
something
about
it.
So
what
do
we
do?
We
do
migration.
B
What
that
means
is
that
we
take
that
aspect
that
use
case
and
move
it
out,
move
its
data
out,
move
it
its
back
end
out,
so
that
we
can
scale
that
one
independently
and
not
effect
of
the
other
views
cases
up
time.
So
what
do
we
do?
We
basically
create
a
second
cluster
they're,
not
connected
they're,
actually
very
independent.
We
create
a
second
cluster
and
then
what
we
do
is
we
restore
from
backup.
B
So
we
did
consistent
backups
and
we
restore
it
using
at
this
table
loader
and
also
from
the
application
point
of
view,
we're
doing
double
rights
and
double
reads.
While
this
is
happening,
it
is
a
little
more
expensive,
but
when
you're
trying
to
move
a
large
amount
of
data
and
make
sure
that
the
end
user
is
not
affected
and
that
you
want
to
keep
up
time,
you
don't
have
any
maintenance.
You
got
to
do
this
kind
of
in
place
approach.
B
B
So
we
also
do
is
that
we
have
integrity,
scripts,
things
where
we
can
iterate
and
just
through
both
the
clusters
and
just
validate
and
see
how
far
apart
they
are,
and
the
idea
is
that
over
time,
these
clusters
should
eventually
become
very
consistent
with
each
other
and
then
over
time.
What
we
do
is
when
you
see
it's:
okay,
we
cut
the
cord,
and
now
the
application
is
a
lot
happier.
The
database
is
a
lot
happier
and,
in
general,
we're
in
a
good
spot
with
moving
our
data
around
and
and
I
think.
The
key.
B
Managing
this
and
just
putting
new
DCs
means
transferring
so
much
data,
and
it
pretty
much
kills
your
your
life
cluster
Network
I,
oh
ok,
so
moving
on
to
the
how
we
can
take
this
a
little
bit
further,
because,
obviously,
in
that
previous
one
you're
still
bound
by
certain
the
application
itself,
so
here
we're
doing
the
same
thing.
What
the
main
point
is
is
that
you
want
to
totally
isolate
that
user
flow,
so
we
do
the
same
thing:
restore
from
backup.
B
B
Here's
in
some
cases,
when
you
first
initially
looked
at
a
feature
you
may
have
thought
that
is
everything
it
should
be.
It
should
happen,
but
sometimes
it
doesn't
happen
that
way.
Use
cases
change.
So
when
I
say
it
is
not
for
migration,
there's
some
things
that
we
do
to
kind
of
evaluate
and
how
we
can
design
or
feed
applications
to
account
for
this
potential
scenario.
So
we
want
to
do
is
really
anticipate
these
critical
areas
in
your
feature
and
and
basically
pushing
through
a
separate
key
space.
B
When
you
close
the
separate
key
space,
it's
so
much
easier
to
restore
backups
so
much
easier
to
set
up
a
second
DC.
If
you
have
to
so
much
easier
to
move
the
data
around
cuz,
it
a
column
family,
it
gets
a
lot
harder.
Okay,
so
some
critical
areas
is,
if
you
have
any
different
kinds
of
security
requirements
around
your
data,
if
it
has
potentially
some
kind
of
user
information
that
may
be
potentially
something
you
don't
want
to
share
with
the
whole
family
as
well
segment
it
out.
B
If
you
have
anything,
that's
unbound
or
potentially
could
be
a
lot
of
large
data.
If
you
can
have
something
that
creates
a
lot
of
edges,
you
should
pass
up
with
this
out
into
its
own
area
and
if
there's
dependency
or
a
lot
of
different
services
to
this
particular
set
of
data
may
be
put
in
a
spot
where
you
can
potentially
create
new
clusters
if
need
be,
and
just
separate
out
the
data.
B
So
you
know
when,
for
our
applications,
we
always
separate
our
protection
pools
for
each
critical
section.
From
the
application
point
of
view,
it
makes
a
lot
easier
to
make
updates
and
split
it
out
to
supper
applications
if
we
have
to
and
most
of
all
just
monitor,
everything
always
monitor
the
load
on
the
different
key
spaces,
different
column,
families
and
also
before
else,
mentioning
that
you
want
to
monitor
when
you're
doing
this
migration
itself
and
make
sure
that
your
consistency
is
in
a
good
state.
C
All
right,
my
favorite
section
issues
so
everyone
that
probably
in
this
room,
intended
multiple
sessions
yesterday
so-
and
you
know
that
cassandra
is
rock-solid-
bulletproof
no
bars,
no
issues.
You
cannot
bring
the
thing
down,
it's
not
possible
period.
Well,
you
can,
and
I
think
he
has
someone
fun.
What
of
you
guys
probably
attended
presentations
about
page
duty
like
say:
experiences,
miss
Cassandra,
so
cassandra
is
a
really
good
database.
C
C
Also,
you
will
be
able
you'll
be
able
to
know
what
is
deployed
in
production
right
now,
because
you
know,
if
you're
going
to
get
the
call
on
friday
night,
you
might
not
be
in
a
best
condition
to
do
a
troubleshooting
and
figure
it
out
by
yourself,
so
it
might
be
better
to
at
least
know
ahead
of
time
that
well,
we
deployed
version
five
phones
and
all
this
open
issues.
Actually
so
we
might
face
them
in
production
and
then,
when
new
version
comes
right
is
released.
C
You
might
it's
much
it's
easier
to
think
about,
so
if
we
upgrade
what
we're
getting
from
upgrade,
that's
why?
Actually,
if
you
run
in
multiple
versions
of
Cassandras
in
your
production,
it
makes
it
very
very
hard
to
track
all
this
changes,
because
now
you
need
to
know
like
all
we
have
version
2.0
point
15,
14,
13,
1.2,
and
then
it
just
it
just
hard.
So
upgrades
good
idea.
Actually
you
don't
even
probably
want
to
run
two
years
old
Cassandra
clusters
Oh
at
least,
if
you're
running
you
should
know
what
is
going
on
with
them.
C
What
what
is
the
possible
issues
inside
yeah?
That's
my
favorite
one
so
Cassandra
to
rinse
right.
They
should
not
connect
if
you
don't
want
them
to
connect
and
then
also
when
you
have
two
rings
right.
If
you
set
it
up
this
one
DC
second
disease,
I
will
be
communicating,
but
in
this
example
you
see
this
confused
note
in
between
what
happened.
We
had
one
cluster.
We
had
second
cluster,
then
at
some
point.
This
note
that
was
the
Commission
could
crash
the
diet
we
removed
it
then
later
and
then
those
clusters
are
using
been
out
later.
C
We
brought
it
up
again,
but
we
it's
all
elastic,
we
use
with
you,
use
Amazon
it
all
those
ec2
instances,
so
we
just
assign
it
to
another
cluster,
and
then
we
thought
is
everything.
Okay,
until
we
saw
that
first
cluster
still
things
that
note
the
one
to
it
and
then
first
cluster
was
showing
it
is
down.
Second,
cluster
was
showing
is
a
knob
and
we
thought
well.
Okay,
at
least
at
least.
Cassandra
is
not
streaming
data
and
then
everything
should
be
fine.
We
were
planning
to
actually
deal
with
it
and
then
I
was
running.
C
So
I
was
running
as
sequential
scan
to
prepare
data
for
a
spark
analysis
later
and
I
found
that
when
I
core
in
this
cluster
actually
sorry
that
first
cluster
some
notes,
return,
respond
with
key
space
is
not
found
and
like
well
how's
it
possible.
It's
definitely
there
and
then
I
start
digging
and
I
found
that
our
client
library
wasn't
a
ver.
That
note
is
down
because
from
client
point
of
view,
it's
it
calls
from
describe.
It
gets
all
the
notes,
notes
of
fine,
so
it
just
it's.
If
think
so,
this
one
belongs
to
that
cluster.
C
It
core
is
it.
Latency
is
okay,
so
it
was
actually
going
to
this
cluster
and
then
we
didn't
have
any
data
loss,
because
actually
this
one,
the
first
one
wasn't
really
mission-critical,
but
I
was
really
surprised
it
it's
as
possible
actually,
but
we
saw
it
a
second
problem
that
we
saw
at
the
puzzle
so
help
me
with
it.
So
at
some
point
at
time,
X
we
saw
a
spike
and
load
averages
across
our
Cassandra
instances.
C
Right
looks
like
a
lot
of
users
just
with
a
modern
new
game
like
destiny
or
something
TCP
connections
from
the
application
to
the
nodes
spike.
So
you
probably
see
that
thing
right.
That
means
that
we
had
the
clusters
that
we
were
ready
to
decommission.
We
stream
data.
It
was
just
there.
We
were
ready
to
decommission
ate.
It
was
just
coming
around
so
definitely
like
a
lot
like
a
spike
of
connections.
C
C
C
Yeah,
but
not
that
now
now,
that's
not
that
time
all
right.
So
this
is
interesting.
It's
another
issue,
memory
meter.
So
it's
a
thing
that
monitors
how
much
memory
is
available
for
the
instance
for
the
Java
so
to
know
when
to
start
flushing,
ma'am
tables,
and
apparently
there
is
an
issue
with
it.
So
what
happened?
Amazon
killed
our
note
and
we
had
to
decommission
it.
We
started
it
with
the
commission
eight
and
then
it's
trimmed
data
to
other
notes,
and
then
the
swimming
part
can
trigger
it.
Memo
meter
goes
into
an
endless
loop.
C
It
consumes
one
hundred
percent
of
CPU
Cassandra
notes
actually
binds.
They
cannot
process
data
if
they
become
really
really
slow
and
we
were
using
V
notes,
which
may
issue
even
worse,
because
here
right
now
just
swim
to
two
of
them
with
v
notes.
What
happened?
It's
just
blasted.
It
just
send
data
to
all
of
our
Cassandra
note.
So
a
lot
of
them
went
into
this
hell
the
solution
he
had
datastax
engineer.
C
How
does
a
lot
with
it
and
the
social
was
to
just
do
it
rolling
restarts
until
it
issued
goes
away
and
then
just
do
a
break
to
latest
datastax
version
because
we
use
data
stack.
So
that's
why
I
said
you
want
to
know
what
deployed
another
one
interesting.
Last
time
we
were
using
two
disks
when
we
present
it.
Last
year
we
said
we
went
through
right,
0,
8
1,
and
then
we
ended
up
using
two
disks.
We
thought
it's
a
good
idea.
It's
not
the
problem,
the
two
problems
with
it.
First
first
compaction
starts
red.
C
One
is
dated,
so
blue
one
is
empty
space.
This
one
is
used
space,
first,
compaction,
compact,
disc,
2,
then
Custer's
ends
or
rice
coming,
because
I'm
just
can
start
another
compaction.
It's
not
scheduling
it
really
well
and
then
at
some
point
discuss
a
discus
full.
So
the
way
the
current
behavior
is
not
we'll
just
go
down,
but
at
that
point
we
were
running
version.
That
was
doing
interesting
thing.
No,
it
was
actually
the
Commission
in
itself.
Note
said:
fine.
C
I'm
done
here
sorry
and
handle
it
anymore
stream,
data
to
other
nodes
and
then
just
lift
left,
and
we
like
well
I
thought
I,
have
20
notes
in
this
cluster.
Why
it's
only
like
19
and
then
like
18.
What
is
going
on
here,
it's
fixed
now
default
behavior
is
just
just
stop
yeah.
So
that's
actually
yeah
abss!
That's
a
that's!
What
was
going
on
this
health
checker
Java,
removing
this
note
from
the
rain
for
the
disc,
it's
close
to
full,
not
very
nice
message:
yeah
EBS,
EBS
that
what
we
were
doing
to
save
other
nodes.
C
C
D
So
I
had
a
question
about
your
trick,
so
it
looked
like
what
you
did
was
basically
took.
The
account
dependent
underscore
and
a
number
to
it
and
I
was
curious.
If
you
did
it
that
way,
if
you
considered
adding
another
column
for
that
bucket
and
then
basically
making
the
partition
key,
the
two
of
them
combined,
it
seems.
D
C
A
D
D
B
B
Ok,
so
the
migration
tickle,
it
was
a
lot
of
it
was
in
planning,
and
a
lot
of
it
was
interesting,
but
it
took
fright
house
it
was
like
a
like.
Actual
execution
was
probably
within
a
week's
time
and
then
just
data
validation.
We
let
it
run
for
like
a
couple
weeks
that
we
saw
that
the
the
difference
in
the
data
was
was
was
not
there.
The.
A
A
B
Is
we're
using
EBS
primarily
to
offload
these
large
compaction,
'he's
auto,
who
uses
what
kind
of
compaction
strategy
here,
if
you
sized
here
to
level
tier
or
whatever?
But
no,
if
you
don't
have
that
fifty
percent
of
space,
sometimes
you
will
give
the
situation.
You
don't
wanna
get
a
situation,
but
if
you
do
get
in
that
situation,
you
need
somewhere
for
it
to
compact
right,
so
we
throw
the
EBS
on
there.
Let
compact-
and
at
least
in
our
case,
what
we've
had
is
that
data
is
being
deleted.
It's
not
really
being
cleaned
up.
B
B
B
G
Give
it
a
good
could
review
yes,
so
the
migration
strategy
you
talk
about
sounds
pretty
similar
with
how
we've
been
thinking
about
it.
One
kind
of
an
unanswered
question
was
how
we
might
deal
with
multi
region
replication
there,
because,
like
you,
can
do
double
rights
in
the
same
region
that
the
alien
is
great.
But
as
soon
as
you
have
another
region,
multimaster
that's
replicating.
It
seemed.
B
Well,
we've
had
situation
where
we
did
multi
datacenter
and
one
of
the
big
killers
was
just
network.
So,
for
example,
we
had
a
datacenter,
Nepalese
and
Japan
and
transferring
data
or
want
the
wire.
When
you
get
a
loan
on
data,
you
will
never
catch
up.
You
just
gather
hints,
for
example,
or
you
just
do
so
so
so
mechanisms
you
always
do.
Is
q
qo
and
then
use
the
cute
it
to
do
it.
B
Then
you
we
still
a
ways
to
validate
the
data,
because
you
know
you'll,
never
with
the
amount
of
Rights
come
in
you'll,
never
be
able
to
be
one
hundred
percent,
confident
that
the
data
is
exactly
the
same
and
the
way
we're
approaching
migrations.
We
don't
necessarily
use
the
data
center
connection
and
we
don't
run
repairs
like
that.
Basically
amount
of
data
we
have,
we
don't
want
to
take
down
a
live
class
or
so
yeah.
You
use
a
QE
recognize
anything.
E
B
Yeah
so
I
think
a
bit
initially,
we
went
through
actually
a
lot
of
graph
databases.
We
would
try
to
orient.
We
try
to
look
into
do
for
Jay
and
stuff,
like
that.
So
for
our
particular
use
case,
the
thing
is
we
want
to
go
for
a
very
personalized
search,
so
the
main
difference
is
that
you
know
you
want
one
of
both.
We
were
very
greedy,
going
personalized
search.
We
also
want
a
global
search
right.
We
want
to
be
able
to
search
across
all
PS
and
users
or
I
want
to
search
with
a
my
local
graph.
B
I
want
to
search
or
search
within
my
own
library,
for
example,
and
the
problem
is
when
you
have
this
kind
of
personalized
aspect
in
a
big
index,
you
can
imagine
how
big
that
goes
in
the
performance
hits
that
you'll
run
into
your
leads.
Take
an
incredible
hit
and
you're,
always
reindex
ting,
the
entire
thing
so
I
think
for
us.
We
moved
for
the
graph
question.
B
We
moved
out
of
it
because
the
amount
of
users
we
had
and
the
amount
of
Kalinka
tween
users
that
we
had
exceeded
what
the
technology
that
we're
looking
at
kids
handled,
for
example,
even
for
day
the
time
was.
You
know
that
cold,
like
master-slave
kind
of
architecture
it
without
doing
standing,
sharda
mechanism.
It
was
very
difficult
to
do
so.
That's
why
we've
leaned
on
this
sanda
distributed
approach
and
doing
the
personal
index
to
handle
the
searching
within
someone's
own
social
graph,
yeah.
H
F
B
The
thing
is
the
way
we
back
that
we
push
everything
into
testing.
If
we
do
increments,
we
have
actually
our
own
hand
written
scripts
that
will
do
incremental
backups.
So
we
have
nodes
that
have
over
500
gigs.
We
have
700
gig
notes,
even
things
that
get
close
to
a
terabyte
and
can
imagine
just
trying
to
back
up
the
entire
thing
for
a
hundred
node
cluster
will
take
forever.
So
we
do
in
Quebec
so
backing
that
up
is
quite
fast
and
also
restoration.
It
could
take
days
actually
to
do
that.
B
The
idea
is
that
we're
not
replicated
not
doing
a
kind
of
weird
replication
or
anything
like
that.
That
puts
strain
on
the
cluster,
we're
doing
it
completely
separate,
and
so
we
have
a
little
bit
of
luxury
and
the
time,
but
so
yeah
usually
we'd,
take
those
backups
from
the
s3
put
them
out
and
depending
on
how
big
that
is,
that's
the
data
transfer
data
transfer
time.
Thank.
F
C
C
Yeah
but
it's
a
little
friend,
okay,
114.