►
Description
Speaker: Jason Brown, Senior Software Engineer at Netflix and Apache Cassandra Committer
Slides: http://www.slideshare.net/planetcassandra/6-jason-brown
This talk focuses Cassandra's anti-entrpoy mechanisms. Jason will discuss the details of read repair, hinted handoff, node repair, and more as they aide in reolving data that has become inconsistent across nodes. In addition, he'll provide insight into how those techniques are used to ensure data consistency at Netflix.
A
Hi
everybody
thank
you
for
coming
out,
so
these
days,
I
spend
a
lot
of
time
thinking
about
data
consistency,
so
back
when
I
started
my
career
and
even
some.
What
about
my
time
doing
lately
is
dealing
a
lot
with
that
things
like
Oracle
and
older
school,
big
iron
databases,
of
course,
with
those
things
you
really
didn't
have
to
worry
much
about
data
consistency,
because
it
was
pretty
much
just
one
big
massive
node.
If
you're
unfortunate
enough
to
have
to
deal
with
things
like
multi-master
replication,
then
you
did
a
cage.
A
So
in
addition
to
that,
there's
also
things
like
master
slave
replication
in
the
classic
kind
of
set
up
with
on
my
sequel,
we've
thought
a
little
bit
about
dumb
data
consistency
in
those
cases,
but
not
so
much.
We
just
always
assume
that
log
shipping
was
usually
pretty
successful,
but
now
that
we're
in
the
new
world
of
this
no
sequel
kind
of
land,
where
there's
a
lot
about
or
essentially
everybody's
appear
and
everybody's,
a
right
master
data.
Consistency
becomes
a
lot
more
interesting
and
a
lot
more
tricky.
A
So
inconsistencies
can
really
creep
in
at
in
really
many
many
different
ways
and
I'm
going
to
be
talking
about
anti
entropy
protocols
and
cassandra
and
the
way
that
we
deal
with
alleviating
a
lot
of
these
inconsistencies.
So
a
little
bit
about
me,
my
name
is
Jason
Brown
I,
currently
work
on
netflix
on
the
senior
software
engineer,
I'm
on
the
cloud
database
engineering
team,
which
essentially
means
I
work
with
Cassandra,
I'm
also
a
committer
on
the
Apache
Cassandra
project.
In
my
previous
life,
I
worked
as
an
e-commerce
architect
at
major
league
baseball
and
I.
A
Did
some
wireless
development
way
back
in
God
seems
like
the
dinosaur
ages
now,
so
maintaining
consistent
state
in
a
distributed
system
is
hard.
Essentially,
the
cap
theorem
really
comes
into
play
and
really
gets
in
your
way.
So,
just
a
very
quick
brief
summary
of
the
cap,
theorem,
sefa,
consistency,
a
for
availability
and
P
for
partitioning
a
consistency
is
data
consistency
as
I'm
going
to
talk
a
lot
about
here.
A
Availability
is
just
the
ability
for
a
client
application
to
get
some
kind
of
response
as
a
positive
or
negative
from
a
distributed
system
and
p
is
partitioned
typically
in
reference
to
application
partitioning.
So
in
this
sense,
in
the
case
I'm
going
to
be
talking
about
its
kassandra.
So
if
a
consentir
node
goes
down,
do
you
really
want
the
entire
application
going
away?
Or
can
you
deal
the
partial
failure?
A
So
inconsistencies
can
creep
into
distributed
databases
in
a
lot
of
different
ways,
I'm
going
to
pick
Cassandra
said
so.
The
the
system
I've
been
thinking
about
mostly
and
the
one
that
my
employer
pays
me
to
think
about.
So
the
most
common
thing
is
a
node
goes
down.
We
happen
to
run
inside
of
ec2,
so
nodes
can
frequently
just
disappear
without
warning
and
often
do
late
at
night
network
partitioning.
A
Mutations
are
kind
of
interesting,
so
that
can
happens
when
a
node
gets
really
overburdened
and
it's
got
so
much
a
traffic
coming
in
that
it
really
just
can't
swallow
and
absorb
all
these
incoming
requests
so
that
no
just
chooses
to
drop
requests
on
the
floor
because
it
needs
to
I
keep
on
going
and
it
could
either
try
to
swallow
everything
they
try
to
swallow.
Everything
that's
coming
its
way
or
I
can
just
try
to
pace
itself
and
just
drop
some
mutations.
A
A
A
So
the
way
that
we
deal
with
a
lot
of
these
problems
is
by
introducing
anti
entropy
protocols
and
essentially
anti
entropy,
tries
to
resolve
all
these
inconsistencies
that
just
happen
to
on
occur
because
you're
running
a
distributed
system
and
things
will
crash.
So
the
major
things
that
we
use
in
Cassandra
are
at
right
time.
There's
tunable
consistency,
there's
atomic
batches
and
hinted
handoff
at
read
time.
A
We
can
do
things
like
the
consistent
reads
and
read
repair
and
then
a
maintenance
time
read
a
node
repair
and
I'm
going
to
go
into
and
I'm
going
I'm
going
to
do
a
deep
dive
into
each
one
of
these
topics,
and
so
we
can
see
where
Alexander
will
will
recover
from
problems
that
will
come
and
creep
in.
So
let's
start
off
with
the
right
time,
because
you
know
that's
how
all
data
gets
into
the
system.
A
So
a
client
application
will
send
a
request.
Do
what's
called
the
coordinator
node
that
coordinator
node
will
figure
out
all
the
replica
nodes
that
that
requests
need
to
go
to
in
every
DC.
That's
in
that
Cassandra
ring.
Then
it's
going
to
send
the
mutate
to
all
the
replicas
in
the
local
DC
to
all
to
any
remote
data.
A
Centers,
it's
just
going
to
send
a
request
to
one
of
the
replica
nodes,
then
that
whoops,
then
that
replica
node
is
responsible
for
forwarding
that
request
on
to
its
peers
within
that
data
center,
then
that
coordinator
known
waits
for
everyone
to
respond
all
the
nodes
in
all
dcs.
So
in
my
example
here
got
the
client
application
sends
a
request
over
to
replica
one
in
the
first
data
center
itch,
and
this
is
assuming
a
replication
factor
of
three
and
it's
going
to
ship
off.
A
B
C
A
Tunable
consistency,
essentially,
the
a
coordinator
will
block
for
a
specific,
a
specified
number
of
replicas
to
respond
to
it.
So
there's
a
lot
of
different
levels:
you're
going
to
have
here
for
the
consistency
level
any
which
is
basically
I'm
going
to
wait
for
anybody
to
respond
to
my
right.
It
could
be
one
of
the
replicas.
It
could
be
anybody
else.
Typically.
This
is
really
only
useful
when
you're
doing
something
where
you
wear
a
degree
of
lot
of
data
loss
is
acceptable.
A
So
if
you're
doing
some
kind
of
logging
that
that,
if
you
missed
a
few
rights,
you
may
be
okay,
however,
for
most
other
use
cases
and
in
fact,
or
everything
in
netflix,
we
never
use
any.
Then
one,
two
and
three
I'm
pretty
straightforward,
you're,
essentially
waiting
for
one
two
or
three
replicas
to
respond
back
to
that
coordinator.
So
he
can
so
it
can
consider
that
right,
successful
local
quorum
is
essentially
within
a
within
that
coordinators.
Data
center
I'm
going
to
wait
for
a
quorum
of
replica
nodes
to
respond
and
and
I'm
sure.
A
You
probably
all
know
this
by
now,
but
replica
is
essentially
the
replication
factor
divided
by
t2,
plus
one.
So
to
go
back
to
my
previous
slide
here.
Actually,
the
first
one
here,
replica
node
one
is
going
to
ship
off
the
request
to
four
and
two
as
well
as
itself.
I
just
made
it
simple
in
this
picture,
and
it
will
wait
essentially
for
two
of
those
guys
to
respond
successfully
before
it
considers
the
local
quorum
right.
A
Successful
each
forum
is
another,
a
consistency
level
which
actually
this
kind
of
almost
would
be,
but
essentially
each
quorum
just
waits
for
a
quorum
from
each
data
center,
pretty
straightforward
and
then
all
basically
says
I'm
going
to
wait
for
every
replica
everywhere
to
respond
successfully
to
me
all
pretty
straightforward.
Now,
the
real
trick
is
that
this
is
tunable
consistency,
it's
inside
of
Cassandra,
while
you
can
choose
things
like
all
and
each
quorum,
the
thing
is:
you're
really
going
to
run
into
real
problems.
A
You
may
think
you're
actually
going
to
get
a
stronger
consistency
level
with
your
rights,
but
you're
more
likely
to
just
cause
other
sorts
of
problems
in
your
application
that
you'll
have
to
deal
with
the
likelihood
that
you'll
have
all
of
your
nodes
up
all
the
time
and
never
have
to
worry
about
anybody
being
down
is
really
just
a
fallacy.
Somebody's
going
to
go
down,
something's,
going
to
fail
or
you'll,
be
doing
maintenance
and
and
your
rights
will
fail
at
that
time,
so
each
quorum
is
almost
as
bad.
A
In
my
opinion,
because,
essentially
what
that's
really
saying
is
that
you're
going
to
trust
the
B
the
network
between
all
of
your
data
centers?
And
if
you
happen
to
be
running
in
something
like
ec2,
where
the
where
the
ISO
the
guaranteed
isolate
between
regions,
is
about
zero
and
it's
not
a
knock
on
Amazon.
That's
just
the
reality
of
how
those
public
Internet's
work
between
those
data
centers.
You
can
really
get
into
problems
of
expecting
that
that
network
connection
to
be
up
all
the
time.
A
Then
your
mutate
is
essentially
it's
got
a
hard-coded
dependency
amongst
all
of
your
regions
or
data
centers,
and
this
is
actually
a
problem.
We
deal
with
that
netflix
there's
one
or
two
use
cases
of
each
form
and
it
does
get
into
problems.
It's
not.
In
my
personal
opinion,
it's
that's
highly
advisable,
but
it
is
a
possibility,
so
hinted
handoff.
So
what
happens
when
one
of
those
when
one
of
the
replica
nodes
is
down
that
the
mutate
is
supposed
to
go
to?
A
That
is
that,
on
the
coordinator,
as
we're
figuring
out
all
the
replica
notes
that
this
mutate
needs
to
go
to
we'll
figure
out
a
first
who
are
all
those
nodes
and
then
we'll
check
to
see
if
any
of
them
are
down.
If
some
of
them
are
down
and
you
can
still
fulfill
the
consistency
level,
there
was
a
as
part
of
the
request.
A
We
can
still
send
off
that
mutate
to
the
nodes
that
are
alive,
however,
for
the
note
or
nodes
that
are
down
will
story
hint
also,
if,
after
we
send
off
that
mutate-
and
we
don't
get
a
response
back
from
one
of
the
nodes
say,
it
went
down
or
there's
some
kind
of
network
congestion
or
whatever.
If
we
don't
get
a
response
within
right
request,
timeout
in
milliseconds
will
also
get
a
will.
Also
start
hint
I
believe
the
the
default
value
for
the
timeout
is
ten
seconds
some
value
in
the
Cassandra
animal.
A
One
of
the
other,
interesting
things
about
hinted
handoff,
is
that
there's
another
parameter
in
the
Cassandra
amo
called
max
hint
window
in
milliseconds
and
that's
basically.
This
is
the
amount
of
time
for
which
a
coordinator
will
store
up
hints
for
a
know.
That's
down
the
default
is
three
hours,
so
if
a
node
is
down
for
one
hour,
we'll
store
our
pins,
because
the
defaults
at
three
for
that
value,
then,
when
they
notice
comes
a
cup,
will
replay
all
the
hints.
A
However,
if
they
note
happens
to
be
down
for
four
hours
after
the
third
hour,
the
coordinator
will
stop
storing
up
hints
and
then
we'll
just
drop
anything
for
that
that
node
and
then
essentially
what
you're
going
to
have
to
do
at
that
point
is
when
that
node
finally
does
come
up.
It
gets
all
the
hints
replayed
to
it,
it's
out
of
sync,
and
it's
not
going
to
get
the
data
until
you
do
something.
A
little
bit
more
involved,
which
I'll
talk
about
later.
A
So
for
the
hinted
and
off
replay,
it's
pretty
straightforward-
it's
just
place
the
hints
back
to
the
nodes
who
came
back
up.
It
runs
every
ten
minutes
as
of
Cassandra
12,
it's
now
a
multi-threaded
operation,
so
that,
if
you
have
one
note,
if
you
have
multiple
notes
that
you're
trying
to
replay
hints
to-
and
one
of
them
is
slow
you're
not
going
to
block
hints
trying
to
replay
to
the
other
nodes
that
are
also
back
up
and
one
of
the
best
things
for
us
is
that
this
is
throttle.
A
It's
a
throttle,
Obol
in
kilobytes
per
second,
and
why
this
is
interesting
is
that
if
you
have
an
O
that
comes
up
and
it's
got
a
bunch
of
hints
that
are
backed
up
for
it?
This
will
help
the
coordinator
not
just
blow
through
all
of
its
IO,
trying
to
read
all
the
hints
off
the
disk
and
then
shove
those
bites
out
through
the
the
network
card.
A
So
it's
really
to
help
the
coordinator,
no
not
get
bogged
down
just
in
hints,
but
in
trying
to
to
serve
up
the
actual
reads
and
writes
that
it's
getting
through
a
normal
production
traffic.
Of
course,
that
note
who
is
coming
back
up
if
the
hints
are
delayed
going
to
him?
Obviously
it's
going
to
be
out
of
sync
for
a
little
bit
longer
time,
but
you
already
have
that
problem
anyway,
since
it
was
down-
and
now
it's
getting
hints
so
the
extra
time
spent
in
shipping
it.
A
Node
r2
is
down.
When
the
request
comes
in
the
coordinator,
node
replica
one
will
store
a
hint
as
soon
as
our
two
comes
back
up.
It's
replays
the
hint
pretty
same
straightforward.
However,
you
get
into
an
interesting
problem
and
I
didn't
explain
the
setup
for
this
intentionally,
because
it
makes
it
a
little
bit
more
complicated.
A
The
way
the
Cassandra
will
do,
mutate
is,
if
it's
got
a
batch,
mutate
I
should
actually
say
if
it's
got
to
mutate,
that's
going
to
be
applied
to
three
keys.
Three
row
keys:
it's
going
to
ship
those
out
it's
going
to
handle
each
one
of
those
mutates
one
at
a
time,
so
serially
soldered
to
the
first
mutate
second
and
then
third,
so
the
question
becomes:
what
happens
if
the
first
one
succeeds,
the
semblance?
A
The
second
one
succeeds
and
then
the
coordinator
fails
its
then
it's
then
up
to
the
client
to
replay
that
done
of
that
mutation,
which
is
a
little
unfortunate.
So
that's
why
I'm
Cassandra
12
we
introduced
atomic
badges
and
what
that
is
is
a
coordinator
upon
getting
that
mutate.
The
first
thing
that's
going
to
do:
is
it's
going
to
send
a
copy
of
that
mutate
to
two
peers
in
its
local
data
center?
That
way,
if
it
goes
down
its
peers
can
replay
that.
So
you
can
play
that,
didn't
what
Peter
replay
at
that
point.
A
So
the
coordinator
sends
those
others
mutates
off
to
its
peers.
It
performs
the
actual
real
job
of
sending
out
the
mutates
to
all
the
real
replica
nodes.
Then
it's
going
to
call
its
peers
and
ask
it
to
delete
that
match
the
the
appears
if
they
don't
get
that
delete
request
within
60
seconds
I
mean
it's
going
to
it's
going
to
play
those
I
mutates.
A
If
the
batch
is
older
than
60
seconds
internally
and
as
of
Cassandra
12
up
all
mutates
are
using
atomic
badges,
so
it
does
add
to
add
on
a
little
bit
of
latency
to
your
right
path,
but
not
a
whole
lot,
especially
when
you
consider
that
Cassandra's
right
story
is
actually
pretty
excellent.
This
is
a
very
small
tax.
On
top
of
that,
so,
let's
jump
over
to
our
what
happens
at
read
time.
A
A
Levels
but
we'll
get
to
that
in
a
second,
but
first
of
all,
we
first
of
all
determined
will
all
be
the
replica
notes.
We
could
be
contacting
our
in
all
the
data
centers,
so
the
coordinator
node
will
send
a
request
to
the
first,
the
closest
data
node,
and
it
will
ask
for
the
full
data
set
to
all
the
other
replicas.
It's
going
to
ask
for
digest
and
the
reason
we
do
that
is
to
optimize
on
network
traffic
and
also
garbage
collection
on
the
coordinator.
A
A
So
those
notes
get
the
full
most
current
set
of
that
of
that
data,
and
the
coordinator
will
block
until
those
guys
respond
and
then
once
those
guys
respond,
everyone
is
up
to
date.
The
coordinator
will
do
the
merge
of
all
those
columns,
hopefully
of
all
this
day,
of
all
those
rows
and
columns
and
then
return
that
to
the
caller,
so
I'm
consistent,
read.
If
there
is
a
mismatch,
obviously
you
are
going
to
pay
a
little
bit
in
terms
of
latency,
but
this
is
a
way
to
make
sure
that
your
data
is
more
consistent.
A
So
Reed
repair
this
attempts
to
synchronize
that
client
requested
data
across
all
the
replicas
and
how
it
does.
This
is
a
piggy
backs
off
of
the
normal
read
path,
but
it's
going
to
wait
for
all
the
replicas
to
respond
back
to
the
coordinator
and
it
does
at
asynchronously
and
similar
to
the
consistent
read
it's
going
to
compare
all
those
all
those
digests
of
all
those
replicas.
And
if
anybody
is
out
of
date,
it's
going
to
ask
for
the
full
data
set.
Do
the
compare,
send
the
up-to-date
information
and
then
it's
done.
A
So
an
example
of
read
repair.
The
green
lines
in
this
are
the
other
local
quorum
of
the
nodes.
We're
going
to
use
for
the
local
quorum
read
the
lines
and
blue
are
the
ones
we're
using
for
read
repair.
So
our
coordinator
here
is
asking
for
a
replica
four
and
three
for
the
data
set
that
will
essentially
use
forward
returning
back
to
the
client
app.
However,
it's
also
sending
back
it's
going
to
send
the
request
over
to
the
other
four
nodes
in
pink
and.
A
It'll
get
all
those
responses
back
fix
up
anybody
who's
out
of
date,
also
on
for
a
reader
pair.
It's
only
going
to
wait
the
for
the
read
timeout
in
milliseconds
to
put
the
actual
name
of
the
llamo
parameter
here,
but
essentially
it's
not
going
to
wait
forever.
It's
not
going
to
keep
some
token
around
for
all
these
nodes
to
come
back
to
figure
out
the
read
repair,
it's
going
to
wait
for
the
standard
time
out
and
then
try
to
do
its
thing.
A
So
the
way
that
this
is
configured
is
that
it's
a
setting
per
column
family,
so
not
every
column
family
needs
to
have
this
on
and
how
it
actually
works
is
that
it's
actually
a
percentage
of
all
reads
that
come
into
that
column,
family,
so
either
using
CQ
lsh
or
the
old-school
command
line
a
CLI.
You
set
it
up
as
a
value
between
0
and
1,
and
that
essentially
just
becomes
a
a
percentage
chance
of
a
particular
read.
A
Getting
the
the
read
repair
chance
and
you
can
set
this
on
a
local
DC
level
or
global
and
where
that
actually
becomes
interesting,
is
local.
Dc
will
just
make
it
faster
and
less
involved,
whereas
got
many
many
data
centers
in
your
standard
cluster.
It's
going
to
take
them
much
more
time
that
much
for
more
computation
to
fix
up
all
that
data.
A
So
reader
pair
is
great
in
that
it
repairs
data.
That's
actually
requested,
there's
of
course
the
problem
of
what
about
data
that
isn't
requested.
So
you
get
a
bunch
of
mutates.
Let's
say
that
some
percentage
of
those
are
actually
requested
by
clients
or
callers,
but
then
you
have
a
sub.
You've
got
some
set
of
data
that
isn't
actually
requested,
so
it
never
gets
go
through
the
read
repair
kind
of
state.
It's
just
sitting
there,
potentially
out
of
sync.
A
What
node
repair
is
going
to
do
is
essentially
repair
inconsistencies
across
all
the
replicas
for
a
given
range.
So
how
does
how
it
actually
execute
is
on
a
given
node
you
connect
to
a
you
connect
to
the
standard
process
with
with
notes,
you'll
execute
a
repair
and
how
it
works.
Is
that
standard
process,
we'll
figure
out
the
ranges
around
the
token
ring
that
it
owns
and
then
it
the
parameters
you
can
pass
in
on
the
command
line
to
unload
tool
are
the
column
families
they
have
to
be
within
a
given
key
space.
A
So
a
couple
of
cautions
before
we
dive
into
how
Notre
Pere
actually
works.
It
should
really
be
part
of
the
standard
operations
for
aches
and
or
cluster,
especially
if
you're
doing
deletes,
because
what
you
can
run
into
is
a
problem
of
resurrected
data.
So
if
you
have,
if
you've
got
a
bunch
of
Rights
going
on,
you
do
some
deletes,
but
those
deletes
those
tombstones,
never
percolate
over
to
other
replicas
nodes.
A
So
if,
if
a
node
that
got
those
deletes,
then
goes
past,
the
GC
grace
period,
which
I
think
of
the
default
is
10
days,
and
if
you
didn't
sink
that
data
over
to
the
auto
sink
node
the
date,
the
node
with
the
tombstones
could
compact
those
could
compact
those
tombstones
plus
the
original
data
out
of
that
node.
But
then
you
still
have
the
live
data
on
the
guy
who
never
got
the
tombstone.
So
therefore,
essentially,
what
you've
just
done
is
resurrected.
A
That
data
you
tried
to
defeat
and
then,
of
course,
if
you
run
something
like
repair
or
no
drip
it
or
read,
repair
on
the
guy
who
was
out
of
sync
he'll,
then
put
that
data
back
into
everyone
else.
Because
of
the
data
will
be
missing
on
the
other
nodes
and
now
you've
just
resurrected
data
that
you
didn't
want
there
in
the
first
place,
so
it's
pretty
important
to
run
node
repair.
A
However,
there's
a
pretty
big
drawback,
and
that's
it
it's
it's
it's
disk,
I/o
and
CPU
intensive
working
at
Netflix
running
in
DC
to
this
is
a
big
problem.
It's
really
important
to
schedule
these.
These
are
no
two
pairs
off
off
off
peak
depending
on
the
size
of
the
cluster
and
the
amount
of
data
that
it
needs
to
to
compare.
It
can
take
upwards
of
n,
depending
on
how
frequently
you
actually
run
the
the
notre
père,
it
could
take
upwards
of
six
to
ten
hours.
A
A
It
actually
makes
this
discussion
a
little
bit
easier
to
not
deal
v
nodes,
but
the
same
basic
principles
apply,
so
the
initiator
node,
the
one
you
trigger
a
repair
on
it's
going
to
contact
all
of
its
peers
that
we
discovered
in
this
first
step
and
it's
going
to
trigger
a
major
compaction
on
those
guys.
So
major
compaction
in
this
sense
is
not
is
going
to
read
all
the
existing
SS
tables
on
those
remote
nodes
and
essentially
it's
going
to
build
up
something
called
a
merkel
tree
and
essentially
merkel
trees.
A
This
is
essentially
a
binary
tree
with
hashes
with
ranges
and
hashes.
Hashes
tend
to
be
at
the
hashes
are
at
the
leaf
nodes,
whereas
ranges
or
just
the
intermediary
nodes
in
the
tree.
So
essentially,
each
node
is
going
to
build
up
this
tree
by
reading
every
row
in
its
data
set
so
means
it
needs
to
churn,
through
a
bunch
of
I/o
reading
every
SS
table
off
disk
merging
that
it
doesn't
write
the
it
doesn't
write
anything
back
out
to
disk,
but
it
does
have
to
read
everything
so
it
can
build
up
that
Merkel
tree.
A
Then
it's
going
to
return
that
tree
back
to
the
initiator,
then
that
initiator
essentially
waits
for
all
the
trees
to
come
back
Africa,
including
a
calculating
its
own
merkel
tree.
Then
it's
going
to
compare
every
tree
that
it
received
to
every
other
tree
and
this
way,
essentially
what
is
doing
its
comparing
the
merc.
It's
it's,
comparing
the
hashes
that
are
in
that
tree,
not
the
actual
beta.
A
That's
that's
an
important
point
to
remember
here,
because
then,
if,
if
in
that,
if
in
that
comparison,
if
any
differences
are
detected,
it's
going
to
ask
one
of
the
two
nodes
that
are
differing.
It's
going
to
ask
him
to
ship
all
the
ranges
that
are
differing
with
the
other
node
to
that
other
guy,
then
the
other
guys
requested
to
ship
all
its
data
back
to
the
other
node
so,
and
all
that
is
written
out
is
new
SS
tables
on
those
disks
and
I'll
show
that,
with
an
example
here
sorry.
A
So
let's
say
we
got
a
six
node
cluster
here
and
I
want
to
start
done,
and
each
note
has
a
its
replication
factor
of
three.
So
each
note
has
three
ranges.
This
is
what
they're
going
back
to
a
cluster
that
is
not
using
vinos.
It's
a
little
bit
easier
to
see.
So
the
note
the
top
is
responsible
for
a
token
ranges,
a
B
and
C
of
the
next
guy
down.
The
line
is
responsible
for
b
c
and
d,
this
guy
C,
D
and
E,
and
so
on.
A
The
three
guys
at
the
top
I'll
share
range.
Be.
Can
you
see
what
this
is
going?
Next,
all
the
guys
on
the
right
there
I'll
share
range,
see
so
at
the
end
of
day,
what
you're
going
to
get
is
all
five
nodes
here,
participating
in
this
repair,
so
I,
meaning
they're,
all
going
to
do.
I
major
compaction
we're
going
to
build
up
this
merkel
tree
and
ship
it
on
back
to
our
initiator
node
at
the
top.
A
A
The
note
the
top
is
going
to
tell
this
guy
hey.
Why
don't
you
send
all
of
your
data
for
that
range
over
to
this
guy,
then
this
guy
is
going
to
ship
all
that
same
data
back
over
here.
So
at
that
point
we're
actually
technically
not
merged,
but
what
actually
happens
is
because
in
exchanging
that
data
it
gets
written
out
as
a
local
SS
table
so
that
the
next
reads
that
come
through
on
those
nodes
will
essentially
act
like
any
other.
Read
and
it'll.
A
A
Similarly,
if
the
data
is
out
of
date,
it
won't
even
bother
to
really
use
it,
so
those
are
most
of
the
Nelson.
Those
are
the
ante
entropy
of
protocols
and
operations
that
we
have
instead
of
Cassandra
and
at
the
end
of
the
day,
the
cap
theorem
lives
and
there's,
and
we
just
need
to
understand
the
trade-offs
of
what's
involved
and
what
we
and
and
and
make
reasonable
decisions
based
upon
what
we're
trying
to
do,
and
Cassandra
we've
decided
to
trade
a
little
bit
of
a
consistency
for
a
greater
availability.
A
So,
of
course,
to
deal
with
the
a
consistency
you
need
to
get
back
to
to
deal
with
the
inconsistency
you
need
some
processes
to
help.
Get
you
back
to
consistency
so
Alexander
has
some
process
now
fix
this
up.
As
I
mentioned,
we've
got,
tunable
controls
exist,
a
right
read
and
the
the
on-demand
time
inside
of
a
cassandra
clustering,
and
that's
all
for
now.
So
please,
if
you
have
any
questions,
I'm
hooking
it
up.
Thank
you.
A
What's
going
to
happen,
is
that
we'll
we'll
figure
out
all
the
up
here,
nodes
that
share
that
same
range
and
it
just
won't
be
as
evenly
on
the
diagram
as
I
put
it
so
it'll
be
a
scattered
nodes
around
the
cluster,
but
because
of
the
nature
of
V
nodes,
every
node
could
be
sharing
a
range
with
every
other
node.
So
it
could
happen
that
you
ask
every
node
in
the
cluster
to
participate
in
that
node
repair
and
what's
going
to
happen,
is
that
every
node
is
doing
a
full,
the
the
the
the
full
major
compaction.
D
E
A
Well,
how
it's
actually
going
to
execute
is
that
back
up
to
the
slide
yeah.
So
all
five
notes
are
going
to
participate
when
you
execute
a
node
repair
on
that
top
node.
All
these
five
notes
are
going
to
generate
their
local
tree
ship
them
back
to
the
original
guy
and
then
so
and
then
our
life
goes
on.
A
On
the
first
node
and
the
second
node
during
which
inconsistencies
could
creep
in
I,
don't
have
a
great
answer,
perhaps
a
Brandon
or
any
of
the
other
committers
her
in
the
room
have
a
better
answer,
but
yeah
read
but
Notre
Pere
is
definitely
a
very
heavy
handed
operation,
but
it
definitely
does
work
in
getting
data
not
consistent.
So
when.
E
Just
just
a
quick
follow-up
sure
when,
when
the
Merkel
tree
is
generated,
so
that's
trainer
generated
across
all
of
the
data
on
that
node
I.
Believe.
A
B
A
A
Not
really
the
way
that
it
actually
the
way
that
the
code
actually
execute
doesn't
take
that
into
account
and
then
well
then,
the
other
trick,
that's
going
to
play
into
a
counter,
is
V
nodes
with
v
notes.
As
far
as
I
remember,
there
is
no
real
primary
range,
since
primary
range
really
referred
to
back
when
you're
assigning
a
single
token
to
a
given
node.
Now
that
it's
oka,
given
node
can
have
up
to
n
number
of
tokens.
Primary
range
really
is
a
little
bit
more
fluid.
So.
A
C
F
F
A
Depends
what
depends
on
what
you're
doing
with
that
actual
cluster
some,
and
we
can
take
that.
That's
specific
Netflix
use
case
offline,
but
there
is,
there
could
be
some
benefit,
but
it's
yeah
all
right.
G
G
A
Can't
give
you
a
real
answer
because
to
degree
I
mean
when
we
run
on
SSD
we're
running
on
amazon's
h1
for
extra-large
instance,
which
really
you
shouldn't
think
about
as
being
SSDs,
because
there's
n
layers
of
virtualization-
and
you
know
in
between
your
application
and
and
the
actual
metal
that
really
to
think
about
it
as
SSDs
is
not
really
doing
it.
Justice
you
can
make
it
as
higher
I
ops,
but
it's
not
like
real
rock
and
SSD.
G
A
So
my
use
case
that
I
know
about
the
time
my
head
is
actually
not
fully
representative,
because
that
specific
use
case
we
actually
switched
from
love
with
compaction
to
size,
tiered
compaction
in
the
middle
of
in
the
middle
after
we
switch
to
SSDs
so
I,
don't
know
if
he
it'd
be
the
more
current
times
our
plus.
At
that
time
we
were
still
running
was
there
was
a
bug
in
the
version
of
Cassandra.
We
are
using
for
doing
repairs
on
leveled
compacted
clusters,
so
it's
a
little
squirrely
and
I.
Don't
you.
I
G
I
A
If
those
six
notes
happen
to
have
the
updated
data
set
and
are
all
consistent
with
each
other
and
if
you're
not
executing,
read
repair
or
even
if
you
are
those
six
will
respond
to
the
coordinator,
the
coordinator,
will
you
know
check
all
those
digests
it'll
check
out
and
it
will
return
back
to
the
the
calling
application
saying:
hey,
everything's
cool
it's
only
until
those
other
nodes
are
involved
in
that
quorum,
read
will.
Will
you
then
a
trigger
the
consistent,
read
algorithm
to
fix
up
that
data
that.
A
Again,
it's
just
you
know
the
luck
of
the
draw
which
nodes
you
happen
to
hit
hi.
J
A
Each
quorum
use
case
so
be
my
specific
beef
with
that
is.
It
essentially
is
now
tying
your
right
to
the
availability
of
every
data
center
and
the
connectivity
to
every
data
center
which
for
Netflix
it's
now
us
to
us
East,
one
Europe,
if
we
launched
in
Brazil,
if
we
launched
in
Singapore
you've
now
just
hide
that
right
to
five
regions
and
a
quorum
in
all
five
and
your
availability
is
definitely
gonna.
A
A
We
internally
we're
really
trying
to
push
an
ocean
push
notion
of
really
just
accepting
eventual
consistency,
especially
in
these
use
cases.
There's
actually
one
use
case
at
Netflix
where
we
actually
use
each
quorum
and
there's
an
argument
whether
could
have
actually
it
would
whether
it's
even
really
necessary
where
things
get
interesting,
is
with
the
with
the
camera,
with
the
compare-and-swap
thing
that's
being
introduced
in
Cassandra,
20
I
still
need
to
do
a
lot
of
our
reading
up
on
that,
but
that
could
potentially
have
helped
this
use
case,
but
I'm
not
sure
yet
so.
K
K
L
K
A
K
A
Let's
say
you
have
a
cluster
and
there's
a
rogue
application,
that's
essentially
a
batch
application
and
what
it
wants
to
do
is
and
for
some
kind
of
consistency
policy,
or
is
it
a
referential
integrity?
But
if
it
fails,
it's
not
going
to
move
necessarily
got
message
to
a
dead-letter
queue
or
put
it
off
to
the
side
so
that
you
can
do
some
special
processing
on
it.
A
It's
just
going
to
keep
processing
the
thing
over
and
over
and
over
and
over
again,
and
this
thing
happened
to
do
a
bunch
of
deletes
and
so
essentially
what
we
happened.
Oh,
what
happened
is
that
it
would
do
a
bunch
of
deletes
and
try
to
read
the
same
row.
So
within
the
span
of
about
two
hours,
we
would
accumulate
about
200,000
tombstones
in
a
given
row
and
then,
if
you're,
trying
to
read
it
at
the
same
time.
So
essentially
you
have
all
these
nodes.
Trying
to
read.
Read
the
data
as
fast
as
possible.
A
Oh
and
read:
repair
chances
also
set
the
one
by
the
way,
globally
yeah
for
first
time.
First
time,
I
was
on
second
time.
It
wasn't
either
way
it's
still
screwed
us.
So
essentially,
you've
got
this.
This
conflict
of
trying
to
read
a
bunch
of
tombstones
out
of
us
as
tables
merge
that,
with
the
one
live
column,
that's
alive,
and
it
was
just
choking
on
GC,
as
I
had
to
have
all
those
those
tombstones
and
TTL
columns
in
memory.
A
A
A
Was
the
order
compaction
then
repair,
yeah
yeah?
So
basically
we
we
lowered
the
GC
ran
the
repair.
Then
we
did
a
major
compaction
to
make
sure
all
the
data
was
gone.
So
at
that
point
we've
just
gotten.
We
threw
away
all
the
tombstones
and
once
all
those
compaction
is
finished,
the
JVM
recovered
like
within
a
minute.
It
no
longer
had
to
worry
about
all
that
garbage
sitting
around
and
basically
the
parties
were
just
killing
it.
So.
K
B
D
M
Out
it
very
fast
over
one
point,
one
point:
nine
and
then
upgrade
you
1.2,
because
there's
autonomic
bash
features
introduced
in
one
point
you
and
turns
out.
The
overhead
of
CPU
is
like
nine
hundred
percent
and
it's
just
killer.
Cassandra
class
mics
and
clusters
and
I
realized
that
if
I
reduce
the
batch
number
to
like
100,
it
tastes
much
better.
It
turns
out
the
cpu
Acosta's
life,
as
you
rise
before
so
I
mean
it
hints
for
to
me,
like
seems
like
atomic
badge
in
knot.
A
C
A
Know
you
can
run
into
problems,
but
you
know
I'm,
it's
gonna
be
the
same
advice.
You
know
that
that
that
I
would
tell
anyone.
You
know
you
know
you
really
need
to
test
your
major
versions
before
pushing
it
out
unfortunate.
I
don't
have
a
better
answer
than
that:
I'm,
not
aware
of
any
bugs
off
the
top,
my
head
with
that,
but
there
certainly
could
have
been
yeah
but
yeah.
M
Because
I
checked
the
usual
case,
like
the
same
the
number
typical
batch
number,
we
are
using
like
around
100
or
something
so
I
choose
five
sullen
before
and
turns
out
that
no
much
buddy
you'll
stand
much.
So
it
seemed
that
probably
like
not
many
people
realize
that
this
is
a
problem
for
like
one
point
you
essentially
when
I
have
a
batch
number,
which
is
very
large
mmhmm.
A
H
So
typically,
that
doesn't
is
not
in
your
experience,
although
I've
seen
in
other
similar
types
of
systems
all
right.
The
other
thing
was,
you
mentioned
file
corruption,
but
you
didn't
say
much
about
it
after
that,
is
there
much
Inca
sondra
to
detect
such
corruptions
and
repair
that
automatically
or
so
pretty
much
so.
H
A
D
A
Dogs
but
yeah
yeah,
no,
there
yeah.
The
reason
why
I
brought
up
compaction
strategies
is
that
in
Cassandra
10
there
was
a
bug
doing
repair
on
level
compacted
clusters.
I,
don't
remember
all
the
details,
but
there
was
a
bug
that
prevented
us
from
running
repair.
So
what
we
did
is
we
just
swapped
that
guy
from
using
movil
compacted
to
sized
here,
then
we're
able
to
do
repair
and
everything
was
happy
after
that.
So
that's
the
only
place
where
that
angle
comes
in
and
decide
doc.
Yeah.
C
J
L
E
L
A
Space,
this
space
shouldn't
jump
because
we're
actually
not
writing
out
new
files,
we're
just
reading
all
the
existing
climbing.
A
L
C
L
L
A
Text:
okay,
I.