►
From YouTube: Apple Inc.: Cassandra Internals — Understanding Gossip
Description
In this talk we'll dive into how Cassandra nodes discover and communicate with each other, and share global state information via gossip. As the gossip subsystem seems shrouded in mystery to many folks, we'll peel back the layers and learn how it powers the underbelly of Cassandra.
A
A
A
Hi
everybody
thanks
for
coming
out.
My
name
is
Jason
I'm,
a
committer
on
the
Apache
Cassandra
project
and
I'm
a
employee
over
at
Apple.
Just
a
quick
note
before
we
start
this
presentation
is
not
a
contribution
to
the
Apache
foundation.
So,
let's
dive
in
so
let's
say
that:
there's
something
big
new
and
exciting
happening
in
your
life,
so
you're
going
to
get
married
or
you're
going
to
have
a
new
baby.
A
How
would
you
go
about
letting
everyone
in
your
social
circle
know
that
this
is
going
to
happen?
These
days,
which
you
probably
do
is
probably
put
something
up
on
Facebook
or
it's
in
on
an
email
to
all
of
your
friends.
But
what
you're,
probably
not
going
to
do
is
sit
around
at
home,
just
just
waiting
for
everybody
to
just
send
you
that
that
congratulations
message
instead,
what's
probably
going
to
happen,
is
your
cousin
will
see
your
Facebook
post
and
say?
A
Congratulations
then
the
cousin
will
the
aunt
the
aunt
will
tell
the
uncle
the
uncle
will
not
do
anything.
The
aunt
will
then
proceed
to
tell
everyone
else
in
the
family
and
so
on.
So,
essentially,
everyone
then
knows
that
you're
getting
married,
and
so
this
piece
of
data
has
on
see
this
piece
of
knowledge
has
essentially
been
broadcast
out
through
your
social
network.
So
there's
a
similar
thing
that
happens
inside
of
Cassandra.
This
is
called
the
gossip
subsystem
and
that's
what
I'm
here
today
to
talk
about
so
first
thing,
I'm
gonna,
do
is
talk
about
well.
A
A
So,
first
of
all,
what
is
gossip?
It's
a
broadcast
protocol
for
disseminating
data,
it's
decentralized
and
peer-to-peer.
So
what
that
basically
means
is
that
there's
no
centralized
server
that
holds
all
the
state
that
a
cluster
you
know
needs
to
talk
to
you
to
learn
all
the
state
that
it
should
know
about.
A
It's
basically
a
peer-to-peer
where
everyone
is
sharing
information
and
replicating
what
it's
heard
about
other
nodes
from
those
nodes
to
new
peer
nodes,
and
in
that
style
it's
called
an
epidemic
broadcast
and
what
that
means
similar
to
my
my
my
a
potentially
trivial
example
at
the
beginning
was
I
told
one
person
that
person
told
another
person
that
person
told
another
and
so
on,
and
so
this
is
in
much
the
same
way
that
biology
works.
Where
you
have
some
infected
organism
which
infects
another
which
infects
another
and
so
on.
A
However,
in
Computer,
Sciences
is
actually
a
good
thing
that
that
everything
gets
infected
unless
it's
a
security
worm.
So
gossip
is
a
fault-tolerant
kind
of
a
protocol,
because
once
that
knowledge
has
been
disseminated
across
a
couple
nodes,
if
one
or
two
or
even
most
of
those
nodes
go
down
still
a
few
are
infected
and
they
can
continue
broadcasting
out
that
information
to
the
rest
of
the
cluster
and
gossip
is
an
efficient
and
reliable
broadcast
protocol.
A
It's
got
a
pretty
minimal
overhead
in
terms
of
the
way
that
the
mechanisms
with
which
we
broadcast
these
messages
and
it's
a
pretty
lightweight
so
back
in
the
80s.
There
was
an
excellent
paper
called
epidemic
algorithms
for
replicated
database
maintenance
by
a
DeMars
at
all.
Basically,
they
built
the
a
system
called
clearinghouse
and
they
were
the
ones
to
really
first
talk
about
this
whole
concept
of
epidemic
broadcast
and
give
us
the
at,
via
via
the
language,
to
talk
about
this
style
of
replication.
A
Excellent,
read
by
the
way.
So
let's
look
at
a
quick
example
of
how
this
actually
works.
So
you
see
you've
got
a
note
here.
Let's
say:
you've
got
a
16
node
cluster
or
16
anything's,
and
this
first
node
has
some
piece
of
information.
What
to
share.
It
then
invokes
another
peer,
and
now
that
peer
has
the
data
both
of
those
peers
invoke
two
more
and
now
four
out
of
the
sixteen
are
infected.
Can
you
guess
what's
going
to
happen
next?
Those
four
are
going
to
call
for
more.
A
They
have
now
have
the
data
and,
lastly,
those
ain't
now
call
eight
more
and
that
has
and
and
now
the
entire
cluster
has
that
data.
So
this
is
a
no
just
a
one
node
talking
to
one
another
to
one
other
note
on
each
round,
and
so
it's
a
base
to
log
a
log
base
two
kind
of
operation
you
can
have
a
wider
fan-out.
So,
instead
of
talking
to
just
one
peer,
you
can
talk
to
five
six,
ten,
whatever
you
want,
but
done,
but
there's
extra
costs
in
the
network
a
involve
and
sharing.
A
So
why
do
we
use
this
in
a
cassandra?
Probably
the
more
interesting
questions,
so
we
use
it
to
our
lively,
disseminate,
no
data.
Amongst
years
note,
meta
data
across
peers
and
those
types
of
metadata
are
the
cluster
membership.
Basically,
what
nodes
are
in
this
cluster
heartbeat
and
the
the
node
status
and
other
meta
data
points
about
a
given
node
and
each
node
maintains
a
view
of
all
of
its
peers
in
the
cluster,
including
itself
and
the
Sun.
A
This
information
is,
of
course
much
as
I
just
described,
disseminated
out
in
an
epidemic
sort
of
fashion,
each
node
talking
to
another
node
and
then
spread
out
now
that
I've
talked
about
what
what
we
actually
use
the
form
Cassandra.
Let's
take
a
quick
second
to
talk
about
what
it's
not
used
for
in
Cassandra.
It's
not
used
for
streaming,
repairs,
reads
or
writes
compaction
since
cql
or
any
other
fancy
things,
and
it's
also
not
responsible.
A
So,
let's
dive
into
the
details,
so
there's
three
main
data
structures
inside
of
a
gossip
subsystem
not
going
to
talk
so
much
about
code
or
dig
into
that,
but
the
but
but
talking
about
the
data
structures,
helps
to
us
structure
the
discussion
so
there's
a
heartbeat
state,
an
application
state
and
an
endpoint
state
inside
of
each
node.
Oh,
the
endpoint
state
is
more
less
a
wrapper
around
and
a
heartbeat
and
a
collection
of
application
states
and
inside
of
each
node
in
cassandra
for
every
impure
node
in
the
cluster.
A
There's
going
to
there's
a
map
of
endpoint
States
for
each
peer,
so
the
heartbeat
state
there's
two
pieces
of
information
in
this
there's
a
generation
and
there's
a
heartbeat.
The
generation
is
essentially
the
timestamp
of
when
the
process
was
launched.
It's
largely
immutable
throughout
the
cassandra
process
during
the
lifetime
of
the
process,
but
there's
some
special
tricks
that
done
that.
I
cannot
talk
about
at
the
very
end
about
why
we
would
actually
violate
this.
So
the
heartbeat,
then,
is
just
a
periodically
updated,
monotonically
incrementing
value.
Now
the
layman's
definition
of
that
is
basically
just
it.
A
Four
endpoints
data
collection
for
each
peer
in
the
cluster,
so
I
will
talk
about
Union
in
names
in
a
second,
but
then
the
version
is
essentially
a
version
of
value
for
that
meta
data
point.
So
every
time
we
increment
the
value
will
also
increment
that
version
number
that
way
it
will
help
with
out
comparisons
later
on
to
help
assure
convergence
of
all
this
data.
A
So
the
enums
of
this
application
state,
there's
probably
a
dozen
or
so
instead
of
Cassandra,
but
I've
only
put
up
the
most
common
ones
up
here,
or
at
least
the
ones
I
find
to
be
the
most
interesting.
So
data
center
rack,
pretty
straightforward.
You
want
other
peers
to
know
you
know
just
where
you
are
physically
or
where
you
are
logically
schema.
A
With
the
dynamic
stitch
which
we'll
talk
about
at
the
end
of
this
talk
and
the
status
you
know
so,
status
is
actually
pretty
important
and
probably
one
of
the
most
important
states.
All
these.
All
of
all
of
these
so
bootstrap
when
a
node
I
want
to
notice
is
a
is
launched.
It's
going
to
set
it
to
a
state
two
on
to
bootstrap
and
then
it'll
gossip
that
out,
it
doesn't
last
very
long
assuming
that
the
the
node
has
been
launched
before
and
it
has
data.
A
A
It
means
that
it's
a
normal
part
of
the
cluster
and
is
behaving
as
it
should
leaving
and
left
have
to
do
without
decommissioning
a
node
out
of
a
cluster
and
removing
and
removed
have
to
do
with
when
you
actually
remove
a
node
that
you
no
longer
have
physical
access
to
the
machine
to
run
decommission
on
that.
So
you
have
somebody
else,
essentially
remove
it
from
the
cluster.
A
The
thing
that's
really
important
about
the
status
is
that
this
is
the
status
that
a
node
has
declared
about
itself.
It's
not
an
evaluation
by
any
other
node.
So
this
is
what
a
certain
node
says:
I'm
up,
I'm
bootstrapping,
something
it
says,
I'm,
bootstrapping
or
I'm
normal
up
and
down-
is
a
rather
different
concept
that
we'll
talk
about
soon.
A
So,
let's
take
a
quick
little
example
of
just
the
gossip
or
metadata
points
out
of
out
of
a
small,
tiny,
three
node
cluster
I,
launched
on
my
laptop
as
you
can
see,
I've
got
notes
a
B
and
C.
The
generation
is
slightly
different.
The
generation
is
more
or
less
a
timestamp
since
of
the
milliseconds,
since
the
epoch
and
since
I
launched
it
on
my
laptop
of
the
script,
they
all
pretty
much
launch.
At
the
same
time,
the
only
difference
being
the
last
millisecond,
the
heartbeats
are
slightly
different,
but
that's
okay.
A
They
don't
have
to
be
exact
if
they're
launched.
At
the
same
time,
if
you'll
notice,
node
a
has
a
value,
that's
two
less
than
the
other
two.
It
could
be
that
node
a
went
through
some
stop.
The
world.
Garbage
collection
pause
for
two
seconds
and
wasn't
they
able
to
update
that
heartbeat
because
it
was
in
the
stop
the
world
pause,
but
ultimately
the
values
lining
up
really
doesn't
matter
as
long
as
it
them
an
incrementing
value.
A
A
A
It's
got
some
timer
tasks
that
kicks
off
and
we
start
a
new
round
and
what
we
do
is
we
select
between
1,
2
3
peers
to
out
gossip
with,
will
always
pick
a
live
peer
if
there
are
any
in
the
cluster,
we'll
probabilistically
pick
a
seed
node
to
talk
with
and
we'll
probably
ballistically
try
to
call
an
unavailable,
node,
so
I
know
that
we've
previously
labeled
this
down.
A
Well,
then
try
to
reconnect
with-
and
we
just
probabilistically
do
that,
rather
than
always
doing
it,
because
if
the,
if
any
for
the
seeds,
we
don't
just
want
to
pummel
that,
with
with
gossip
traffic
and
for
the
unavailable
nodes,
you
don't
want
to
flood
the
network
with
packets
that
are
just
going
to
be
dropped
on
the
floor.
A
So
it's
dive
into
the
messaging,
so
Cassandra's
gossip
messaging
is
very
similar
to
the
TCP
three-way
handshake
in
the
three-way
handshaking,
a
sin
and
ACK
and
a
syn
ACK.
Instead
of
inside
the
gossip,
we
have
AK
herbs
like
syn,
ACK
and
ACTU.
Now
why
it's
called
back
to
rather
than
syn
ACK,
it's
an
excellent
question
that
I
don't
know
the
answer
to
so.
A
If
you'll
notice,
it's
got
three
messages
in
it.
With
a
with
the
broadcast
protocol,
we
could
just
ship
out
one
message
and
be
done
and
let
that
and
then
let
all
that
information,
just
you
know
eventually
percolate
out
through
the
cluster,
but
having
three
messages
for
each
run
of
gossip.
It
allows
us
to
add
in
a
degree
of
anti
entropy
into
into
the
gossiping
protocol.
A
A
A
So
if
you
remember
inside
of
those
app
states
and
there's
this
generation
and
heartbeats,
these
are
all
integers
and
I
mentioned
that
these
are
all
incrementing
values.
So,
let's
talk
about
how,
inside
of
a
a
run
of
gossip,
how
we
actually
reconcile,
who
is
out
of
date
and
who
has
a
more
updated
value?
Senders
gossip,
anti
entropy
is
based
upon
the
van
or
NSF
paper
efficient
reconciliation
and
flow
control
for
anti-itch
protocols
in
a
big
long
title.
But
it's
essentially,
how
do
we
make
data
converge
fast
in
a
gossiping
kind
of
system?
A
It's
also
nicknamed
the
scuttlebutt
paper,
as
you
can
see
in
my
poor
handwriting
that
I
didn't
erase
so
the
appstate
reconciliation
has
essentially
three
levels
of
precedence:
the
generation
and
the
and
the
and
comparing
the
app
states.
If
you
remember
those
individual
metadata
points
based
upon
that
version,
that
I
mentioned.
So,
let's
take
a
look
at
an
example
of
how
we
can
reconcile
the
data,
so
in
this
example,
I've
got
a
four
node
cluster.
A
You
know
it's
a
b
c
and
d,
and
this
is
a
round
of
gossip
between
nodes,
a
and
b
hope
this
lisa
semi
legible
over
there,
but
them
so
first
they're,
going
to
compare
the
a
metadata
about
a
the
generations
are
the
same
one
two
three
four.
The
heartbeats,
however,
are
different
a
since
it
is
the
owner
of
that
data
thinks
think
said:
it's
heartbeat
version
is
nine.
Ninety
four
B
thinks
that
the
heartbeat
is
nine
nine
zero.
Obviously,
nine
nine
four
is
more
current.
A
So,
therefore,
at
the
end
of
this
run
of
gossip
B
will
update
its
its
notion
of
A's
heartbeat
to
the
value
of
nine
ninety
four
for
node
B.
You
see
that
thought
the
generations
are
again
the
same.
The
heartbeats
are
are
different
in
this
case,
so
a
thinks
it's
ten
B
sets
at
seventeen.
However,
more
interestingly
is
the
status
is
different
between
the
two.
A
A
currently
things
said
that
B
is
bootstrapping
and
and
you'll
see
that
I
put
the
number
one
in
braces
just
to
indicate
the
version
that
it's
at
B,
however,
has
now
entered
the
normal
state
and
is
updated
that
status
to
normal,
and
it's
also
incremented
the
version
two
two.
So
how
we
actually
compare
these
application
states
is
just
strictly
based
upon
that
version
number.
We
don't
actually
try
to
compare
the
values
to
C,
which
is
bigger
one.
It's
always
based
upon
that
version
number.
A
So
in
this
case,
two
is
bigger
than
one
at
the
end
of
this
round
of
gossip,
a
will
think
that
B's
status
is
now
a
normal
version
of
two.
Now,
if
the
version
on
B
was
say
seven
and
a
still
knew
about
version,
one
of
that
value,
we
actually
don't
care
about
any
of
the
intermediary
state.
So
two,
three
four
five
and
six:
a
won't
care
about.
We
just
care
about
the
latest
value.
A
It
doesn't
done
I'll
care
about
any
of
the
intermediary
values,
so
looking
at
nodes
see
be
totally,
doesn't
even
know
that
that
note
exists.
So
in
this
case,
B
will
just
take
anything
beat
B
will
take
anything
that
a
has
to
say
about
C
and
at
this
and
after
this
round
of
gossip
B
will
now
know
that
C
exists
and
is
part
of
the
cluster.
A
Know
D:
the
generation
is
different,
a
thinks
that
the
the
generation
is
2
2
to
2
and
B
says
that
it's
3
3,
3
3
in
this
case
I
know
D
has
has
a
bounced.
It
was
a
restarted
and
now
has
an
incremented,
a
generation
value,
which
means
a
will.
Take
anything
that
that
B
says
about
D,
because
D
was
bouncing.
It
has
a
new
set
of
a
potentially
new
set
of
metadata
points,
so
at
the
end
of
this
run
of
gossip
able
to
take
everything
that
B
has
about
node
D.
A
A
The
metadata
forth
back
back
and
forth
in
the
AK
and
AK
2
is
going
to
be
pretty
much
constant
across
all
the
rounds
of
gossip
across
all
the
nodes
in
the
cluster,
so
because
of
gossip
gossip
itself
won't
necessarily
cause
any
network
spikes.
It's
always
going
to
be
a
pretty
flat,
constant
rate
of
network
traffic
and
the
reason
I
bring
this
up
is
that
is
that
sometimes
I've
heard
complaints
about
things
like
a
gossip
storm
instead
of
Cassandra.
A
Where
we're
you
know
certain
Network
events
trigger,
and
it
just
happens
to
get
blamed
upon
gossip-
that's
usually
almost
never
the
case.
Gossip
is
a
very
constant
trickle
in
the
background
of
your
cluster.
Of
course,
if
you
get
into
clusters
of
you
know,
10,000
nodes,
the
packets
pass
back
and
forth
are
inherently
going
to
be
larger,
but
then
again,
you've
got
a
cluster
of
10,000
nodes.
A
The
reason
why
you
have
a
cluster
that
large,
because
you
have
that
much
data
and
you're,
probably
already
maxing
out
the
network
bandwidth
anyways,
so
gossip
really
isn't
going
to
be
the
cause
of
that.
However,
when
gossip
does
choose
to
mark
a
node
up
that
was
previously
down.
Well,
it
could
be
happening.
Is
that
if,
if
if
there
was
one
node
down
in
the
cluster,
and
then
it
comes
back
up
and
all
the
node
seeds
said
it's
up-
there-
probably
gonna
start
streaming
hints
and
things
like
that
over
to
that
node.
A
That
was
down,
and
that's
probably
more
likely
the
cause
of
what
it
is
frequently
called.
The
gossip
storm
gossip
didn't
necessarily
causes,
but
but
they
have
a
practical
effect
inside
of
Cassandra.
Seeing
that
a
node
is
now
up
means
that
we're
going
to
start
streaming
data
over
to
that
peer
node,
so
I
want
to
spend
some
time
talking
about
the
practical,
implement
the
practical
implications
of
gossip
continuing
on
the
previous
thought,
so
some
questions
that
dumb
will
be
really
good
to
answer
and
those
are
who's
in
the
cluster.
A
A
When
does
a
node
stop
sending
another
node
traffic
good
question:
when
is
one
period
preferred
over
another
a
closely
aligned
with
the
last
question
then
the
last
thing
we'll
talk
about
is:
when
does
a
node
actually
leave
the
cluster,
and
when
does
it
doesn't
and
when
and
what
happens
when
it
doesn't
leave
the
cluster
and
it's
supposed
to
so
cluster
membership?
We
kind
of
talked
briefly
about
this
in
my
discussion
of
how
on
when
we
talked
about
the
reconciliation
between
a
gossip
messaging.
A
So
when
and
when
a
node
and
node
starts
up,
it
basically
needs
someone
to
start
gossiping
with
right.
It's
the
whole
point
of
gossip
is
to
have
someone
to
talk
with
well,
if
you're
starting
up-
and
you
need
to
know
the
address
of
someone
to
talk
to.
But
how
do
you
know
what
that
is
until
you
actually
gossip
with
someone
to
find
out
what
the
other
addresses
are
that
you
should
guess
be
gossiping
with
well,
if
you've
ever
dealt
with
the
Cassandra
yamo
file,
there's
a
seed
provider.
A
The
most
common
example
is
a
simple
seed
provider
which
you
just
give
it
a
hard-coded
list
of
IP
addresses.
There's
a
couple
of
other
seed
provider,
implementations
that
essentially
I'll
provide
a
list
of
host
names
or
IP
addresses,
so
that
new
node
that
comes
up
gets
this
list
of
a
seed
addresses
and
starts
gossiping
with
with
with
one
of
them
randomly
and
through
that,
it's
going
to
learn
about
all
the
other
peers
that
are
in
the
cluster
and
then
because
it
knows
about
all
the
other
peers
in
the
cluster.
A
So
up
and
down
the
measurement
of
up
and
down
is
specific
to
a
note
itself.
We
never
never
never
never
never
broadcast
to
any
other
node.
That
I
think
that
other
pier
is
up
or
down
it's
always
local
to
me
and
how
we
actually
determine
that
is
based
upon
the
heartbeat
that
I
mentioned
earlier,
and
how
I
actually
get
updates
how
a
node
actually
gets
updates
about
the
another
piers.
A
current
heartbeat
value
doesn't
necessarily
need
to
be
communicated
directly
to
me,
but
it
can
come
in
directly.
A
A
For
you
know,
whatever
networking
a
partition
problem
that
might
exist,
but
B
is
still
getting
all
those
heartbeat
updates
from
C,
because
a
knows
that
B
is
up
and
legit
and
because
a
is
also
gossiping
with
B
a
will
get
the
heartbeat
updates
about
C,
even
though
a
can't
talk
to
see
so.
Therefore,
a
will
still
think
that
C
is
up,
even
though
a
in
the
all
the
network
traffic
between
a
and
C
can
be
completely
dropping
on
the
floor,
packets
or
timing
out
no
acts,
no
sins.
A
So
for
the
general
purposes
of
this
discussion,
the
only
thing
inside
of
Cassandra
that
can
market
peer,
node
is
down
is
the
failure
detector
and
that's
based
upon
other
receiving
the
those
heartbeat
updates.
Almost
nothing
else
will
mark
it
down.
Conversely,
only
Gosper
can
mark
a
note.
The
gossip
are
the
primary
gossip
cost.
Sorry,
that's
the
only
component
that
can
mark
a
notice
up,
there's
a
couple
minor
details,
but
nothing
that
should
interest
anybody
else,
besides
our
coders
inside
this
thing.
A
So
what
does
this
notion
of
up-and-down
really
affect?
How
does
it
affect
your
cluster?
So
it
does
not
affect
writes,
because
we
always
want
to
send
writes
to
the
target
node
that
it
needs
to
go
to
whoever
owns
a
certain
range
for
a
key.
We
always
try
to
shove
it
over
there
and
if
we
fail
to
get
a
an
acknowledgment
or
a
receipt
from
that,
node
will
actually
store
that
up
as
a
hint
pretty
common
in
cassandra.
A
A
Now,
probably
because
you
know
probably
because
you
haven't
been
getting
those
heartbeat
updates,
the
node
is
down
or
there's
some
network
partition
and
probably
the
Asaka
connections
are
dead
or
all
the
are
or
nothing
as
being
act
or
whatnot.
So
basically,
all
we
do
is
terminate
the
sockets
and
close
any
of
any
memory
sessions
just
to
clear
up
space.
So
that's
great
about
the
heartbeat.
No,
but
you
know
what
really
happens
if
that
peer
is
running
really
slow
and
everything
is
timing
out.
Do
we
ever
mark
it
as
down?
A
The
reason
why
we
reset
the
scores
every
10
minutes
is
that
that
allows
us
nodes
that
maybe
we're
having
some
long
GC
period
for
whatever
unfortunate
reason
to
actually
get
a
rearrange
correctly,
where
it
should
be
in
a
list
so
that,
if
it,
if
it
normally
is
a
fast,
faster,
responding
node
yet
had
some
heavy
I/o
or
some
heavy
a
garbage
collection
going
on.
It
can
then
be
re
ranked.
A
So,
let's,
let's
switch
a
little
bit
and
talk
about
how
nodes
actually
leave
a
cluster.
So
there's
a
couple
different
ion
mechanisms.
One
is
to
use
the
enode
tool
decommission
feature,
and
for
that
you
actually
need
to
log
into
the
actual
node
that
you
want
to
exit
the
cluster
and
run
node
two
of
the
commission.
What
happens
is
that
that
that
node
is
going
to
change
its
status
to
leaving
and
then
broadcast
that
out.
A
It's
then
going
to
find
the
peers,
who
should
now
own
the
ranges
that
it
has
and
stream
the
data
over
and
then
send
any
hints
over
as
well.
That
should
be
played
to
any
peer
nodes
who
might
be
unavailable
at
that
time.
Then,
after
all,
that
activity
is
done,
it's
going
to
change
its
status
to
left
and
it's
going
to
set
an
expiration
time.
A
The
next
way
of
removing
a
node
is
a
is
removed,
node
I,
once
again
it's
another
node
tool
command.
However
node
remove
node
is
when
the
Anoat
that
you
want
to
to
kick
out
of
the
cluster
is
no
longer
available
for
you
to
log
in
and
run
decommission
on.
So
you
had
to
go
to
some
other
random
node
in
the
cluster
and
execute
this
so
that
that
initiator
is
going
to
set
the
status
to
removing
for
that
peer
of
who's
gone
and
then
every
node
is
responsible
for
rebalancing
the
cluster.
A
So
they
need
to
figure
out
what
what
nodes
now
over,
which
token
ranges
and
then
stream
some
data
over
to
them.
They'll
each
all
then
delete
any
local
hints
they
had
for
that.
Node
that's
now
gone
and
then
finally
notify
the
the
the
coordinator
or
initiator
that
they're
done
with
all
their
actions.
At
the
end
of
all
this
activity,
that
coordinator
is
going
to
set
the
status
removed
and
once
again
send
another
expiration
time
so
that
everyone
can
drop
that
nodes
information
after
three
days
replace
node.
A
We
perform
what's
called
a
shadow
gossip
run,
which
is
basically
kind
of
like
a
minion
gossip
round.
Basically,
what
we
want
to
do
is
probe
the
the
cluster
to
see
what
are
the
nodes
are
out
there.
You
know,
because
we're
trying
to
replace
somebody,
let's
see
if
it
actually
was
in
there
and
find
out
the
tokens
that
it
actually
owned.
So
it's
called
the
shadow
gossip
round.
A
It's
really
just
a
sin
and
an
act
without
any
act
to
because
clearly
be
I
know
that
trying
to
replace
somebody
doesn't
it
doesn't
know
anything
about
the
rest
of
the
cluster.
It's
just
trying
to
grab
information
from
from
the
alive
nodes.
So
we
take
the
tokens
and
the
host
ID.
Then
we
check
that
the
owner
actually
is
dead
and
isn't
still
actively
gossiping
the
cluster.
Then
we
just
stream
the
data
from
any
live
nodes
to
ourselves.
A
So
that's
what
happens
when
everything
goes
right
when
you
want
to
kick
a
node
out
of
the
cluster,
what
happens
when
things
don't
quite
go
so
right
and
you
move,
and
you
remove
a
note,
but
yet
it
kind
of
hangs
around
in
your
cluster.
For
days
weeks
past
that
three-day
period,
a
lot
of
you
seen
nodding
your
head
kind
of
knowing
what's
coming,
so
we
assassinate
that
damn
note
to
get
rid
of
it.
A
So
there's
a
JMX
command
called
unsafe,
assassinate
endpoint
a
little
bit
unfortunate
with
the
naming
of
unsafe,
but
it
should
be
used
with
caution,
but
it
basically
forces
a
change
to
appear
so
similar
to
how
that
removed
token
will
be
a
remove
node
a
command
works.
You
invoke
it
on
a
node,
because
that
original
guy
is
gone
and
you
bit
and
and
give
it
the
IP
address
of
that
previous
node.
And
what
we
do
is
we
force
that
generation
to
be
updated
and
we
force
the
status
to
me.
I
think
dead.
A
Epidemic
broadcast
protocols
provide
a
resilient
and
efficient
and
efficient
mechanism
for
data
dissemination.
Basically,
I
tell
you
you
tell
a
friend
you
tell
another
friend,
the
end
of
day
is
just
rumor
mongering,
just
sharing
information
all
around
in
a
kind
of
a
decentralized
manner
and
it's
out
of
Cassandra.
We
use
it
for
Peter
for
peer
discovery
and
metadata
propagation.
That
way,
everyone
could
eventually
comes
to
kind
of
an
agreement,
even
though
that
it's
never
quiescent,
because
that
heartbeat
is
always
updating.