►
Description
Speaker: Aaron Morton, Co-Founder
Apache Cassandra makes it possible to write code on a laptop and deploy to multi-region clusters with a few configuration changes. But what does it take to create repeatable, scalable, reliable, and observable clusters?
In this talk Aaron Morton, Co Founder at The Last Pickle and Apache Cassandra Committer, will discuss the tools and techniques they use. From environment planning to implementation for tools such as Chef, Sensu, Graphite, Riemann and LogStash this will be a discussion of the full stack ecosystem for successful projects.
A
My
name
is
Aaron
Morton
and
today
I
want
to
talk
to
you
about
how
to
take
Cassandra
off
your
laptop
and
into
the
data
center
and
the
types
of
mistakes
that
people
make.
That
mean
they
come
and
give
me
money
which
I
like,
but
sometimes
I'd
like
to
do
some
more
enjoyable
work.
A
little
bit
about
myself
before
I
start
I
run
a
small
consultancy
called
the
last
pickle
I've
been
using
Cassandra
for
about
five
years,
I'm
a
committer
on
the
project,
and
we
have
staff
around
the
world
and
we
just
help
people
use
Cassandra.
A
So
what
we're
going
to
do
is
we're
going
to
look
at
the
decisions
we
make
at
the
design
phase
and
the
development
phase
and
the
deployment
phase
that
mean
that
when
we
get
to
our
cluster
it
works.
It
always
works
on.
Your
laptop
I
saw
someone
today
with
a
sticker.
That
said,
it
worked
on
my
laptop.
We
want
to
get
it
into
the
data
center
and
then
we
want
it
to
work
in
the
data
center
after
three
months
and
after
six
months,
even
after
12
months.
A
There'd
be
a
lot
of
other
courses
that
we've
done
today
and
yesterday
in
the
day
before
around
how
to
actually
achieve
functionality.
This
is
about
how
their
achieve
performance
and
scale
so
the
first
one
we
want
to
look
at
is
we
want
to
avoid
doing
unnecessary
reads:
we're
going
to
do
what
we
call
no
look
right
so
say:
we've
got
a
table
that
looks
like
this:
we're
just
going
to
track
the
days
that
a
user
visit
our
website.
A
If
you
were
doing
this
in
an
old-fashioned
database,
you
might
have
a
model
like
this
read
from
it.
If
it
doesn't
exist,
do
an
insert.
So
you
add
one
so
there's
a
reasonable
thing,
because
you've
got
a
primary
key
constraint
in
there,
but
in
Cassandra
our
primary
keys
don't
have
constraints
on
them.
You
can
insert
again
and
again
and
again
with
the
same
value,
so
don't
bother
doing
the
read
you're
just
wasting
time
and
energy.
Just
insert
it
and
then
let
compaction
take
care
of
dealing
with
those
over
rights.
A
Okay,
I
use,
I
always
use
the
phrase
in
space
or
time.
So
if
we
look
at
it
when
we're
limiting
something
by
time,
imagine
we've
got
that
table
of
now
and
we're
tracking
every
time
the
user
visits.
We
record
some
piece
of
data,
it's
a
hundred
k's
in
size,
for
example,
and
we
keep
that
forever
and
so
this
table
will
grow.
The
partitions
in
this
table
will
grow
with
no
bound
and
as
long
as
your
site
keeps
running,
this
partition
keeps
getting
bigger,
which
means
that
we
have
more
of
a
partition.
A
We
have
a
bigger
index
or
inside
that
partition
that
you
don't
even
see
and
know
about.
It
means
that
when
it
happen
when
it
goes
through
compaction,
it
goes
slower,
big
ones.
Things
are
above
32
Meg's
they
go,
they
have
to
go
through
a
slower
compaction
process
and
it
means
that
when
you
do
a
repair,
you
are
potentially
copying
around
data
that
you
don't
need
to,
because
we
repair
token
rangers
not
individual
rows.
A
If
there's
nothing
wrong
with
this
partition,
except
it's
just
next
to
a
partition
that
does
have
an
inconsistency
in
it
will
copy
this
guy
around.
So,
if
you've
got
the
partition,
that's
a
500
Meg's
in
size.
We
will
just
copy
that
to
all
the
other
nodes
and
then
wait
for
a
compaction
to
do
its
thing
again.
So
a
better
approach
is
to
bucket.
A
So
we
add
another
column
on
and
this
time
we're
going
to
buck
it
up
by
day
and
we've
put
that
into
the
partition
key
and
now
every
partition
is
only
going
to
get
written
two
for
one
day,
and
that
means
that
the
it's
not
going
to
get
too
big.
We
have
a
fair
idea
of
how
big
a
Terran
to
get.
If
you
did
the
developer
training,
they
went
through
a
whole
set
of
algorithms
about
how
to
work
out.
How
many
bites
it's
going
to
be.
A
A
Another
issue
we
want
to
look
at
around
tables
is
avoiding
a
mixed
workload.
So,
if
you
think
about
that
log
structured
merge,
tree
storage
engine,
we
can
end
up
with
fragmentation
of
your
partition
because
we
flushed
the
disc
and
the
rows
get
fragmented
and
the
job
of
compaction
is
to
bring
those
back
together,
but
compacting.
This
can
only
do
so
much.
If
you
have
a
bad
data
model,
you
will
have
things
spread
out.
A
A
We've
got
the
users
password
and
the
last
time
they
visit
password
gets
updated.
Frequently
read
frequently
last
time
they
visit
gets
updated
frequently
and
let's
just
go
with
red.
Infrequently
two
different
workloads:
it's
going
to
create
a
table
where
it's
very
fragmented
and
when
we
go
to
do
the
reed,
the
reed
path
will
have
to
look
at
all
the
fragments,
even
if
you're
just
getting
the
password,
then
that's
only
in
one
fragment.
A
A
A
It
can
be
a
friend
if
you
want
to
do
a
lot
of
low
latency
reads
and
it
will
be
your
enemy
if
you're,
trying
to
put
too
many
rights
through
because
it
uses
about
twice
the
disk
I/o
and
when
it
gets
behind
it
gets
really
behind
and
things
get
out
of
control,
so
LCS
is
good,
but
if
you've
got
a
high
write,
throughput
consider
a
date
tiered
compaction
strategy
and
we
add
it
just
by
adding
a
property
onto
the
table.
It's
pretty
simple.
There
are
other
properties,
of
course,
of
compaction,
the
level
compaction
strategy.
A
So
we
want
a
data
model
that
is
going
to
scale
the
throughput
when
you
scale
your
cluster.
If
you
launch
your
site
and
you
have
three
nodes
and
you're
very,
very
successful-
and
you
get
some
money
and
you
scale
up
to
30
nodes
you're
going
to
have
a
lot
more
capacity
in
your
cluster.
But
you
also
want
to
get
more
throughput
out
of
that.
So
if
we
had
a
data
model,
let's
say
for
a
hotel
pricing
and
we
have
the
check-in
day
as
the
partition
key
and
then
the
hotel
name
and
the
pricing
blog.
Now.
A
All
of
the
travel
in
all
of
the
check
and
information
to
check
in
on
the
fourth
of
July
is
in
one
partition
that
one
partition
is
on
three
nodes.
When
you
have
a
three
node
cluster
is
on
three
nodes
and
when
you
have
a
30
node
cluster,
it
is
on
three
nodes.
You
will
not
scale
the
throughput
to
read
that
if
you
have
a
lot
of
people
who
want
to
check
into
a
hotel
on
july,
4th
you're
not
going
to
be
able
to
successfully
get
all
those
requests
out.
A
A
So
we
know
where
what
CDs
we
want
to
look
at
and
put
that
into
the
partition
key
so
that
now
the
all
the
hotels
in
santa
clara
are
on
one
set
of
nodes
and
all
the
hotels
in
san
jose
are
on
the
different
set
of
nodes
on
a
small
scale
cluster
on
the
three
node
cluster
they're,
all
the
same,
when
we
move
up
to
a
30
node
cluster
they're
spread
out
and
we
can
increase
our
throughput
as
we
scale
the
cluster
and
to
do
that.
We're
going
to
want
to
use
asynchronous
requests.
A
A
What
this
allows
us
to
do
is
think
in
terms
of
paragraphs
instead
of
sentences
if
I
want
to
go
and
get
the
hotel
pricing
information
for
six
hotels,
don't
make
sex
sequential
requests,
make
sex
concurrent
requests
using
the
asynchronous
features
of
the
drivers
and
then
wait
on
those
six
futures.
It
will
be
a
little
bit
more
than
the
network
round
trip
for
one
request,
but
it
will
be
significantly
less
than
the
network.
Then
six
network
round
trips,
so
back
to
our
hotel
model.
This
is
our
table.
A
A
Make
sure,
when
you're
designing
your
data
model,
you
document
what
sort
of
consistency
levels
you
expect
people
to
be
using,
because
often
the
people
making
the
data
model
and
the
developers
are
the
same
people
to
begin
with,
and
then
there
are
some
new
developers
that
come
along
and
they
don't
really
understand
why
or
they
just
have
to
copy.
What's
in
the
code,
if
you
write
it
down
in
your
data
model,
you
can
explain
why
you
think
you
can
use
eventual
consistency
in
this
area.
A
A
The
idea
of
the
smoke
test
is
to
find
problems
as
early
in
the
method
in
the
process,
as
we
can
so
we
may
insert
for
our
hotel
data,
have
a
little
script
that
inserts
and
hotel
prices
and
the
cities
and
their
relationships
to
each
other,
and
then
all
I
do
is
I
write
a
couple
of
select
statements
and
I
put
some
comments
in
the
schema
in
the
script
and
I
say:
run
these
in
parallel.
Make
this
before
that
and
I
can
test
my
logic.
A
First,
one
is
we
want
to
understand
how
much
we're
reading
from
Cassandra,
so
it
may
be
that
you're
reading
cells,
columns
that
have
20
megabytes
in
them
and
the
thing
to
do
would
be
to
not
put
20
megabytes
in
one
column,
or
it
may
be
that
you're
reading
tens
of
thousands
of
rows.
The
important
part
is
to
know
what
you're
reading,
how
much
your
a
reading
back
so
again
in
the
native
binary
protocol
at
the
base
level,
there
is
support
for
pagination,
whether
you
see
it
or
not.
A
It's
there
and
is
always
enabled
the
driver
sets
this
by
default,
to
be
five
thousand
rows.
So
if
you
do
a
select
against
a
partition-
and
it
has
five
thousand
rows
in
it
or
more-
you
will
get
multiple
pages
come
back.
This
is
called
the
fetch
size.
Five
thousand
might
be
a
bit
much
if
you've
got
a
bunch
of
data
in
there.
A
If
you've
got
some
wide,
if
you've
got
some
large
rose,
that
could
be
a
quite
a
significant
amount
of
data,
so
you
might
want
to
turn
that
down
to
be
100
200
whatever
also
this.
So
in
the
background,
the
driver
is
going
to
fetch
again
and
it
will
either
by
default.
It
does
it
synchronously,
so
you
exhaust
the
other
Raider.
It
goes
back
and
get
to
the
next
page
or
you
can
make
it
do
it
eagerly.
A
So
once
you
start
reading
from
the
iterator
in
the
background,
it
goes
out
and
gets
the
next
page
and
brings
it
back
again
we're
going
to
use
the
appropriate
consistency
level
because
reducing
to
the
lowest
consistency
level
improves
performance,
improves
throughput
and
we
wanted,
wherever
we
can
use
token
aware,
asynchronous
requests
at
consistency
level
one.
So
we
touched
on
this
before
in
the
design
phase.
A
What
this
means
is
your
client
is
paying
attention
to
the
tokens
that
are
assigned
in
the
cluster
and
when
it's
going
to
write
a
will
read
a
value
it
works
out
which
nodes
are
actually
replicas
for
that
directs.
The
request
to
one
of
those
nodes
and
if
you're,
using
consistency
level
of
one,
it
can
answer
that
query
using
local
disk
only
it
will
still
do
if
it's
a
right,
it
will
still
send
it
to
all
of
the
other
nodes
in
the
cluster.
A
There
are
replicas
and
if
it's
a
read
it
may
do
some
reading
in
the
background
to
check
for
consistency
and
repair
it,
but
you've
taken
network
latency
at
out
of
it.
You're
taking
one
network
hop
out
of
the
picture
and
it
can
really
reduce
your
latency
down.
So
you
set
this
up
using
what
are
now
like
lots
and
lots
of
policies
and
properties
on
the
driver.
You
create
your
cluster
and
in
this
example
here
we
create
a
DC
round-robin
load
balancer.
A
A
If
we
want
to
do
asynchronous
requests,
we
go
to
our
session
object
and
we
call
execute
I
think
and
it
gives
us
back
a
result
set
future
that
we
can
listen
on
using
guava
or
we
could
call
get
interrupted
uninterruptedly
if
we
want
to
pretty
simple
stuff,
it's
just
going
to
give
us
the
row
back
now.
If
you're
doing
this,
you
really
should
avoid
doing
a
denial
of
service
attack
on
your
cluster
okay.
So
your
client
is
probably
running
at
the
speed
of
memory.
A
A
You
may
overload
your
cluster,
so
if
you
want
to
send
like
a
thousand
rights,
maybe
only
have
100
of
those
in
flight
at
any
time
again
the
numbers
not
so
important
as
the
fact
that
you
think
about
it
and
make
sure
that
you
don't
have
it
set
up
that
if
you're
a
site
gets
really
busy,
you
didn't
have
any
protections
in
place
and
what
the
amount
of
rights
that
were
in
flight
on
the
cluster
went
sky
high.
I.
A
Think
that
monitoring
in
the
learning
is
part
of
development
that
if
the
developers
have
access
to
monitoring
early
on,
they
can
see
how
the
system
is
running
or
not
running
and
that
most
developers
should
be
writing
metrics.
That
kind
of
that
you
want
to
see
on
the
graph
with
the
metrics
that
you
see
from
Cassandra,
so
you
can
understand
latency
going
through
your
staff.
That
kind
of
thing-
and
you
should
just
use
whatever
works.
Our
ops
center
is
getting
better
and
better.
Remon
is
very
good
but
very
complicated
you're.
A
Some
aggregations
remember
they're
going
to
be
coming
in
off
one
node,
but
if
you've
got
a
15
0
plus,
do
you
have
to
think
about
what's
happening
at
the
cluster
level,
what's
happening
at
the
node
level?
So
we
generally
want
to
have
some
aggregations
at
cluster
wide.
We
want
to
know
throughput
across
the
whole
cluster.
If
you've
got
a
smallish
cluster,
you
may
be
over
the
show,
the
value
for
every
node
on
the
same
chart,
it's
often
useful
to
have
a
filter
that
pulls
out
the
top
three
and
the
bottom.
Three.
A
So
the
throughputs
for
the
more
recent
seconds
have
a
greater
waiting
than
the
than
the
older
seconds
in
that
minute,
and
sometimes
we
have
just
values
that
are
counts
and
you're
going
to
want
to
run
a
derivative
on
that
to
get
the
Delta
over
time.
And
then
we
have
lots
of
latency
metrics
and
they
come
out
with
a
min
and
the
max
and
a
medium
and
the
standard
deviation,
and
things
like
that.
I
tend
to
just
grab
the
75th
95th
and
99th
percentile.
For
those
and
these
things
report
in
microseconds.
A
Whichever
process
you
go
to
get
your
metrics
out,
they
have
a
naming
scheme
that
is
reasonably
consistent.
You
can
get
your
metrics
via
jmx.
You
could
use
joke
allah
or
MX
ml
to
get
that
by
HTTP.
You
could
have
collect
d
pulling
things
off
or
you
could
use.
I
think
the
preferred
method
is
to
use
the
configuration
file
called
my
metrics
gamal
and
configure
cassandra
to
push
metrics
out
to
the
reporting
systems,
which
is
much
better
approach.
You
can
configure
it
to
push
the
multiple
reporting
systems.
A
You
can
configure
it
to
push
at
different
intervals
to
different
reporting
systems.
You
can
have
regular
expressions
that
are
going
to
filter
the
metrics
that
are
showing
it's
a
really
useful
thing,
so
they
have
a
naming
scheme
that
starts
with
org
Apache
Cassandra
metrics,
Oh
ACM,
and
then
we
get
into
the
different
areas.
So
we
have
one
here
for
the
cluster
throughput.
This
is
an
aggregate
of
the
throughput
on
this
node
and
we
have
the
right
latency
one
minute.
The
latency
bit
is
a
little
confusing.
This
is
the
throughput.
This
is
not.
A
The
latency
will
see
the
latency
in
a
few
minutes,
but
that's
where
the
throughput
hides
in
the
client
requests,
and
then
we
have
the
throughput
for
the
node.
Now,
if
we're
writing,
if
we've
got
the
schema
that
has
RF
three,
the
three
put
to
the
clustering
will
say
is
a
hundred.
The
local
throughput
should
be
around
three
hundred
because
we're
going
to
do
three
rights
for
every
request
that
comes
in.
A
So
you
can
see
the
throughput
at
the
individual
column,
family
level
and
it
it's
org,
Apache,
Cassandra,
metrics,
column,
family,
then
you're
a
key
space
name.
Then
your
table
name
and
then
write
latency
again,
one
minute
rate
and
we
can
see
the
request
latency.
So
the
timer
for
this
starts
when
the
request
hits
the
coordinator
from
your
client
and
it
ends
when
we
send
the
response
back.
So
it
includes
all
of
the
wait
time
all
of
the
internal
network
traffic,
all
of
the
disk
access
all
of
the
queue
times
everything's
in
there.
A
It
is
the
the
most
important
number
to
understand.
I
think
everything
else
make
a
bad
data
model.
If
you've
got
metrics
telling
you
that
your
data
model
is
bad,
but
your
SLA
is
being
met,
then
you
probably
shouldn't
spend
too
much
time
trying
to
fix
your
data
model,
there's
probably
more
important
things
to
do
so.
A
We
have
that
request
latency,
but
this
one
is
for
all
the
requests
coming
through
the
class
through
this
particular
node,
and
then
it
can
be
broken
down
by
individual
tables,
so
you
go
Oh
ACM
column,
family,
you're,
a
key
space
name,
you're
a
table
name
and
down
here.
It's
called
coordinator,
write,
latency
and
coordinate
a
read
latency.
This
is
really
important
because
you
might
look
at
that.
One
number
to
begin
with
and
see
that
the
95th
percentile
is
one
second,
but
you
don't
know
what
table
it
is.
A
So
you
can
drill
down
here
and
see
that
oh
it's
table
foo
is
the
one.
That's
90
that,
where
the
95th
that's
one
second,
the
next
level
down
on
latency
is
the
local
latency.
So
this
is
what's
happening
when
that
read
thread
gets
the
message
off
the
queue
on
each
node
and
starts
pulling
information
off
desk,
and
this
is
telling
us
how
fast
this
particular
part
is
running.
So,
for
example,
write
latency
might
be
about
50
to
150
micro
seconds
and
read.
A
Latency
might
be
around
a
hundred
to
sort
of
800
milliseconds
on
microseconds,
and
once
you
start
to
look
at
these
numbers,
you
get
a
feel
for
what's
happened.
What
that
network
latency
is
all
about,
because
that
number
is
typically
a
lot
smaller
than
the
overall
request.
Latency
now
most
problems
that
people
have
our
to
do
with
the
repast,
and
there
are
three
key
metrics
that
you
can
look
at
to
understand
the
red
path.
The
first
one
is
the
live.
Scanned
histogram
I've
just
pulled
out
the
95th
percentile
here.
A
This
is
how
many
cells,
which
is
an
internal
representation
of
a
non
primary
key
column.
How
many
cells
we
read,
/,
read
and
breaks
it
down
for
the
percentiles,
so
you
could
look
at
this
and
understand
if
you're
doing
huge
reads
on
this
table.
Maybe
that's
why
the
reeds
on
that
are
kind
of
slow
and
we've
got
the
tombstone
scanned
histogram,
and
this
is
how
many
tombstones
we've
read
off
desk
and
remember
those
tombstones,
get
rid
off
desk,
allocated
memory
and
then
throw
in
a
way.
So
we
don't
want
too
many
of
those
around
again.
A
So
we
have
some
metrics
that
let
us
know
how
many
hints
we
stored,
storage,
total
hints.
It's
a
count.
You
have
to
do
a
derivative
on
it,
but
you
can
know
how
many
hints
you
store
and
remember.
The
hints
are
when
the
node
was
down
before
we
started
the
request
or
when
the
no
time
doubt
didn't
get
back
to
us
for
a
right
and
then
it's
broken
down
pure
IP
address
so
you're
going
to
know
that
oh
look,
the
nodes
on
the
other
side
of
the.
Where
are
the
ones
that
we're
storing
all
the
hints?
A
For
but
don't
forget,
that
hints
are
an
optimization.
They
can
be
turned
off
and
by
default,
if
a
node
is
down
for
more
than
three
min
are
three
hours
we
stop
storing
hints
for
it.
So
there's
another
measure
here,
which
is
time
outs,
and
this
is
the
rate
of
time
out
that
you're
getting
talking
to
other
nodes
and
it's
a
good
measure
of
kind
of
your
network
health
and
let
you
understand
if
you've
got
problems
going
across
the
land
and
things
like
that.
A
On
the
other
side
of
this,
we
have
to
repair
it.
So
we've
got
read
repair.
There
is
read
repair
that
happens
in
the
background
and
doesn't
slow
anyone
down
as
read
repair
Brown.
We
could
know
how
many
times
that
detected
a
problem
and
we've
also
got
read
repair.
That's
blocking
this
is
where
you
do
a
read
as
a
consistency
level
to
tie
them.
A
Occasionally,
there
are
errors.
We
know
there
are
two
types
that
happen
in
Cassandra
and
the
first
one
is
the
good
one
unavailable
exception
says
you
asked
me
to
do
this
read
or
this
right
as
a
certain
consistency,
level
and
I
can't,
because
there
is
not
enough
nodes
available,
so
I
haven't
even
tried
to
do
it.
The
second
one
is
the
bad
one.
Timeouts
are
essentially,
we
just
shrug
our
shoulders
and
we
don't
know
what
happened
both
of
these
should
be
tracked
and
they
loaded
on
and
measured
and
if
you're
managing
the
system.
A
There
is
also
errors
that
cassandra
has
not
many
nowadays,
but
there
can
be
errors
and
there
is
an
unhandled
exception
handler
that
will
catch
those
and
that's
trapped
here.
You'll
also
see
this.
If
you
do
know
tool
info,
it
says
how
many
unhandled
exceptions
it's
normally
0
nowadays,
but
it's
handy
to
know.
If
there's
something
going
on,
you
might
want
to
understand
how
much
disk
you're
using
so
there
is
old,
Apache,
Cassandra,
metrics
storage
load
count,
and
then
you
can
get
the
this
number
of
bytes
and
then
you
can
get
that
per
table
as
well.
A
Speaking
of
taking
up
space,
there
is
compassion
which
whose
job
is
to
squish
down
all
of
our
over
rights
and
tombstones
and
make
things
take
up
less
space.
There
are
metrics
that
tell
you
at
the
ground
level
how
many
pending
Compassion's
there
are.
There
are
metrics
that
tell
you
how
many
pending
Compassion's
there
are
per
table
and
there
are
metrics
that
tell
you
how
many
you've
completed.
So
you
might
want
to
alert
on
this
if
you're
in
count
of
metrics
gets
above
50
or
100,
or
something
like
that.
A
It's
a
queue
with
a
number
of
threads
at
the
end
of
it,
and
the
pending
says
how
big
that
Q
is
often
it
might
be
0
you
might
get
a
large
number
of
requests
come
in
and
that
might
take,
as
you
know,
a
split
second
for
it
to
get
above
to
get
back
to
0.
So
you
can
monitor
this
and
that's
a
good
indicator
of
how
up-to-date
the
cluster
is.
A
Now,
if
those
messages
sit
in
the
queue
for
too
long,
they
get
dropped.
It's
part
of
the
load
shedding
process
that
cassandra
has
how
long
they
have
to
sit
there
for
is
controlled
by
the
timeouts
that
you
have
in
the
yam
or
file,
and
so
by
default,
now
for
reads
it's
five
seconds
and
for
rights
it's
two
seconds.
If
we
are
dropping
messages
than
we
are
shedding
load,
it
means
the
cluster
is
overloaded
and
we
want
to
track
that
and
know
that
is
happening
could
be
an
early
indicator
or
something
going
wrong.
A
Important
thing
to
remember
is
that
we
can
drop
messages
and
shed
load,
and
the
system
can
still
be
functioning
correctly.
So
long
as
we
go
back
to
the
client
and
say
we
successfully
process
your
request
at
the
consistency
level
that
you
access
to
we're
good
if
we
drop
messages
on
one
load,
that's
fine
all
right!
So
now
we're
up
to
provisioning.
Getting
a
system
out
and
running
I
mentioned
smoke
tests
earlier
on
around
smoke
testing
your
data
model
to
kind
of
check
your
logic.
A
You
can
go
and
smoke
desk
you're,
a
disk
which
is
a
good
thing
to
do
in
the
AL.
Toby
has
a
good
write-up
about
how
to
do
this
and
the
types
of
numbers
that
you'd
expect
and
the
techniques
that
you
can
use
just
make
sure
that
the
disks
are
not
broken
before
you
go
inside
to
use
them.
You
can
use
the
Cassandra
stress
tool
and
just
run
the
smoke
test.
A
Just
stand
up
your
cluster
and
run
this
beforehand,
and
if
it
has
any
issues
you
have,
you
can
fix
them
before
you
go
to
all
the
trouble
of
getting
your
application
provisions
and
everything
else
and
there's
a
Cassandra
there's
a
blog
post
on
datastax
about
that
as
well,
and
then
we
start
to
build
run
books
and
the
idea
of
the
run
book
is
the
plan.
How
you're
going
to
handle
some
bad
situation
in
the
future,
and
my
goal
with
a
run
book
is
always
to
communicate
like
a
ten-year-old
like
I.
A
Did
this
because
of
this
on
the
weekend,
I
went
to
the
park
type
of
stuff
and
explain
what
we're
doing
and
how
we're
doing
it.
And
why
and
then
the
way
you
test
your
run
box
is
you
run
fire
drills?
Then
the
fire
drill
is
to
see
that
the
runbook
works
and
that
people
understand
it
and
if
you're,
in
a
fortunate
position
that
you're
deploying
a
new
system.
Hopefully
you
can
do
this
early
on.
A
So
here
are
the
scenarios
that
you
want
to
run
through.
First,
one
I
call
a
short
short
term
node
failure.
The
node
is
down
for
less
than
the
hinted
handoff
window,
which
is
three
hours,
so
the
other
nodes
collect
hints
for
it.
We
continue
to
be
available
for
quorum
requests
and,
at
the
end
of
it,
there's
no
action
necessary,
because
this
is
what
cassandra
is
designed
to
handle
all
the
time.
A
But
you
want
to
test
it,
I'm
going
to
make
sure
your
applications
happy.
The
next
thing
we
want
to
do
is
break
the
cluster
like
take
down
multiple
nodes
until
the
cluster
breaks
and
hopefully
understand
how
the
application
reacts
in
that
situation
and,
if
you're,
if
you
care
about
this,
for
in
a
rumble
situation,
you
should
run
a
repair.
When
you
come
back,
then
we
can
maybe
lose
an
entire
availability
zone
or
a
rack,
or
something
like
that.
A
So
we
could
put
into
my
P
tables
or
shut
the
nodes
down
with
everyone
to
do
for
that,
and
we
should
still
be
available
for
quorum
if
we've
designed
it
correctly,
and
that
would
be
a
good
test
of
a
test,
your
application
as
well,
and
when
we
come
back,
we
may
want
to
put
repair
in
to
make
sure
that
all
that
data
gets
back.
We
could
rely
on
hints,
but
if
you've
got
a
high-throughput,
it
might
be
faster
to
run
repairs.
A
The
next
type
of
failure
I
call
a
medium-term
failure.
So
that's
where
we're
down
for
longer
than
the
hinted
handoff
window.
We
can't
rely
on
hints
anymore,
but
we're
down
for
less
than
the
GC
gray
seconds,
which
is
important,
will
see
that
on
the
next
one.
So
this
might
be
the
case
where
you've
disabled
hints,
because
you
don't
need
them
or
you
had
a
failure
overnight
and
you
woke
up
or
you
didn't
even
bother-
waking
up
and
you're
going
to
fix
it
eight
hours
later.
So
it's
a
fairly
common
scenario.
A
You
should
bring
it
back
and
run
repair
because
it
will
have
missed
a
significant
number
of
rights
and
hinted:
handoff
will
no
longer
be
storing
those
rights
and
replaying
them
and
we've
got
long-term
failure.
This
is
where
our
nodes
down
for
more
than
the
GC
gray
seconds,
which
means
that
any
tombstones
and
T
tl's
that
were
created
may
now
have
been
purged
off
disc
and
the
node
that
was
down
may
miss.
The
deletions
and
data
will
come
back
to
life
and
you
should
never
bring
this
node
back
into
the
cluster.
A
A
Now,
with
all
that
practice
in
place,
you
can
understand
now
what
happens
in
the
rolling
upgrade.
It's
just
repeated,
short-term
failures,
so
you
might
want
to
in
the
fire
drill
to
an
upgrade.
Sometimes
it's
good
if
you,
if
you're
provisioning,
provision
the
revision
before
the
revision
you
want
to
run.
So
if
you
want
to
run
to
19
provision
to
18
put
the
application
in
run
some
load
test
through
it
and
then
do
an
upgrade
to
219.
A
So
and
if
you
go
in
the
scale
out,
you're
not
going
to
have
any
impact,
so
you're
going
to
scale
out
and
you'll
be
available
for
all
and
again,
if
you
really
want
to
get
into
this,
you
can
provision
your
initial
system
with
not
all
of
your
nodes.
If
you're
going
to
go
to
the
six
nodes,
initially
provision,
five
of
them
put
the
application
in
run
it
and
then,
while
it's
running,
add
another
one
and
print
and
get
some
confidence
that
that's
going
to
happen.
A
A
A
So
question
is
about
using
different
disc
layouts
or
raid
and
things
I've
worked
with
enterprises
where
they
say
the
box
is
a
raid
10
and
it
has
20
disks
in
it
and
that's
the
only
box
we
have
and
that's
fine.
The
biggest
concern
you
have
is:
if
a
disk
fails,
how
long
does
it
take
for
the
disk
to
get
replaced?
So
if
the
gardening
happens
once
a
month
or
once
a
week,
and
how
long
will
it
take
to
get
replaced
in
terms
of
performance?
A
It's
j
board
is
pretty
good
and
raid
0
is
pretty
good.
The
problem
with
raid
0
is,
if
you
lose
a
disk,
you
lose
the
whole
all
of
the
data.
So
generally,
nowadays
people
go
for
j
board.
It
can
be
a
bit
of
a
pain
with
fragmentation
on
there.
So
I
would
go
with
like
the
best
hardware
accelerated
raid
and
if
you
don't
have
any,
I
would
just
use
j
bald.
B
A
To
identify
hotspots
in
your
data
model,
you
want
to
look
at
the
local
write
throughput
and
you
want
to
find
the
top
three
in
the
bottom
three
per
node
like
so
for
your
table
across
all
nodes,
get
a
line
for
each
throughput.
If
you
see
one
that's
getting
more
rights,
then
that's
a
that's
a
hot
spot!
There,
that's
getting
more
right
or
and
same
for
reads,
understanding
that
you've
got
fat
rose.
A
You
can
use
node
tulsi
of
histograms
and
no
tools
here,
stats,
and
it
will
tell
you
if
there's
a
very
what
the
biggest
fragment
is
and
if
that's
a
lot
bigger
than
your
average,
then
you've
got
a
hot
spot
somewhere,
then
you're,
finding
that
in
the
data
model
can
be
hard.
Okay,
thank
you.
Cuz,
there's
a
question.
At
the
back.
There.
A
The
question
is
about
bringing
a
node
back,
there's
been
long
down
longer
than
GC
grace
and
join
ring.
False
yes,
sir,
join
ring
false
is
what
we
used
to
do,
but
now
there's
an
option
called
replace
node
which
either
takes
the
IP
or
the
host
I.
You
are
uuid
of
the
node
you're
replacing
the
node
joins
the
ring
in
a
joining
mode.
It
accepts
rights
from
the
other
nodes,
doesn't
accept
any
reads
from
the
other
nodes:
they
don't
send
any
to
it.
A
D
So
my
question
is
about
repairing
nodes.
What
I
have
sometimes
observed
is
that,
if
a
note
goes
down
the
coordinator,
who
is
storing
his
hints,
sometimes
also
gets
down
or
very
busy
or
so
in
those
situations.
What's
the
recommendation
should
why
just
bring
down
both
the
nodes
and
repair
both
of
them?
So
what's
your
take
on
it?
Sorry.
D
What
I'm
saying
is
that,
when
a
node
is
down
are
generally
the
coordinator
would
storing
is
hence
sometimes
also
gets
very
busy
and
very
right.
Heavy
traffic
yeah
and
also
goes
down
or
becomes
very
slow.
So
in
such
situations,
just
to
bring
back
the
cluster
and
a
healthy
state
should
I
kill
those
both
nodes
and
bring
back
both
or
what
should
be
the
yeah.
A
So
the
question
is
about
when
hinted:
handoff
goes
crazy,
so
in
the
situation
where
the
node
goes
down
and
the
other
the
nodes
that
it's
storing
hints
go
down
as
well,
you
bring
them
all
back
at
any
time.
It
doesn't
matter
which
order
you
bring
them
back
and
if
you're,
in
the
situation
where
you
once
you
lose
a
node,
you
can't
handle
the
throughput.
It
sounds
like
your
cluster
needs
to
grow,
so
remember,
you're,
doing
n,
plus
1.
You
should
be
able
to
handle
all
of
your
throughput
and
latency
on
N
and
the
+1.