►
From YouTube: Community Webinar | Large nodes with Cassandra
Description
Starting with Version 1.2 Cassandra has made it easier to store more data on a single node. With off heap data structures, virtual nodes, and improved JBOD support we can now run nodes with several Terabytes of data.
In this talk Aaron Morton, Co-Founder and Principal Consultant at The Last Pickle, will walk through running fat nodes in a Cassandra cluster. He'll review the features that support it and discuss the trade offs that come from storing 1TB+ per node.
A
Hello,
everyone
and
welcome
to
this
week's
cassandra
community
webinar
delighted
to
welcome
back
Aaron
Morton,
the
co
founder
and
principal
of
the
last
petal
is
also
a
committer
on
the
Apache
Cassandra
project,
extremely
well
known
in
the
Cassandra
community.
If
ever
you
are
on
IRC
or
the
mailing
list
asking
a
question
odds
are
erin
has
answered
one
for
you,
so
welcome
back
Erin
and
just
one
piece
of
housekeeping.
A
If
this
is
your
first
webinar
with
us
and
we'll
be
taking
questions
at
the
end
of
the
session,
please
use
the
Q&A
tab
inside
of
WebEx
and
ask
your
question
there
and
we
will
get
through
as
many
as
we
can
at
the
end.
So
Erin
exciting
times,
I
believe
at
the
last
pickle
you
are
expanding.
Business
is
good.
B
Thanks,
Christian
and
good
morning
to
everyone
and
squish.
Instead,
I'm
the
co-founder
and
principal
consultant
at
the
last
pickle,
where
we
help
customers
deliver
and
improve
apache
cassandra
based
solutions
were
all
beta
sex
MVPs
and
I'm
a
committer
that
they
had
the
maintainer
for
the
hector
library
and
a
committer
for
the
apache
user
Grid
project
and
were
based
in
New,
Zealand
and
America.
B
Well,
one
to
do
today
was
talk
about
large
nodes.
Now,
when
we're
talking
about
large
nodes,
it's
important
to
have
some
sort
of
context
for
what's
March.
After
all,
we're
supposedly
din
dealing
with
big
data.
A
few
years
ago,
I
started
saying,
as
a
rule
of
thumb,
don't
put
more
than
500
gigs
of
data
on
a
node.
Initially,
this
was
just
talking
about
et
tu
and
the
reasons
had
to
do
with
a
little
bit
of
what
it
was
to
be
running
on
easy
too.
B
So
what
sort
of
throughput
could
you
get
on
the
networking
and
the
performance
of
your
discs?
It's
bound
up
in
a
bunch
of
operational
concerns,
and
these
are
the
things
that
are
going
to
talk
about
today
and
how
these
operational
concerns
have
disappeared.
A
bit
over
the
intervening
years,
but
in
general
we're
talking
about
nodes
with
over
500
600
gigs
of
data
per
node.
Nowadays
we
can
talk
about
node
in
the
one
to
three
terabytes
of
scale.
B
That's
our
framework,
but
we're
also
talking
about
nodes
that
have
more
than
1
billion
rows
per
node,
so
a
billion
rows
over
all
of
your
tables.
Now
that
could
be
a
case
where
you've
got
multiple
terabytes
of
data.
You
can
also
pretty
easily
get
to
billions
of
rows
per
know.
Just
by
having
lots
of
small
rose,
they
might
be
recording
something
like
website
hits
or
website
moles,
or
something
like
that.
You
can
pretty
easily
get
to
a
billion
rows.
B
And
over
a
billion
rows,
close
1.2,
there
are
fewer
operational
concerns.
Most
of
them
are
still
there,
but
they're
there
decreased
in
terms
of
their
impact,
but
it's
important
to
understand
what
why
we
were
concerned
about
these
things
in
the
beginning,
because
it's
good
to
have
an
understanding
about
why
changes
are
made
in
Cassandra
and
some
of
these
operational
concerns
are
still
there
and
you
could
grow
fast
enough.
You
will
still
run
into
them.
B
So
look
at
those
issues
we
had
prevalent
1.2
and
we'll
look
at
some
of
the
work
around
that
we
had
just
to
give
us
some
context
about
why
changes
were
put
in
version
1.2
and
beyond,
and
we'll
look
at
a
couple
of
the
issues
coming
up
in
2.1
that
are
going
to
improve
things.
Even
further
memory
management
was
always
a
big
isn't
was
always
a
big
concern.
There
are
some
memory
structures
in
Cassandra
that
grow
with
the
number
of
rows
and
the
size
of
data
that
you've
got
to
note.
B
The
first
one
we'll
look
at
here
is
bloom
filters,
and
these
are
probably
the
most
well-known
and
Rhys
understood
data
structure
that
we
have.
We
use
these
to
test.
If
a
roti
exists
in
a
particular
assess
table,
it
will
tell
us
either
that
the
routine
definitely
does
not
exist,
or
it
does
exist
with
a
certain
probability
that
that's
false.
We
hold
this
in
memory,
potentially
just
a
bit
set,
and
by
keeping
any
memory
we
dramatically
reduce
the
amount
of
disco
yogi
has
to
do
so.
B
A
B
B
So
in
fernley,
our
bloom
filter
was
implemented
as
a
two-dimensional
array
of
Long's.
Again
we
treat
it
just
as
a
bit
set.
This
was
just
how
it
was
implemented
and
we
look
at
they
use
the
memory
usage
for
this
guy
along
the
bottom
of
year.
We've
got
millions
of
rows
and
when
we
hit
a
billion
rows
at
the
right-hand
side,
the
grey
line
indicates
when
the
bloom
filter,
FP
champs,
is
point
zero,
one
which
is
the
default
when
we
use
to
size
kid
compassion.
B
Strategy
at
that
setting
will
use
approximately
twelve
hundred
megabytes
of
specs
just
to
hold
the
bloom
filters
if
the
value,
if
we're
using
the
level
compaction
strategy
when
your
column
family
is
created,
the
blue
filter
FP
chance
defaults
to
one
or
ten
percent
and
see
on
the
red
line
there.
That's
approximately
half
the
size
that
we
have
for
the
point:
0
1,
so
it's
still
about
600,
megs,
all
stuff
that
we
have
to
keep
in
memory.
B
We
also
have
compression
metadata
that
we
have
to
store
in
memory
now
when
we
take
your
data
and
compress
it,
we
need
a
map
that
tells
us.
Oh,
this
chunk
of
uncompressed
data
actually
starts
at
this
position
in
the
compressed
data
stream.
So
when
we
take
the
offset
for
the
start
of
your
row
from
the
index
component
of
ESS
table,
we
know
where
the
book
the
size
of
this
depends
on
the
size
of
the
uncompressed
data
and
it's
held
in
memory
as
well
again
with
implemented
as
a
two-dimensional
array
of
roms.
B
The
size
of
this
depends
on
the
amount
of
data
will
try
to
compress
and
to
a
degree,
the
compressor
that's
in
use
and
the
size
of
the
chunks
that
we're
compressing.
If
we've
got
a
terabyte
data
using
the
snappy
compressor,
we
can
expect
a
couple
of
hundred
megabytes
of
compression
metadata
again
all
suffer
has
to
be
held
in
memory
and
can't
be
released.
B
We
also
have
index
samples
so
in
our
SS
tables,
on
disk
that
the
three
most
important
files
for
each
SS
table
are
the
data
component,
the
bloom
filter
in
the
filter,
BB
and
the
index
of
primary
index.
This
is
your
row
keys
and
their
offset
into
the
data
DVD
component.
Now
in
memory
we
hold
a
sample
of
every
128
keys
by
default,
and
this
essentially
gives
us
a
skip
list
or
an
index
over
this.
B
This
was
implemented
as
an
array
of
Long's
and
an
array
of
bytes,
a
2d
array
of
bytes
for
the
routines,
probably
version
1.2.
It
was
implemented
by
holding
the
objects
that
actually
those
right
to
get
busy
realized
into
a
decorated
key,
but
in
1.2
it
look
like
this.
This
is
a
bit
easier
to
get
your
head
around
again
stuff.
We
have
to
hold
em
memory.
B
B
B
Once
you
get
above
that,
there's
extra
work
that
the
par
de
noob
compaction
process
that
works
on
the
new
keep
has
to
do,
because
it
has
to
look
at
all
of
the
old
data.
All
of
the
data
of
the
ten
you
deke,
sorry
to
see
if
any
of
it
is
pointing
to
objects
in
the
newly
and
also
CMS,
is
going
to
take
longer
on
the
Kennedy.
B
Additionally,
if
you
put
a
large
working
set
large
amount
of
memory,
that
garbage
collection
cannot
free
you're,
going
to
see
more
frequent
and
prolonged
garbage
collection,
typically
in
this
NS
collector
concurrent
mark
sweep
on
the
tenured
heat
and
normally,
if
you're
looking
at
at
a
graph
of
your
jvm
usage,
what
you
want
to
see
is
that
it
goes
up
and
then
drop.
Suddenly.
It
looks
like
a
sawtooth
pattern.
B
Don't
we
like
to
see
that
in
a
healthy
machine
dropping
to
between
two
to
three
gigs?
If
it's
not
getting
below
three
gigs
a
lot
I
want
to
understand
why
and
it
depends
on
some
of
your
configuration
settings
and
things,
but
to
give
you
an
idea.
This
is
the
sort
of
thing
we
should
be
overseen.
Gt.
Doing
me
a
big
chunk
like
dropping
a
three
or
four
degrees
data
at
a
time,
so
they'll
give
you
some
of
the
operational
concerns.
B
B
Those
tokens
specified
data,
that's
going
to
be
a
replica
for,
and
it
talks
to
the
other
nodes
that
are
already
replicas
for
that
data
and
ask
them
to
stream
data
over
to
it.
So
you
can
take
ownership
once
it
finishes
the
bootstrap
process.
It's
got
all
of
the
data
that
the
other
no
sense
it,
and
it
also
has
rights
that
started
that
occurred
as
soon
as
it
started.
B
When
you
bootstrap
it
mode
with
RF
three
and
we're
not
using
virtual
nodes
here,
this
is
premium
version
1.2
my
node
will
come
in.
It
will
have
an
initial
token
and
that
will
identify
one
pokin
range,
but
we've
got
RS
three.
So
there's
actually
three
token
ranges
that
this
knowing
is
a
replica
for
till
the
last
three
nodes,
laundry
to
those
Tobin
Rangers
to
send
it
data
to
bootstrap.
B
We
know
that
this
ending
process
is
throttled
at
25
megabytes
a
second,
so
our
maximum
center
in
is
75.
Megabytes
per
second
in
practice
is
going
to
be
less
than
that,
but
that's
our
maximum.
If
we've
got
a
one
gig
networking
in
place,
we've
got
125
megabytes
per
second,
so
there's
quite
a
lot
of
headroom
on
our
bootstrapping
node,
and
we
really
like
to
saturate
that
guy.
We
can't
control
the
time
for
sale,
but
we
can
control
how
long
it
takes
to
recover
from
failure.
B
So
when
we
have
to
replace
it
mode,
we
want
Mexico
absolutely
as
fast
as
possible,
and
we
want
that
process
to
scale
as
we
grow
the
cluster.
You
want
to
get
more
value
as
about
costume
now,
if
you
don't
always
meet
the
bootstrap.
Often
we
do
a
protest
that
internally
recall
a
lift
and
shift,
and
this
might
be
that
we're
upgrading
the
new
hardware
inside
of
82
or
an
enterprise
data
center
or
moving
into
new
networking
infrastructure,
or
something
like
that.
We
don't
need
to
use
a
bootstrap
process.
B
Cassandra
is
pretty
flexible
here
we
can
just
shut
the
node
down
cleanly
copy
all
of
its
data
in
config
over
to
a
new
node
and
started
up
and
Cassandra
will
just
see
that
I
peas
have
changed,
handle
it
all
and
no
concerns.
So
in
this
case
we're
just
talking
about
that
transfer
speed
through
the
data
center
using
I.
Think
or
something
like
that.
B
So
if
we
get
50
megabytes
per
second
in
82,
that's
probably
about
what
I'd
expect
and
it's
going
to
take
us
half
an
hour
roughly
to
move
100
gigs
if
you've
got
500
gigs
multiply
that
by
five,
if
you've
got
over
500
gigs
on
the
player
by,
however
many
and
you
start
to
see
that
it
can
take
some
time.
This
is
a
copy.
B
Disk
management
is
one
of
the
hardest
things
I
think
in
in
deploying
and
Cassandra
cluster
disks,
don't
like
having
more
than
about
seventy
five.
Eighty
percent
of
their
space
used
performance
degrees.
When
you
get
above
that
on
a
spinning
disk,
we
want
to
store
more
than
500
gigabytes,
we're
going
to
need
multiple
terabytes
of
data
on
each
data
space
on
each
node.
We
could
build
a
single
volume
or
we
could
use
multiple
volumes
to
do
that.
B
B
So
there's
sort
of
a
negative
feedback
here
build
a
node
with
raid
0
put
a
log
data
on
it,
congratulate
ourselves
that
we've
got
a
lot
of
space
and
filled
it
up
loser
disk
because
we're
using
row
0
and
then
go
back
to
the
beginning
and
have
to
do
a
bootstrap
and,
and
that
can
take
a
long
time.
Another
option,
this
user.
A
pen,
of
course
this
doubles
the
world
capacitive
requirements.
B
Typically,
we
might
see
this
in
an
enterprise
data
center,
where
there's
a
standard,
build
for
machines
and
we'll
come
in
with
a
rate
10
great
for
operators.
They
really
comfortable
just
replacing
disks
in
a
hardware
level,
but
it
increases
the
costs.
You
can
use
multiple
dateable.
You
can
mount
each
desk
individually
and
tell
Cassandra
about
those
through
the
data
files
directory
Yemen
setting
now
again
we're
talking
in
the
context
of
pre
version
1.2
in
that
environment,
Cassandra
was
not
intelligent
about
how
its
distributed
and
load
amongst
those
multiple
volumes.
B
It
would
just
choose
the
one
with
the
most
free
space,
and
so
you
could
end
up
with
multiple
right
threads,
trying
to
write
SS
table
as
quickly
as
they
could
onto
the
same
volume.
Also,
if
you
had
a
single
failure
in
a
data
volume,
it
would
shut
down
the
whole
mode.
We
still
have
all
the
remaining
data
edges
that
the
noise
would
have
an
exception
and
shut
down
the
repair
process.
B
Is
you
know
that
the
repair
process
causes
problems
for
people,
but
we
know
it's
important.
The
background
here
is
that
when
we
do
deletes,
we
do
a
soft
early
and
write
a
tombstone,
and
we
want
to
make
sure
that
that
tombstone
is
fully
replicated
before
we
purge
it
off
desk
through
the
compaction
process.
Also,
repair
is
the
way
to
ensure
on
this
consistency
across
all
of
your
nodes.
The
way
it
works
is
that
we
calculated
in
court
of
Merkel
tree
and
then
we
use
that
to
compare
differences
to
build
the
miracle
tree.
B
We
have
to
read
all
of
your
data
in
your
particular
table.
Technically
we
do
it
by
reading
ranges
of
data
at
the
time,
but
mature.
So
we
have
to
read
all
the
data
in
a
particular
table.
You
can
see
this
in
no
tool.
Compaction
status
called
a
validation
compaction.
That's
because
this
process
runs
through
the
same
infrastructure
as
compaction,
its
modeled
by
the
compaction
throughput
megabytes
per
second,
which
the
animal
setting
which
defaults
to
16,
and
so
you
can
probably
guess
what
I'm
going
to
say.
Is
this
process.
B
Rose
and
time
and
the
amount
of
data
on
the
nose
grows
if
you've
got
10
gigs
or
data
on
this,
your
repair
only
have
3
10
mins
of
data.
If
you've
got
one
point,
two
terabytes,
we
have
to
read
one
point:
two
terabytes
and
calculate
a
half,
so
that's
a
CPU
intensive
operation
depending
on
how
your
machine
set
up
first
thing:
we've
got
to
get
all
this
data
off
desk.
B
The
second
part
of
repair
is
that
afterwards
exchanged
the
Merkel
tree
and
detected
differences.
We
stream
those
differences
using
the
same
process
that
we
use
for
bootstrap
the
streaming
infrastructure
and
again
this
has
trouble
so
the
same
reason:
the
bootstrap
it
struggled
and
we
don't
repair
individual
rows.
We
repaired
ranges
of
rows.
That's
where
we
detect
the
differences.
If
you've
got
very
big
roads,
we
could
end
up
streaming
a
new
copy
of
a
very
big
road
to
a
no
just
because
another
node
in
that
token
range
that
was
checked
as
out.
B
Another
road
was
out
of
sync.
If
you've
got
billions
of
small
rose,
you
can
end
up
streaming
billions
of
small
roads
because
one
of
them
and
data
sync
now
compaction
is
a
fact
of
life
in
Cassandra.
It's
a
fact
of
life
in
any
sort
of
log
structured,
merge,
storage
engine
like
we
have.
We
have
the
great
advantage
called
writing
new
things
to
desk
writing.
New
files
to
disc
every
toilet.
Flush
takes
out
a
lot
of
locks
in
the
storage
infrastructure,
but
it
requires
a
compaction
process.
B
Otherwise,
reaper
forms
would
just
fall
off
a
cliff
over
time,
as
we
had
lots
and
lots
of
new
file
to
look
at
so
what
compaction
does
is
it
looks
at
a
particular
set
of
files
on
disk
and
it
writes
the
same
truth
that
you
find
in
those
source
files
into
some
new
files,
and
it's
discard
the
information
that
you've
no
longer
required.
It
might
be
that
you've
done
an
overwrite
and
the
previous
value
is
no
longer
required
and
that
one
goes
into
the
new
files.
They
have
two
strategies
for
this.
B
The
original
one
is
called
the
slidecage
compaction
strategy,
and
this
code
is
root.
The
SS
tables
by
size
files
in
the
same
bucket
that
during
the
process
or
50,
then
within
fifty
percent
of
the
medium
of
the
size
of
the
files
in
that
bucket,
and
it
assumes
no
reduction
in
size
to
the
output
so
even
going
to
compact
53
450mm
files,
it
assumes
it
needs
200
needs
of
free
space.
So
in
theory
we
need
fifty
percent
free
space
on
the
desk.
In
practice,
we've
seen
this
run
with
the
less
than
fifty
percent
free
space.
B
Although
it's
not
recommended
really,
you
should
use
fifty
percent
free
space
as
a
soft
limit
on
your
desk.
Now
this
doesn't
sound
as
bad
as
it
is
because
we
know
that
if
we
get
above
75
percent,
we're
going
to
see
the
throughput
on
our
disk
reduced
quite
a
lot
on
spinning
disk
I'm,
not
sure
on
the
impact
on
SSDs.
B
The
other
strategy
we
have
is
called
level
compaction
strategy,
and
this
is
expired
by
leveldb
from
the
google
in
there
chromium
project.
This
group's
SS,
table
together
by
a
level
and
data
moves
up
in
the
higher
level
for
more
often
is
compacted
not
based
on
size,
just
based
on
some
other
heuristics
inside
each
level.
Your
robe
is
guaranteed
to
only
have
one
fragment.
B
This
has
great
result
that
this
can
have
a
great
impact
on
reducing
the
read
latency
level
compaction.
Also,
there's
a
really
good
job
in
a
highly
mixed
workloads
when
you've
got
a
lot
of
overrides
and
deletes.
But
to
do
this
it
requires
a
lot
of
disk
I/o,
approximately
twice
the
displayer
and
my
feeling
is.
It
requires
approximately
twenty-five
percent
disk
free
space.
B
B
When
the
first
things
we
don't
really
do
to
manage.
Memory
is
reduce
the
bloom
filter
size,
so
it
change
the
bloomfield
at
p,
champs
from
point
O
12.1
on
some
column
families,
and
we
know
that
they've
got
to
reduce
the
size,
building
children's
that
we
have
to
hold
in
memory
now.
This
is
probably
going
to
also
increase
the
read
latency,
because
we
know
why
the
bloom
filters
are
there
they're
there
to
reduce
the
amount
of
waste
of
this
photo
where
we
go
and
look
for
a
road
in
particular
SS
table,
it
doesn't
exist.
B
We
can
play
around
with
the
size
of
the
compression
metadata
by
adjusting
the
chunk
length.
We've
never
really
been
a
fan
of
this.
This
can
increase
the
read
latency,
because
now
we've
got
to
decompress
more
data
to
find
the
piece
that
we're
interested
in
this
is
a
more
typical
thing
to
do.
We
can
reduce
the
size
of
our
index
samples
by
increasing
the
in
depth
interval
in
the
yellow
file,
and
we
typically
typically
kick
this
up
to
512
up
from
the
default
128.
B
B
Now
all
that
doesn't
work,
and
you
probably
actually
do
this
in
conjunction
and
while
you're
making
those
changes,
you
would
increase
that
you
can
increase
the
jvm
heap
up
to
12
gigs,
sometimes
a
runner
at
16.
I
would
say
this
should
be
seen
as
a
temporary
measure,
and
the
goal
should
be
to
get
back
to
running
at
an
eight
gig.
Heap
may
be
doing
this.
You
can
increase
their
new
size
of
the
heat,
to
something
reason
to
keep
that
something
reasonable
around
a
thousands
of
twelve
hundred
megabytes.
B
So
we
had
to
copy
a
lot
less
than
that
nodes
down
for
a
lot
less
remember
if
you
do
that
to
include
the
flag
in
there
to
delete
files
on
the
destination
that
has
no
longer
exists
on
the
source
node,
this
benjamin.
If
you
can
use
raid
0
and
over
provisions,
I
mean
just
you
know,
have
a
little
bit
more
than
what
you
expect,
not
a
huge
number
and
if
you're
in
82,
you
don't
need
to
do
this
because
Amazon's
or
any
provision
thousands
of
nodes
that
you
can
get
one
within
a
few
minutes.
B
B
Now,
repair
really
is
something
I
encourage
everyone
to
use
unless
it's
taking
several
teams
to
complete,
because
you've
got
so
much
data.
If
you
really
need
to,
you
can
only
use
it
when
data
is
deleted,
in
which
case
I'd
recommend
that
the
consistency
level
is
not
one
which
should
be
core
to
ensure
that
your
data
is
written
to
at
least
two
nodes.
B
You
can
also
do
sort
of
more
frequent,
smaller
repairs.
You
don't
have
to
repair
a
whole
piece
piece.
You
can
run
a
repair
that
table
level
or
you
can
run
the
repair
and
individual
token
range
and
if
you've
got
a
very
big
table,
this
was
on
the
jmx
interface
for
a
while.
It
got
moved
on
to
the
node
tool.
Repair
function,
motul
repair
until
I
can't
remember
exactly
what
those
who
have
got
moves
on
there,
but
you
should
be
able
to
use
it
now.
Compaction.
B
Your
donors
are
over
provision.
The
disk
capacity
when
using
size,
kids,
typically
there
for
a
modern
82
mode
and
in
one
XL,
which
used
to
be
the
standard
build.
You
have
1.7
terabytes
of
data
and
reduce
500
gigs
of
soft
segments.
So
in
that
case,
we've
sort
of
over
provisions.
If
you're
running
low
on
space
that
a
little
pack
you
can
do,
that
can
help
is,
if
you
adjust
the
men
compaction
threshold
and
the
max
compaction
threshold
and
drop
those
down,
you
will
juice.
B
The
number
of
files
they're
going
to
compact
becomes
more
aggressive
in
the
sensor
that
runs
more
frequently,
but
instead
of
compacting
400
Meg
file.
That
will
only
ever
compact
two
and
over
the
only
needs
200
megs,
if
you
are
really
in
a
bind
that
can
allow
compaction
to
make
incremental
improvements
in
philippines
more
space.
B
Mobile
compaction
is
a
great
thing
to
use
if
you've
got
a
lot
of
over
rights
and
because
psyche
compaction
doesn't
handle
loads
very
well
or
if
you
you
care,
a
lot
about
latency
and
that
reflect
use
it
where
appropriate.
As
I
said,
it
takes
approximately
twice
the
disk
I/o
if
you're
on
spinning
disk
I'd
use
it
sparingly
on
just
to
get
your
column
families,
but
neither
if
you
are
nesting
you
can
go
crazy.
B
So
a
little
bit
of
background
about
why
some
changes
may
have
happened.
One
point
two
things
we'd
have
to
work
around
if
we're
still
on
one
point
before
1.2,
hopefully
now
convince
you
to
be
using
at
least
1.2
or
2
point
0
and
give
a
bit
of
understanding
about
why
changes
are
going
on
so
version,
1.2
lose
the
bloom
filters
and
the
compression
metadata
off
the
JVM
heap
version.
2.0
move
the
index
samples
off
the
JVM
heap.
B
These
still
take
up
memory,
then
sitting
out
there
still,
but
the
garbage
collector
doesn't
care
about
them
and
we've
reduced
the
size
of
the
working
set.
So
now
our
CMS,
which
before
we
have
to
go
along
and
couldn't
free
up
enough
space,
perhaps
because
the
bloom
filters
were
sitting
there
now
has
loads
of
space.
So
we
can
get
a
much
better
sawtooth
pattern.
B
We
have
lowered
lower
pauses,
you
haven't
in
our
process.
Virtual
nodes
were
added
in
version
1.2,
and
one
of
the
reasons
they're
added
was
to
improve
the
performance
of
the
bootstrap
process.
So
when
you
get
up
to
having
30
or
40
nodes
in
the
cluster,
you
can
get
some
value
for
having
all
those
nodes.
Vinos
deserve.
Having
one
token
range
to
note
have
256
by
default.
So
each
night
is
a
replica.
Each
node
shares
replicas
with
so
many
other
node
at
you.
Eventually,
all
moved
in
the
cluster
share
data
with
another
node.
B
So
now,
when
we
bootstrap-
and
you
know
it
in,
it-
has
256
token
ranges,
replication
factor
that
it
needs
data
for
and
it
goes
and
talks
to
lots
of
other
nodes
in
the
cluster.
So
for
bootstrapping
in
this
environment
and
we've
got
ten
modes
adding
another
one.
All
ten
loads
can
contribute
a
small
amount
of
data
and
becomes
a
lot
easier
to
saturate
incoming
mode.
B
We
also
have
jboard
support
just
a
box
of
tips.
This
really
did
improve
the
way
that
handle
multiple
data
volumes,
see
the
mountable
up
individually
list
them
in
data
files
directory
just
like
before.
But
now
when
we
go
to
write
to
one
a
lot
more
intelligent,
it
will
write
to
the
volume
that
has
the
most
space
that
isn't
currently
being
written
to.
So
we
don't
get
a
thundering
herd
going
to
a
volume
that
suddenly
got
more
the
space
because
compaction
bring
something
up
or
something
like
that.
B
B
So
the
ignore
setting
for
this
failure
polity
makes
it
work
like
a
pretty
version
1.2,
which
is
the
exception
that
handled
server
shuts
down
the
stop
setting
says:
okay,
when
you
get
an
I/o
exception,
run
into
this,
handle
the
exception.
Margot
mark
that
that
data
volume
should
no
longer
be
used
for
reads
or
writes,
make
that
information
available
wire
jmx,
including
visit
jmx,
push
notification
interface.
There,
then
it
puts
the
nose
into
a
suspended
state.
It
disables
forth
rest
of
the
binary
AP.
Is
it
disables
gossip
and
the
protest
keeps
it
running.
B
B
Isolate
that
volume
they
no
longer
do
it
now
under
breathing
right
to
it
mol
get
the
information
make
it
available
via
J
of
X
and
the
J
next
push
notification
that
keep
running
so
you've
got
four
bits.
You
suddenly
lost
a
quarter
of
the
data
on
this
node
and
it's
going
to
keep
on
running
if
you're,
using
CL
core
there's
no
problems
there.
The
corn
process
free
reads
and
writes,
will
still
detect
that
this
node
that's
suddenly.
B
Loss
of
data
is
returning
data
that
doesn't
match
and
we'll
repair
that
if
using
CL
1
you're
going
to
get
some
stale
data
in
the
best
effort,
it's
really
good.
You
can
then,
when
repair
and
repair
that
data
you
could
replace
the
disk
and
run
repair
if
you
wanted
to,
and
the
process
that
you
go
through,
the
best
effort
is
roughly
the
same
process
which
stop
you
can
stop
happens.
You
can
make
a
decision.
B
B
B
B
So
that's
a
look
at
what
happens
when
you
take
Cassandra
beyond
a
billion
rows
beyond
500
Meg's
5mg,
sorry
of
data
turn
mode.
Hopefully
this
also,
given
you
some
background
understanding
about
why
changes
I
think
Cassandra.
Why
were
the
bloom
filters?
Take
them
off
steps
off
jbm,
peep,
and
things
like
that.
So
when
we
see
some
new
features
come
through,
you
can
understand
why
they're
there
so
I'd
like
to
hand
over
to
Christian
now
and
any
questions.
A
Okay,
great
so
just
a
reminder
to
please
post
your
questions
in
the
Q&A
tab
inside
of
WebEx
we're
getting
a
lot
of
good
questions
coming
in
and
on
your
screen.
Right
now,
you'll
see
our
upcoming
webinars
march,
six
patricia
gawler,
also
a
member
of
the
last
pickle
and
then
on
april.
Third,
we
have
Cassandra
lithium.
Liam
is
a
social
interaction
platform
for
large
enterprises.
A
B
A
B
I
can
handle
that
so
yeah
if
you're,
using
raid,
0
you're
using
multiple
disks
and
the
reason.
One
of
the
reasons
we
have
raid
0
is
that
it
gives
us
a
nice
big
data
volume
and
the
other
reason
it
is
an
increased
performance
by
striking
the
rights.
The
rights
across
multiple
disks
personally
I
think
about
using
an
m1
x,
well
or
insert
or
any
amazon
instance
with
spinning
disks
I
would
still
use
a
raid
0
to
get
the
best
performance.
You
can
mean
the
highest
amount
of
dis
cops
out
of
that
volume.
B
If
I
was
using
say
one
of
the
new
I
two
instances
that
use
fsd,
I
considered
going
to
G
board
and
I,
don't
understand
precisely
how
they
could
fail,
one
fail
and
not
the
other.
So
you
probably
want
to
do
some
research
on
that,
but
I
would
consider
using
jbug
just
because
it
is
there
an
inducible
if
you
have
a
single
disk
failure,
beacon
people
working
one
of
the
downsides
of
day
board
is
that
you
don't
have
a
huge
one.
B
A
B
Would
that
knowing
too
much
more,
I
would
probably
say
no,
I'm
going
to
assume
that
your
roller
process
is
not
a
latency
sensitive
query
so,
but
latency
sensitive,
I
mean
you're
in
there
you're
in
an
API
request,
you're
doing
a
page
refresh
or
whatever
it
is
we're
going
to
assume
the
background
process.
Also,
if
you're
doing
time,
series
you're,
probably
not
mutating
your
data,
so
you
I'm
guessing
you're
not
doing
over
rights
in
or
doing
deletions.
They
probably
depend
only
data
model,
normally
so
to
sort
of
data
model
that
works
well
with
site
in
compassion.
A
B
Right
so
now,
method
tables
on
disk
are
immutable.
So
once
we've
read
all
the
data
in
particular
assess
table
and
calculated
the
miracle
tree,
which
is
a
tree,
a
passion,
so
the
intermediate
nodes
are
a
hash
of
the
hashes
on
the
nodes
below
them,
and
the
leaf
level
know
they're
a
hash
of
a
range
of
roads
once
we've
calculated
that
it
will
never
change
and
currently
before
what
we
do.
B
Is
we
calculate
that
again
and
again
and
again
you
could
have
a
hundred
gig
SS
table
of
disk,
and
every
time
you
run,
the
repair
will
go
and
calculate
a
hash
on
that
guy,
which
is
a
waste.
So
what
that
ticket
then
2.1
does
is
take
advantage
of
their
and
store
those
those
miracle
trees.
Then
there's
a
bunch
of
clever
logic
in
there
about
understanding
how
what
excess
tables
have
had
been
dropped
and
which
ones
have
been
created
and
that
impacts
on
compassion
of
it.
B
B
So
again
the
bootstrap
process
it
even
though
now
it's
better
and
you
probably
know
there
are
some
physical
limiting
Cassandra.
So
just
how
much?
How
high
a
long
will
count
for
something
like
that,
but
if
you've
got
five
terabytes
on
a
bit
on
the
nose,
what
will
you
do
when
that
node
fails?
How
will
you
build
a
new
node?
How
quickly
will
it
take
you
to
get
five
terabytes
on
there?
B
There's,
probably
not
a
lot
of
physical
restrictions
like
okay
you're
going
to
overflow
this
long,
so
you
can't
have
this
much
data
on
it,
but
it
comes
down
to
things
like
challenge
again.
Take
you
directly
to
the
node
how
long's
are
going
to
take
for
repair
the
rug?
How
long
is
it
going
to
take
to
do
a
backup
of
get
stuff
off
that
node?
That
sort
of
stuff,
then
mostly
operational
management
things?
Sometimes
things
like
glue
on
that
sort
of
stuff
can
kick
in
and
become
a
concern.
B
So,
even
though
say,
you've
got
five
terabytes
I've.
We
got
like
five
billion
plus
x
56
billion
rows
on
that
mode.
Bloom
filters
at
a
billion
if
we
use
inside
T
compaction.
Looking
at
1.2
gigs
of
data,
so
now
we're
looking
at
six
or
seven
gigs
off
heap
data
for
the
bloom
filters
that
can
take
it
while
to
load
and
now
we're
looking
at
yeah
10,
maybe
20
gigs
of
data
in
a
memory
to
hold
the
JVM
and
all
the
associated
off
heat
information.
And
so
now
you
know
you're
getting
bigger
dis,
operational
concerns.
B
B
Also
given
the
choice,
if
I
had
this
scale,
if
I
had
to
put
like
two
terabytes
of
disk
on
something
I'd
rather
have
a
couple
of
them
would
rather
have
21
terabytes
or
something
like
that,
because
now
I
can
use
raid
0
and
get
twice
the
disk
I/o
or
even
using
jboard
I
can
write
to
one
biscuit
and
write
to
another
desk,
so
I'm
splitting
my
rights
and
effectively
using
the
disk
I/o
that
they
provide.
So
all
things
being
equal.
A
B
B
It
mostly
because
I
started
using
Cassandra
before
they
were
there,
and
there
are
some
concerns.
Their
secondary
indexes
are
good
in
some
situations
you
might
have
some
either
model,
which
is
a
this
query.
Only
it
gets
used
by
people
on
the
internal
portal
or
CRM
pipe
people.
It
doesn't
happen
very
often
and
we
just
need
some
support
for
it.
So
that's
a
good
use
case
for
a
secondary
indexes.
B
The
if
you've
got
a
query.
That
is
something
that's
part
of
a
hot
code,
part
I'm
like
okay,
this
happens.
Thirty
percent
of
time
we
do
a
page
refresh
this
happens.
This
gets
called
all
the
time.
Then
I
think
it's
best
to
model
that
as
a
first-class
entity
in
your
data
model
secondary
indexes,
we
need
to
a
query.
We
have
to
go
and
ask
a
lot
of
nodes.
B
We
don't
know
exactly
what
node
has
your
data,
so
they
have
some
reduced
performance
there
and
they
have
some
reduced
availability
because
they
have
to
ask
so
many
nodes
when
we
do
a
bootstrap
process
and
we
will
rebuild
the
secondary
mixes
and
when
we
do
a
streaming
process,
we
will
rebuild
the
secondary
indexes.
As
well,
I
believe,
internally,
secondary
indexes,
are
just
hidden
tables.
A
B
If
we've
got,
if
we
mostly
just
writing
data
and
doing
back
offers
pipe
reads,
I
think
you
could
put
more
data
/
mode
if
you've
got
operational
concerns
about
how
quickly
you
can
request
those
again.
So
you
in
advertising
retargeting
or
something
like
that,
those
guys
normally
have
really
big
throughputs.
We
want
to
be
able
to
replace
those
pretty
quickly,
so
it's
kind
of
a
balance
between
all
those
issues.
A
B
B
One
of
the
factors
in
there
is
a
number
of
column
families,
so
we
will
flush
the
desk
more
frequently
and
when
we
flush
this
more
frequently
there's
the
extra
I/o
of
washing
the
disk,
it
creates
more
compaction
and
they
use
more
disk
Leo
there.
So
I
don't
think
it
fixes
those
problems.
Now
we
can
have
multiple
pins
of
column.
Families
I've
had
been
some
systems
where
they're
in
the
400
plus
column
families,
and
it's
just
really
painful
to
do
anything.
B
A
B
A
B
If
they
don't,
then
there
they
just
all
going
forward
things.
So
what
you're
going
to
probably
now
they've
been
2.2
I
say
we're
at
one
point
two
point
one
five
and
that
guy
is
essentially
based
unless
there's
a
a
bad
problem.
I'm
not
going
to
get
me
updates
to
point,
owes
at
five
or
six
now.
So
that
goes
pretty
much
ready
for
prime
time
and
once
we
get
2.1
into
global
into
general
release,
then
the
chances
of
anything
going
into
two
point.
B
A
So
Aaron
this
one
is
probably
a
broader
like
consulting
question
so
that
this
one's
from
yarn,
but
if
there's
a
very
strong
arm
so
great,
if
not
maybe
you
know
just
go
one
on
one
with
yarn
afterwards.
I
have
a
database
that
will
contain
around
50
to
75
billion
rows
and
the
future,
which
is
spread
over
two
tables.
How
stable
would
it
be,
and
does
it
make
sense
to
use
a
structure
like
that.
B
Yeah
that
doesn't
it
doesn't
sound
any
alarm
bells
to
me.
We've
talked
about
what
happens
when
you
have
lots
and
lots
of
rows.
So
the
best
thing
you
can
do
is
jump
on
AWS
you
get
a
decent
node
like
an
eye
to
node
and
just
fill
up
the
data
and
see
what
it
looks
like.
You
can
then
program,
look
at
memory,
usage
and
disk
usage
and
things
like
that,
but
willing
to
store
that
much
in
two
tables
and
it
sounds
fine.
You'll
probably
end
up
with
a
few
nodes
in
your
cluster.
A
Okay,
great
so
the
next
question
I
can
take
it's
a
doc
question.
It's
from
the
damp.
It's
done
a
great
job,
documenting
process
for
installing
to
sound
on
ec2.
However,
there's
very
little
documentation
on
the
errors,
for
example,
I'm
getting
a
new
user
data
available
error.
They
cannot
find
any
doc
to
resolve
it.
A
B
A
B
Yeah,
so
we
used
to
also
use
the
m1
excels
because
they
were
folk
on
to
the
best
there's
another
machine,
so
I'm
just
trying
to
grab
the
name
here
of
which
has
it's
called
an
m24
itself
for
memory,
and
that
has
six
year
old,
gigs
of
ram
and
2
800
gig
disks
on
it
and
lots
of
course.
So
that
could
be
a
good
situation
where
you've
got
a
large
working
set
because
you've
got
60
gigs
of
memory
and
then
what
you
can
have
a
good
page
cache.
Now
the
new
i2
instances
really
are
good,
see.
B
They've
got
a
lot
of
cause
and
we
like
having
cpu
calls
to
do
things
like
have
a
high
write.
Throughput
then,
can
generally
get
three
to
four
thousand
writes
per
second
per
core.
On
the
note,
we
also
like
it
for
compaction
because
the
compaction
is
going
to
sit
there
and
run
it's
going
to
take
up
one
core.
So
it's
good
for
that.
We've
got
a
lot
of
good
lot
of
memory.
It
wasn't
always
a
huge
amount
of
memory
on
those
m1
XL,
so
there
in
the
30
to
60
range
and
there
they've
got
SSDs.
B
A
B
Again,
it
comes
down
to
what
you
do
when
the
node
fails
and
how
you're
going
to
replace
it.
So
you
can
put
as
much
data
on
the
node
and
you
have
this
capacity.
But
what
do
you
do
when
that
node
fails,
or
you
want
to
update
anyone
moves
to
another
node,
so
I
just
think
that
there
you
need
to
look
at
it.
From
that
point
of
view,.
B
You
can't
you
can't
separate
the
components
of
an
SS
table,
but
what
you
can
do
is-
and
this
is
added
back
around
version,
1
point
0
or
1
point
1,
when
SSDs
we're
still
super
expensive.
The
the
layout
on
disk
is
that
the
directory
for
the
King
spades,
another
directory
for
each
column,
family
and
so
what
it
used
to
be
was
we'd
say:
oh
okay,
you
got
some
some
850
and
some
hard
drives.
Ok,
Maldives
up
now.
B
What
we'll
do
is
put
in
some
symlinks
in
place
so
that
the
column
family,
that
you're
really
sensitive
on
performance
on,
can
go
on
SSD
and
the
Collins
family
that
you're
really
sensitive
on
the
not
so
tentative
one
can
go
in
hard
drive
and
I.
Think
by
the
time
you
make
any
sort
of
change
such
as
okay.
Can
we
split
the
SSD
components?
But
these
things
we
don't
read
much
like
the
bloom
filters
over
here
by
the
time
that
came
in
and
that
was
bedded
in
you'd,
probably
just
beyond
SSDs
Caribbean.
A
Okay,
thank
you
very
much
and
Aaron.
Thank
you
very
much
again
for
today's
presentation,
looking
forward
to
Patricia's
next
week,
yes,
next
week
on
the
screen
right
now,
if
you
are
interested
in
presenting
at
the
Cassandra
summit,
the
call
for
papers
is
open
and
also
we
will
begin
selling
tickets
very
soon,
for
that
so
make
sure
to
come
along.
It
is
a
great
event
and
then
also
if
you
want
to
continue
your
Cassandra
learning,
you
can
take
the
course
at
the
link
on
your
screen:
decks
tax,
Academy,
geologic
learning
com,
that
is
free
training,
so.