►
From YouTube: Apache Cassandra: noSQL, Yes to Scale
Description
In this presentation, SriSatish Ambati is going to talk about The Apache Cassandra Project, a highly scalable second-generation distributed database.
He'll cover:
- Use cases
- Why Cassandra
- Brisk and Hadoop
- FUD: Consistency
- Facebook and Cassandra
- Community, Code, and tools
A
Without
further
ado,
let's
look
at
what
apache
cassandra
has
got
to
offer.
No
sequel
is
actually
more
more
annoying
about
your
queries,
applications
and,
and
obviously
there's
a
lot
of
good
intro
to
no
sequel
that
has
happened.
If
you
to
take
one
slide
away
from
this
whole
talk
my
talk,
it
would
be,
you
would
have
to
know
your
queries
up
front.
You'd
have
to
know
them
very
well.
At
least
eighty
percent
of
your
application
uses
mostly
the
same
queries
and
so
you're
paying
costs
for
being
the
versatile,
as
SQL
is
great
it.
A
It
is
very
versatile.
It
gives
you
a
lot
of
flexibility
to
change
your
queries
on
the
fly
in
the
runtime
in
realtà ,
towards
it
after
you've
built
your
application,
but
turns
out
most
applications
when
they
scale
end
up
having
to
apparently
using
only
very
small
set
of
their
real
functionality,
more
frequently,
the
8020
rule
and,
as
a
result,
what
your
application
is
really
depends
on.
How
it
performs
really
depends
on
a
small
set
of
queries.
A
So
imagine
you
have
to
redesign
your
application
or
redesign
your
schema
such
that
you,
those
queries
are
answered
really
well,
and
that's.
That's
the
crux
of
the
problem
that
most
cyclists
trying
to
help
it
and
I'm
kind
of
imagining
most
applications
will
end
up
fitting
into
that
space.
There
are
some
applications
which
would
not,
and
those
are
not
ones
we'll
talk
about
so
with
that
I'll.
Just
give
a
brief
on
what
the
what
the
topics
are
going
to
be
just
talking,
points
feel
free
to
stop
and
and
ask
questions
as
we
go.
A
It's
going
to
be
a
synchronous
run
through
some
use
cases
on
why
Cassandra
and
how
Cassandra
should
be
the
reason
for
your
no
sequel,
adventures
and
we'll
talk
a
little
about
the
use
cases,
obviously
pepperidge
with
lots
of
different
use
cases.
There
are
more
use
cases
in
my
mind,
so
if
you
have
any
more
questions
will
probably
answer
it
with
a
use
case,
so
try
and
see
if
it
fits
fits
your
kind
of
use
case
and
there
is
so
up
front.
A
There
is
a
lot
of
obvious
discussions
when
Cassandra
comes
up
eventual
constancy
pops
up
so
we'll
talk
about
that.
We'll
see
why
and
how
and
what
use
cases
fit
and
watch
these
cases
to
not
so
and
obviously
the
number
one
reason
for
adopting
Cassandra
will
be
the
community
and
the
vibrant
community
around
that
code.
The
code
is
also
speaks
for
itself
very
brand
new
code.
So
you
look
at
that
and
there's
a
journey
of
tools
that
are
being
built
as
we
as
in
the
last
few
months
as
well.
A
It's
a
fast-moving
project,
so
the
community
around
cassandra
is
definitely
rich
and
that's
kind
of
one
of
the
reasons
I
got
it.
Excitable
center
use
case
users,
so
this
is
the
day
in
the
age
of
millions
of
users
or
our
same
users
producing
millions
of
clicks,
you're
trying
to
find
out
funnel
your
user
see
how
it
fits
in
see
what
he
did
before
he
did.
The
shopping
cart
see
what
he
did.
Where
did
he
go?
What
did
it
tweet
and
try
and
connect
the
dots
around
your
user
and
that's
one
common
use
case?
A
We
see
where
people
when
they
move
to
Cassandra
they're,
looking
at
trying
to
funnel
and
shape
their
users
Netflix
common
example,
big
marquee
example.
For
us
they
are
in
production
and
all
the
movies
you're
watching
our
are
being
all
your
user.
Data
for
netflix
is
on
Cassandra
running
real
time.
So
so
a
couple
of
couple
of
concepts
are
introduced.
A
Your
key,
you
need
key,
it's
a
so
most
of
these,
no
sequel
still
course
have
originated
from
key
value
stores
so
being
able
to
build
a
good
key
is
actually
a
big
big
answer
to
your
your
problems.
You're
representing
your
data.
Your
data
is
also
a
part
of
your
key,
so
you
keep
my
customer.
It's
a
very
read
heavy
all
your
place
as
you
play
them
on.
A
Different
devices
are
being
stored
in
a
column
as
they
go,
and
it's
growing
long
row
of
columns
Cassandra
can
support
up
to
2
billion
columns,
but
if
you're
not
using
lots
of
columns,
then
you're,
probably
not
using
Cassandra
strength.
At
the
same
time,
you
don't
want
a
big
fat
row,
so
that's
other
option
other
thing
to
look
at
key
by
customer
movie.
So
you
want
to
know
how
a
particular
customer,
if
a
bunch
of
customers
are
stopping
a
movie
at
a
particular
spot,
you
want
to
know.
A
If
that's
a
probably
a
problem
in
that
video
file,
so
again
you
want
to
see
how
the
customers
are
watching.
So
that's
other
set
right.
Heavy
operation,
you're
writing
all
the
top,
and
so
that's
the
other
high
speed
rights
is
another
common
strength
for
that
Cassandra
brings
to
the
table,
and
so
that's
one
of
the
right
heavy
operation.
A
Another
common
use
case
time
series.
We
see
a
lot
of
customers
trying
to
build
trying
to
get
data
around
their
devices
or
they
want
to
get
periodic
readings
or
even
on
on
their
own
applications.
They
want
to
know
production
application,
how
they're
doing
how
they're?
What
the
performances
of
the
different
stacks
twitter
is
a
common
example.
The
rain
bird
is
as
a
base
sort
of
cassandra
trying
to
collect
statistics
on
different
pieces.
A
We
all
are
another
common
use
case
doing
time,
series
style
data.
We
see
a
lot
of
young
startups
who
are
not
up
there
in
the
big
names
who
are
using
cloud
cake
was
one
of
the
earliest
use
case
for
time.
Series
as
well
metrics
turns
out
gathering
metrics
around
data.
You've
already
collected
turns
out
to
be
much
larger.
Data
set
than
actual
data,
so
basically
you're
trying
to
get
the
top
10
you
want
to
get
you
want
to
shape
your
users
based
on
different
time
series.
The
footprints
have
left
all
the
data
surrounding
your.
A
Our
data
is
actually
the
stats
tables
end
up
being
fast
and
furious,
and
so
that's
where
you
want
to
attack
the
problem
with
something
that
scales
horizontally
scales,
without
worrying,
about
about
filling
up
your
desk
or
without
worrying
about
how
do
you
partition
it?
So
partitioning
comes
in
naturally
22
time,
series
stuff,
so
that's
another
common
use
case
that
people
have
when
they're
trying
to
to
to
shape
their
their
end
user
experience.
A
Cassandra
inherits
a
lot
of
its
distribution
model
from
dynamic,
Amazon's
large-scale
store
and
has
a
schema
from
big
table,
but
for
the
most
part,
the
distribution
model,
every
node,
its
peer
to
peer,
so
peer
to
peer,
has
historically
proven
to
be
more
resilient
to
scale.
If
you
look
at
DNS,
it's
a
very
peer-to-peer
storage
of
your
of
IP
information
and
it's
scales,
it's
killed
for
the
last
many
many
decades
so
and
and
so
peer-to-peer
having
no
central
point
of
of
anything,
makes
Cassandra
very
resilient
to
a
lot
of
lot
of
stuff.
A
So
that's-
and
this
occurred
one
of
our
customer
made,
which
basically,
when
an
ops
guy
picks,
which
no
sequel
store
to
use
that
he
eventually
grab
to
it
gravitates
towards
Cassandra,
and
that's
been
the
case
for
most
of
our
customers.
They
spend
less
time
on
operations
of
how
to
run
a
distributed
data
store
when
you're
spending
more
time.
Sharding
your
my
sequel
store.
That's
when
you're
arrived
into
big
data,
that's
when
you're
ready
for
Cassandra.
A
Little
structure,
Cassandra,
actually
is
it
rich.
Cassandra
is
actually
in
a
ring-like
structure,
all
those
each
one
of
those
actually
present
a
load.
Every
node
has
a
comet
logs,
which
basically
is
where
it's
a
classic
database
concept
of
read
the
head
of
basically
sequential
logs,
append
only
sequential
logs,
and
so
all
so,
when
you
send
it
right,
for
example,
three
nodes
and,
depending
on
what
consistency
leverage
if
applied
to
it,
you
have
three
nodes
participating
in
that
right,
for
example,
right
there
so
basically
or
I
mean
it
goes,
finds
the
key.
A
The
coordinate,
a
node
points,
the
nodes
that
are
participating
that
will
look
clean
and
stores
at
the
same
time
and
fires
off
the
other
two
synchronously,
now
sequential,
writes,
are
very
fast
and
what
that
leaves
is
a
very
simple
right
model
that
also
does
replication.
So
it's
happening
while
you're
doing
the
right.
It's
not
postponed
to
do
a
replication
after
the
fact.
After
your
build
system
up
after
you
have
your
data
on
it
all
your
data,
all
your
data
is
being
replicated
in
in
the
right.
A
It's
not
it's
not
it's
an
amortized
cost
of
rite
of
replication,
so
you're
getting
a
lot
of
distribution
out
of
the
box.
So
what
does
that
mean?
Why
do
I
care
so
back
in
the
day?
Only
the
financial
services
guys
would
have
high
available
multi
data
center
up
times
and
have
that
kind
of
availability.
Now
everybody
can
get
it
anybody
who's
who's
able
to
have
two
nodes
in
two
different
availability
zones
will
get
that
kind
of
availability,
whether
it's
ec2
or
in
on
your
own
machines.
A
You
can
do
rack
of
air
within
your
own
data
center
if
you
want
so
all
that
at
a
very
inexpensive
cost
of
a
single
right,
so
every
right
replicate
another
example
they're
showing
two
different
data:
centers
DC,
1
DC
to
now
ec2
has
multi
region,
so
you
can
use
that
or
ECG
as
well.
When
you
read
you,
don't
necessarily
have
to
cross
the
boundaries
of
off
the
data.
Centers,
so
important
to
know
is
that
you
can
actually
separate
your
right
performance
from
your
reads,
so
you
can
continue
to
do
answer
your
queries
locally.
A
So
you
get
that
performance.
You
don't
pay
for
your
performance
on
the
reeds,
so
everyone
anyone
here
deployed
on
in
situ
and
and
have
hit
the
outage
on
April
21st.
So
most
people
are
watching
Netflix
and
Netflix
was
running
on
AWS
right.
So
an
epic
was
using
multi
data
center,
multi-region
multi
availability
zones
or
features
of
cassandra,
so
others
other
Cassandra
customers
have
survived
as
well.
So
let's
switch
gears
too
fast
gerbil
rights
right
I
mean
the
reason
I
got
it
excited
about.
Cassandra
was
I,
ran
the
benchmark
the.
A
Why
CSP
cloud
benchmark-
and
you
look
at
that-
and
I
was
like-
maybe
the
numbers
are
wrong:
maybe
the
digits
are
off,
so
maybe
the
data
is
not
in
there
right,
so
coming
from
coherence
and
oracle
and
other
structures
from
bing
from
big
stacks
from
the
j2ee
world.
I
looked
at
this
number
like
this
doesn't
make
sense.
The
rights
are
super
fast.
You
have
to
run
it
to
know
it
and
the
data
is
in
there
and
it
is
so.
The
crux
of
that
is
the
commit
log
that
we
saw
earlier.
A
It's
append
only
fast
seeks,
so
cheap
inexpensive
discs
can
get
you
fast
performance
and
it's
not
it's
not
accessories
that
are
getting
that
performance.
In
fact,
even
on
the
cloud
we've,
a
femoral
discs,
which
is
the
local
disks,
will
get
you
better
performance
than
the
most
more
expensive
ones.
So
in
some
sense
rights
are
the
like
the
workhorse
of
off
Cassandra.
You
can
basically
write
a
lot
of
rights
and
and
not-
and
maybe
orders
of
magnitude
better.
A
So
in
many
ways,
so
single
digit,
millisecond
right
our
common
for
our
customers
in
production
and
fast
reads
and
you're
not
paying
the
cost
a
tree
time
will
also
double
click
on
the
rates.
Part
rights,
so
single
digit
milliseconds
happened
only
reads
so
the
other
interesting
new
ones
are
on
Cassandra,
which
attracts
attracts
really
really
a
lot
of
talent
is,
is
basically
that
reads:
you
pay
your
repair
costs
while
you're
reading
it.
So
when
you're
reading
data
you're,
actually
repairing
the
entire
equals
the
distributed
system
in
some
sense
you're.
A
So,
just
like
you
replicated
while
you,
while
you
wrote
and
you
monetize
the
cost
of
replication,
wonder
when
you
wrote
the
reads
are
basically
playing
the
cause
of
they
do
the
the
part
of
home
keeping
by
just
basically
going
through
the
net
system
and
seeing
if
there's
any
off
data,
fixing
the
data
so
you're
paying
the
cost
of
repair
while
you're
reading?
Now
it
does
look
like
you're,
paying
the
cost
a
prawn
and
doesn't?
Is
that
a
good
thing
no,
but
turns
out?
A
The
reads:
are
fast
they're
not
super
super
slow,
but
the
repairs
are
actually
going
to
keep
your
data
in
saying
most
of
the
time.
So
that's
kind
of
your
the
second
interesting
nugget
from
Cassandra
world
is
that
am
authorized.
Repair
actually
actually
pays
off
in
the
long
term.
Your
data
gets
to
be
in
better
shape
and
your
one
node
is
not
too
far
off
from
the
others.
So
there
are
a
bunch
of
Cash's
key
and
row
caches
and
kudos
to
the
HBase
team
which
implemented
offi.
A
We
also
have
off
heat
on
Cassandra
arms,
so
you
get
the
benefits
of
escaping
from
garbage
collection,
JVM
problems
so
off.
If
Jenna
based
off
fips
indexes
secondary
indexes,
are
in
the
new
Sandra
world
and
but
for
the
most
part
metallized
a
way
of
looking
at
things
don't
expect
joints.
Joints
are
not
there.
So
many
lies
your
schema
so
to
fit
your
queries,
so
mr.
will
go
back
to
the
slide
one
at
some
point:
you're
you're,
paying
the
cost
of
metallic
material.
Your
data
upfront,
your
scheme
up
front,
so
clients
be
preferred
lines.
A
These
days
are
sequel.
Drift
is
the
civilization
distillation
format
within
Cassandra,
and
that
still
happens
to
be
some
things
that
people
use
customers
use
because
I
have
seen
a
lot
of
Python
and
PHP
in
our
customer
base.
A
lot
of
roll-your-own
types
from
scala
and
closure,
but
Hector
leads
the
pack
in
terms
of
the
number
of
Java
clients
that
we
have
so
in
the
number
of
clients
using
Java
client
is
hector.
P.
Lots
is
a
simple
get
your
hands
dirty
quickly,
so
you
can.
A
A
Use
case
number
three
Hadoop
turns
out.
We
started
implementing
a
lot
of
Cassandra
customers
started
in
too
many
Laura
Cassandra's
scale,
and
we
would
help
them
get
quickly
up
to
speed
and
see.
What's
going
on,
eventually,
they'll
tell
us
the
whole
story
and
what
so
we
want
to
improve
performance
of
reads
right.
So
that's
a
common
question
that
would
come
up
and
we
look
at
it
and
say
it's
a
ruby
client
trying
to
talk
data
from
from
Cassandra
or
talk
into
Cassandra,
and
we
look
further.
This
client
is
actually
reading
from
a
cloud
era.
A
Cloud
of
Hadoop's
serve
or
the
new
and
apache
hadoop,
so
you'll
see
that
dupe
was
actually
siphoning
a
lot
of
the
data
from
log
files
running
a
bunch
of
things
and
then
they're,
storing
those
into
cassandra
and
so
and
serving
those
that
data
from
Cassandra,
whether
it's
alex
or
real
web
web
apps.
So
this
turned
out
to
be
a
pretty
common
case
for
our
customers
and
that
led
us
to
invest
in
investing
time
and
building.
What's
called
brisk.
Brisk
is
a
truly
peer-to-peer,
Hadoop
and
we'll
see
we'll
see
where
that
gets.
A
But
brisk
is
essentially
hive
plus
the
hdfs
plus
Cassandra.
So
that's
our
that's.
The
Cassandra's
entrance
into
the
Hadoop
space,
where
we're
trying
to
see
how
we
can
solve
solve
problems
from
their
name.
Node
has
been
a
problem
in
Kazan
and
HDFS
perk
for
a
little
while,
where
you're
unable
to
put
a
lot
of
to
all
the
inodes.
Anyone
who
has
and
I'm
expecting
taught
to
really
go
deeper
into
this
slide
and
explain
some
more
of
the
HDFS
Hadoop
internals.
A
But
anyone
who
has
seen
Hadoop
distribution
installed
will
see
that
they're
spending
their
limited
by
the
size
of
at
this,
with
which
the
name
node
would
scale.
So
one
of
the
things
we
saw
that
the
an
opportunity
to
make
this
all
peer-to-peer
and
basically
the
dupe
in
brisk
essentially
Britain
brisk.
We
basically
took
HDFS
and
laid
out
the
cork
I
note
and
blogs,
as
as
just
basic
tables
and
any
table
essentially
scales,
peer-to-peer
cross
all
the
nodes
and
Cassandra.
So
so
too.
A
Here
you
see
a
piece
where
all
the
with
the
elephant's
inside
those
nodes
are
brisk
nodes.
You
can
continue
to
run
the
rest
of
your
cluster
as
a
Cassandra
cluster,
so
this
is.
This
is
three
months
in
the
making
and
and
and
now
currently
in
adoption
and
several
customers
so
and
it's
a
bi
say
very
good
play
in
the
bi
space
and
and
people
trying
to
use
together
a
low,
latency
and
batch
together
are
working
with
it.
A
Blue
double-click
on
some
of
the
use
case
there,
but
the
column
families
essentially
are
Cassandra's
or
big
tables
way
of
talking
about
tables.
You
hear
about
Colin
families,
probably
more
as
you
read
through
the
space,
so
it's
been.
Essentially
we
took
I
know'd
and
s
blow
up
and
and
made
them
real
tables,
and
that
basically
puts
them
on
Cassandra
on
peer-to-peer,
so
so,
but
low
latency
and
you
have
a
cassandra
data
center
notes
and
for
batch
analytics.
You
use
brisk
data
center
notes.
What
does
that
do
to
do
to
me
as
an
application
provider?
A
You
try
you're
now
putting
in
logs
through
a
dip
into
a
cluster.
You
don't
know
what
cluster
it's
a
Hadoop
cluster
now
that
data
essentially
make
becomes
available
for
you
to
be
run
as
queries
through
hive
or
through
through
basic,
even
other
operations
that
you
can
make
small
tables
that
can
now
serve
real-time
data
or
low
latency
data
near
real-time
data,
low
latency
data
for
the
rest
of
the
world,
so
this
brings
together
a
problem
that
our
customers
were
working
a
lot
to
connect
all
the
dots
of
different
pieces
of
no
sequel
space.
A
A
The
true
true
high
scale
store
that's
happening,
and
now
the
tail
end
of
that,
where
you
have
now
once
all
this
data
created
a
small
little
table
that
or
small
set
of
tables
that
you
want
to
put
are
now
sitting
on
Cassandra
and
you're,
serving
them
off
of
the
sender
to
the
rest
of
the
world.
That's
a
common!
That's
a
typical
use
case
that
we
see.
A
Of
course
there
is
this
flip
use
case
where
you
want
to
put
petabytes
of
data
in
case,
and
that
happens
to
but
dupe
itself
is
a
market
that
has
taken
off
and
is
something
we
are
aware
of
and
are
paying
attention
to
all
right
pause
in
the
talk
fun.
Let's
look
at
what
flaws
and
fraud
that's
surrounding
cassander
space
and
we'll
also
look
at
some
real
real
flaws
as
well
right
consistency.
So
people
talk
about
consistency
in
cap,
serum
and
cap
theorem,
and
actually
the
paper
that
Bruno
put
up
Nancy
Lynch.
A
She
proved
the
paper
eventually.
She
was
also
part
of
Leslie
Lampard's.
She
reviewed
a
slam,
poet
social
paper
on
Fox,
which
is
another
interesting
tip
it
from
from
back
in
the
day,
anyways
consistency
you
here:
r
WN
algebra.
When
you
talk
about
cap
theorem,
what
what
is
our
W
and
then
RS
number
of
reads:
w
is
the
number
of
rights
or
a
number
of
copies
frights.
You
make
sure
that
one
at
have
to
agree
on
a
particular
value.
N
is
the
total
number
of
replicas.
A
So,
given
that,
let's
look
at
how
this
works,
if
so,
what
the
cap
theorem
states
is,
if
your
number,
if
your
read
consistency
and
right
consistency,
is
greater
than
the
total
number
of
copies
inside
your
cluster,
you
have
a
consistent
data.
Now,
how
does
this
really
work?
Let's
look
at
Oracle
to
node
failover
application
scenario.
How
do
you,
how
does
a
two
node
article
be
consistent?
A
It
asks
for
it
asks
reads
from
one
node,
so
it
asks
for
any
one
of
the
nodes
to
agree,
and
it
makes
sure
that
the
number
of
right
every
time
you
write
to
one
or
make
sure
it
writes
to
the
second
node,
the
w
equal
to
chew
right
and
the
total
number
of
copies
are
always
going
to
be
too.
So,
if
you,
even
if
you've,
made
a
third
copy
and
only
wrote
twice,
then
that
would
be
not
r
plus
w
greater
than
N,
and
so
it
would
be
inconsistent
right.
A
So
if
you
made
a
backup
last
night
and
and
did
not
write
to
that
backup,
that
would
be
behind
today's
data.
So
the
reason,
r,
plus
WN
or
the
reason
Oracle
replication
works.
Even
the
big
Oracles
is
because
the
total
number
of
nodes
is
too,
and
you
always
made
sure
that
our
press
w
is
greater
than
n.
So
this
is
the
simple
logic
behind
eventual
consistency.
This
is
a
simple
logic
that
we're
we're
saying
that
that's
not
going
to
bite
your
data.
Your
data
is
not
getting
inconsistent.
A
There
is
an
inconsistency
vendor
for
every
complex
system
and
that's
different.
The
eventual
consistency
model
has
worked.
Dns
is
the
most
popular
eventual
consistency
system,
when's
your
consistent
system
and
has
scaled
for
us
for
years.
So
what
we
like
to
think
about
this
is
more
as
more
as
tunable
consistency
and
tunable
consistency
is
it
gives
you
flexibility,
so
you
can
program
consistency
for
the
first
time
all
along
we
paid
the
cost
of
always
consistent
all
the
time
for
every
little
application.
So,
for
example,
the
geocode
of
this
particular
site.
A
It's
not
going
to
change
in
neon,
so
why
do
I
have
to
make
sure
it's
locked
heavily
around
or
why
do
I
have
to
make
sure
it's
it's
it's
not
going
to
change
the
immutable
data.
So,
let's
let
me
actually
get
something:
that's
not
consistent
or
try
to
be,
not
pay
the
cost
of
consistency
for
it.
So
for
the
first
time
you
actually
have
an
application
paradigm
that
actually
allows
you
to
program
it,
of
course,
and
yes,
there
is
cost
with
that
and
that's
kind
of
a
the
big
pushback
you
get.
A
Is
it's
expensive
to
program
thinking
about
all
these,
but
that's
what
some
of
our
customers
are
actually
gaining
from?
Why
not
saying
that?
I
don't
need
to
lock
these
pieces
and
these
pieces
are
fine
with
having
a
consistency,
level
up
of
one
or
concerns
or
level
of
quorum
for
high
constancy
to
regard.
Let's
the
Cassandra
programming
model
actually
makes
available
all
the
levels
of
consistency.
A
A
So
so,
if
you
want
a
very
highly
available
system,
you
want
to
write
lots
of
copies
and
make
sure
you
read
from
any
one
of
them
and
you'll
find
or
any
two
of
them
agree
on
the
data
and
you're
fine.
So
you
could
have
pretty
high
high
available
dates
with
that.
So
that's
the
that's
the
last
piece
there
and
and
at-
and
there
is
a
ton
of
fun,
especially
because
audibly
and
and
n
is
usually
confused
with
total
number
of
nodes,
which
is
different.
A
A
So
another
common
question
that
I
get
asked
is:
why
is
Facebook
not
using
cassander
anymore,
and
I
just
typed
facebook
and
cassandra
ancora,
and
you
see
like
a
dozen
questions
which
are
all
pretty
much
saying
the
same
thing
which
application
is
using
in
which
it's
not
and
oh
and
I
actually
spoke
to
the
team
that
wrote
Cassandra
at
Facebook
recently
and
happen
to
connect
with
them
and
ask
why
and
turns
out.
They
actually
only
recently
and
actually
only
recently
removed.
A
The
application
on
Cassandra,
like
a
couple
of
months
ago,
on
inbox
search
so
in
boxers
was
a
virginal
application
that
was
actually
running
it
and
it
did
scale
then
scale
for
them.
So
they,
the
crux
of
that
is
it
did
scale
for
facebook,
from
100
million
to
find
admin,
users
and
that's
a
true
story
and
that's
not
made
up.
A
And
if
you
are
running
into
that
scale,
problems
or
if
the
things
that,
if
your
your
context,
may
not
be
the
context
of
Facebook
essentially,
and
so
it
did
scale
for
them
and
they
did
use
and
they
were
using
and
we
were
all
using
it
as
part
of
that
so
and
the
average
no
sequel
deployment
size
is
not
nearly
nuts
eyes
right.
It's
very
small
and
the
one
we
see
is
actually
12
notes.
A
That
also
gives
a
hint
into
another
use
case
which
is
search.
Cassandra
is
actually
used
in
conjunction
with
solar
as
sole
andhra,
which
is
actually
an
interesting
product
that
we
baked
in
labs,
shake
whose
github
account
is
they're,
essentially
able
to
do
solar
store
on
Cassandra.
So
we
essentially,
you
can
store
and
sandra
get
a
bite
scale
indexes.
So
you
can
get
the
same
kind
of
search
interface
that
you
got
in
the
pin,
solar
and
the
scene
and
use
that
on
essentially
using.
A
So
that's
another
common
use
case
at
me
to
see
eventual
consistency
is
harder
to
program,
and
that's
true
actually,
but
it's
also
a
flexibility.
You
have
you
never
be
never
had
it.
We
always
burping
acid
sea
was
always
taken
from
us
from
whether
it
was
my
sequel
or
our
Kobe,
always
they're
paying
for
that.
A
But
as
you
as
you
put
yourself
in
a
person,
who's
shouting
my
sequel
for
the
first
time,
you
understand
that
you're
inventing
inventing
when
she'll
consistency,
while
you're
doing
that
some
of
our
customers
who
chose
Cassandra
where
they're
filling
up
the
shot
faster,
they
land
a
good
shot
so
said.
But
the
crux
of
the
argument
that
I
make
is
that
the
average
customer
has
mostly
immutable
data
you're
using
Hadoop.
You
have
written
the
log
file
last
night
right.
The
data
is
already
immutable.
A
It's
not
changing
so
you're,
mostly
mutable
data
and
complex
systems
at
scale
are,
our
only
part
are
only
half
a
consistent
anyway,
so
that
if
you
had
a
big
GC
pause
on
one
of
the
node,
that
node
was
behind
behind
on
data,
and
that
was
true
for
weblogic
right.
So
it
was
true
for
the
previous
tight
as
well
other
miscellaneous
Mets
that
are
wrong.
Sandra
is
like
you,
probably
you
probably
have
partial
rose
or
you
have
data
loss.
A
No,
we
have
a
commit
log,
it's
per
row,
and
it's
actually
and
it's
very
it's
a
sequential
happened
only
if
you
lost
the
disk.
It's
still
recover
from
that
wee
bit
with
the
commit
log
servi.
We
actually
have
a
test
and
some
of
our
customers
also
have
tested.
They
kill
the
disk
and
make
sure
everything
is
fine.
A
So
I
mean
that
node
was
was
out
of
action,
fine,
five
out
of
six
nodes,
one
of
the
six
nodes
and
rest
five
nodes
performing,
and
essentially
that's
what
you're
paying
for
really
when
you're,
when
you
are
in
the
thick
of
it,
and
you
need
to
scale
that's
when
Cassandra.
That's
when
you
would
thank
the
senator.
Actually,
three
more
reasons
were
using
Sandra
before
I
run
out
of
time.
A
One
is
tools,
a
bunch
of
a
mis
have
come
out.
Datastax
emi
is
the
one
preferred
ops
center?
Is
a
data
stack
stool
again,
which
will
allows
you
to
look
at
data?
Look
at
your
appt
look
at
your
cluster
and
be
able
to
perform
operations
on
it.
It's
a
Jake,
it's
kind
of
a
jconsole
on
with
a
very
clean
jmx
presentation,
but
jmx
is
I
mean
every
little
nuance
around
cassandra
has
been
jmx,
so
you
can
actually
look
at
all
these
metrics
cuz
I'm
is
one
of
the
most
I
mean
I've
started.
A
I've
worked
on
jboss
when
it
was
pre.
One
point
O
so
looking
at
when
I
saw
that
jmx
richness
of
jboss
I
was
like
wow
and
kiss
and
has
made
me
Bob
again
because
it
definitely
tracks
everything
through
jmac,
so
very
well,
nuance
project
or
project
are
on
our
own.
Metrics
AppDynamics
has
a
pretty
pretty
cool
tool
around
Cassandra
and
other
apps
as
well.
So
definitely
check
that
out.
A
Another
big
reason
that
it
tracks
cassandra
is
it's
beautiful
code,
it's
new
code
and
it's
actually
lot
smaller
than
you
think.
When
I
started
looking
at
it
was
about
75,000
lines
and
the
last
one
were
point:
8
version
we
close
to
90
k
as
of
last
night
and
its
uses
the
latest
java.
So
most
people
who
have
here
from
the
java
land
will
automatically
be
able
to
look
at
it.
Looking
for
it
every
piece
of
it.
It's
Jesus,
mostly
concurrent
collections,
so
skip
plus
you'll,
see
that
the
core
of
the
at
the
architecture.
A
You
see
very
interesting
good
collections.
So
if
you're,
if
you're
avid
reader
of
code,
you
would
love
to
you,
love
Cassandra,
it
uses
annotations
bloom
filters,
Markel
trees,
lots
of
interesting
good
data
structures.
These
are
real
hard
problems,
guys
distributed
counters
very
hard
problem
right,
so
people
have
done
that
before,
but
you
get
to
invent
them
again,
see
again
and
it's
happening.
It's
not
like
the
product
is
getting
to
be
1
point
0
soon
right,
so
it's
non
blocking
use,
non-blocking
I/o
staged.
So
then
you
actually
do
a
note
tool
stats.
A
A
He
wants
us
there
if
the
current
focus
around
counters
and
c
ql,
which
is
trying
to
make
it
less
trying
to
make
it
easy
for
people
to
get
in
that's
one
of
the
hardest
biggest
critique
of
cassandra
and
somewhat
valid
for
so
far
is
it's
very
hard
to
get
into
using
it
from
an
end
application
standpoint
and
a
lot
more
gaps
exist
there
and
are
being
fixed
and
we're
trying
to
make
a
simple
blind.
So
you
can
actually
use
it,
and
so
the
flip
side
is
true
for
manga,
where
the
client
is
dead.
A
Simple
to
use,
you
can
actually
use
it
right
away,
so
so
credit
where
credit
is
due,
cassandra
is
actually
it's
going
to
is
now
focusing
on
making
that
dead,
simple
for
end-users
and
there's
a
lot
of
operational,
smoothing,
that's
happening
hardening
before
one
point,
oh
and
more
of
that
will
will
us,
as
we
go
forward
the
community
behind
cassandra,
as
I
mentioned,
that's
kind
of
the
biggest
biggest
reason
to
choose
and
learn
cassandra
and
use
that
as
your
no
sequel
place,
it's
a
very
robust.
Its
rapid
hash
cassandra
is
active.
A
24
x,
7,
I've
not
seen
it
be
empty
anytime
of
the
day.
Most
people
on
tonnelle
is
founder
of
datastax
is
one
email
away
and
usually
his
email
arise
much
faster
than
my
own
boss,
email.
So
so
I
mean
these
are
really
very
passionate
engineers
young
set
of
engineers
behind
this
project
datastax,
as
well
as
the
people
who
did
work
on
it
and
and
it's
a
bunch
of
Engineers
with
independent
consultant
startups,
reddit
other
startups
from
san
francisco
rest
of
the
world.
Large
companies
Rackspace
twitter,
Netflix.
All
of
them
are
behind
us.
A
It's
not
a
project.
That's
going
away!
Come
join
the
efforts
and
that's
kind
of
one
thing
that
I
would
say
before:
I
live.
Here's
other
trends
that
we
see
job
trends
downloads,
don't
speak
well.
Well,
so
job
trends
of
Cassandra
sanda
up
and
this
case
of
numbers
are
off.
Another
use
case
is
first
no
sequel
den
scale.
So
when
you're
first
no
sequel
may
not
be
the
one
that
you're
going
to
be
using
for
scaling
get.
A
A
So
it's
a
common,
it's
easy
to
get
in
there
and
then
and
then,
when
you
really
need
to
scale
and
you're
pulling
all
the
operational
pieces
together
by
yourself
when
you're
rolling
them
on
that's
when
you're
moving
to
Cassandra,
so
job
trend
to
is
Cassandra
and
and
John
20
is
Cassandra
and
HBase.
So
space
is
definitely
on
the
rise
and,
what's
also
so,
is
Cassandra.
So
it's
good.
So
that's
and
actually
there's
a
job
trend
for
which
I
do
not
put.
But
it's
the
total
sample
of
no
sequel.
A
We
want
a
standard
right,
so
we
want
to
standardize
all
these
and
then
we
create
a
newer,
newer
version
of
ql.
So
it's
not
initially
unless
leave
it
work,
but
there
is
different
flavors,
no
sequel
and
you'll
hear
the
rest
of
them
and
there
is
a
healthy
learning,
that's
happening
within
the
system.
We
learned
a
lot
from
each
base
and
its
basis
for
picking
pieces
from
Sam,
and
so
is
so
all
these
things
are
not
static.
They're
changing
as
we
speak.
A
If
you
google
up
any
of
these,
some
of
those
blogs
of
06
are
no
longer
valid.
408
version
of
Cassandra
and
some
of
the
versions
pieces
of
HBase
or
named
nodes
will
not
be
true
in
the
future,
so
things
are
changing
as
its
peak,
and
these
slides
are
dated
and
so
will
be
at
current,
but
they've
dated
so
too
I
see
a
future
where
all
of
these
guys
will
be
robust
enough
to
become
your
database
of
choices.
So,
in
summary,
cassandra
is
a
high-scale
peer-to-peer
distributed
database
questions.
A
The
question
is
a
party
college
concerti.
Are
you
asking
for
difference
between
a
apache
cassandra
and
apache
couchdb
in
terms
of
what
a
ho
Apache
treats
them
or
I'm?
Not
the
apache
spokesperson
for
this,
but
I've
seen
really
fairness
on
their
part.
So
far,
I
don't
see
a
reason.
One
should
be
preference
toward
other
when
they're
just
I
mean
in
Apaches
ways
this
mostly
peers,
there's
no
reason
to
to
be
worried
about
losing
or
gaining
from
an
Apache
standpoint.
A
Historically,
Apache
played
community
or
code,
and
they
stuck
to
that
and
there's
a
community
around
college
TV,
it's
going
to
be
there
for
a
long
time.
So
I
don't
think
that's,
that's,
probably
not
the
reason
to
to
go
either
way.
But
the
main
reason
is
what
you
said:
it's
one
is
very
distributed
and
peer-to-peer
another
one
is:
gets
you
up
and
ready
with
no
sequel
and
that's
good.
So
what
play
their
roles?
Yeah
very
well,
yeah,
so
I
think
that's
a
more
of
a
good
panel
like
question.
A
We
be
great
to
have
the
other
speakers
finish
and
we
could
get
back
to
the
same
question
if
that
makes
sense.
The
question
is:
how
do
these
different
players,
work
and
I
think
Cassandra
definitely
clearly
is
designed
to
make
sure
your
data
gets
a
durable
e
written
and
written
across
lots
of
nodes.
So
you
don't
need
to
worry
about
availability
and
it's
partitioned,
and
so
it's
clearly
focused
on
making
availability
and
partitioning
the
key
focus.
A
So
if
you're,
if
you're
a
small
website,
you
don't
have
to
be
a
netflix
or
a
big
large
company
to
be
using
Cassandra
if
you're
a
small
and
we
have
a
lot
of
small
startups,
actually
using
it
if
you're
on
ec2-
and
you
don't
want
em
to
the
east,
even
today,
there
was
a
ECT
oth.
So
if
you're
worried
about
up
time
on
your
ec2
set
up,
your
should
be
thinking
about
Cassandra.
So
it's
a
when
you
can't
fit
things
in
one
in
memory
on
one
boxes,
so
as
John.
A
6
billion
rows
can
actually
fit
in
memory
on
one
box
these
days,
but
once
you
start
mining
that
data
and
creating
data
around
it,
that's
when
data
falls
off
one
note.
So
if
you
doing
multi-node
data
stores,
you
and
you
can't
pay,
the
big
bang
for
Exadata
is
in
the
big
systems.
Then
your
choice
is
having
to
partition
it
having
to
run
on
multi
nodes
and
that's
the
context
behind
almost
equal.
Essentially.
A
So
the
question
is
at
what
point
do
I
move
from
one
go
on
to
Cassandra
right?
It's
much
longer
nuanced
question
apologize
for
chopping
it,
but
the
question
so
is
dead,
simple
to
program
and
it's
I
mean
even
I
love
that
it's
an
interface
that
you
just
basically
it's
very
simple,
speaks
like
Java.
You
write
it
and
it's
there
right
d,
quite
the
deep,
it's
a
JSON
style
tables.
So
everything
is
JSON.
So
you
understand
that
as
a
JavaScript
developer,
you
can
see
that
see
how
your
app
looks
like.
A
So
it's
a
very
good
programming
world
for
end-users.
What
happens
is
when
it
actually
falls
to
a
larger
than
one
node
set
up
when
you
start
putting
many
nodes
on
it.
When
your
data
scales
remember
the
whole
reason,
we
were
coming
to
no
sequel,
trying
away
all
the
goodies
that
Oracle
build
for
us.
Our
IBM
build
for
us
or
deep
I
mean
all
the
big
databases
right,
the
tip.
That's
the
database
theory
for
10
20
years.
We
threw
all
of
that
off
as
well
as
the
caching
vendors,
because
we
want
ski
right.
A
A
That's
when
people
move
from
using
a
traditional,
my
sequel
or
a
into
into
Cassandra,
so
the
use
cases
we
see
today
are
data
that
suddenly
your
app
becomes
more
popular
than
you
expect,
and
then
you
see
higher
higher
number
of
users,
I
a
number
of
rate,
enema,
frights
and
so
write
speeds
of
obviously
speak
very
well
for
Sandra
and-
and
so
that's
I
mean
language
is
not
the
reason.
People
are
moving
to
Cassandra.
A
The
main
reason
they're
moving
from
from
is
really
because
of
scale
and
being
able
to
be
able
to
kill
your
app
at
mercilessly.
I
mean
I,
actually
tell
my
customers
if
they,
if
they
are
worried
about
restarting
their
Cassandra
I,
said
just
kill
it.
It's
it's
designed
to
replay
everything,
that's
in
memory
or
not
flushed
the
commit
logs
every
time
it
comes
up
exactly
replace
everything,
so
it's
I
mean.
We
actually
believe
that
the
apps
that
that
live
long
enough,
our
design,
so
their
fault,
tolerant.
A
That
way,
so
so
fault
tolerance
and
when
you
getting
two
data
sets
that
are
not
fitting
on
small
number
of
nodes
and
replication.
When
you
do
multi
set
data
center
application
and
all
these
features
are
coming
into
those
databases.
So
I
will
let
those
authors
speak
for
them.
Actually,
those
are
the
features
that
they
are
also
seeing.
Customers
and
the
markets
force
market
wants
the
same
thing
they
want.
They
have
the
same
problems.
They
want
multi
the
availability
they
want.
They
want
some
level
of
consistency.
A
They
want
partitioning
right
so
that
the
market
wants
all
of
them
actually
and
they
want
them
all
to
be
easy
to
use.
So
that's
what
the
average
designer
wants,
and
so
all
these
did
all
these
no
circle
stores
are
slowly
heading
towards
that.
So
so
so
anyway,
so
I
hope
I
answered
so
there
are
some
numbers
around
it
too,
but
really
when
you
try
to
get
beyond
a
few
nodes,
you
quickly
find
yourself
at
the
mercy
of
mongos
scale
yeah.
So
the
question
is:
how
does
Cassandra
handle
backups?
A
So
it's
not
something
you
don't
worry
about.
Our
customers
have
incremental
and
so
there's
a
global
snapshot.
So
you
can
take
a
global
snapshot
and
you
can
take
notes,
slap
shots
and
store
these
be
the
files
are
actually
so.
You
have
committed
send
data
logs
say
they
are
usually
run
on
different
lungs
as
well,
so
they
get
good
performance,
but
so
you
take
snapshots
and
then
you
take
incremental
backups.
Our
customers
take
copies
of
these
files
and
put
them
on
s3,
for
example,
and
then
take
incremental
Delta's
on
them
on
their
local.
A
So
it's
a
regular
file
file
system,
backup.
The
actual
app
does
not
have
any
state,
so
all
the
state
is
in
the
llamo,
the
Yama
files
or
the
database
and
log
files.
It's
all
file
based
so
and
we
don't
do
anything
fancy
on
the
file
itself.
It's
real
raw
data,
so
hex
data,
so
there's
nothing
that
stops
you
from
taking
that
snapshot
and
creating
a
new
cluster,
so
yeah
and
the
option.
One
of
the
feature
for
the
opscenter
is
to
actually
press
the
button.
Do
the
same
thing
get
a
snapshot?