►
From YouTube: C* Summit 2013: Jonathan Ellis Keynote
Description
Speaker: Jonathan Ellis, Apache Cassandra Chair and DataStax CTO
Slides: http://www.slideshare.net/jbellis/cassandra-summit-2013-keynote
Keynote for Cassandra Summit 2013
A
So
we're
coming
up
on
five
years
since
cassandra
was
open
sourced
we've
had
a
lot
of
progress
in
those
five
years.
On
the
technical
side,
we've
tried
to
focus
on
three
core
values
that
we
can
lead
the
industry
in
and
be
proud
of:
massive
scalability
high
performance
and
bulletproof
reliability,
and
we
we've
gotten
more
attention
on
our
successes
here.
Over
the
past
year,
like
billy
said
one
of
one
of
the.
A
Factors
in
that
was
a
paper
published
at
the
very
large
databases
conference
by
some
researchers
at
the
university
of
toronto
and
what
they
did
was
they
benchmarked
cassandra
against
some
leading
nosql
databases
to
see
how
the
performance
stacked
up
and
this
this
graph.
This
is
one
of
several
workloads
that
they
studied.
So
this
was
the
reads,
writes
and
sequential
scans,
all
three
in
a
single
workload
and
it's
fairly
representative
of
what
they
found
in
the
sense
that
cassandra
led
the
field
in
performance
that
was
consistent
across
all
the
workloads.
A
One
of
the
things
that's
especially
interesting
about
this.
One,
though,
is
that
you
can
see
at
the
bottom
kind
of
hugging
the
x-axis
the
line
for
my
sequel,
so
they
also
threw
my
sequel
into
this
benchmark
as
kind
of
a
control
point
kind
of
a
known
quantity
and
what
they
did
was,
as
they
grew
the
cluster
side.
They
size,
they
didn't
shard,
my
sequel
to
make
it.
A
You
know
a
fair
comparison
and
my
sequel
almost
kept
up
with
cassandra
in
the
read
dominated
workloads,
but
when
you
started
to
throw
in
the
the
different
kinds
of
operations,
my
sequel
kind
of
fell
apart
and
wasn't
able
to
keep
up
and
there's
the
the
fundamental
reason
behind
that
is
because
of
cassandra's
log.
Structured
storage
engine
is
able
to
mitigate
the
contention
between
the
reads
and
the
rights
to
some
degree,
whereas
the
the
b
tree
based
storage
engine
of
things,
like
my
sequel,
was
not
able
to
the
one.
A
The
one
thing
that
that
I
missed
on
the
university
of
toronto's
benchmark
is
that
it
didn't
cover
mongodb,
so
tengen's
marketing
team
has
done
a
great
job
and
people
often
ask
me
so
I've
I've
heard
of
nosql
it's.
You
know
I've
heard
of
mongodb.
How
does
cassandra
compare
with
mongodb
in
performance,
so
data
stacks
commissioned
a
a
company
named
endpoint
to
redo,
basically
the
same
benchmark
that
university
of
toronto
did
but
add
mongodb
to
the
mix,
and
so
on
this
graph
you
can
see
cassandra
and
hbase
and
mongodb
together
it.
A
It's
not
a
linear
increase,
because
if
you
look
at
the
x-axis
we're
actually
doubling
the
cluster
size
for
each
data
point
rather
than
increasing
it
by
a
linear
amount.
So
that
that's
why
the
graph
curves
up
like
this,
but
you
can
see
that
that
we're
consistent
with
the
vldb
benchmark
in
that
cassandra's
about
four
times
faster
than
hbase
we're
we're
looking
at
about
34
000
ops
per
second
here
for
the
32
node
cluster,
with
cassandra
about
8400,
with
hbase
and
about
1800
with
mongodb.
A
So
this
this
was
actually
consistent
across
all
the
different
workloads
that
we
looked
at,
not
not
just
the
workloads.
That
included,
writes
like
this
one,
which
was
a
little
surprising
to
those
of
us
that
are
familiar
with
the
architecture.
There.
We
kind
of
expected
mongodb
to
do
poorly
on
the
workloads
that
involved
rights
because
of
lock
contention,
but
it
actually,
you
know,
did
about
this
well
on
the
reed
dominated
workloads
as
well,
so
again,
cassandra's
really
proving
to
be
the
performance
leader
across
the
board.
A
Now
this
past
year
we
had
cassandra
literally
weather
hurricane
sandy,
taking
out
data
centers,
we've
had
people
power
through
amazon,
web
service,
outages
and
cassandra's
really
really
good
at
dealing
with
large-scale
failure
like
that
now,
one
thing
that
historically
we
haven't
addressed
is
dealing
with
losing
individual
disks,
and
so
we've
recommended
to
deploy
cassandra
in
a
raid,
one
configuration
and
that's
that's
kind
of
silly,
if
you
think
about
it,
because
if
you're
dealing
with
a
system,
that's
replicating
your
data
across
a
cluster,
there's
no
reason
to
throw
away
half
your
performance
by
doing
hardware-based
replication
on
each
machine
as
well.
A
So
we
wanted
to
address
that
and
we
did
in
the
the
1.2
release
early
this
year
and
in
the
in
the
lower
right.
You
can
see
jake's
comment.
You
know
people
are
already
seeing
this
work
in
production,
but
the
biggest
story
for
1.2
is
that
we
really
want
to
make
cassandra
easy
to
use
and
really
bring
down
the
learning
curve.
A
If
cassandra
had
a
reputation
among
architects
and
developers,
it's
probably
that
it's
powerful
and
robust
and
performant,
but
it's
got
a
steep
learning
curve
and
that
I'd
say
that
would
be
a
fair
characterization
up
until
recently.
A
So
we've
been
aware
of
this,
we've
been
working
on
this
for
the
past
couple
years.
I
first
talked
about
cql
the
cassandra
query
language
at
the
2011
cassandra
summit,
so
literally,
we've
been
working
on
this
for
two
years
and
cql
has
always
been
the
future
of
cassandra.
But
not
quite
yet.
A
So
today,
cql
is,
is
the
present
it's
ready
for
prime
time
we
are,
we
are
seeing
it
drive
that
learning
curve
down
and
really
deliver
on
the
promise
of
making
cassandra
much
much
easier
to
use.
Now
there
are
some
key
concepts
that
you
want
to
be
aware
of
whether
you're
a
cassandra
veteran
or
new
to
cassandra
there.
There
you
do
need
to
know
a
little
bit
about
the
cql
worldview
and
how
to
design
your
application
to
to
make
sense
of
that.
A
There's
actually
two
talks
that
deal
with
this
and
you'll.
So
you
get
a
chance
to
see
at
least
one
of
those.
The
next
top
data
model
in
the
state
of
cql-
I
highly
recommend
seeing
one
of
those
two,
if
you're
not
one
of
the
few
people
who's
running
cql
in
production
already.
A
A
So
the
first
thing
that
I
first
question
I
get
when
I
talk
to
an
audience
that,
like
this
one
that
is
familiar
with
cassandra
is,
you
know,
does
is
it
backwards
compatible?
Do
I
have
to
do
some
kind
of
dump
and
load
to
get
my
data
so
that
cql
can
make
sense
of
it
and
I've
written
a
couple
blog
posts
on
this?
Now
that
go
into
the
technical
details
about
how
this
works?
The
short
answer
is:
yes,
it's
fully
backwards
compatible
one
of
the
things
that
cql
adds
to
sql.
A
It's
not
just
a
subset
of
sql,
but
it
also
adds
to
it
in
ways
that
make
life
easier
for
cassandra
developers.
One
of
those
is
collections
so
suppose
I
had
a
table
of
users
and
I
wanted
to
add
email
addresses
to
it
and
I
want
to
have
multiple
email
addresses
per
user,
so
in
the
relational
world.
I'd
create
a
table
for
that
and
I'd
join
it
to
my
users,
but
we
don't
do
joins
in
cassandra
for
performance
reasons.
A
So
what
we're
going
to
do
instead
is
use
the
new
collections
to
say
I
want
a
set
of
email
addresses
to
be
in
this
column
and-
and
we
also
have
maps
and
lists
available
to
use
and
they're,
just
really
convenient
for
this
kind
of
denormalization.
A
We've
also
added
a
data
dictionary
in
cassandra
1.2
to
expose
what
cassandra
knows
about
itself
and
about
its
cluster
to
clients
into
applications.
So
these
this
is
replacing
or
or
rather
exposing
to
cql.
The
old
describe
cluster
describe
rings
describe
schema,
commands
that
you'd
have
in
the
thrift
api.
A
So
in
the
system
key
space,
there's
a
bunch
of
these
tables
holding
different
information.
The
table
named
local
holds
information
about
the
the
node
that
you've
connected
to
the
table.
Peers
holds
information
about
the
rest
of
the
cluster.
What
has
been
gossiped
to
me?
What
do
I
know
about
the
other
nodes
in
the
cluster
and
then
the
schema
tables
basically
hold
the
schema
for
your
application
and
the
metadata
about
that.
A
A
So
there's
an
authenticator
configuration
in
cassandra.yaml
the
defaults
to
allow
all
authenticator
the
default
on
both
of
these
is
allow
everything
so
same
as
the
same
as
it's
always
been.
Allow
everything
optionally,
you
can
opt
in
to
requiring
further
authentication,
so
the
password
authenticator
that
cassandra
ships
with
stores
encrypted
passwords
directly
in
a
cassandra
system
table
in
datastax
enterprise.
We
also
ship
a
kerberos
authenticator
that
allows
you
to
integrate
with
external
authentication
systems,
so
how
the
password
authenticator
looks
like
it
allows
you
to
run,
create
user,
alter
user
statements
the
first
time.
A
You
start
your
cassandra
cluster
with
password
authenticator
enabled
it
will
create
a
hard-coded
user
named
cassandra,
password
cassandra,
so
not
very
secure.
So
the
first
thing
you
should
do
is
you
should
create
a
new
super
user
which
will
then
be
able
to
create
further
users
for
you,
and
then
we
recommend
changing
the
password
on
the
cassandra
user
or
dropping
it
entirely.
A
Authorization
allows
you
to
grant
different
roles
and
privileges
to
users
across
your
tables
and
across
your
key
spaces.
This
is
not
new
in
the
database
world,
but
it
is
new
to
nosql,
so
cassandra
is
really
leading
the
industry
in
bringing
enterprise
level
features
to
the
applications
that
need
them.
A
And
finally,
what
we've
been
doing
concurrently
with
this
is
working
on
a
really
first-class
driver
story
for
client
applications
and
part
of
1.2
was
a
new
native
protocol
that
allows
us
to
do
things
that
the
old
thrift-based
protocol
couldn't
so
the
thrift
protocol,
of
course,
is
is
rpc
based.
The
client
calls
a
method
that
contacts
the
server
and
then
gets
back
a
response.
A
So
what
we
wanted
was
an
asynchronous
protocol
that
allows
the
server
to
notify
the
client
of
things
like
a
new
node
has
joined
the
cluster
or
a
new
table
has
been
added
to
the
schema.
We
want
to
be
able
to
push
those
to
the
client
asynchronously,
so
the
the
native
protocol
allows
us
to
do
this.
It's
also
lighter
weight,
so
we,
you
know
it
is
going
to
be
higher
performance
than
the
chattier
protocol
would
be
on
the
wire.
A
So
the
first
driver-
that's
production,
ready
based
on
the
native
protocol-
is
the
java
driver,
so
it
went
generally
available
a
few
weeks
ago.
There's
already
applications
in
production
on
it.
The.Net
driver
is
in
beta.
It
will
be
in
ga
shortly
we're
also
working
on
python,
php
and
ruby
drivers
and
working
with
the
community
on
pearl
erlang
closure,
haskell
drivers,
so
you're
going
to
see
cql
native
drivers.
For
you
know,
whatever
your
weapon
of
choice
is
coming
soon.
A
Another
thing
that
we
added
to
1.2-
that's
actually
not
cql
specific,
but
it
is
well
integrated
with
cql
is
tracing.
A
So
the
way
you
use
tracing
from
cql
is,
you
just
say,
set
tracing
on
in
the
cql
shell
and
then
whatever
commands
you
issue
until
you
turn
it
off
will
be
traced
and,
and
what
we're
trying
to
do
here
is
we're
kind
of
we're
trying
to
bring
cassandra
troubleshooting
out
of
the
dark
ages
a
little
bit.
You
you've
really
needed
to
be
something
of
a
cassandra
expert
to
understand.
A
You
know
why
is
this
operation
slower
than
than
I
think
it
should
be,
and
and
we
want
to
to
expose
that
to
the
average
developer
and
give
some
more
insight
into
what's
going
on
under
the
hood.
So
on
this
I'm
doing
a
simple
insert,
you
can
see
how
the
the
query
the
trace
goes
from
the
coordinator,
node
in
blue
to
the
replica
node
in
red
and
back
fairly
simple
operation.
So
I
want
to
look
at
something
a
little
more
interesting,
a
thing
that
that
people
often
do
with
cassandra.
A
That
is
tempting
because
of
how
the
data
model
works
is
using
cassandra
as
a
cue,
a
durable
cue
in
particular,
and
so
I'll,
create
my
cues
table
and
I'll
say
I
want
to
cluster
on
the
creation
time
of
the
q
events
and
so
cassandra's
going
to
order
my
rows
in
this
queue
by
that
creation
time
that
that's
that's
what
that
that's,
what
that
does
for
me
and
so
then
I'll
I'll
pull
off
events
and
delete
them
as
I
consume
them.
A
So
what
happens
when
we
have
done
this?
A
bunch
of
times,
I've
enqueued
ten
thousand
events
and
I've
pulled
all,
but
one
of
them
off
and
now
I'm
going
to
go
and
I'm
going
to
trace
the
select
to
get
the
last
one,
and
so
all
I
have
to
do
is
say
select
where
order
by
limit
one.
So
I'm
since
I'm
clustering
by
that
creation
time,
it
should
be.
You
know
a
constant
time
operation
to
grab
that
next
event
right.
A
But
let's
look
what
look?
What
happens
in
the
trace
in
the
green
halfway
down?
We
can
see
that
it
goes
from
taking
small
numbers
of
microseconds
to
35,
000,
microseconds
or
35
milliseconds,
and
the
trace
event
gives
you
a
hint
as
to
why
it
says
to
read
that
one
row
that
you
asked
for,
I
had
to
merge
20
000
cells
that
that
I
had
to
skip
because
they
weren't
live
anymore.
They
were
tombstoned.
A
So
this
this
is
the
the
dark
side
of
a
log.
Structured
storage
engine
is
that
the
deletes
are
super
fast
because
I'm
not
actually
overwriting
anything
in
place.
I'm
just
appending
a
new
deletion
record
called
a
tombstone
that
says
this
row
doesn't
exist
anymore,
but
when
I
go
and
do
the
select,
I
actually
have
to
scan
back
all
those
things
scan
past
all
those
deleted
events
and
check
them
against
the
tombstones
and
say
yeah.
This
one's
deleted.
Okay,
this
one's
deleted
too.
A
Until
I
get
to
the
one
that's
not
deleted,
and
then
I
can
give
it
back
to
you.
So
that's
something
to
be
careful
with
now
that
this
can
be
worked
around.
You
can
give
cassandra
a
hint,
for
instance,
that
says
where
id
equals
my
q
and
created
at
is
greater
than
you
know
some
time
that
I've
already
read
off
the
queue
and
that
would
let
me
skip
past
those
tombstones
with
the
index.
But
now
this
is
just
an
example
of
how
tracing
can
shed
some
light
on.
A
So
I've
talked
about
virtual
nodes
at
last
year's
summit.
We
also
have
a
talk
on
that
later
on
on
the
program
this
year,
but
what
I
want
to
talk
about
today
is
the
improvements
that
we
made
to
allow
cassandra
to
deal
with
an
order
of
magnitude,
more
data
per
node
in
the
cluster
than
it's
been
able
to
previously.
A
A
You
can
typically
allocate
up
to
about
eight
gigabytes
of
memory
on
the
heap
before
garbage
collection.
Pause
time
start
to
become
a
problem,
even
even
with
a
system
like
cassandra,
where
we've
worked
really
hard
to
avoid
fragmenting
the
old
generation
and
causing
stop
the
world
pauses.
You
you
you'll
still
have
young
generation
pauses
and
those
start
to
get
longer,
as
your
total
heap
size
grows
as
well.
So
eight
gigabytes
is
is
kind
of
our
rule,
of
thumb,
of
where
we're
comfortable.
A
Modern
machines,
of
course,
have
much
more
memory
than
that,
and
cassandra
can,
of
course,
make
use
of
that
as
page
cache
for
the
files
on
disk
that
we're
accessing,
but
that
eight
gigabytes
becomes
a
problem
because
we
keep
a
bunch
of
metadata
about
the
data
on
disk
in
memory
and
that
metadata
that
we
keep
is
proportional
to
how
much
data
we
have
total
and
so
there's
there's
actually
four
places
where
we
keep
metadata
in
memory
on
the
heap
in
1.1
and
for
1.2,
we've
moved
those
off
heap
so
of
those
four
things
we
keep
in
memory
I'll
go
through
those
three
of
those
grow
proportionally
to
the
size
of
memory
total.
A
The
first
of
those
is
the
bloom
filter.
So
when
I
go
to
read
a
row
from
cassandra
first,
I
check
the
partition
key
against
the
bloom
filter
in
each
data
file
and
I'll
say:
do
you
even
recognize
this
partition?
Do
you
have
any
data
for
this
partition
and
if
the
bloom
filter
says
no,
then
we
then
we're
done
we
move
on
if
the
bloom
filter
says
probably
then
we'll
go
to
the
next
step,
which
is
to
check
the
key
cache
if
the?
A
If
we
already
have
the
index
entry
for
this
partition
in
the
cache
we
can
skip
to
the
to
the
last
step.
If
we
have
a
cache,
miss,
though,
then
we're
going
to
go
to
the
partition
summary
and
we'll
do
a
binary
search
against
that
to
find
out
where,
on
disk
approximately
that
index
entry
is
going
to
be,
and
then
we
can
fetch
it
from
disk.
So
so
this
this
could
be
our
our
first
time
that
we
hit
disk
in
this
operation.
A
After
we
have
that
index
entry,
then
we're
going
to
go
to
the
compression
offset
map,
so
compression
is
enabled
by
default,
starting
in
1.1,
because
even
though
you
can
burn
some
cpu
by
by
going
through
this
compression
offset
map,
having
compression
enabled
makes
your
page
cache
so
much
more
effective
that,
on
balance,
it's
almost
always
a
win
to
have
compression
enabled.
So
you
can
disable
it.
A
But
generally
better
to
keep
it
on
and
then
once
once
we
know
where
the
compressed
block
is
that
has
the
data
we're
looking
for,
then
we
can
finally
fetch
it
from
disk.
So
of
of
these
things
that
I've
that
I've
drawn
above
the
memory
line
in
this
slide,
only
the
partition
key
cache
has
a
fixed
size.
The
others
will
grow
as
my
data
set
grows,
so
we
moved
the
bloom
filter
off
heap
in
1.2
and
that's
going
to
be
about
one
to
two
gigabytes
per
billion
partitions.
A
In
the
extreme
case,
you
can
have
one
partition
per
row,
of
course,
so
you
can
easily
have
billions
of
these
entries
on
a
single
machine,
so
it's
tunable,
depending
on
whether
you
want
to
trade
memory
for
performance,
so
it
is
tunable
but
you're.
Looking
at
the
range
of
one
to
two
gigabytes
per
billion
partitions,
the
next
largest
one
is
going
to
be
the
compression
offsets
so
you're,
looking
at
about
one
to
three
gigabytes
per
terabyte
compressed.
This
depends
on
your
tuning
parameters.
You
can
make
the
compression
blocks
larger.
A
It
also
depends
on
how
much
of
a
compression
when
you're,
getting
from
your
particular
data,
set
the
more
the
more
your
data
compresses
down
the
more
compressed
blocks.
You're
going
to
have
and
the
larger
the
compression
offsets
table
is
going
to
be
the
last
one.
The
partition
summary
is
not
off
heap
until
2.0,
so
we've
we've
got
the
code
written
it's
in
the
the
2.0
branch,
but
it
wasn't
stable
for
1.2.
A
So
we
we
left
that
out
until
2.0
we
did
make
some
optimizations
in
1.2.5
to
cut
down
the
size
in
of
that
partition
summary
by
using
raw
longs
and
so
forth.
Instead
of
boxed
boxed
numbers
inside
the
jvm,
also
in
1.2.5
we
improved
compaction
throttling
substantially,
so
we've
had
compaction
throttling,
since
I
think
1.
since
0.8.
I
think,
and
that
allows
you
to
set
a
target
rate
for
how
much
io
you
want
the
compaction
system
to
be
able
to
use.
A
So
the
idea
is
you,
you
set
your
compaction,
throttle
megabytes
per
second,
and
then
that
leaves
plenty
of
io
for
your
reads
and
rights
to
continue
to
happening
without
being
affected
much.
The
problem
is
that
until
1.2,
we
could
only
throttle
on
partition
boundaries.
So
what
you
tended
to
get
was
an
io
graph
that
looked
like
this,
where
it
would
go
full
speed
ahead
for
a
partition's
worth
of
data,
and
then
it
would
get
to
the
end
of
that
and
say:
oh
I've
been
going
way
faster
than
I've
been
supposed
to.
A
A
Limiting
that
you
know
no
matter
what
part
of
the
data
that
we're
reading
we're
also
pleased
to
announce
that
we'll
be
shipping
data,
stacks
enterprise
3.1
at
the
end
of
this
month
with
cassandra
1.2
one
of
the
things
about
cql
that
I'm
not
going
to
go
into
today,
but
it
it
helps
hadoop
and
solar
a
lot
by
by
giving
the
data
a
more
regular
structure
than
you
had
with
the
old
thrift-based
api
datastax
enterprise
3.1
will
also
include
the
most
recent
release
of
solar,
with
all
the
bug,
fixes
and
improvements
that
entails.
A
We're.
Also
going
to
release,
at
the
same
time
a
beta
of
datastax
dev
center,
and
what
that
is,
is
it's
basically
an
ide
for
cql
for
browsing
your
data
for
querying
it
and,
for
you
know,
prototyping
and
tracing
things
in
a
visual
manner
that
will
be
in
beta
at
the
end
of
this
month
and
generally
available.
We
expect
in
q3.
A
So
I
want
to
switch
gears
a
little
bit
now
and
and
talk
about
2.0.
What's
coming
next,
we're
planning
to
release
2.0
at
the
end
of
july
and
part
of
what
we're
doing
part
of
the
reason
that
we're
calling
it
2.0
is
that
we're
going
we're
going
to
do
some
house
cleaning
and
you
know,
get
rid
of
some
of
the
clutter
that
we've
accumulated
over
the
last
five
years.
This
is
my
favorite
part
of
2.0.
A
So
the
things,
though,
that
we're
getting
rid
of
since
we
we've
introduced
v-nodes,
you
shouldn't
be
dealing
with
manual
token
management
anymore,
but
this
this
is
what
you
had
to
do
in
kind
of
the
battle
days
in
1.1
and
earlier
before
we
had
v-nodes,
where,
when
you
added
a
machine
to
the
cluster,
we
we
had
a
a
system
that
would
try
to
guess
the
best
token
for
that
node
to
join
the
cluster
with,
and
it
would
try
to
bisect
the
range
of
the
most
heavily
loaded
other
node
in
the
cluster.
A
This
didn't
this
never
worked
as
well
in
practice
as
it
sounded
on
paper,
and
so
the
best
practice
was
to
manually
assign
tokens
to
nodes
that
that
you
added
to
the
cluster,
of
course
with
vnodes.
You
don't
need
to
worry
about
that.
So
for
2.0.
What
our
message
is?
You
should
probably
upgrade
to
vnodes
if
you
haven't
already,
but
if
you
don't,
then
you
should
continue
specifying
tokens
manually,
we're
getting
rid
of
this
old
token
range.
Bisection,
stuff,
we've
also
removed
super
columns
and
I,
I
hasten
to
add,
we've
removed
them
internally.
A
We've
tried
in
the
past
to
to
maintain
compatibility.
Basically,
since
0.6,
possibly
earlier
in
practice,
we've
we've
basically
been
able
to
deliver
network
compatibility
with
just
the
previous
major
release.
A
So
if
you
wanted
to
do
a
rolling
upgrade
with
parts
of
your
app
parts
of
your
cluster
on
1.0
and
part
of
it
on
1.1,
then
you
really
needed
to
stick
to
just
those
two
major
releases
to
upgrade
to
1.2
without
downtime
you
needed
to
be
on
1.1,
so
in
2.0,
we're
kind
of
making
that
official
and
saying
we're
going
to
get
rid
of
that
the
legacy,
compatibility
and
clean
that
out
new
in
2.0
a
little
more
exciting.
A
I'm
going
to
talk
about
two
of
these
today,
I'm
going
to
talk
about
the
compare
and
set
operation
which
is
kind
of
a
lightweight
transaction
and
I'll
also
talk
a
little
bit
about
triggers.
A
So
the
problem
that
we're
trying
to
solve
with,
compare
and
set
is
that
eventually,
eventual
consistency
is
not
a
silver
bullet
and
even
though
cassandra
offers
tunable
consistency
where
you
can
opt
for
us
a
strong
version
of
eventual
consistency,
even
that
is
not
enough.
When
there's
operations
that
you
need
to
make
them
happen,
one
at
a
time
you
need
to
have
make
them
happen,
sequentially,
so
eventual
consistency
is
always
fully
concurrent.
A
No,
it
doesn't
exist,
so
they
both
issue
create
commands,
and
so
one
of
them
will
walk
away
from
that
thinking
that
he
created
the
account,
whereas
it
got
scribbled
over
by
the
second
guy
microseconds
later,
and
so
what
we've
tried
to
do
typically
in
this
case,
is
to
wrap
this
operation
with
with
a
lock
such
as
with
zookeeper,
and
this
doesn't
work,
or
rather
there
are
corner
cases
in
which
it
gets
you
into
trouble.
So
I'll
illustrate
one
of
those
suppose.
A
I've
got
a
client,
that's
that's,
locking
and
he's
going
to
create
the
jbls
account.
So
he
tells
the
coordinator
insert
this
record
the
coordinator
relays
it
to
the
replica
and
says,
insert
this
record
and
then
the
replica
goes
down.
We
don't
know
whether
it
got
to
write
the
record
or
not.
So
then
the
the
coordinator
will
say:
okay.
A
Well,
he
went
down,
I'm
not
sure
if
he
wrote
it
or
not,
so
I'm
going
to
store
a
hint
that
says
when
he
comes
back
up
I'll
replay
that
insert
to
make
sure
he
got
it
and
then
I'll
tell
the
client
that
I
timed
out.
I
didn't
hear
back
from
him.
I'm
not
sure
so
whenever
you
get
a
timeout
from
cassandra
on
our
right,
this
is
what
happened.
A
The
coordinator
has
written
a
hint
to
make
sure
that
that
operation
does
happen,
but
in
the
meantime,
we're
not
sure
if
you're
going
to
be
able
to
read
it
yet
so
the
client
here
is
in
a
dilemma.
It
cannot
release
the
lock
until
it
knows
that
that
record
has
been
written
successfully
because
until
it
has,
I
could
still
have
another
client
come
in
and
say:
hey
has
it
been
written
yet
and
the
replica
says
no
before
the
hint
gets
replayed
and
then
that
second
guy
goes
and
tries
to
create
the
account
anyway.
A
So
you
can
try
to
wrap,
wrap
some
band-aids
around
this
and
say
well,
let's,
let's
turn
off
hints
for
some
classes
of
operations
or
different.
You
know
you
can
you
can
try
to
put
some
duct
tape
around
this,
but
they
they
don't
work,
especially
as
you
generalize
this
to
more
than
one
replica.
A
So
what
we
do
in
in
2.0
to
provide
a
serial
or
immediate
consistency
option
is
use
a
consensus
protocol
called
paxos,
and
what
what
this
does
is
you
can
think
of
it
if
you're
familiar
with
two-phase
commit
it's
similar
to
two-phase
commit,
but
with
a
couple
key
changes.
First,
is
that
all
operations
are
quorum-based,
so
it
can
tolerate
losing
replicas.
As
long
as
you
still
have
a
majority,
you
can
make
progress.
A
Another
is
that,
during
the
prepare
phase,
each
replica
will
send
to
the
leader
information
on
unfinished
transactions
that
were
in
progress
and
that
their
previous
leader
did
not
finish
so.
You've
got
kind
of
a
built-in
failover
for
the
paxos
leader,
the
best
resource
to
learn
more
about
this
is
a
paper
called
paxos
made
simple.
A
It's
it's
a
very
elegant
protocol.
I
don't
have
time
to
go
into
it
in
more
detail
today,
but
the
the
short
version
is
that
a
cast
operation
in
cassandra
will
do
three
round
trips
and
the
each
of
those
round.
Trips
is
making
changes
to
the
paxo
state
that
is
written
to
the
commit
log
and
so
you're
doing
three
times
as
many
round
trips.
A
We
also
have
a
we're,
also
introducing
consistency,
level,
dot
serial
that
lets
a
reader
come
and
say,
hey.
I
want
to
know
I
want
to
buy
into
this
immediate
consistency
world
and
get
information
on
any
updates
that
have
been
accepted,
but
not
yet
committed,
and
that's
what
that
lets
you
do
so.
The
the
bottom
line
is
that
this
is
really
useful
for
the
one
percent
of
your
application
that
really
needs
that
that
serial
consistency
world,
but
because
of
the
performance
implications,
it's
you
know
you
don't
want
to
build
your
whole
application
around.
A
This
christos
is
giving
a
talk
later
today
on
how
eventually
eventual
consistency
works.
A
great
background
for
on
that
aspect
of
cassandra,
so
how
this
works
is
from
a
client
perspective.
Is
we've
added
an
if
keyword
to
cql,
so
the
the
create
the
jblis
record?
If
it
doesn't
exist,
query
that
I've
been
talking
about
would
look
like
this
at
the
top.
A
I
I
just
say,
if
not
exist
and
add
that
to
my
insert
I
can
also
perform
a
cast
operation
against
a
row
that
does
exist
and
then
I
just
add
the
predicate
for
that
cast
operation
to
the.
If,
as
in
the
bottom
query
here-
and
you
can
have
that
affect
multiple
columns,
I've
just
shown
a
single
one
here,
but
it
absolutely
can
affect
multiple
columns.
A
A
So
if
you're
wondering
does
this
mean
that
cassandra
can
now
be
appropriate
for
these
applications?
The
answer
is
yes
that
that's
one
of
the
reasons
why
we
wanted
to
add
it.
The
other,
of
course,
is
we
don't
want
you
to
have
to
fall
back
to
broken,
locking
or
using
some
other
database
for
the
parts
of
your
application
that
need
immediate
consistency.
A
So
moving
on
to
triggers
you're
you're
going
to
be
this,
this
is
the
syntax
of
how
creating
a
trigger
looks
like
in
2.0.
We
had
a
late-breaking
change
this
morning.
Actually
so
2.0
is
still
in
flux,
it's
actually
saying
it
actually
says
using
class
name
instead
of
execute
class
name.
But
but
this
is
this
is
the
idea,
but
by
when
you
see
class
name
on
the
side,
then
you're,
probably
thinking.
Oh,
this
is
low-level
java
code
and
that's
absolutely
what's
going
on
here.
A
So
we're
giving
you
maximum
power
and
maximum
pain
in
the
ass
factor,
but
you
you
implement
this
interface
called
I
trigger
and
you're
dealing
with
column,
family
objects
and
row
mutation
objects.
So
you
know
these
are
internal
classes,
so
really
what
we're
doing
with
with
the
2.0
version
of
triggers?
Well
we're
putting
a
big
red
sign
up
that
says:
hey!
This
is
experimental,
we're
kind
of
doing
an
iterative
design
process
on
this.
We
want
to
get
some
feedback
on
and
what
this
is
useful
for.
A
What
use
patterns
emerge
from
this
and
and
we'll
iterate
on
that
for
2.1,
but
in
the
meantime,
when
you
absolutely
need
to
to
push
some
calculations
or
notifications
onto
the
server,
then
this
does
let
you
do
that
I
will
be
available
for
for
follow-up
discussion
or
questions
three
places
today.