►
Description
Speaker: Jonathan Ellis, Apache Cassandra Project Chair and CTO/Co-Founder at DataStax
Slides: http://www.slideshare.net/planetcassandra/c-summit-eu-2013-keynote-by-jonathan-ellis
Keynote Presentation on Cassandra 2.0 & 2.1 by Jonathan Ellis at Cassandra Summit EU 2013
A
Hey
I'm
Patrick
McFadden,
a
lot
of
you
prior
I
know
me
probably
hear
me
talk
all
the
time
anyway,
so
I
apologize
for
that
really
messed
up.
Looking
screen
so
I'm
a
last
time,
I
talked
to
you,
I
was
not,
but
now
I
am
I'm
chief
evangelist
now
which
is
really
cool,
because
that's
what
I
like
to
talk
about
all
the
time
is
Cassandra.
The
evangelist
team
is
here
today.
It's
me
and
alto,
be
some
of
you
may
know
al
from
his
talks.
A
He's
probably
one
of
the
best
performance
people
when
it
comes
to
Cassandra
he's
around
today
come
find
us
if
you
want
to
talk
to
us
and
we're
here
to
help
you
it's
all
about
community.
So
this
year
we've
seen
so
much
community
growth,
and
this
is
really
exciting.
We
did
our
summit
in
San
Francisco
over
a
thousand
people
there.
I
think
a
lot
of
you
were
there
and
now
we're
doing
this
in
Europe.
A
We're
just
going
to
keep
the
love
going
so
today
we're
going
to
be
doing
a
lot
of
talks,
but
also
I
we're
doing
some
announcements
as
well.
So
one
of
the
announcements
were
going
to
be
doing
is
the
datastax
Academy
for
Apache
Cassandra.
Yesterday
I
said
I
was
gonna.
Put
the
link
out
here.
You
go
hi
I
apologize
for
the
wallet
text
here.
This
for
me
is
really
exciting,
because
this
is
where
we're
gonna
really
start
seeing
people
use
Cassandra,
I,
think
effectively.
We
have
online
training,
it's
free.
A
A
So
we
have
a
lot
of
content
on
there.
I
was
personally
involved
in
this.
There
were
several
of
us.
They
were
in
a
community
that
were
involved
in
this
great
content.
Please
sign
up
for
this.
We
really
want
to
get
a
lot
of
people
on
there.
A
hundred
thousand
registrations
by
2014
is
pretty
ambitious.
So
a
couple
other
things
little
housekeeping
tonight
after
the
talks
we're
going
to
be
doing
a
reception,
but
we're
also
going
to
be
doing
lightning
talks
and
light.
A
If
you,
if
you
haven't
seen
the
lightning
talks
here
last
at
the
summit
in
San
Francisco,
they
were
so
much
fun
and
we
have
a
great
lineup
for
tonight
too.
So,
stick
around
don't
just
take
off
after
five
o'clock
plus
you
don't
want
to
get
on
the
tube
at
this
time
anyway,
just
took
around
have
a
beer.
The
other
thing
we're
going
to
be
doing
tonight
is
our
MVP
announcements.
A
We
didn't
do
that
earlier
this
year,
so
we're
going
to
do
it
today
and
it's
really
exciting,
because
we
really
like
to
recognize
people
in
the
community
and
MVPs
or
what
it's
about
so
I
think
I
ran
through
my
list
and
I
almost
caught
us
up
on
time.
So
I
think
I'm
gonna
be
done
talking.
You
guys
are
here
to
talk
to
listen
to
Jonathan
talk,
he's
gonna
be
talking
about
Cassandra
to
dido.
Of
course,
Jonathan
is
the
project
chair
for
Patrick,
Cassandra
CTO
for
datastax
and
pretty
much
the
man
when
it
comes
to
Cassandra.
B
Now
that
what
the
one
problem
with
this
study
that
the
guys
in
Toronto
did
was
that
it
didn't
include
ODB
and
so
mongodb
is
kind
of
in
a
kind
of
playing
in
a
different
part
of
no
sequel
they're
more
in
the
in
the
rapid
development
part
than
the
scalable
system
part.
But
in
the
minds
of
a
lot
of
people
out
there
you
know
hidden,
no
sequel,
I've
heard
of
MongoDB
I
want
to
know
how
it
compares
to
Cassandra
performance-wise.
B
So
we
actually
repeated
datastax
hired
a
company
called
endpoint
that
that
works
on
these
kinds
of
things,
to
repeat
the
Toronto
study,
with
HBase
and
with
MongoDB,
and
so
we
again
see
Cassandra
about
four
times
faster
than
HBase,
which
is
the
line
kind
of
in
the
middle
there.
So
it's
it's!
It's
a
reproducible
result,
that's
exactly
what
the
Toronto
guys
saw.
So
we
felt
good
that
you
know
we're
doing
it
right.
This
is
done
with,
and
both
both
of
these
benchmarks
were
done
with
an
open-source
benchmarking.
B
Suite
called
ycs
be
it's
available
on
github,
so
we
were.
We
felt
good
that
we
were
getting
a
reproducible
result
mongodb.
You
can
see,
didn't
you
know
it's
roughly
20
times
slower
than
Cassandra
once
you
start
throwing
a
lot
of
data
at
it,
and
so
that's
that's
an
actually
an
important
caveat,
because
if
you
look
at
the
MongoDB
site,
you
know
a
lot
now
all
they're,
all
their
numbers
that
they
give.
B
You
are
from
data
sets
that
actually
fit
a
hundred
percent
in
memory,
and
so
you
know
as
when
you
win
when
you
start
getting
into
the
real
world,
people
don't
want
to
have
to
get
enough
ram
to
fit
their
entire
data
set
in
memory.
You
you
want
to
fit
your
hot
data
in
memory,
but
there's
no
sense
in
in
spending
extra
money
on
keeping
your
cold
data
in
RAM
as
well.
So
I
actually
told
the
endpoint
guys
make
sure
you
benchmark
a
set.
B
A
data
set
where
you
have
the
hot
data
set
fitting
in
memory,
but
more
cold
data
then
will
actually
fit
in
memory,
and
so
when
you
that
that's
that's
why
this
this
graph
looks
very
different
than
the
ones
you
might
see
where
someone
did.
A
toy
data
set:
hey
db's
fast!
Well,
it's
fast
until
you
get
outside
of
memory
and
then
things
really
start
going
through
the
floor.
B
These
are
some
experiences
that
people
have
had
with
with
Cassandra
in
the
field.
Speaking
to
the
the
reliability
and
availability
story
in
the
bottom
center.
Here
you
have
Nathan
milford
from
Outbrain
lost
a
data
center
when
Hurricane
sandy
hit
the
east
coast
of
the
United
States.
Cassandra
powered
through
cassandra
also
deals
well
with
with
small
outages,
not
just
really
big
ones,
so
Jake
luciani
in
the
lower
right
talking
about
Cassandra,
dealing
with
individual
disk
failures
and
and
managing
that
as
well.
B
So
what
we've
been
focusing
on
this
year,
then
has
been
really
a
new,
a
fourth
core
value
and
really
making
Cassandra
easy
to
use
and
bringing
that
power
to
two
people
who
are
interested
in
using
cassandra
and
the
need
of
scalable
database
but
they're
there.
They
don't
want
to
live
on
the
bleeding
edge.
You
know
they
don't
want
to
necessarily
have
to
dig
into
the
codebase
to
figure
out.
B
What's
going
on
so
ease
of
use
and
developer
productivity
are
really
what
we
want
to
focus
on
and
what
we've
been
focusing
on
this
year,
so
that
comes
in
two
forms
in
one
is
on
the
right
here
we
have
the
cassandra
query
language,
so
you
know
up
until
cassandra
12
at
in
january
this
year.
The
main
way
you
access
Cassandra
was
with
an
RPC
protocol
called
thrift,
which
was
kind
of
clunky
and
kind
of
painful.
So
we
introduced
the
cassandra
query,
language,
which
is
basically
a
subset
of
sequel.
B
So
we
that
that's
part
of
ease
of
use,
but
another
part,
is
making
Cassandra
more
forgiving,
making
it
harder
to
do
the
wrong
thing
and
I'll
talk
about
some
of
the
examples
of
of
how
we've
been
working
on
that,
as
well
as
part
of
bringing
out
cql
and
giving
Cassandra
query
language.
We've
also
introduced
a
new
native
protocol.
That's
you
know,
that's
native
to
c
ql.
B
It
gives
you
improved
performance
that
gives
you
a
synchronicity
and
push
notifications
out
of
the
box,
and
so
we
work
on
that
kind
of
in
parallel
with
with
the
server
piece
and
so
the
Java
and.net
drivers.
Those
are
both
production,
ready,
Python
drivers
in
beta
and
we're
working
on
others.
There's
a
talk
later
today
on
the
Ruby
driver
for
cql,
and
the
author
of
that
driver
will
be
talking
about
the
low-level
details
of
that.
So,
if
you're
looking
to
get
your
hands
dirty
with
some
pretty
low-level
code,
I
would
definitely
recommend
checking
that
out.
B
So
one
of
one
of
the
important
things
that
we've
introduced,
we
actually
support
tracing
in
the
old
thrift
protocol
as
well
as
cql,
but
it's
so
much
it's
so
easy
to
use
with
cql
that
I
want
to
call
this
out
this
actually,
for
those
of
you
have
been
using
cassandra
for
a
while
that
you
know
that
this
isn't
new
and
20.
This
was
actually
an
introduced
in
12,
but
a
lot
of
people
don't
know
about
it.
Yet
so
I
want
to
keep
mentioning
it,
because
it's
such
a
powerful
tool
when
you're
trying
to
figure
out.
B
What's
going
on
under
the
hood
with
my
Cassandra
cluster,
so
what
this
does
is
is
it's
sort
of
like
explained
analyzed
from
the
postgresql
world
or
something
like
that
where
it
actually
shows
you
how
what's
going
on
when
Cassandra
execute
my
query
and
how
long
each
of
those
steps
take.
So
here,
I
just
have
a
simple
insert
statement,
nothing
fancy,
but
you
can
see
in
I've
color
coded
this
that
in
the
blue
at
the
top.
B
This
is
the
node
that
the
client
talks
to
and
gives
the
request
to,
and
then
Cassandra
actually
has
to
send
that
insert
off
to
some
other
machines.
You
can
see
that
lower
down
and
then
then
those
other
nodes
acknowledge
the
right
back
to
the
coordinator,
which
is
what
the
client
is
talking
to
in
blue
again
and
then
and
then
it
hands
back
that
acknowledgement
to
the
client.
So
you
know
in
a
distributed
system.
Obviously
you
can't
just
tail
your
log
to
see.
B
What's
going
on,
you
need
to
assemble
pieces
from
all
the
different
machines
that
participate
in
the
request
and
that's
what
that's
what
this
does
for
you
in
a
very
nicely
packaged
easy-to-use
format.
Now
this
is
one
way
you
can
use
tracing
and
just
from
cql
shell,
you
can
set
tracing
on
and
run
your
query.
You
can
also
trace
programmatically
and
you
can
say,
I
want
to
trace.
You
know
point
one
percent
of
my
requests
and
save
those
out
all
that
all
the
traces
are
saved
to
a
system
table.
B
Another
part
that
we
introduced
earlier
this
year
that
a
lot
of
people
don't
know
about
yet
is
authentication
and
authorization,
so
you
can
enable
authentication
and,
and
let
in
Cassandra
will
make
sure
that
I
you
have
the
proper
credentials
to
access
the
cluster.
When
you
enable
this
so
out
of
the
box
in
an
open
source
Cassandra,
we
have
the
password
authenticator,
so
that
stores
your
your
users
and
your
hashed
passwords
natively
inside
the
Cassandra
cluster
in
datastax
enterprise.
We
also
offer
the
Kerberos
Authenticator,
so
that
gives
you
access
to
enterprise
single
sign-on.
B
So,
as
John
mentioned
earlier,
datastax
enterprise
is
now
free
for
startups,
so
definitely
something
to
keep
in
mind
there,
and
so
this
is
what
that
looks
like
in
c
ql
that
I
can
create
users.
I
can.
I
can
change
their
passwords
as
the
superuser.
I
can
drop
them.
This
is
actually
something
you
want
to
do
something
like
this.
If
you
enable
authentication
by
default,
Cassandra
just
still
lets
everyone
just
connect
to
the
question:
do
whatever
they
want.
B
If
you
enable
authentication,
Cassandra
creates
a
default
super
user
whose
username
is
password
who's
in
user
name
is
Cassandra
and
who's
password
is
Cassandra.
So
so
you
know
this
is
this
is
kind
of
the
the
bootstrap
user?
So
what
you're
supposed
to
do,
then?
Is
you
either
create
a
new
super
user
and
drop
the
existing
one
or
you
change
the
password
on
the
existing
one?
That's
that's
an
important
thing
to
do.
B
We
also
support
authorization
and
once
I've
got
these
users
authenticated
splitting
out
what
data
and
they're
allowed
to
access
and
modify,
and
so
I
I
can
do
that
on
the
keyspace
level
on
the
table
level
and
read
and
modify
our
separate
permissions.
So
so
pretty
straightforward
paradigm
is
what's
what
we're
familiar
with
from
the
relational
world.
B
So
moving
now
into
the
the
Cassandra
20
world,
one
of
the
the
marquee
features
of
Cassandra
to
do
is
lightweight
transactions.
So
what
we've
seen
over
the
last
you
know
five
years
is
that
as
people
deploy
Cassandra
applications,
cassandra
is
an
eventually
consistent
system
or
tuna
Blee
consistent
if
you
prefer,
and
so
what
that
means
is
there's
no
way
to
say
there
there's
no
concept
of
locking,
there's
no
way
to
say,
while
I'm
doing
this
operation,
nobody
else
is
allowed
to
touch
this
data
and
that's
great
for
ninety-nine
percent
of
your
application.
B
But
it
turns
out
that
that,
for
a
lot
of
op
applique
patient's,
there's
a
there's,
a
small
piece
that
will
need
some
kind
of
linearize
ability
a
way
to
separate
and
isolate
operations
from
other
things
going
on
in
the
cluster,
and
so
what
we
saw
is
that
that
people
needed
these
so
badly
that
they
were
willing
to
do
unnatural
things
like
integrate,
zookeeper,
locking
on
top
of
cassandra
to
get
that,
and
so
it
this
this
setup.
You
know
this
cuz
operational
problems
is
caused
development
complexity.
B
So
we
wanted
to
solve
this
problem
and
so
in
a
nutshell,
as
an
example
of
the
kind
of
race
condition
that
this
solves
is,
you
know,
consider
creating
user
accounts
where
I
allow
users
to
specify
their
own
username.
If
I
have
two
people,
both
asking
Cassandra
hey
does
this
does
username
JB
Ellis
exist
and
they
both
asked
that
at
the
same
time
and
cassandra
says
no,
it
doesn't
exist
yet
no,
it
doesn't
exist
yet
and
then
they
both
go
ahead
and
try
and
create
and
insert
that
row.
B
Then
one
of
them
is
going
to
stomp
on
the
other
and
so
I've
basically
introduced
application
level
corruption,
because
those
operations
were
not
isolated
and
and
Patricks
going
to
talk
in
more
detail
about
this
in
his
data.
Modeling
talk
later
on
today
in
this
auditorium,
but
you
know
so.
The
short
version
is
that
lightweight
transactions
solves
this
problem
for
you.
So
we
implement
this
using
a
consensus.
Protocol
called
paxos
apexis
has
a
reputation
for
being
difficult
to
use
difficult
to
implement.
It's
actually,
not
that
bad.
There's
a
paper
called
paxus
made
simple.
B
Most
of
the
complexity
in
paxos
comes
from
trying
to
elect
a
master
that
then
does
most
of
the
paxus
operations.
But
then,
if
the
master
fails,
you
need
to
do
a
reelection
and
we're
not
doing
that
we're
keeping
it
fully
distributed
system
with
with
no
master
and
doing
paxos
at
a
purely
peer
to
peer
level,
actually
isn't
so
bad.
So
this
gives
us
the
properties
that
that
were
used
to
with
Cassandra
that
you
know
if,
if
a
node
fails,
your
life
goes
on
as
long
as
image.
B
As
long
as
we
still
have
a
majority
of
replicas
alive,
we
can
still
make
progress.
So
that's
the
that's
the
world
we
want
to
live
and
we
don't
want
to
compromise
availability
to
deliver
this
now.
The
trade-off
is
that
we
do
compromise
on
performance,
so
in
particular,
we
need
for
round
trips
to
do
a
lightweight
transaction
versus
one
for
a
normal
update.
So
no,
roughly
speaking,
you're
going
to
have
a
quarter
of
the
operations
per
second,
then
then
you
would
with
normal
updates.
B
So
what
you
know
the
the
design
goal
here,
then,
is
to
support
that
one
percent
that
really
needs
it
and
you
know,
continue
to
build
the
rest
of
your
application,
using
eventual
consistency.
Now,
when,
when
people
hear
that
word
eventual
consistency,
you
know
they
start,
you
know
getting
scared
and
nervous,
and
what
does
eventual
mean?
Well,
eventual
means
that
you
know
it's
going
to
be
replicated
within
a
handful
of
milliseconds.
B
So
the
syntax
for
lightweight
transactions
as
you've
got.
We've
got
two
different
cases
here:
we've
got
on
insert
and
on
update
both
of
them
use
a
new
if
clause
in
c
ql
for
inserts.
It's
really
simple,
because
all
we
can
do
is
you
know,
insert
this
if
it
doesn't
already
exist
so
that
that's
super
straightforward.
So
it's
just
if
not
exists
on
the
bottom.
B
Here
we
have
an
update
and
that
can
actually
be
a
little
more
complex,
because
I
can
specify
an
arbitrary
number
of
columns
in
my
if
clause
and
so
and
and
they
they
all
have
to
be
conjunction
with
and
so
there's
so
there's
no
or
involved
here,
but
but
I
can
have
as
many
columns
in
my
if
Clause
as
I
want
and
then
I
can
update
the
here.
I'm
updating
the
same
column
that
I'm
checking
in
the
if
clause,
but
that's
not
required.
I
can
update
arbitrary
columns
as
well.
B
Another
new
feature
in
Toronto
is
triggers,
and
so
this
is.
This
is
fairly
it's
it's
straight
forward
until
it's
not,
and
so
what
I
mean
by
that
is
on
this
slide.
I
have
create
trigger.
You
know
very
straightforward,
but
but
what
you
see
here
is
it
is
I'm
giving
it
a
class
name.
So
that's
that's
the
part
that's
less
straight,
for
it
is
that
you
have
to
write
java
code
or
something
that
compiles
down
to
a
java
class
to
implement
the
trigger,
and
so
the
reason
is
that
you
know
we
have.
B
We
have
people
are,
and
companies
now
that
have
Cassandra
clusters
of
over
a
thousand
nodes.
So
so,
while
I
said
that
we've
been
focusing
on
ease
of
use
this
year,
we
haven't
forgotten
about
the
people
who
are
really
pushing
the
envelope
in
terms
of
scalability
and
performance,
and
when
you
have
a
thousand
no
database,
it's
actually
worth
a
bit
of
complexity
to
get
that
extra
performance
from
pushing
some
some
logic
closer
to
your
data.
So
the
trigger
allows
you
to
know
based
on
Rose
being
modified.
B
So
so
at
that
scale,
performance
does
matter
and
and
people
are
willing
to
write
some
java
code.
If
that's
what
it
takes
to
to
get
that
extra
performance.
That
said,
for
4
99
percent
of
the
people
in
this
room,
you
know
we're
putting
big
scary
quotes
up
for
triggers
and
saying
you
probably
shouldn't
use
this.
Yet
we
are
we.
We
know
for
a
fact
that
we're
making
changes
here
in
two
dot,
one,
it's
not
going
to
be
backwards
compatible.
B
B
Okay,
now
give
me
the
next
hundred,
and
you
can
see
that
you
know,
since
I
have
a
compound
primary
key
here,
the
the
logic
for
give
me
the
next
hundred
starts
to
get
slightly
painful,
because
I
have
to
say
give
me
the
give
me
the
the
tweets
from
the
next
partition
or
give
me
the
tweets
from
the
current
partition
that
I
haven't
seen
yet
and
so
the
more
components
I
have
in
my
compound
primary
key,
the
more
clustering
columns.
I
have
the
more
clauses
that
I
need
in
this
paging
syntax
here.
B
So
cursors
makes
that
simple,
you
just
say,
select
star
from
timeline
and
cassandra
handles
grabbing
more
rows
from
you
after
you've
consumed
the
you
know,
the
first,
the
first
hundred
or
so
this
is
actually
preserves
your
state
across
servers.
So
if
the
cassandra
know
that
you're
talking
to
goes
down-
and
you
have
to
fail
over
to
a
different
one-
the
cursor
actually
still
lets
you
resume
where
you
left
off
in
your
result,
set
so
so
very
cool
in
that
respect.
B
So
we
have.
We
have
some
minor
improvements
to
c
ql
as
well.
We
added
a
special
case
for
select
distinct,
so
we
still
don't
support,
select,
distinct
in
a
general
case,
but
as
a
special
case,
we
allow
you
to
say,
select,
distinct
primary
key
are
a
partition
key.
Rather
so
that's
actually
something
we
can
do
very
efficiently.
We
don't
need
to
do
a
sort
and
merge.
We
can
actually
just
grab
the
partition
keys
from
the
data
files
on
disk
and
and
give
those
back
to
you.
It's
it's
a
very
performant
operation.
B
So
we
were
okay
with
giving
that
out
as
a
kind
of
a
baby
step
there
create
table.
If
not
exists,
so
just
just
syntactic
sugar,
so
you
don't
need
to
do
a
dance
of
hey
check.
If
this
table
exists,
if
it
doesn't
exist
and
go
ahead
and
create
it,
otherwise
don't
bother.
So
just
since
some
syntactic
sugar,
therefore
you
select,
as
so
we've
been
introducing
some
functions
into
cql.
So
it's
it's
more
important
to
be
able
to
allow
you
to
create
aliases
for
that.
B
So
it
left
your
your
actual
data
sitting
on
disk,
so
for
one
dot
to
we
remove
drop
column
entirely
for
20.
We
added
it
back
the
right
way,
which
is
that
when
I
say
drop
a
column,
not
only
does
it
drop
the
metadata,
but
as
my
data
files
get
compacted,
Cassandra
will
omit
the
column
that
you
dropped.
So
it's
actually
it's
not
going
to
go
through
and
start
scrubbing
your
data
files
and
cause
a
big
impact
to
your
cluster,
but
it
will
evict
that
lazily,
as
the
compaction
happens.
Naturally,.
B
So
I
mentioned
that
that
our
goals
for
ease
of
use
were
partly
on
the
developer
productivity
set,
but
also
partly
on
the
operational
side
and
making
Cassandra
more
forgiving
operationally.
So
one
of
the
things
that
we've
that
we
concentrated
on
for
20
was
moving
our
storage
and
engine
internals
off
of
the
JVM
heap
now.
The
reason
that's
important
is
that
you
know
I've
tried
to
diagram
it
here
that
the
amount
of
memory
that
you
can
allocate
to
the
JVM
is
roughly
about
8
gigabytes
before
you
start
getting
really
onerous
garbage
collection
pauses.
B
B
But
you
know
we
take
care
of
that.
So
you
don't
have
to
worry
about.
You
know
the
the
heap
issues
so
much
so
I'm
going
to
give
a
quick
overview
of
what
happens
when
you
read
some
rows
from
Cassandra
to
explain
what
what
we've
done
here
so
on
a
per
SS
table
bases
per
data
file.
What
we're
going
to
do
when
you
ask
for
ROS
from
your
from
your
users
table
first
we're
going
to
check
a
bloom
filter,
and
so
what
this
is
is
it's
a
probabilistic
set?
B
That
tells
me
with
high
confidence
if
I
don't
have
any
rose
for
the
partition
you're
asking
for.
So
if,
if
the
bloom
filter
says
there
are
no
rows
here
that
I'm
done.
If
the
bloom
filter
says,
there's
probably
some
rows
here,
then
I'll
continue
on
to
the
to
the
partition,
key
cash,
and
so
that's
going
to
say
if
I
did
have
a
if
I
do
have
a
partition
here.
This
is
going
to
tell
me
where
that
partition
begins
in
the
data
file.
B
So,
if
I
hit
in
the
cache,
then
I'm
going
to
go
around
at
this
dotted
line
here
around
to
the
left
side.
I'm
going
to
skip
this
next
part,
because
I
had
a
cache
hit.
If
I
had
a
cache
miss,
then
I
go.
I
go
to
this
next
step
in
the
partition
summary,
and
so
the
partition
summary
is
a
sample
of
my
primary
key
and
partition
key
in
the
in
the
data
file
and
so
and
what
that
does
is
it
lets
me?
B
Do
a
binary
search
against
that
to
see
where
on
disk
I
should
start
looking
for
this
partition?
So
then,
when
I,
once
I've
done
that,
then
I
go
to
the
first
place
on
disk,
so
I've
got
this
dotted
line
down
the
middle
on
top
of
the
line
is
in
memory.
We've
stayed
in
memory
so
far.
This
is
my
first
seek
to
disk
is
I'm
going
to
check
that
partition
index
that
has
all
of
the
partition
keys,
indexed
in
it
and
I'm
going
to
do
a
sequential
scan,
starting
at
the
place
of
the
partition.
B
What
we
find
is
that
what
you
give
up
in
terms
of
CPU
you
make
up
for
in
terms
of
having
that
much
more
data
cashed
in
RAM.
So,
even
even
in
terms
of
the
the
performance,
your
you
get
better
performance
as
well
as
reduced
disk
space.
So
that's
why
it's
on
by
default.
You
can
disable
it,
but
it's
on
by
default.
So
once
we've
got
the
compression
officer,
then
we
finally
go
and
we
can
go
read
that
data
from
disk.
B
So
the
three
things
that
we
moved
into
native
memory
are
the
bloom
filter,
the
compression
offsets
and
the
partition
summary.
So
what's
in
what's
critical
about
these
is
that
all
of
these
structures
grow
linearly
with
the
amount
of
data
that
you
have
in
Cassandra.
So
you
can
see
how
this
this
can
cause.
B
It
could
catch
you
by
surprise
when
everything's
working,
fine,
everything's
working
fine
until
now,
I
have
two
terabytes
of
data
instead
of
one-
and
you
know,
I
start
hitting
large
garbage
collection
pauses
as
the
JVM
valiantly
struggles
to
find
enough
space
to
carry
on,
because
these
structures
have
gardens
so
large.
So
we're
moving
those
into
native
memory.
You
know
basically
means
I,
don't
need
to
worry
about
that
so
much.
B
You
know
as
long
as
I
have
enough
RAM
proportional
to
the
amount
of
disk
I
have
then
I'll
be
in
good
shape
and
that's
a
problem
that
that
most
people
are
in
good
shape
about
now.
The
last
part
that
I
that
I
didn't
circle
here,
the
partition
key
cash-
that's
still
on
heap,
but
that's
a
fixed
amount
of
of
space.
B
So
you
can
you
tell
Cassandra
I
want
to
give
the
key
cash
one
gigabyte,
half
a
gigabyte
and
then
we'll
it
will
limit
itself
to
that
and
and
use
an
LRU
policy
to
evict
the
oldest
entries
as
you
as
you
pull
a
new
data,
so
it
works,
works
very
well
with
the
Cassandra
goal
of
I'll.
Keep
my
hot
data
in
memory.
My
cold
data
on
disk.
B
One
other
thing
that
we
improved
on
the
operation
side
is
a
couple
improvements
to
compaction.
So
one
of
those
is
that
compaction
is
always
single
pass.
So
in
1-2
and
earlier,
if
you
had
a
large
enough
partition,
we
would
have
to
do
two
passes
on
that
to
compact
it
because
of
some
details
of
our
data
form
data
storage
format.
So
we
actually
change
the
format
to
be
able
to
do
a
single
pass
compaction
and
save
you
that
performance
when
your
partitions
get
large,
which
they
sometimes
do,
the
other
one
is
that
leveled
compaction.
B
So
so,
for
those
of
you
not
super
into
the
Nitty
Gritty
of
Cassandra
operations,
LCS
stands
for
level
compaction
strategy.
Stcs
stands
for
sized
here
compaction
strategy
and
what
those
are
is
that
size,
tiered
compaction
strategy
is
the
default
compaction
and
it's
more
io
efficient,
but
it
does
a
less
good
job
of
evicting
old
data,
obsolete
data
from
your
data
files,
and
it
will
tend
to
give
you
less
good
performance
on
reads
than
leveled
compaction.
B
So
level
compaction
is
a
popular
option,
but
it
had
some
kind
of
poor
behavior
under
under
some
corner
case
scenarios
and
that's
what
that's
what
we're
addressing
here
so
level?
What
level
compaction
does?
Is
it
guarantees
that
it's
that,
in
any
level
above
level,
one
will
only
have
one
data
file
containing
data
in
a
given
partition?
B
So
if
I
go
and
I
want
to
read,
you
know
some
some
rows
from
a
partition,
then
I'm
at
most
I
will
have
one
data
file
that
I
need
to
hit
per
level,
plus
any
data
files
containing
that
partition
in
level
0,
because
level
0
is
where
newly
flushed
data
files
appear
that
haven't
been
organized
into
levels
yet,
so
those
could
potentially
contain
multiple
entries
from
my
partition
in
level
0.
So
the
problem
is
that
it's
actually
relatively
easy
to
have
a
right
spike.
B
That
Cassandra
can
accept
and
write
to
disk
faster
than
compaction
can
push
this
out
into
the
levels.
So
what
I
end
up
with
is
a
scenario
like
this,
where
now
I'm
having
to
check
hundreds
of
SS
tables
for
this
partition.
So
what
because
compaction
gets
behind
my
Reed
performance
gets
worse,
so
now
I'm
doing
more.
I
owe
on
Reed's,
which
means
there's
even
less
I
owe
to
do
compaction,
so
compaction
gets
more
behind
and
you
can
get
into
kind
of
a
vicious
cycle.
B
There,
so
what
we're
doing
in
20
now
is
when,
when
the
level
0
compaction
falls
behind
is
will
actually
do
sighs,
tiered
compaction,
which
is
which
is
significantly
faster
and
so
that
it's
not
going
to
give
me
it's
not
as
good
as
if
I
had
all
that
data
merged
into
the
levels,
because
I'm
still
doing
those
three
extra
reads
or,
however
many
extra
reads
in
my
size
tiered
component,
but
it's
a
lot
better
than
doing
hundreds
of
extra
reads.
So
this
is
not
a.
B
This
is
not
a
silver
bullet
that
makes
level
compaction
a
ton
faster
that
you
can,
just
you
know,
beat
the
hell
out
of
it
with
you
know,
without
caring
about
how
level
compaction
is
doing,
but
what
this
does
do
is.
Is
it
let's
level
compaction,
tolerate
spikes
in
the
right
load
and
say:
okay,
I'll,
just
sighs
tier
these
up
when
the
when
the
right
load
dies
down
again,
then
I'll
be
able
to
get
back
to
pushing
those
out
into
the
levels
and
catch
up.
B
So
if
you're,
you
know
a
hundred
percent
writing
as
fast
as
your
cluster
can
tolerate
it,
then
sighs
tiered
compaction
is
still
really
the
the
only
option.
But
if
you
have
the
workload
where
your
level
compaction
can
mostly
keep
up,
but
you
do
have
those
brief
spikes
in
extra
right
activity
than
this
helps
out
a
lot.
B
The
last
improvement
operationally
in
20
is
what
we're
calling
rapid
read
protection.
So
what
Cassandra
does
is
when
you
tell
it
to
go,
do
a
read
it
will.
Actually,
it
will
send
as
many
requests
out
to
Cassandra
replicas
as
it
needs
to
to
satisfy
the
consistency
level
you're
asking
for
so.
If
you
do,
if
you
have
three
replicas
and
you
do
a
quorum
read
it
will
send
out
to
it
will
send
out
requests
to
two
replicas.
If
you
are
reading
a
consistency
level,
one,
it
will
just
send
out
a
single
request.
B
So
what
what
that
implies,
then,
is
to
successfully
finish
that
read.
It
needs
all
the
replicas
it
sent
requests
to
to
actually
respond
to
those
requests,
and
so
what
you,
what
you
would
have,
then,
is
if
a
note
dies
before
the
failure,
detector
notices
and
stop
sending
it
requests.
Then
you
would
have
this
this
blue
line
here.
That
goes
that
goes
down
to
the
bottom.
You
know
that
those
are
requests
that
are
timing
out.
They
were
sent
to
their
to
a
replica.
The
replica
died
and
then
the
the
coordinator
waits
for
those
requests.
Nothing
happens.
B
So
what
we're
doing
in
20
now
is
we
introduce
this
rapid,
read
protection
so
that
the
coordinator
doesn't
just
wait
until
the
the
request
times
out
what
it
will
do
is
it
will
wait
until
the
request
is
taking
longer
than
most
so
by
default,
starting
in
20
2,
which
will
be
released
probably
next
week,
starting
in
two
da
dota
2.
This
is
enabled
by
default
at
the
99th
percentile
level.
B
What
we
care
about
is
finishing
this
request,
so
I'm
going
to
make
a
duplicate
request,
a
redundant
request
to
another
replica,
so
I,
so
I
can
get
that
request
done
before
the
client
timeout
expires,
and
so
what
you
see
at
the
top
there
are
a
bunch
of
different
options.
Depending
on
how
aggressive
it
is.
You
can
see
it
at
the
top
there.
One
of
the
lines
is
99th
percentile,
that's
right!
Next
to
the
no
read
protection
at
all,
so
the
the
line
of
the
bottom
there,
the
orange
line
there,
that's
always
do
a
redundant
request.
B
So
you
can
see
that
that
has
an
impact
on
my
throughput
when
I'm
always
doing
a
redundant
request
or
the
next
line
up
from
it
is
seventy-five
percent
of
the
time.
So
you
can
see
that
there
is
an
impact
on
on
throughput,
but
at
99th
percentile
I'm
only
doing
it
one
person
of
the
time
the
impact
on
throughput
is
negligible
and
it
still
saves
me
from
having
to
fail
some
requests
when
I
when
I
lose
a
note.
So
it's
it
seems
like
a
good
default
to
have
you.
B
Another
value
that
might
make
sense
would
be
taking
that
down
to
90th
percentile,
so
doing
extra
requests.
Ten
percent
of
the
time-
and
the
reason
you
might
want
to
do-
that
is
that's
going
to
help
out
my
99th
percentile
latency,
so
I
didn't
have
room
here
for
another
slide,
showing
the
the
impact
on
latency,
but
this
also
smooths
out
the
latency
curve
as
well
the
latency
distribution,
as
well
so
at
night,
if
I'm
only
doing
a
an
redundant
reads.
One
person
of
the
time.
B
Now
that's
not
going
to
help
my
99th
percentile
latency
it'll
help
my
99.9
latency,
but
not
my
99th.
So
if
I
reduce,
if
I
bring
the
read
protection
down
to
being
more
aggressive
at
90th
percentile,
then
that
will
roll
into
that
will
flow
into
a
better
latency
distribution
at
the
99th
level,
which
is
which
is
something
that
a
lot
of
monitoring
people
care
about.
B
So
moving
on
to
two
dot.
One
are
we're
looking
at
delivering
two
dot
one
in
january
of
2014,
so
we're
we've
got
a
pretty
good
idea
of
what
we're
looking
to
deliver
there
now.
So
one
of
one
of
those
is
user-defined
types,
so
we
kind
of
started
giving
you
access
to
two
nested
data
with
collections
in
cql
already
and
user-defined
types
takes
that
to
the
next
level,
because
I
can
I
can
create
a
type
like
like
address
here.
B
So
here
we
have
a
simple
example,
with
just
one
level
of
nesting
where,
where
I've
got
the
addresses
inside
a
map
and
then
I
can
I
can
pull
those
out
with,
with
a
fairly
simple
extension
to
c
ql,
using
this
dot
notation
to
access
fields
within
the
type
or
I.
Could
you
know
the
main
reason
I've
done?
That
is
so
that
so
that
it
fits
on
my
slide
here,
I
could
just
say,
select
star
still
get
everything
back,
but
but
then
the
wrapping
gets
ugly.
B
B
Another
I
think
that
were
tackling
that's
more
on
the
low-level
operational
side
is
making
our
bloom
filters
more
efficient,
more
memory
efficient,
since
in
particular,
so
when
I'm
doing
compaction,
this
is
something
that
a
lot
of
people
add
don't
know
about.
Actually
is
that
when
I'm
doing
compaction,
I've
got
this
this
data
file
here
that
I've
got?
I
have
this
bloom
filter.
B
So
what
we
do
now
is
we
allocate
the
bloom
filter,
pessimistically
and
say
you
know
this.
This
is
how
large
it
will
need
to
be
if
there's
no
overlap
at
all,
and
so,
if
it
turns
out
that
there
is
a
lot
of
overlap,
then
we
end
up
with
this.
This.
This
bloom
filters
that's
larger
than
its
need
than
it
needs
to
be.
So
what
we
want
to
do
is
we
want
to
size
it
appropriately.
So
what
we're
using
in
20
a
two
dot?
B
One
sorry
is
cardinality
estimation
technique
called
hyper
log
log,
and
what
that
does?
Is
it
tracks
a
sampling
of
the
distribution
of
partition,
keys
in
each
SS
table,
and
so
that
lets
us
calculate
a
to
within
five
or
so
percent
of
accuracy
of
how
much
overlap
there
actually
is
in
those
partitions
in
those
SS
tables
before
I
actually
go
ahead
and
merge
them?
The
interesting
thing
about
this
is
that
we
can
do
apply
this
cardinality
estimation
to
more
than
just
bloom
filters,
so
in
I
talked
briefly
about
size,
tiered
compaction.
B
So
what
we
do
right
now
is
we
look.
It's
called
sized
here,
because
we
look
for
our
data
files
that
are
of
about
the
same
size
and
say:
ok,
let's
merge
you
guys
together
into
a
larger
one,
and
what
cardinality
estimation
will
allow
us
to
do
is
say:
okay,
you
know
we
can
actually
go
do
better
than
just
say:
SS
tables
of
about
the
same
size,
probably
overlap.
We
can
actually
go
ahead
and
look
and
say
which
SS
tables
actually
do
overlap
and
do
a
smarter
compaction
based
on
that.
So
this
this.
B
This
is
the
most
hand-wavy
of
my
slides
here,
because
we
actually
did
we've
we've
been
we've
got
code
written
for
everything
else
that
I've
been
talking
about
on
this.
This
part
about
the
compaction.
Here
we
don't
have
any
code
written
here.
We
just
have
big
ideas
in
in
our
bug
tracking
system.
So
what
we're
planning
on
doing
this
for
two
dot
one?
B
But
you
know,
don't
don't
send
me
hate
mail
if,
if
we
have
to
push
that
out
farther,
but
but
it
looks
really
good
on
paper,
another
thing
that
were
that
we're
working
on
that's
also
kind
of
at
the
at
the
data
file
level
is
increasing
the
repair
efficiency.
So
this
is.
This
is
another
place
where,
where
we
have
a
kind
of
an
operation
that
grows
linearly
with
the
amount
of
data
in
the
system,
and
so
what
would
what
I
mean
there?
Is
that
our
repair
highly
networked
efficient
because
I'll
go
through
your
data
files?
B
/
replica
I'll
build
a
hash
tree
called
them
out
called
a
Merkel
tree.
That
represents
the
data
in
that
in
that
replica
and
I'll.
Compare
that
Merkel
tree
with
the
Merkel
tree.
For
the
other,
my
peers
in
the
cluster
that
also
replicate
that
range
of
data,
and
so
that's
what
I
do
is
I'll
compare
the
root
of
the
tree
and
if
the
roots
of
the
tree
are
identical,
then
I
know
that
all
the
data
matches
everything's
good.
If
the
root
of
the
tree
is
not
identical,
then
I'll
look
at
the
left.
B
Half
of
the
tree
look
at
the
hash
at
the
at
the
root
of
that
half
I'll
compare
the
right
half
of
the
tree
and
wherever
the
those
halves
don't
match,
then
I'll
drill
down
further
into
that
half.
So
at
each
step,
I'll
cut
my
tree
in
half
that
I'm
evaluating
until
I
get
to
a
leaf
node.
That
says
this
is
the
partition
or
the
set
of
partitions.
B
That
is,
that
is
missing
on
one
of
these
replicas
and
then
I
can,
and
I
can
send
that
over
so
it's
very
network
efficient,
but
building
this
tree
building
this
tree
is
not
efficient.
Building.
This
tree
requires
a
sequential
scan
across
all
that
data,
so,
as
I
start
to
get
into
the
terabyte
range
of
data
that
starts
to
get
painful.
So
what
so?
You
know
as
that
this
is
my
illustration
of
as
that
data
set
grows,
I'm
still
having
to
evaluate
sequential
scan
that
whole
set
to
build
my
tree.
B
B
What
I
want
to
do
is
as
I've
added
that
that
extra
data
I
want
to
be
able
to
build
the
tree
out
of
just
that.
New
data
I
should
be
able
to
recognize
that
hey
I've,
already
repaired
and
made
sure
that
all
the
replicas
have
the
data
on
the
left
here.
I
shouldn't
need
to
repair
that
again
ad
infinitum,
so
I'm
just
going
to
build
the
tree
now
for
this
new
data.
B
So
that's
going
to
be
a
lot
more
efficient,
so
our
roadmap
for
two
got
one
I
mentioned,
is
in
January
and
we're
looking
to
deliver
we're
looking
to
continue
delivering
ease
of
use
both
on
the
improved
cql
feature
side,
as
well
as
on
the
operational
side.
So
how
close
might
out
of
time?
I've
got
two
minutes.
So,
let's
do
we
have
time
for
two
questions
if
we're
fast.
B
B
You
know
I
I'm,
in
a
different
position
than
a
lot
of
people
in
this
room.
I
would
be
comfortable
with
202
in
production.
So
for
for
those
of
you
who,
don't
you
know,
dig
into
the
cassandra
code
on
a
regular
basis,
I
would
definitely
evaluate
that
in
a
staging
environment,
we're
looking
at
getting
it
into
datastax
enterprise,
hopefully
by
the
end
of
the
year,
so
yeah.
Well
we're
going
to
be
evaluating
that
internally
as
well.
B
The
question
was
about
aggregate
functions
in
c
ql,
and
what
the
plan
is
there.
I
have
mixed
feelings
about
aggregate
functions
because
they're
in
they
tend
to
be
I
mean
you
can
special
case
things
and
precompute
them
the
way
hakuna
does
for
their
analytics,
but
for
the
ad
hoc
aggregation
site
of
things,
I'm,
not
a
hundred
percent
sure,
that's
a
good
fit
for
what
Cassandra
is
trying
to
do
because
you're
by
its
nature,
you're
giving
it
a
query
that
that
may
involve
a
lot
of
sequential
scans,
a
lot
of
sorting
a
lot
of
merging.
B
So
it's
it's
different
than
what
we're
the
problem
we're
solving
now,
for
you
know
online
applications
that
need
to
do
millions
of
similar
requests
per
second,
so
not
on
I.
I
have
my
Midland
about
it.
That
said
there
there
are
some
Cassandra
committers
who
are
pretty
gung-ho
about
doing
this
Jake
at
Blue,
Mountain
capital
is
one
their
use
case
could
use
this
and
could
get
a
lot
of
benefit
out
of
this.
So
my
guess
is:
it
will
probably
happen
at
some
point.
I,
don't
know
if
it's
going
to
happen
in
the
two
dot.