►
Description
Speaker: Robbie Strickland, Software Development Manager at The Weather Channel
As a reformed CQL critic, I'd like to help dispel the myths around CQL and extol its awesomeness. Most criticism comes from people like me who were early Cassandra adopters and are concerned about the SQL-like syntax, the apparent lack of control, and the reliance on a defined schema. I'll pop open the hood, showing just how the various CQL constructs translate to the underlying storage layer--and in the process I hope to give novices and old-timers alike a reason to love CQL.
A
A
A
You
know
if
you
may
be
one
of
the
old
people
still
using
thrift.
I
think
there's
a
lot
to
gain
from
moving,
so
we'll
get
self-promotion
out
of
the
way
real,
quick
I'm,
the
software
development
manager
at
the
Weather
Channel
I
work
on
the
data
services
team
and
my
team
is
responsible
for
delivering
back-end
data.
A
Basically
all
the
backend
data
and
analysis
services
for
weather,
comm,
the
TWC
mobile
apps
and
most
of
the
other
data
weather
services
that
you
use
that
you
don't
know
you're
getting
from
us,
and
so
we
actually
provide
weather
data
for
the
vast
majority
of
weather,
related
services
out
there,
whether
you
use
iOS,
8
or
Google,
now
or
yahoo
weather,
or
a
lot
of
these
other
services
actually
come
to
us
as
well.
So
we
do
about
10
billion
transactions
a
day
which
is
not
not
maybe
Twitter
scale,
but
it's
significant.
A
So
if
that's
something
that
you're
interested
in
I'm
actually
really
interested
in
hearing
from
people
on
that,
so
it
should
come
out
about
next
year.
This-
maybe
not
this
time
next
year,
but
earlier
a
little
earlier
than
this.
Hopefully
we'll
see
so
anyway.
If
you're
interested
in
that
come
see
me
afterwards.
So
there
are.
There
are
really
two
target
audiences
for
this
talk.
One
is
people
like
myself,
long-term
Cassandra
users
who
haven't
yet
adopted
cql
or
they
don't
really
understand
it,
and-
and
they
may
be,
you
know
they.
A
They
may
really
be
comfortable
with
the
nuts
and
bolts
of
Cassandra.
They
don't
really
know.
What's
going
on
with
cql
and
I
actually
fell
into
this
category
up
until
about
a
year
ago,
a
little
more
than
a
year
ago,
at
Cassandra
summit
last
year,
when
Sylvain
for
those
of
you
who
are
here
and
heard
his
talk
yesterday,
yes,
last
year
on,
CQL
I
found
it
to
be
compelling
and
it
caused
me
to
take
a
second
look
and
so
I
have.
We
have
since
moved
I.
A
A
So
the
other
group
is
people
who
are
new
to
Cassandra
and
really
they've.
Never
you
never
been
exposed
to
anything
but
cql,
and
you
know
for
these
people.
It's
really
important
to
be
able
to
understand.
What's
going
on
in
two
covers,
because
if
you,
you
know
it's,
it's
easy
to
come
from
a
relational
background
and
see
something
that
looks
really
familiar
and
and
get
sort
of
lulled
into
into
that.
So
my
I
want
to
do
a
brief
survey.
A
A
Okay,
well,
that's
good,
so
the
the
and
the
rest
of
you
are
holed
out
so
you're
still
using
thrift
I
assume
so
who
falls
into
the
second
category.
I
presume.
That's
everyone
else,
people
who
have
not
used
anything
but
seek
you
out
and
then
so
by
show
of
hands
who
in
the
room,
feels
like
they
really
understand,
what's
happening
with
cql,
you
guys
are
in
the
right
place
and
all
of
your
friends
who
didn't
who
didn't
make
it
to
this
talk.
A
There
is
actually
another
talk
that
appears
to
be
on
the
same
thing
later,
so
you
know
which
is
at
first
I
thought.
That's
you
know
how
did
that
happen?
That
must
have
been
a
logistical
fail,
but
I
think
the
topic
is
so
important
that
I'm
actually
pretty
happy
that
that's
the
case,
because
I
think
everyone
needs
to
hear
this.
So
let's
start
with
some
definitions
so
that
we're
on
the
same
page.
A
First,
if
you're
new
to
the
game,
you
certainly
heard
about
thrift,
and
this
I
guess
will
be
helpful
to
a
lot
of
the
people
in
the
room.
So
what
is
thrift?
Thrift
is
basically
the
old
way
of
querying
cassandra.
It's
it's
an
RPC
mechanism
bolted
on
to
a
code
generation
tool
that
allowed
us
to
to
talk
to
Cassandra
in
a
sort
of
a
language
agnostic
way,
and
then
you
had
client
libraries.
You
know
all
sorts
of
client
libraries
that
would
that
would
develop.
They'll
basically
create
language,
specific
idioms.
A
A
My
machine
continues
to
go
to
screensaver
and
I'm,
not
sure
why
that's
happening,
I
apologize
and
it's
also
advancing
my
slides
so
but
that's
okay.
This
is
a
this.
Is
things
that
happen
in
when
you're
standing
in
front
of
people
that
never
happen
when
you're
practicing
right,
so
the
I
think,
let's
see
what's
going
on.
A
Okay,
so
by
contrast,
the
native
protocol
is
it's
binary.
It
replaces
thrift
and
it
your
it
uses
cql
by
default.
So
there's
a
lot
of
benefits
to
using
the
native
protocol,
which
I'll
talk
about
in
a
minute.
So
if
you
want
to
use
the
native
protocol-
and
you
want
to
use
the
new
libraries-
then
you're
essentially
forced
into
cql
I-
think
that
was
probably
not
a
mistake
on
the
part
of
the
Ellison
crew.
A
But
so,
if
you
want
to
use
the
native
protocol
that
you're
going
to
be
moving
towards
CQL,
so
there's
an
important
distinction
also
between
the
the
the
storage
layer,
which
is
actually
what
all
interactions
with
prior
to
cql
were
with
the
storage
layer.
The
storage
ROS
consists
of
partition,
keys
and
their
corresponding
columns.
That's
basically
yeah
exactly
as
they're
stored
on
disk.
If
you're
a
cassandra
veteran
you
dealt
with
those
directly.
A
That's
all
that
you
understand,
whereas
cql
actually
introduces
an
abstraction
layer
on
top
of
the
storage
ROS,
and
it's
really
only
a
direct
mapping
to
the
storage
rows
in
the
very
simplest
of
use
cases.
So,
unlike
storage
ROS,
unlike
storage
ROS,
which
have
no
schema,
cql
actually
requires
you
to
define
a
schema,
but
the
data
is
still
stored,
sparsely
so
your
null
columns
are
actually
just
simply
not
there
on
disk.
So
let's
do
a
quick
history
lesson.
A
In
the
old
days
as
in
a
year
and
a
half
ago,
we
have
lots
of
clients
with
a
bunch
of
different
api's,
like
Hector
Sdn
X
picked
PI
cos,
set
pill,
ops,
cassia,
quill
YZ,
just
to
name
a
few
depending
on
what
language
you're
using
there
were
quite
a
few
just
for
Java.
The
list
is
pretty
extensive,
so
that
meant
we
actually
had
no
common
language
for
talking
about
our
schemas
or
queries.
It
was
not
uncommon
to
actually
put
in
a
lot
of
investment
in
a
given
library.
I'll
need
to
see
it
abandoned.
A
A
So
the
other
thing
is
thrift,
didn't
support.
Cursors-
and
this
may
not
seem
like
a
big
deal,
except
that
you
got
a
big
data
database-
everything
had
to
be
materialized
into
memory
on
both
the
client
and
the
server.
So
this
is
a
problem
when
you're
dealing
with
lots
and
lots
of
rows
of
data,
and
especially
considering
that
a
bad
query
can
go
in
both
your
application
and
a
Cassandra
node.
That's
that's
really
bad.
So
that's
one
of
its
fundamental
flaws.
A
It's
also
hard
to
add
new
features
because
of
the
limitations
of
thrift
and
when
new
features
were
added.
We
actually
had
a
quite
a
bit
of
lag
time
between
the
time
the
feature
would
be
added
and
when
they
would
be
adopted
in
the
client
libraries.
So
these
are
actually
significant
obstacles
and
to
me
it's
a
testament
to
Cassandra
and
its
capabilities
that
that
it
survived
through
those
days.
A
So
a
solution,
in
my
opinion
and
in
the
opinion
of
the
committers
to
Cassandra
its
seek
you
out
so
now
we
actually
have
a
common
familiar
interface
for
communicating
to
them
about
our
Cassandra
data.
But
the
reaction
to
this
was
probably
not
exactly
what
the
developers
intended
and
you
know,
I
already
gave
it
away
cuz
my
glitchy
little
laptop
here.
So
this
was
my
reaction,
as
a
matter
of
fact
is
Jonathan
Ellis
is
said
in
last
year's
keynote.
A
Many
of
us,
like
myself,
were
left,
saying
who
put
the
C
cool
in
my
Noah
sequel
right
and
at
one
point,
I
even
questioned
whether
Jonathan
and
crew
had
gone
off
the
deep
end,
and
maybe
I
should
look
elsewhere
to
a
different
database.
But
on
the
opposite
front
now
we
have
a
bunch
of
new
adopters
coming
out
of
the
woodwork
and
they
think
they
can
hit
the
ground
running
with
what
appears
to
be
a
familiar
interface.
So,
unfortunately,
the
reality
is
somewhere
in
between
both
groups
need
a
reality
check
for
veterans
like
myself.
A
There's
no
need
to
panic,
because
cql
really
is
the
answer
to
our
thrift
pain
and
it
is
painful
and
you
don't
really
know
how
painful
it
is
until
you
stop
using
it,
and
we
haven't
really
lost
anything
because
the
underlying
storage
is
the
same,
whereas
for
those
new
to
Cassandra,
which
apparently
is
most
to
you
guys
in
this
room,
cql
does
improve
learning
curve,
which
is
substantial.
I
brought
some
new
people
online
and
and
since
cql
and
it's
a
lot
easier
than
it
was
before,
but
it's
not
SQL.
A
You
still
have
to
understand
what
you're
doing
and
know
you
can't
just
index
everything
right
so
now
that
we've
gotten
out
of
the
way
that
all
the
way,
let's
peel
back
the
onion
in
a
little
bit
and
look
at
what
what's
happening
out
of
the
thing.
So
here's
a
simple
example:
we
create
a
table
called
books
with
title
as
a
primary
key.
We
insert
some
data
and
then
we
do
a
basic
select.
At
this
point
we
could
be
looking
at
statements
using
my
sequel
or
Oracle.
It's
it's
basically
just
looks
like
an
SI
sequel.
A
So,
let's
see
what's
happening
at
the
storage
later,
so
the
easiest
way
to
do
this
is
by
looking
at
the
old
CLI,
which
is
the
predecessor
to
CQ
LSH,
and
it
allows
us
to
interact
directly
with
the
storage
ROS.
This
is
actually
functionally
equivalent
to
what
we
did
in
the
previous
slide.
So
we'll
start
by
creating
a
key
space,
then
a
column,
family
called
books
and
note
that
we
don't
actually
have
a
schema
we've
defined.
We
simply
tell
Cassandra
the
types
of
our
keys,
column,
names
and
column
values,
and
then
we
start
writing
data.
A
So,
while
we're
looking
at
this,
let's
remind
ourselves
about
some
key
data
distribution
principles
in
Cassandra.
First
remember
that
the
key
is
assigned
a
random
node
using
a
hash
algorithm
which
is
murmur
three
by
default
and
that's
the
one
you
should
use
in
case
you're
wondering
do
not
use
by
order
just
as
a
little
plug
there
so
and
the
columns
are
so.
Therefore,
the
the
data
is
unordered.
A
On
the
other
hand,
the
columns
are
ordered,
naturally
by
name
as
you
can
see
here,
author
comes
before
Year
lexicographically,
so
it's
ordered
first.
So
now,
let's
look
at
a
slightly
more
complex
example.
This
time
we
have
an
author's
table
with
a
compound
key
and
a
short
listing
of
the
contents
of
the
table.
The
question
is:
what
does
it
mean
to
have
a
multi-part
key
I
think
this
is
maybe
one
of
the
most
misunderstood
things
nca
in
cql.
A
Well,
when
specifying
a
primary
key
in
cql.
The
first
field
is
always
the
partition
key
in
this
case
name,
the
remaining
columns
in
the
key
are
called
clustering
columns.
So
what
are
clustering?
Columns
well,
they're
simply
values
their
column,
values
that
are
sorted
at
right
time
in
the
order
specified.
A
So
in
this
case
the
data
is
actually
partitioned
by
name
and
then
sorted
by
year
and
then
sorted
by
title
so
for
the
the
the
old-school
people
in
the
room,
I'm
going
to
show
you
what
this
looks
like
under
the
hood
under
the
covers
this
uses
composite
column
names
which,
if
you've
been
using
thrift,
you
you
most
likely
make
extensive
use
of
these
and
since
the
column,
names
are
stored.
Naturally,
by
default,
this
works.
A
So
we
can
also
reverse
the
stored
sort
order
the
sort
stored,
sort
order
by
adding
the
with
clustering
order
Clause.
The
result,
as
you
can
see
here,
is
what
we
would
expect
with
1993
coming
before
1987
and
again.
This
is
what
this
looks
like
on
the
storage
layer,
and
while
this
may
seem
to
be
a
bit
of
a
trivial
point,
it
can
be
quite
important
depending
on
the
types
of
queries
you
want
to
run,
and
so
we'll
look
at
it
a
little
more
closely
later.
A
So
we've
essentially
moved
the
year
component
from
a
component
of
the
column
name
to
a
component
of
the
key.
So
why
should
you
care
about
any
of
this?
Well,
there
are
actually
several
reasons.
First,
your
queries
have
to
respect
the
underlying
storage.
Cassandra
does
not
allow
ad-hoc
queries
like
you
can
form
in
as
in
cql,
in
a
relational
system.
So
if
you
don't
understand
how
the
data
is
stored,
at
best
you'll
end
up
constantly
frustrated
by
error
messages
that
you
receive
and
at
worst
you'll
suffer
poor
performance.
A
Second,
you
need
to
define
your
partition
key
properly,
because
it
must
be
known
a
query
time.
This
is
something
that
I
see
as
a
common
point
of
confusion
with
Cassandra.
It
must
also
distribute
well
across
the
cluster,
so
something
like
a
timestamp
would
be
a
bad
key
because
it
typically
isn't
known
at
query
time
and
a
date
is
usually
a
bad
key,
because
you'll
end
up
doing
most
of
your
rights
to
only
a
subset
of
your
nodes.
A
Third,
because
of
its
sequential
log,
structured
storage,
Cassandra
handles
range
queries
very
well.
A
range
query
simply
means
that
you
select
a
range
of
columns,
columns
being
the
keyword
there
for
a
given
key
and
the
order
they're
stored.
And,
lastly,
you
have
to
carefully
order
your
clustering
columns,
because
this
will
affect
the
storage
order
of
your
data
and
therefore
they
determine
the
kinds
of
queries
you
can
perform.
A
So,
let's
look
at
some
examples
of
various
queries
using
our
authors
table
with
name
as
the
partition
key,
and
your
in
title
is
clustering
columns
we'll
also
reverse
the
sort
order
on
year.
So
this
is
what
our
data
looks
like
in
cql
rows.
So
pretty
straightforward,
we'll
start
with
a
basic
query
by
key
for
all
of
these
queries,
we're
going
to
assume
a
consistency
level
of
quorum
with
a
replication
factor
of
three
for
this
simple
Select.
A
You
can
see
that
this
query
hits
the
coordinator
node,
which
owns
a
replica
for
our
key,
and
then
it
grabs
the
row
from
another
replica
node
to
satisfy
the
quorum.
Thus,
we
need
a
total
of
two
nodes
to
satisfy
this
query
at
the
storage
layer.
What
happens
is
this
query
quickly
finds
the
partition
key
using
the
hash
and
then
it
scans
all
the
columns
in
order
to
produce
the
result.
So
our
simple
query
by
key
is
actually
a
range
query
under
the
hood
and
that's
really
important
to
sort
of
understand.
A
So
what
happens
when
we
actually
do
a
range
query,
such
as
the
one
here?
Well,
in
this
case
we've
added
a
query
predicate
limiting
the
year
to
greater
than
or
equal
to
1990
again
we
see
it
requires
only
two
notes
to
satisfy
the
query,
then,
at
the
storage
layer
again
we
find
the
partition
key.
That's
gonna
be
the
first
step
in
most
of
these
queries
and
then
scan
until
the
value
is
less
than
1990.
The
important
point
here
is
that
cassandra
only
had
two
scan
the
columns.
A
A
A
A
A
So
what
do
we
learn
so
far?
We
saw
that
if
you
want
a
fast
query,
it
needs
to
be
sequential
at
the
storage
layer
because
of
its
log,
structured
storage
and,
as
a
corollary,
you'll,
want
to
always
try
to
query
by
key
and
clustering
columns,
because
this
results
in
a
sequential
lookup.
We
also
noted
that
multi
key
queries
often
result
in
lookups
on
many
nodes
and,
lastly,
you
should
always
write
in
your
intended
read
order,
so
those
are
some
things
that
I
hope
that
you've
gleaned
from
that
so
far.
A
So
now,
let's
move
on
to
so
we
cover
some
of
the
basic
stuff.
Let's
move
on
to
the
some
more
advanced
concepts,
starting
with
collections.
So
as
a
reminder,
Kisan
supports
three
collection.
Types
first
is
sets
which
are
unordered
or
more
accurately,
naturally
ordered
collections
of
items
and
lists
which
are
ordered
and
allow
duplicates
just
like
Java
lists.
Java
sets
those
sorts
of
things
and
maps
which
are
key
value
pairs.
So,
if
you've
been
wondering
where
your
dynamic
columns
went
in
in
your
thrift
model,
this
is
a
good
place
to
start.
A
It
works
in
some
of
those
use
cases,
but
we
do
have
some
caveats,
the
first
being
that
we
can
only
store
64,000
items
in
a
collection
with
each
having
a
max
size
of
64k.
So
the
second
is
that
querying
a
collection
always
returns
the
entire
collection-
that's
pretty
important,
especially
if
you
have
a
lot
of
data,
so
the
net
result
is
that
these
are
good
for
relatively
small
bounded
data
sets.
So
let's
look
an
example
of
a
set
here.
A
We
have
a
table
of
authors
with
each
author
having
a
set
of
books,
you
can
add
and
remove
items
from
a
set
using
an
update
statement
as
well
I'm
only
showing
the
insert
here,
but
it
does
have
update
semantics.
So
at
the
storage
layer,
this
is
actually
stored
as
a
two-part
composite
column
name.
So
the
first
part
is
the
name
of
the
set,
and
the
second
is
actually
the
set
item
itself.
A
So
with
this
approach,
we
actually
get
it
sort
of
naturally
disallows
duplicates,
because
if
you
try
to
insert
a
duplicate,
it
simply
causes
an
overwrite
of
the
item.
So
list
actually
look
very
similar
at
the
cql
level,
except
that
the
curly
braces
are
replaced
by
brackets
and
the
update
syntax
supports,
append
and
prepend
operations
to
allow
you
to
order
things.
So
you
can
also
see
that
when
we
query
the
list
we
can
see
the
list
of
books
is
in
the
order
we
specified
and
not
in
the
natural
order.
A
You
know
you
can
see
without
remorse,
obviously
would
come
after
Patriot
Games
if
we
were
doing
it
in
natural
order.
So,
unlike
the
set
the
list,
the
actual
list
item
is
stored
in
the
value,
since
we
don't
want
to
allow
for
dupe
since
we
want
to
allow
for
duplicates
and
we
don't
want
natural
sorting.
Instead,
each
item
in
the
list
includes
an
ordering
ID,
and
this
ID
gives
us
a
way
to
exploit
the
natural
ordering
of
column
names
while
maintaining
the
order.
A
We
asked
for
now
on
the
maps,
so
you'll
notice
that
the
sword
order
on
a
map
is
actually
based
on
the
natural
ordering
of
the
key
Patriot
Games
comes
from
a
head
of
without
remorse.
The
map
structure
bears
a
strong
resemblance
to
the
list
structure,
except
that
the
ordering
ID
is
actually
replaced
by
a
map
key.
This
prevents
duplicate
keys
as
a
duplicate
will
result
in
an
overwrite
and
it
allows
keys
to
be
ordered.
A
So
at
the
beginning
of
the
presentation,
I
said
you
can't
just
index
everything.
So
let's
take
a
look
at
what
actually
happens
when
you
create
in
Korean
index.
So
here's
a
simple
example
where
we're
indexing
the
publisher
column,
on
the
offer
table
before
diving
into
the
details.
Let's
remind
ourselves
with
some
important
points
regarding
indices.
First,
they
allow
query
by
value
in
some
limited
circumstances,
some
limited
circumstances.
Okay,
they
are
distributed
based
on
the
row
key
of
the
index
table.
A
This
is
really
important,
so
we'll
examine
this
concept,
actually
a
little
bit
more
in
detail
in
a
minute
and
updates
are
atomic.
So
you
know
when
you
insert
an
indexed
field,
that
the
index
is
updated
as
well
and
indices
should
be
low
cardinality
or
they
won't
scale
to
large
cluster
sizes.
If
you're
going
to
use
an
index,
that's
probably
the
most
important
thing
to
know.
A
For
example,
birth
year
could
make
a
reasonable
index,
but
a
UUID
does
not
but
know
as
a
sort
of
corollary
or
opposite
to
that
we
don't
want
extremely
low
cardinality
like
boolean
or
gender
value.
So
let's
look
at
how
indexes
are
actually
distributed
across
the
cluster.
Here
we
have
three
nodes
in
our
cluster
with
three
keys.
Each
key
has
two
replicas
using
our
publisher
index.
A
As
an
example,
you
can
see
that
the
key
of
the
index
is
actually
the
value
of
the
index
column,
which
is
logical,
but
indices,
are
not
distributed,
the
same
way
as
normal
tables
and
that's
maybe
one
of
the
most
important
things.
If
you're
ever
planning
to
use
indexes
pay
attention
to
these
two
slides,
so
you'll
notice
that
the
index
only
contains
values
for
the
keys
stored
on
that
node.
A
So,
let's
make
it
worse,
so
the
index
key
distribution
strategy
has
implications
right.
So
if
you're
gonna
query
by
value
using
an
index,
you
know,
theoretically,
the
value
we're
querying
could
exist
on
any
node
in
the
cluster
right.
We
don't
know
where
it
exists.
That's
why
we
created
an
index,
so
it
he
also
its
corresponding
rows
in
the
index
table.
Couldn't
could
live
anywhere
in
the
cluster.
A
So
what
the
coordinator
has
to
do
then,
and
does
everybody
know
what
I
mean
when
I
say
coordinator,
it's
the
the
node
dot,
then
the
node,
you
ask
when
you
when
you
send
the
request,
so
the
the
coordinator
in
this
case
has
to
ask
enough
nodes
to
cover
the
entire
token
range
I'm,
not
a
PowerPoint,
guy
and
I
had
to
used
PowerPoint
for
this.
So
you
know
it's
so
it
has
to.
It
could
exist
on
any
sorry.
It
has
to
has
to
cover
the
entire
token
range,
which
which
will
involve
many
nodes
right.
A
So
in
many
ways
querying
an
index
has
a
similar
or
even
worse
effect,
as
using
the
end
clause
so
and
as
a
result,
indexes
should
be
used
very
sparingly
and
they
should
not
be
a
crutch
for
poor
data
modeling.
You
know,
modeling
data
and
Cassandra
is
different.
You
know
it
requires
you
to
think
about
things.
It
requires
you
to
write
the
data.
The
way
you
want
to
read
it
and
indexing
I've
seen
a
lot
of
times.
We're
indexing
is
just
a
crutch
for
people
that
it
they
didn't
think
about
things
in
advance.
A
So
if
you
started
nodding
off
during
that
discussion,
I
can
see
a
few
of
you
that
it's
riveting
right.
This
is
great.
It's
a
lot
of
fun,
so
it's
time
to
tune
back
in
because
it's
time
to
talk
about,
deletes
like
this
right.
You
know
you
know
that
kind
of
delete
or
or
that
the
TTL
right
you
you
familiar
with
those,
but
what
about
those
okay,
nulls?
Who
in
here
knew
that
inserting
nulls
caused
the
delete?
One
got
two
guy:
three
guys.
A
A
So,
if
writes
are
immutable,
you
can't
actually
delete
data,
so
you
have
to
add
a
marker
that
says:
hey
ignore
this
column,
that
marker
is
called
a
tombstone
and
you
get
one
for
each
column
you
delete
so
and
in
order
to
make
in
order
to
make
sure
it
doesn't
panned
you
deleted
data.
Cassandra
has
to
read
those
tombstones
and
remove
the
affected
records
from
your
result
set.
A
Because
of
this,
deleting
data
in
any
significant
volume
is
an
anti-pattern
de
rigueur
that
do
not
delete
data
in
volume
on
Cassandra.
So
there
are
very
unusual
cases
where
it
can
be
okay,
but
if
you
don't
already
know
what
those
are
then
don't
use
those
don't
use
it.
So,
let's
to
drive
this
point
home,
let's
look
at
the
difference
between
columns
that
are
simply
missing
and
deleted.
Columns.
As
previously
discussed,
data
is
stored,
sparsely
so
missing.
Columns
simply
are
not
written
when
you
query,
using
cql
missing
columns
appear
as
nulls,
but
really
in
reality.
A
They
just
aren't
there
at
all.
So
we
can
see
that
in
the
storage
layer
representation
they're
below.
So
if
we
turn
on
tracing
to
see
what's
actually
happening,
we
can
verify
that
we
just
read
one
column
to
produce
the
result.
This
is
actually
if
you
haven't
used
tracing.
This
is
a
fantastic
way
to
find
out
where
you
have
issues
with
your
queries,
so
Cassandra
in
this
case
actually
didn't
need
to
reconcile
any
tombstones.
So
this
is
really
important
if
you
ever
do
deletes,
you
should
monitor
this
and
trace
your
queries
to
find
out
what's
happening.
A
On
the
other
hand,
what
if
we
actually
specify
null
values
for
our
ISBN
and
publisher
in
this
model?
Well,
as
we
just
discussed,
this
is
going
to
result
in
two
delete,
so
the
question
comes:
why
can't
Cassandra
just
not
write
those
values
right,
I
specified
nulls?
Why
can
I
do
that?
Well,
the
reason
is
because
inserts
and
updates
are
basically
the
same
in
Cassandra
we
get.
A
The
cql
syntax
puts
put
some
dressing
over
that,
but
if
you
use
the
thrift
API,
you
know
that
everything's
just
a
mutation,
so
there's
really
no
difference
between
inserts
and
updates.
It's
giving
away
my
secrets
so
and
since
Cassandra
so
what's
happening
is
they're.
Both
just
rights
on
the
commit
log,
so
Cassandra
doesn't
read
before
it
writes.
That
would
be
catastrophic
right,
so
it
has
no
way
to
know
whether
there
were
previously
values
written
for
those
columns.
So
it
must
write
tombstones
just
to
make
sure
so
we
can
actually
verify
this
by
tracing
our
query.
A
A
For
this
reason,
so
it
also
means
you
need
to
be
careful
when
using.
So,
if
listen
for
this
part
about
deletes
prepared
statements,
this
is
where
you
can
get
in
trouble
prepared
statements
with
a
bunch
of
columns.
You
may
not
need
this
is
a
common
use
usage
pattern
in
like
JDBC
type
work
right.
You
create
one
prepared
statement.
You
just
populate
the
columns.
You
need
right
because
it
doesn't
matter
if
you
write
nulls,
it
does
matter.
So
it's
better
to
you
prepare
multiple
statements
that
actually
cover
the
permutation
that
you
need
in
that
case.
A
So,
let's
talk
a
little
bit
about
why
queries
fail,
maybe
a
little
bit
more
slowly
than
my
slides
are
going
so
I
suspect,
yeah
Wow.
Look
at
that.
So
I
suspect
that
one
of
the
reasons
you
decided
to
come
to
this
talk
is
because
you
wanted
to
get
the
name
of
the
author
who
wrote
patriot
games.
That's
why
I
came
so,
but
you
asked
Cassandra
to
tell
you-
and
you
got
this
right
primary
key
part.
Title
cannot
cannot
be
restricted
preceding
part
year
is
either
not
restricted
or
by
a
non
EQ
relation.
A
So
for
the
love
of
Oracle.
What
does
that
mean
right
to
answer
this?
Let's
look
at
some
of
the
reasons
why
queries
fail.
First,
if
you
don't
provide
the
full
partition
key
in
most
cases,
you're
gonna
get
a
nasty
response.
Cassandra
needs
the
key
to
know
exactly
which
one
of
your
hundred
nodes
owns
the
data
where
the
disk,
where
it
lives
on
disk.
A
Second,
you
might
be
trying
to
query
by
value,
as
you
would
in
a
relational
database,
but
you
actually
haven't
created
an
index
on
that
value
without
an
index
occurring
by
value
would
result
in
a
full
table
scan
and
if
you
have
terabytes
of
data,
this
is
a
bad
idea
right
and
Cassandra
simply
doesn't
let
you
do
it.
And,
lastly,
the
other
possibility
is
you're
confused
about
how
clustering
columns
work,
but
don't
worry,
I'm
gonna,
try
to
resolve
your
misunderstanding.
A
So
let's
go
back
to
an
earlier
example
where
our
authors
table
had
a
compound
partition
key
name
and
year.
If
you
recall
this
resulted
in
a
storage
layer,
representation
that
looks
like
this
with
two
parts
of
our
key
acting
as
the
row
key.
So
what
happens
if
we
try
to
do
a
select
specifying
only
the
author's
name?
Well,
as
we
should
expect
it
fails
with
this
horrible
error
message
that
ends
in
is
because
we
only
supplied
half
of
the
partition.
A
A
Attempting
arranged
query
on
year
gives
us
the
same
message,
which
is
a
bit
misleading,
because
it
implies
that
maybe
we
could
just
add
the
name
part
and
get
what
we
want
right.
But
if
we
do
this,
we
see
that
no,
in
fact,
we
can
only
use
equals
comparisons
on
the
partition
key
parts.
This
is
because,
if
you
were
recall,
the
keys
are
randomly
distributed.
A
Making
range
queries
impossible
unless
you
use
byte,
ordered
partitioner,
which
doesn't
exist
as
far
as
you're
concerned
and
don't
get
too
excited
about
the
token
function
which
is
referenced
in
this
error
message
because
it
doesn't
solve
this
use
case.
The
token
function
allows
you
to
get
the
hash
token
for
a
given
key
and
is
useful
for
paging
through
your
entire
table.
It
does
not
permit
range
queries
on
partition
keys,
so
don't
get
excited
as
we've
discussed.
Cassandra
doesn't
allow
us
to
query
by
value
without
an
index.
A
So
if
we
do
as
in
this
example,
where
we're
asking
for
authors
with
a
specific
ISBN,
we
get
an
error
indicating
we
haven't
specified
an
index
column.
So
the
solution
is
in
fact
not
to
create
an
index
it's
to
create
a
sane
data
model.
So
perhaps
the
greatest
sort
of
source
of
query
confusion
is
actually
with
clustering
columns.
A
This
is
probably
because
many
people
don't
understand
how
they're
stored,
which
we
did
look
at,
and
so
hopefully
you
do
now
and
giving
my
things
away
again,
so
they
don't
realize
that
Cassander
limits
you-
and
this
is
important
if
you
don't
get
anything
else
from
what
I've
said.
Cassandra
limits
you
to
mostly
queries
that
can
return
contiguous
data
without
filtering
okay.
So
once
you
understand
this,
queries
involving
clustering
columns
make
a
whole
lot
more
sense.
So,
let's
go
back
to
our
authors
table
except
this
time
year
in
title
or
clustering
columns.
A
Soyou're
is
no
longer
part
of
the
partition
key
and
let's
remember
what
this
looks
like
the
storage
layer.
So
we
have
our
clustering
columns
translated
to
composite
column,
names
right,
and
so
let's
look
at
some
possible
queries.
First,
we'll
do
a
select
using
the
partition
key,
because
now
we
know
we
need
it
right.
Everybody
knows
that
now
and
then
we'll
filter
on
year.
This
works
because
it's
a
simple
range
query.
A
Okay,
and
so
this
is
what
it
looks
like
again
at
the
it's
a
small
range,
but
in
this
case
it's
still
arranged
to
columns
in
order
and
specifying
a
range
of
years
also
works
for
the
same
reason
right
because
under
the
hood,
it's
just
starting
at
one
year
and
going
to
the
next
in
order
on
disk.
So
we
already
know
that
Cassandra
won't
tell
us
who
wrote
patriot
games,
unfortunately,
but
we
because
we
have
to
tell
it
the
author's
name.
But
what
about
a
query
based
on
name
and
title
like
this?
A
A
In
this
example,
I
have
Patriot
Games
only
listed
once,
but
there's
absolutely
no
reason
why
you
couldn't
have
it
many
times
in
your
data,
and
so
the
only
way
to
find
it
would
be
for
it
to
search
all
of
the
rows
in
order
for
it
to
locate
Patriot
games,
because
it's
the
second
component
of
the
key
of
the
column
name,
there
could
theoretically
be
millions
of
columns
in
this
row.
So
Cassandra
won't.
Let
you
do
this
so
now.
A
Here's
an
interesting
example:
we
do
a
select
without
providing
the
partition
key
and
instead
we
ask
for
a
specific
year,
which
is
the
first
clustering
column,
based
on
what
you've
already
learned.
You'd
think
the
Cassandra
would
simply
disallow
this
well.
Sort
of
the
air
returns
tells
us
that
this
is
probably
a
bad
idea,
but
we
can
do
it
anyway
if
we
use
the
allow
filtering
clause.
So,
let's
just
do
it
right
sure
enough.
A
This
has
the
effect
of
scanning
only
until
you
hit
a
column
or
columns
that
match
your
criteria.
However,
it
actually
doesn't
protect
you
from
the
worst
case
scenario
in
which
you
have
no
records
matching
the
criteria
or
if
your
limit
can't
be
satisfied,
so
that
would
actually
result
in
a
full
table.
Scan
and
I'm
gonna
finish
up
here,
so
I
can
get
some
quite
so
the
bottom
line
is
cassandra.
A
Wants
you
to
query
sequential
ranges
without
filtering
and
that's
because
it's
designed
to
handle
huge
amounts
of
data
where
random,
seeks
and
table
scans
are
a
bad
idea,
so
think
about
your
queries
in
that
way.
So
to
sum
up,
so
we
can
get
some
questions.
Let's
remember
two
things:
one,
thrift
is
dead,
switch
to
cql
as
soon
as
possible,
but
make
sure
you
understand
your
data
models
and
queries
because
it's
not
SQL.
So.
Thank
you
very
much.
Thank
you.
Robbie.