►
From YouTube: Apache Cassandra and Python
Description
Jeremiah Jordan
Using Apache Cassandra from Python is easy to do. This talk will cover setting up and using a local development instance of Cassandra from Python. It will cover using the low level thrift interface, as well as using the higher level
A
A
What
am
I
not
going
to
talk
about
I'm,
not
going
to
talk
about
setting
up
and
maintaining
a
production
instance
of
Apache
Cassandra.
That's
a
whole
nother
talk,
so
talk
just
talk
about
using
it
locally
from
Python.
If
you
want,
you
can
get
this
get
a
copy
of
the
slides.
This
address
or
PyCon
will
have
them.
They'll
be
linked
from
the
PyCon
website.
Later.
A
All
right,
so
what
is
Apache
Cassandra?
So
here's
the
description
from
the
Cassandra
wiki
in
their
website,
so
Cassandra
is
a
highly
scaleable,
eventually
consistent,
distributed,
structured
key
value
store.
It
brings
together
the
distributed
systems,
technology
from
dynamo
and
the
data
data
model
from
Google
BigTable,
and
so
it's
eventually
consistent
like
dynamo
and
has
a
column
family-based
column
and
with
columns
and
a
key
value
store
like
BigTable.
A
Alright,
so
many
things
its
column
based
key
value
store
basically
looks
like
a
multi-level
dictionary
and
you
know
dynamo
from
Amazon
BigTable
from
Google
and
then
the
other
nice
thing
about
its
schema
optional.
So
if
you
want
to
tell
Cassandra
how
your
data
Stipe
tanned,
how
your
data
stored
in
the
tables
you
can
and
there's
some
extra
stuff
you
can
get
out
of
that.
But
if
you
just
want
everything
to
be
bytes
and
you
can
insert
whatever
you
want,
you
can
do
that
too.
A
So
here's
the
basic
structure
of
how
data
is
laid
out
in
Cassandra,
so
you
have
a
key
space
at
the
higher
level,
which
kind
of
like
a
schema
from
a
normal
database,
and
then
we
have
inside
the
key
space.
You
have
column,
families
and
inside
column
families.
You
have
rows
which
have
keys
and
column
names
and
values
going
so
here's
an
example
set
up
with
some
more
real
names
for
things.
So
now
you
have
your
application
data
key
space
with
a
column,
family
user
info.
It's
got
keys
in
it.
A
The
keys
are
sorted,
are
not
sorted
they're
in
a
random
order.
For
the
most
part,
it's
the
recommended
way
of
using
Cassandra.
Basically,
the
ordering
of
the
keys
is
how
Cassandra
decides
to
store
data
across
a
cluster
of
machines.
It's
a
distributed
system
and
the
easiest
way
to
do
it
is
you
have
a
random
ordering
of
keys
so
that
stuff
gets
evenly
distributed
across
your
cluster.
A
If
you
use
one
of
the
ordered
methods
of
storing
keys,
then
you
have
to
be
careful
to
put
putting
data
hotspots
on
your
cluster
because
the
the
data
is
put
in
an
order.
So
if
everything
is
grouped
together,
you're
gonna
have
one
server,
that's
getting
hit
by
every
single
one
of
those
requests
for
the
data.
A
Alright
and
then
your
column
names,
column
names
are
sorted
and
column
names.
Art
can
be
typed
so
like
in
the
first
column,
the
first
column
family.
Here,
the
user
info
column,
family,
the
column,
names
are
strings
and
so
they're
sorted
in
string,
sort
order
and
the
second
column
family,
the
column
names
are
time-based
unique
identifiers,
and
so
those
are
sorted
in
time
order.
A
So
you
know
time
one
time
two
times
three
are
gonna,
be
sorted
in
time
order
and
we'll
get
into
how
that's
useful
later,
and
then
you
also
have
every
column
has
a
value
associated
with
it
as
well,
and
you
can
also
tell
Cassandra
or
not
tell
Cassandra
the
types
for
values
and
we'll
talk
a
little
about
more.
If
you
type
values,
then
you
can
use
some
of
the
extra
indexing
features.
Cassandra
has
and
then
you
also
have
every
column,
value,
column,
name.
A
Value
pair
also
has
a
timestamp
associated
with
it
and
that
timestamp
is
used
to
do
conflict
resolution.
So
if
multiple
people
write
to
a
given
key
with
the
same
column,
name,
the
one
with
the
higher
timestamp
wins
and
the
timestamp
is
client
provided.
So
you
need
to
make
sure
when
you're
using
Cassandra,
if
you
have
multiple
clients,
writing
to
the
system,
make
sure
their
clocks
are
in
sync
if
they
have
any
chance
of
writing
to
the
same
spot.
A
Really
it's
the
at
the
lowest
level.
It's
an
ordered
dictionary.
So
your
your
columns,
you
can
go
through
them
in
sorted
order,
all
right.
So
now
we
know
a
little
bit
about
it.
Where
do
you
get
it?
You
get
it
from
Cassandra,
Apache,
org
or
if
you
want
there's
a
company
out
there
called
data
stacks
that
provides
Debian
and
Red
Hat
packaging
for
Cassandra,
so
once
you've
downloaded
it
extracted
installed
it
whatever.
A
For
a
development
instance
you're
going
to
want
to
change
the
Cassandra
configuration
files
some
to
away
from
the
default
locations,
and
so
you
know
in
the
yum
file,
you're
gonna
change,
where
the
data
storage
may
be,
where
it's
changed
where
the
logs
are
stored
and
then
the
only
thing
you're
gonna
want
to
change
on.
The
development
instance
is,
probably
you
don't
want
it
using
all
your
RAM
by
default,
it's
gonna
take
up
half
the
RAM
on
their
system.
So
you
probably
don't
want
that
from
a
dev
incidence,
your
running
unit
tests
on
all
right.
A
A
So
the
first
thing
you
need
to
do
is
connect
and
set
up
some
key
spaces
and
column
families
for
your
code
to
stick
stuff
into
there's
a
command
line,
interface
tool
that
ships
with
the
distribution.
So
you
start
up
the
command
line
tool,
connect
to
it
and
create
your
key
space
when
you
create
a
key
space,
you
also
have
something
called
a
placement
strategy
which
basically
says
how
Cassandra
is
going
to
store
that
data
across
the
cluster,
so
where
it's
going
to
put
each
key
the
default
strategy.
A
Is
it's
going
to
take
your
ring
of
machines
and
chop
them
up
into
equal
space
chunks
and
stick
the
data
across
those
different
locations?
There's
other
different
things.
You
can
do.
There's
Network
aware
strategies
that
can
know
about
data
centers
and
we'll
make
sure
data
goes
into.
You
know.
One
copy
goes
into
datacenter
1
and
2
and
a
datacenter
2.
However,
you
want
to
set
it
up
and
that's
where
the
strategy
options
come
in
there
we're
for
the
simple
strategy.
A
Basically,
you
just
taught
how
many
machines
you
want
a
piece
of
data
on
for
the
more
complex
strategies
you
can
tell
it.
One
piece
of
data
over
here
over
there
3
in
the
third
place.
However,
you
want
to
set
it
up.
So
then,
after
you
create
your
key
space,
you
create
your
column.
Families
inside
that
key
space.
A
I
said
when
you
create
a
column
family,
you
have
to
tell
Cassandra
what
kind
of
columns
you're
gonna
stick
in
there
there's
a
if
you
just
want
to
narrate
columns
that
you
there's
a
bytes
type,
that
you
can
just
stick
whatever
you
want
as
a
column
name
or
you
can
tell
it.
Columns
are
strings
and
then
they'll
be
sorted
in
string
order.
You
can
tell
them
they're
utf-8
strings
you'll,
take
care
of
that
or
you
can
do
things
like
make
it
a
time
base
UUID
so
that
you
can
sort
things
in
time-based
order.
A
You
can
make
stuff
make
them.
Integers
floats
whatever
you
want.
Basically,
a
column
name
is
really
just
another
spot.
You
can
stick
data,
unlike
most
databases.
So
then,
once
you've
got
your
schema
set
up,
you
want
to
connect
to
the
system,
so
you're
gonna
go.
You
want
to
client
Union
there's
the
Cassandra
wiki
has
links
to
most
of
the
up.
To
date,
clients
for
connecting
to
it
in
a
variety
of
programming
languages,
the
ways
to
connect
to
it
from
Python,
so
Cassandra
is
built
with
thrift
as
its
main
interface.
A
Thrift
is
another
Apache
project
that
basically
gives
you
a
remote
procedure,
call
interfaces
in
a
variety
of
languages.
There's
a
thrift
compiler
you
give
it
some
IDL.
It
spits
out
code
to
talk
to
a
server
that
implements
a
thrift
interface
and
the
rift
has
generate
code.
Generators
for
probably
20
different
languages,
at
least
so
you
don't
really
want
to
you.
Don't
want
to
use
the
thrift
directly,
not
a
good
idea,
so
you
want
to
get
a
native
client.
So
for
a
native
client,
their
PI
Casa
is
the
one
I'm
going
to
talk
about
today.
A
There's
also
telophase
for
your
doing.
If
you
have
a
twisted
app,
you
don't
want
to
use
telophase
and
then
there's
a
new
client
called
Cassandra
db-api
2,
which
basically
implements
the
python
db-api
2o.
On
top
of
there's
a
new
CQ
l,
query
language,
that's
being
developed
as
a
new
interface
for
talking
to
cassandra
and
that
db-api
compliant
interface
uses
the
the
sequel,
CQ
l
interface,
the
CQ
l
interface
doesn't
have
all
of
the
functionality
of
the
thrift
interface
yet,
but
it's
getting
there,
but
that
exists
as
well.
A
If
you
have
something
that's
already
using
a
db-api
2o
interface.
So
this
is
why
you
don't
want
to
use
thrift.
Basically,
the
the
IDL
generated
code
has
a
whole
bunch
of
extra
objects
and
stuff
like
that
in
it.
So
you
get
a
lot
of
code
generated.
You
want
to
use
Picasa,
that's
doing
the
same
thing
connecting
and
inserting
a
value
see
it's.
You
know,
half
a
third
of
the
lines
of
code,
so
Picasa
has
very
good
documentation
up
on
github
and
then
there's
also
an
example.
A
Application
implemented
called
true
Sandra,
which
is
basically
a
Twitter
clone
using
Django
and
Picasa.
It's
a
good
good
to
go.
Look
at
so
go
through
a
little
bit
of
using
Picasa.
So
connecting
it's
very
simple.
You
create
a
connection,
pool
object.
You
tell
it
what
keyspace
you're
going
to
connect
to,
and
you
give
it
a
server
a
list
of
servers
that
are
in
your
cluster
and
Picasso.
Will
you
know
if
there's
any
errors
or
anything
like
that?
A
Cassandra
is
actually
very
good
about
scaling
up
pretty
linearly
when
you
add
more
nodes,
if
you
add
more
processes
to
go
across
those
different
nodes,
Netflix
actually
published
a
pretty
good
benchmark,
very
extensive
benchmark
where
they
had
millions
of
clients
talking
to
hundreds
of
Cassandra
nodes
across
multiple
AWS
regions,
and
there
was
a
very
linear
scale
up
in
their
benchmark
is
I
was
surprised
when
I
saw
it.
It
was
pretty
impressive.
A
If
you
would
go
check
out
Netflix's
tech
blog,
they
have
a
lot
of
articles
about
them
using
Cassandra
they're
using
it
from
Java,
but
it's
a
good
at
least
what
you
can
do
with
Cassandra.
It's
pretty
interesting.
So
now,
once
you've
created
your
connection,
pool
you're
going
to
create
a
column
family
object,
you
know
tell
it
what
connection
pool
to
use
what
Collin
family
you
want
it
to
talk
to.
A
Read
all
so
very
simple.
You
say,
get
give
it
a
key
and
you
can
optionally
pass
in
a
list
of
columns
and
there's
actually
some
other
options
to
get
that
all
get
into
later
for
fancy
things
you
can
do
delete
the
same
thing,
give
it
a
key
optionally,
a
list
of
columns
all
right
and
then
so.
The
batch
interface,
which
is
probably
what
most
people
are
going
to
want
to
use
a
lot.
A
You
do
a
batch
insert
and
basically
you
just
give
it
a
multi-level
dictionary
of
a
key
to
another
dictionary
of
columns
and
values
and
it'll
insert
all
of
that
in
a
batch,
and
then
there's
also
a
streaming
interface
for
batching.
So
if
you
create
a
batch
batch
object
from
the
column
family,
you
can
optionally
give
it
a
queue
size
if
you
specify
a
queue
size,
basically
every
whatever
your.
A
So
in
this
case,
10
function
calls
it'll
batch
those
together
and
send
them
off
to
the
server
or
you
can
also
not
specify
one,
and
it
only
send
when
you
call
send,
but
then
once
once
you've
created
this
batch
object.
You
just
do
your
inserts
and
your
deletes,
just
in
your
removes
just
like
on
the
regular
object
and
then
either
when
the
queue
sizes
hit,
where
you
call
send,
it's
gonna
send
all
those
off
and
you
can
actually
can
batch
up
inserts
and
removes
in
the
same
batch
and
it
works
fine.
So
this
is
good.
A
You
know.
You've
got
a
stream
of
data
coming
in,
it's
always
more
efficient
to
do
things
and
do
things
in
batches,
but
you
can
treat
it
just
like
you're
doing
one
at
a
time.
Your
code
doesn't
have
to
know
that
you
know
internally
batches
F
up
and
the
other
thing
you
can
do
with
batches
is
do
batches
across
multiple
column,
families,
and
this
is
nice,
because,
basically,
your
your
insert
is
going
to
succeed
or
fail
atomically
as
an
operation.
A
You
know
across
multiple
column
families,
so
it's
nice
for
doing
things
like
inserting
into
a
column,
family
and
also
into
a
second
column
family,
maybe
as
an
index
or
denormalized
query
for
some
other
kind
of
some
other
way
of
doing
it.
To
do
that
batch
across
multiple
column,
families,
you
create
a
mutator
object
and
then
for
every
operation
you
specify
the
column
family
object
for
the
column
family.
You
want
that
batch
operation
to
happen
on
and
it
basically
works.
A
Like
I
said,
columns
have
types
so
and
they're
stored
in
a
sorted
order,
so
you
can
do
what's
called
column
slicing,
and
so,
when
you
say
get
you
can
specify
start
and
finish
values
for
doing
a
slice,
and
it
will
return
you
all
of
the
columns
which
have
a
column
name.
That's
you
know
sorts
between
those
two
values
and
which
is
so.
A
You
know
which
you
can
do
for
addresses,
but
which
is
also
nice
for,
if
you
do
things
like,
if
you
do
the
time
you
you
IDs,
you
can
say
create
a
start
time
say
ten
minutes
ago
and
then
ask
for
all
of
the
things
that
have
been
inserted
in
the
last
ten
minutes.
You
know
all
of
the
activity
in
the
last
ten
minutes
for
something.
A
You
can
actually
just
stick
an
integer
into
that
dictionary
and
Picasso
will
go.
Oh,
he
told
me
that
age
is
an
integer,
so
I'm
gonna,
you
know,
convert
32
into
a
byte
stream
and
insert
those
bytes
in
to
Cassandra,
or
he
told
me
height,
is
a
float
so
I'm
gonna
take
that
float
and
convert
it
to
an
I,
Triple
E,
float
representation
and
store
that
into
Cassandra
and
then
it'll
do
the
same
thing.
When
you
do
the
get
back
out.
A
A
A
So
you
set
up
these
up.
Basically,
you
create
these
objects.
You
use
these
special
types
from
the
Picasso
type
library
just
say
you
know.
The
key
is
gonna,
be
a
utf-8
string.
In
this
case,
any
email
address
is
ASCII
ages,
integer
height
is
the
float
joined,
is
a
date
type,
and
then
you
create
a
column.
Family
map
object
just
like
a
column.
Family
object,
but
you're
also
gonna
pass
in
that
object.
A
You
created
in
the
that
says
what
all
the
the
column
names
mapped
to
in
terms
of
their
types
and
then
to
write
something
with
this
interface.
You
instantiate,
the
user
object,
then
just
fill
in
all
the
attributes
on
it.
The
key
is
John,
the
email
is
John
at
Gmail,
you
know
age,
32,
height
6.1,
they
joined
on
date/time
now
and
then,
when
you
call
insert
Picasso
is
going
to
take
all
those
things
convert.
A
All
right,
and
so
then
the
next
thing
you
do
so
like
I,
said
at
the
beginning,
all
of
column
value
pairs
have
timestamps
associated
with
them.
You
can
get
at
those
timestamps
by
saying,
include,
timestamp
equals
true
on
get
and
then
on
inserting
you
can
insert
your
own
timestamp
instead
of
Picasa
just
using
now,
if
you
want
to
specify
what
the
timestamp
is,
that
something
gets
stored
with,
and
you
do
that
on
the
insert
just
add
an
extra
parameter
to
insert
to
do
that
and
then
there's
also
something
called
consistency,
level
and
Cassandra.
A
So
when
you
have
a
multi
machine
cluster,
the
consistency
level
is
how
many
machines
do
you
want
Cassandra
to
check
the
data
on
before
it
returns?
You
and
answer
so,
if
you
say,
are
storing
data
across
three
machines,
when
you
insert,
you
can
say
I
want
you
to
return
success
to
me
when
one
of
those
machines
has
said
I
got
the
value
or
you
can
say
when
I
want
a
quorum
which
means
n
over
two
of
those
machines
have
gotten
the
value
return
success
to
me,
and
basically
you
can
pick
how
consistent
you
want.
A
Yet
you
know
you
just
care
that
it's
fast,
you
can
use
one
and
then
you'll
just
get
whatever
value
that
that
machine
has
on
it.
So
you
can
pick
how
consistent
you
want
your
data
to
be
and
how
fault-tolerant
you
want
your
data
to
be
all
right,
so
indexing.
So
cassandra
has
some
native
indexing
built
into
it
and
you
can
also
roll
your
own
indexing.
A
A
You
have
to
have
given
told
Cassandra
what
type
that
column
is
going
to
be
and
then
basically
it's
going
to
build
a
column
family
in
the
background,
that's
feed
by
column
values
so
that
when
you,
you
can
search
on
that
index
and
it'll,
give
you
back
the
rows
that
go
with
it.
You
can
also
do
filtering
when
you
query.
A
You
always
have
to
have
at
least
one
equality
operation,
but
after
it's
matched
that
equality,
you
can
do
things
like
greater
than
less
than
an
equal
to
on
other
columns
in
your
query
and
I'll
show
that
real,
quick
and
then,
but
this
isn't
recommended
for
really
high
cardinality
values.
So
Cassandra
has
a
maximum
of
2
billion
columns
in
a
row,
so
I
mean
really
high
cardinality.
But
so,
if
you
have
some
column,
that's
going
to
have
a
value
in
it
that
you're
gonna,
have
you
know
a
billion
values?
A
It's
not
going
to
be
very
performant
using
the
the
native
indices
and
then
the
native
indices.
Also
slow
down
writes
a
little
bit
because
the
server
always
has
to
do
a
read
before
right
before
it
can
insert
your
data
so
to
make
sure
it
doesn't
have
to
update
an
older
value
so
to
use
it
to
add
an
index
I'm
just
going
to
go
to
in
the
command
line.
Do
an
update,
column,
family
I'm,
going
to
say,
update
it,
so
that
state
is
a
utf-8
and
it
has
an
index
on
it.
A
The
only
index
type
right
now
is
keys
there.
What
other
indices
they're
planning
to
add,
but
that's
your
only
choice
right
now,
and
so
basically,
once
you've
added
that
index
and
Picasso
you're
gonna
create
an
index
expression
so
we're
here,
I'm
gonna,
search
for
everything.
Everyone
who
lives
in
Illinois
and
their
age
is
greater
than
20
and
then
once
I've
created
those
two
index.
A
Expressions
I
create
an
index
clause
out
of
those
two
expressions
and
you
put
the
clause
in
the
order
you
want
them
checked
and
then
you
can
use
the
get
index
slice
passing
in
that
clause
and
it's
going
to
return
you
an
iterator,
that's
gonna
page
through
the
values
returned
from
the
database
to
give
you
back
all
those
the
index
values
and
it
actually
will
default.
Paging
batching
data
back
from
the
server
I
think
the
default
page
size
of
the
thousand
items
you
can
specify
how
what
you
want
the
paging
size
to
be.
A
You
also
specify
a
maximum
count
of
things
to
be
returned.
The
other
thing
you
you
can
roll
your
own
indices.
Basically,
you
know,
like
I,
said
before
use
the
batching
interfaces
to
write
data
to
two
different
column,
families,
so
that
you
don't
have
to
do
the
read
before
right.
If
you
know
the
things
new,
the
other
thing
you
can
do,
if
you
do
it
yourself,
do
you
normalize
your
queries
so
that
when
you
read
something
from
the
index
row
that
has
your
data
in
it?
A
A
A
Yes,
so
can
the
key
be
a
composite
value?
Yes,
so
Cassandra
does
have
n
talked
about.
There's
a
there's,
a
more
advanced
feature,
that's
new
to
Cassandra
or
one,
oh,
where
you
can
tell
Cassandra
that
a
key
or
a
column
name
is
a
composite
value.
Basically,
it's
letting
the
composite
types.
Let
you
say
this
key
is
going
to
be
a
string
and
an
integer,
or
this
columns.
A
Gonna
have
two
strings
and
an
integer
and
the
date
in
it,
and
it's
basically
just
so
that
it's
going
to
concatenate
all
that
stuff
together
before
it
stores
it
to
the
database
and
then
for
the
and
the
indexing.
If
you
actually
have
composite
values
with
the
indexing,
the
indexing
can
know
about
the
composite
values,
and
so
you
can
do
some
interesting,
sorting
things
with
using
the
composite
values
so
that
it'll
it'll
sort
by
the
the
first
part
of
the
composite
and
then
the
second
part
and
the
third
part
so
yeah.
A
B
A
Cassandra
so
she's
not
saying
how
does
Cassandra
store
the
data
with
the
key
value
system?
Does
it
store
it
to
RAM,
first
to
disk?
First,
whatever
it's
so
Cassandra,
depending
on
what
consistency
level
you
tell
Cassandra
when
you
write
data
when
you
write
data,
if
you
see
a
consistency
level
of
any,
it's
just
going
to
write
the
data
in
RAM
before
it
replies
back
to
you,
but
they
walk,
but
it
will
then
eventually
propagate
it
back
out
to
other
machines.
A
If
you
use
a
consistency
level
of
one,
it
means
that
data
is
in
the
commit
log
on
disk
of
one
of
the
nodes.
So
basically
Cassandra
has
has
a
commit
log
where
basically,
it's
going
to
write
data
real
fast
at
the
end
of
a
commit
log
in
order
and
then
event
as
it
collects
data
up,
it
writes
stuff
out
to
SS
tables
on
disk.
You
know,
which
are
the
sorted
tables
for
actually
indexing
stuff.
A
The
so
writes
are
really
fast
because,
as
long
as
you
keep
your
commit
log
on
a
separate
disk
from
your
random
access
to
your
tables,
the
commit
log
is
always
written
in
sequential
order.
So
your
hard
disk
is
always
just
your
read
head
doesn't
have
your
write
head
doesn't
have
to
move;
it's
always
just
writing
to
the
end
of
this
file,
so
that
makes
the
writes
fast
and
then
yeah.
So
then,
with
your
other
consistency
level,
it's
how
many
machines
is
it
in
the
commit
log
of
before
it
replies
back
to
you.
A
B
A
Yes,
that's
the
consistency
level,
so
the
consistency
level
you
specify
is
how
well
I'm
sorry
the
replication.
You
specify
replication
on
a
per
column
family
basis.
So
you
say
this
column.
Family
gets
this
many
replicas
on
these
servers.
But
then
your
consider
consistency
level
specifies
how
many
of
those
replicas
data
is
written
to
a
read
from
before
you
get
an
answer
to
your
query.
B
B
A
A
I
mean
it
can
be
used
for
pretty
much
anything
anything
that
fits
this
data
model.
You
know
where
you're
coming
in
with
keys-
and
you
know
you
want
you-
know-
either
ranges
of
values
or
anything
I
mean
and
because
of
the
ordering
of
the
columns.
There's
interesting
things
you
can
do
with
you
know
having
data
based
on
keys,
but
I
mean,
if
you're
and
the
other
thing
with
Cassandra
and
most
other
no
sequel
key
value
stores.
Basically,
you
want
to
store
your
data
so
pretty
much.