►
From YouTube: C* Summit 2013: Cassandra Internals
Description
Speaker: Aaron Morton, Apache Cassandra Committer
Slides: http://www.slideshare.net/aaronmorton/apachecon-nafeb2013
A
My
name
is
Aaron
Morton
today
I'm
going
to
talk
about
Cassandra
codebase.
Obviously
it's
a
quite
calm
codebase.
It
solves
complicated
problems,
so
we're
not
going
to
dive
into
in
great
detail.
What
I'm
going
to
do
instead
is
pull
out
some
of
the
more
important
classes
and
point
to
some
of
the
places
where
exceptions
happen,
that
you
might
get
as
a
client
or
timeouts
are
raised
a
little
bit
about
me
before
we
get
started.
A
I'm
a
freelance
Cassandra
consultant
I'm
based
in
Wellington
down
in
New,
Zealand
and
I'm
a
commuter
on
the
apache
project,
we're
going
to
start
by
talking
about
the
architecture
and
then
we're
going
to
dive
into
the
code
base
architecture
wise.
We
can
draw
a
pretty
simple
diagram:
there's
some
clients
they
connect
to
api's.
Those
API
is
talk
to
some
software
components
that
we
can
call
cluster
unaware.
Acts
are
a
cluster
aware.
They
in
turn
talk
to
software
components
that
a
cluster
aware
which
read
and
write
from
disk.
A
When
we
put
this
stack
together
in
a
cluster
as
you'd
expect,
the
cluster
aware
components
talk
to
each
other.
The
cluster
underwear
components
continued
just
to
do
their
thing,
reading
and
writing
from
disk
without
worrying
about
the
other
nodes,
but
we
can
put
some
bed
labels
on
this
diagram.
Cassandra
is
a
dynamo
based
system,
so
let's
call
our
cluster
aware
components,
the
Dynamo
layer
and,
let's
call
our
cluster
unaware
components.
A
database
layer.
Cassandra
is
just
a
fast
log,
structured
storage
engine
with
this
dynamo
layer
on
put
on
top
of
it.
A
The
database
components
take
things
from
memory.
Put
them
on
disk.
Take
things
from
disk
put
them
in
memory.
The
Dynamo
layer
collects
that
memory
on
the
various
machines
brings
it
together
on
the
coordinator.
That's
where
we
take
care
of
eventual
consistency
and
all
those
guarantees
make
it
available
to
the
API
and
it
gets
sent
back
to
your
client.
So
we
can
expand
our
overview
today,
we're
going
to
talk
about
an
API
layer,
a
dynamo
layer
and
then
a
database
layer
and
break
the
API
layer
up
into
two
transports
and
services.
A
We
use
thrift
as
a
transport.
If
you
do
nothing,
your
API
call
starts
in
the
custom.
Tea
thread
pool
server
in
these
slides,
where
I
have
Oh
I'm
talking
about
the
Java
package
in
the
codebase
org
Apache,
Cassandra
thrift,
custom,
T,
thread,
pool
server.
This
guy
uses
a
thread
pool
that
has
one
thread
per
client
connection.
That
thread
stays
with
the
client
connection
for
the
life
of
their
connection.
At
the
end
of
it
doesn't
clean
up
and
then
goes
on
the
process
and
another
client.
A
You
might
also
find
in
the
codebase
a
custom
T
non-blocking
server,
that's
an
older
deprecated
server.
What
you
will
find,
though,
is
the
custom
ths
a
chase,
whoever
you
choose,
the
server
type
using
the
RPC
server
type
yeah
more
configuration
the
default.
Sync
is
the
custom
T
thread
pool
server,
H
sha
chooses
the
H
sha-1,
the
difference
is
HS,
hae
uses
a
thread
pool
and
each
thread
processes
one
request
from
a
client
and
then
goes
on
to
look
for
another
request
to
process.
A
It's
a
bit
more
efficient
in
terms
of
the
number
of
threads
that
you
use.
Now.
That's
the
thrift
and
it's
pretty
simple,
because
a
lot
of
that
code
is
thrift
code,
and
we
don't
want
to
talk
about
that.
Today,
native
binary
is
new
in
in
Cassandra
1.2.
In
fact,
it's
labeled
beta.
It
uses
nettie
under
the
hood
disabled.
By
default,
you
enable
it
using
stark
native
transport
in
the
y
ammo.
You
can
run
both
thrift
and
native
binary.
They
use
two
different
ports
and
they're
happily
coexist.
A
You
can
dig
into
this
guy
a
little
bit
further
start
in
our
server
class
in
the
transport
package
in
the
run
function
build
the
execution
handler.
This
is
our
thread,
pull
thread
processes,
one
request
from
a
client
and
goes
on
to
find
another
request
from
a
different
client,
perhaps
the
process
so
to
do
that
it
uses
nao
server
sockets
under
the
covers,
and
it
sets
up
this
thing
in
nettie
that
we
call
the
server
pipeline.
A
This
has
a
series
of
stages
that
your
request
goes
through,
so
this
frame,
encoding
and
decoding
frame,
compression
and
decompression
message,
encoding
and
decoding
up
to
a
stage
called
the
dispatcher,
which
is
where
your
request
actually
runs.
Our
dispatcher
is
in
the
class
called
transport
message.
Dispatcher,
it's
got
a
pretty
simple
message
receives
function
on
it.
This
gets
called
with
a
message:
it
gets
our
server
connection
and
validates
that
we
expect
to
receive
this
message
for
our
current
state.
If
we're
already
authenticated,
we
don't
expect
to
be
given
credentials.
A
We
take
the
request
and
we
execute
it.
That
gives
us
a
response
that
we
want
to
send
back
to
the
client
before
we
do
that,
though,
we
get
that
server
connection
again
and
check
to
see
if
we
want
to
make
a
state
transition.
So
do
we
now
move
this
server
connection
into
an
authenticated
mode,
because
we've
actually
got
the
credentials
and
everything
is
good.
A
The
messages
in
that
native
binary
transport
are
described
in
a
human,
readable
document.
That's
in
the
source
tree.
You
can
go
and
have
a
look
and
get
a
really
good
idea
of
what
this
protocols
talking
about
I'm
to
the
services
layer.
Now
JMX,
you
should
all
be
familiar
with
JMX.
You
need
to
know
how
to
use
this
to
admin.
Your
servers,
the
management
beans
to
the
xposed,
are
spread
around
the
code
base
that
they
all
implement.
A
Interfaces
that
end
within
bean
so
they're
pretty
easy
to
find
the
interfaces
themselves
all
have
pretty
good
documentation
on
them.
So
you
can
go
and
look
that
up
find
out
what
they're
doing
they're
implemented
in
a
really
sensible
way
as
well.
There's
an
interface
called
storage,
proxy
mbean,
it's
implemented
on
the
storage
proxy
class
and
it's
exposed
in
JMX
as
org
Apache,
Cassandra
storage
proxy,
pretty
easy
to
track
those
down
onto
thrift
at
the
service
layer.
This
is
our
RPC
interface
and
it's
implemented
on
the
org
Apache
Cassandra
thrift,
Cassandra
server
class.
A
It
implements
the
entire
thrift
interface,
those
the
sorts
of
things
you
would
expect
it
to
do,
has
access
control,
input,
validation
and
that
maps
from
thrift
types
to
our
internal
types
and
back
again,
because
we
don't
use
thrift
internally
that
thrift
interface
is
in
an
ideal
and
interface
definition.
Language
file-
it's
not
a
very
human,
readable
document,
but
you
can
go
and
have
a
look
and
there's
some
few
comments
in
there
and
you
can
get
a
bit
of
an
idea
of
all
the
functions
that
are
available
on
the
thrifty
interface.
A
Let's
take
a
look
at
probably
one
of
the
more
popular
calls
on
that
get
slice.
This
is
where
we
want
to
get
some
named
columns
or
a
slice
range
of
columns,
one
particular
row.
This
code
example
is
from
2.0,
so
the
first
thing
it
does
is
it
sets
up
request
tracing.
If
that's
enabled
on
this
request,
it
gets
the
client
state.
We
have
client
state
for
both
thrift
and
native
binary
connections
and
on
that
client
state.
A
We
have
the
key
space
that
you're
working
in,
and
we
have
your
username
if
you've
authenticated
use
that
to
do
our
access
control
and
because
this
is
a
a
pretty
common
function.
Now
we've
got
to
go
through
a
few
abstractions
here
to
complete
our
read:
we
go
over
to
multi
get
slice
internal.
This
is
where
we
validate
your
thrift
request.
Did
you
ask
us
to
return?
One
rose
something
like
that,
and
then
we
create
an
internal
representation
of
your
request
in
an
object
called
a
read
command.
A
A
That's
where
it
does
that
to
get
the
data
that
is
going
to
return
back
it
hands
the
read
command
over
to
this
read
column,
families
function,
which
takes
the
read
command
and
returns
our
internal
representation
of
its
result
just
call
the
column
family
and
does
that
by
handing
the
read
command
down
to
a
class
called
the
storage
proxy
see
this
guide
later,
but
the
storage
proxy
lives
in
the
Dynamo
layer.
This
is
the
end
of
your
API
call.
A
If
you
came
in
through
thrift
over
the
C
ql3
now,
the
equivalent
to
the
Cassandra
server
is
pretty
much
the
query
processor.
This
guy
prepares
and
executes
cql
three
statements.
It's
used
by
both
thrift
and
native
binary.
If
you
call
execute
C
ql3,
you
come
into
the
Cassandra
server
and
jump
over
here,
for
the
execution
has
the
access,
control
and
input
validation
that
you'd
expect
and
it
returns
transport
messages
instead
of
thrift
messages.
A
So
when
this
is
called
from
thrift,
there's
a
little
bit
of
extra
work
to
thrift
to
fire
that
result,
sequel,
3
itself
is
described
in
an
antelope
grammar
again.
This
is
not
a
human
reading
human
readable
document,
it's
designed
for
the
antler
compilers,
but
you
can
have
a
look
in
there
and
get
a
pretty
good
idea
of
the
nature
and
the
structure
of
CTL
3.
The
statements
in
that
language
are
processed
in
a
two
phase
approach.
First
week
we
create
these
things
called
path
statements
these
are
created
by
the
antler,
compiler
and
they're.
A
Essentially,
a
representation
of
what
you
sent
to
the
server
has
some
extra
things
on
there
for
tracking
bound
terms.
If
you're
using
prepared
statements
has
a
prepare
function
on
it,
which
then
creates
something
we
can
execute.
It's
called
a
cql
statement.
These
guys
have
access
control,
input,
validation
because
they
we
need
to
check
that
you're
allowed
to
delete
or
insert
or
or
read,
and
they
have
an
execute
function
when
we
want
to
run
it.
A
If
you
have
a
look
at
the
parameters
there,
you
can
get
a
feel
for
this,
the
consistency
level,
that's
a
parameter,
query,
state
and
variables
that
we
want
to
run.
These
are
all
parameters.
If
this
is
a
prepared
statement,
pass
those
guys
in
and
the
way
we
go,
it's
obviously
the
most
popular
and
most
complicated
cql
statement
is
the
Select
statement.
A
It
has
an
inner
class
which
implements
the
path
statement,
has
a
lot
of
logic
in
there
to
understand
what
you're
actually
saying
in
the
Select
Clause,
and
it
creates
a
select
statement
for
us
which,
in
its
execute
implementation,
creates
a
recommand.
These
are
the
same
objects,
the
same
classes
that
the
Cassandra
server
uses
and
it
passes
these
down
to
the
storage
proxy,
and
this
is
the
same
storage
proxy
that
the
Cassandra
server
uses.
So
your
API
call
comes
into
Cassandra
and
either
goes
into
the
message.
A
The
locator
package
contains
the
partitioner,
and
so
it
contains
the
replication
strategy
and
the
snitch
and
the
stream
package
contains
the
logic
that
we
have
for
streaming:
files
between
nodes
during
repair
and
bootstrap.
Let's
dig
into
the
service
package,
though
storage
proxy.
We
saw
this
guy
before
this
guy's,
the
workhorse
of
the
Dynamo
layer.
It's
the
entry
point
for
all
of
our
cluster
wide
storage
operations.
Reading
and
writing
all
goes
through
here.
We
it
does
the
sorts
of
things
you'd
expect
it
selects
endpoints
it
checks.
A
If
there's
consistency,
level
nodes
available,
it
sends
messages
to
stages.
These
are
the
stages
in
the
stage
event-driven
architecture
that
Cassandra
uses
we'll
see
more
about
those
shortly.
It
sits
around
and
waits
for
a
response
in
a
very
efficient
way
handles
hints.
If,
if
we
didn't,
if
we
got
the
timeout
or
something
like
that
and
then
returns
the
result
back
to
the
API
layer,
so
we
can
go
back
to
your
client
there's
another
important
class
in
here.
A
That's
the
storage
service,
this
guy's
a
wrapper
for
a
lot
of
functionality,
that's
implemented
elsewhere,
everything
to
do
with
tracking
the
ring
state,
starting
and
stopping
ring
membership,
mapping
between
nodes
and
tokens
and
tokens
and
nodes
so
that
we
know
how
to
find
your
replicas
for
reads
and
writes
so.
The
storage
property
comes
along
and
it
uses
the
storage
service
and
it
works
out.
A
We're
going
to
do
a
read
and
we're
going
to
send
the
request
off
to
three
other
nodes
say
to
do
this
read
hopefully
we'll
get
back
a
bunch
of
results,
two
or
three
of
them,
and
we
need
a
way
to
resolve
any
differences
and
that's
handled
by
objects
that
implement
the
I
response.
Resolver
interface,
pretty
simple
interface.
Here
we
call
pre
process
when
we
get
a
message
from
another
node
once
we've
gotten
enough
of
those
we
call
resolve
and
that
will
either
give
us
the
response.
A
We
want
to
send
back
to
the
client
or
throw
an
exception.
The
road
digest
resolver
is
use
the
first
time
we
run
the
read.
This
is
where
we
ask
one
no
to
give
us
the
data
that
we
want
to
turn
to
the
user
and
the
other
nodes
to
give
us
a
digest
of
the
data
that
they
would
send
in
the
resolve
implementation.
It
creates
a
digest
of
the
data
response,
compares
that
to
the
other
digests.
If
they
don't
match
it
throws
a
digest,
mismatch
exception.
A
If
they
do
match
it
returns
the
data,
and
we
send
that
back
to
you.
If
they
don't
match
the
storage
proxy
will
run
the
read
for
a
second
time
and
this
time
an
arc
saw
the
nodes
that
are
involved
to
return
their
data.
So
it
uses
the
row
data
resolver.
This
guy
is
a
little
bit
more
complicated.
It
works
out.
The
final
result
that
we
would
send
back
to
the
client
using
those
timestamps
to
get
the
correct
value,
and
then
it
works
out
the
deltas
between
all
of
the
replicas
that
participated
in
this
request.
A
It
creates
mutations
to
fix
those
deltas.
It
sends
them
one
way
to
the
remote
replicas
doesn't
wait
for
a
response
and
it
returns
back
that
correct
result
that
it
calculated
and
we
hand
that
back
to
the
client
without
waiting
for
the
remote
replicas
to
fix
their
inconsistency.
If
you're
doing
a
range
scan,
we
have
to
use
the
range
slice
response,
resolver
a
bit
more
complicated
again
and
the
range
scan
your
replicas
can
return
different
numbers
of
rows.
So
this
guy
has
to
handle
the
fact
that
one,
though,
could
return
five
and
another
note
could
return.
A
Ten
rows,
that's
great
on
the
read
path.
Well,
obviously,
we
don't
use
that
on
the
right
path.
We
do,
however,
have
a
commonality
between
the
two.
It's
called
the
response
handler
or
a
callback.
This
is
an
object
that
implements
the
ia
sing.
Call
back
in
the
face,
comes
from
the
networking
package
and
it's
a
way
for
the
net
for
the
messaging
package.
Sorry
to
deliver
a
message
to
an
object
that
understands
your
request.
This
object
is
in
charge
of
making
sure
that
your
request
gets
completed,
simple
interface.
We
call
response
we
hand
over
the
message.
A
It's
implemented
in
a
class
called
wreath
callback
and
those
are
the
ones
for
rights
as
you'd
expect,
but
the
get
function
on
here
is
handy.
We
call
this
when
we
want
to
get
the
response
from
your
request.
It
waits
on
the
condition
if
that
condition
is
not
before
RPC
timeout
occurs.
We
finish
waiting
and
we
raise
an
exception
that
says:
read
timeout,
this
read
timeout
gets
turned
into
a
time
date
exception
that
gets
thrown
back
to
you,
the
client.
A
The
condition
is
set
once
we've
received
enough
responses,
so
that
happens
in
our
implementation
of
response
on
the
ia
sync
callback
interface.
So
what
we're
waiting
for
is
consistency
level
knows
to
come
back
so,
if
you're
doing
a
quorum
at
our
three
we're
waiting
for
two
nodes
and
the
data
response,
so
the
one
node
that
we
asked
to
give
us
data.
When
we
get
those
the
condition
is
set
we
passed
through,
then
we
use
the
resolver
to
get
the
result.
We
want
to
send
back
to
the
client.
A
This
all
comes
together
in
the
storage
proxy
in
a
function
called
fetch
rows.
This
guy
gets
a
list
of
recommends
it
iterates
through
them
a
couple
of
times
in
the
first
pass.
It
goes
through
and
creates
a
list
and
for
each
one
it
creates
a
list
of
thus
ordered
end
points
that
we
want
to
send.
This
read
to
this
takes
into
consideration,
proximity
and
the
consistency
level.
A
It
constructs
a
road
digest
resolver,
because
this
is
the
first
time
we're
doing
this
read
it
uses
that
and
some
information
about
the
consistency,
level
and
the
end
points
to
construct
the
recall
back.
It
uses
the
read
command
to
create
messages,
so
we
can
send
those
recommends
to
other
replicas
and
passes
those
messages
and
the
recall
back
to
the
messaging
service.
A
Send
our
our
function
send
request
response,
so
the
messaging
service
now
is
going
to
send
those
messages
to
the
remote
replicas
when
it
gets
a
response
is
going
to
deliver
them
into
our
recall
back
object,
and
we
saw
that
with
the
ia
sync
callback
interface
and
the
implementation
of
get
after
we've
done
this.
For
all
of
these
read
commands
we
iterate
through
all
the
read
call
backs
that
we
created,
and
we
call
get
remember
that
get
is
blocking
that
could
throw
an
exception.
A
The
digest
mismatch
means
we
run
it
again
and
the
read
timeout
means
we
have
to
throw
time
that
exception
back
to
you
guys
in
the
dynamo
layer.
The
network
package
has
both
a
protocol
and
the
in
sport
and
we're
just
going
to
talk
about
that.
The
protocol,
it
talks
about
verbs,
request
and
response
verbs.
Those
the
messaging
service
doesn't
know
how
to
process
the
verbs
though
it
delivers
them
for
objects
that
implement
a
verb
handler.
Just
having
an
object.
A
Isn't
enough,
we
need
a
thread,
so
we
map
those
message:
verbs
on
two
stages
onto
the
thread
pool
stages,
so
this
comes
together
in
the
messaging
service
receive
function.
We
take
our
message
and
rewrap
it
in
a
message:
delivery
task.
This
is
just
a
runnable.
We
take
the
message
type
and
we
use
that
to
get
the
thread
pool.
We
want
this
runnable
to
go
on
and
we
drop
the
runnable
into
the
thread
pool
and
we
go
back
and
look
for
the
next
set
of
bytes
to
come
off
the
socket.
This
is
running
on
the
thread.
A
That's
just
reading
from
a
socket
for
one
particular
node
in
the
message:
delivery
task.
We
look
at
the
time
that
this
object
was
constructed
and
if
it
was
longer
than
RPC
time
in
the
timeout,
we
dropped
the
message
and
that
sorry-
and
it's
a
droppable
type
message-
we
drop
it.
So
when
you
see
dropped
messages
in
the
logs
they
come
from
here.
If
everything
works,
we
get
the
verb
handler
object.
A
We
call
that,
and
we
pass
over
the
message
and
we're
now
processing
your
read
in
the
read
stage
on
the
remote
node
quickly
now
onto
the
database
layer
and
the
thing
that
you
think
of
sorry,
the
concurrent
package,
just
Maps
stage
names
to
thread
pool
executors.
These
are
the
things
you
see
in
no
pool
no
tall,
tepee
stats
in
the
database
package.
The
thing
that
you
think
of
as
your
key
space
is
implemented
in
an
object
called
table.
We
have
one
instance
of
these
for
each
table.
We
call
open
to
get
that
instance.
A
All
the
time
has
a
function
called
get
column
family
store,
because
the
thing
that
you
think
of
as
your
column,
family
or
table
is
implemented
in
a
class
called
column
family
store.
This
has
the
top-level
functions
that
we
use
for
doing,
reads
and
write,
get
row
and
apply
over
to
our
column
family
store.
We
have
a
top
level
function
here
to
do
reads:
get
column
family.
We
have
an
apply
function,
I'll
skip
down
here,
because
I'm
running
a
little
late
we
have
I
saw
the
column.
I
saw
the
columns.
A
This
is
the
guy
that
contains
all
your
column
objects.
We
have
three
implementations.
The
array
back
store
saw,
the
columns
is
thread
unsafe
and
used
on
the
read
path.
The
atomic
saw
that
columns
is
ISIL,
has
isolated,
updates,
it's
used
by
the
mem
table
and
the
tree
map
that
saw
that
columns
is
a
thread
safe
structure.
We
use
in
multiple
places,
I
have
to
skip
forward
a
little
bit.
A
So
in
summary,
here
we
come
in
through
the
API
into
a
couple
of
different
ways,
but
we've
always
come
down
to
creating
read,
commands
or
mutation
commands
and
dropping
those
into
the
storage
proxy
in
the
dynamo
layer
that
uses
response,
resolvers
and
async
callback
objects
and
the
messaging
service
to
transport.
This
read
command
into
the
correct
thread,
pool
on
the
current
node
or
on
a
remote
node
that
goes
through
the
table
and
the
column,
families
store
and
the
disc
ads
and
filters
which
we
didn't
look
at,
which
is
where
all
the
reads
happen.