►
Description
Link to blog referenced in video: http://www.planetcassandra.org/blog/this-week-in-cassandra-3-0-storage-engine-deep-dive-3112016/
A
B
Alright,
here
we
are
with
another.
This
week
in
Cassandra
a
planet
Cassandra
I'm,
John
headed
today
we
have
Tyler
Hobbes
datastax
is
a
committer
to
open
source
Cassandra.
We
also
have
Aaron
Morton
of
the
last
pickle.
Also
a
committer
I
believe
ya,
sound
old
late,
yeah,
pretty
pretty
exciting
old
school
committer
I.
You
know
that's
great.
We
have
some
good
stuff
here
today
we're
going
to
be
talking
about
the
new
Cassandra
releases
and
we're
going
to
also
going
to
be
taking
a
really
big
look
at
the
new
storage
engine
in
3
dot
0.
B
So
first,
let's
talk
a
little
bit
about.
What's
happened
this
week.
Big
news
is
we're
looking
at
two
Cassandra
releases,
30,
dot,
4
and
3
dot.
4
304
is
a
bug-fix
release.
So
it's
going
to
have
a
bunch
of
stuff
fixed
in
the
three
dotted
line:
good
stuff.
If
you're
already
running
30
in
production
and
on
3
dot,
4
we've
got
that's
our
new
features
on
our
tick
tock
release
cycle.
The
last
number
is
odd
or
then
it's
bug
fixes.
If
it's
even
then
we're
looking
at
new
features.
So
it's
a
peach
release.
B
C
Yeah
so
I
think
by
far
the
biggest
new
feature
in
and
three
dot
4
is
the
sazzy
indexes.
Those
are
contributed
well,
one
of
the
main
people
with
Pavel
developer
at
at
apple
and
cassandra
pmc
member.
These
are
a
you
know,
a
huge
upgrade
from
what
we
have
an
cassandra
and
older
versions
of
Cassandra
for
secondary
indexes.
C
So
basically,
you
know
and
previous
versions
of
Cassandra,
the
secondary
index
works
as
as
kind
of
a
hidden
second
table,
that
for
each
index,
value
partition
that
stores
primary
keys
of
the
index
table
for
every
row
that
matches
that
index
value.
So
it's
basically
just
a
you
know
a
primary
key
that
we
then
use
to
go
to
another
look
up
on
the
base
table
and
that
comes
with
you
know
a
lot
of
there's
a
lot
of
inefficiencies
in
that
and
it's
very
limited
in
terms
of
what
operations
that
can
support.
C
B
Operators,
mmhmm
yeah,
so
that
the
interesting
contrast
to
to
the
existing
second
remix
of
implementation
with
sazzy
is
with
sazzy
effectively.
You
have
an
index
written
/,
SS
tables.
Whenever
an
SS
table
is
written
to
disk,
you
also
get
a
B+
tree
written,
which
is
a
really
efficient
storage
format
for
doing
database
lookups
for
range
lookups.
It's
optimized
for
disk
seeks
it
normally
can
get
problematic.
B
If
you
have
a
lot
of
updates
and
deletes,
that's
why
you
see
issues
with
relational
databases
under
insert
and
delete
and
update
heavy
workloads,
but
in
our
case,
because
cassandra
has
immutable
data
files,
you
can
actually
have
perfect
B+
trees,
written
to
disk
and
then
they're
never
touched
again
so
from
a
performance
standpoint.
They
don't
slow
down
over
time.
You
just
generate
a
new
one
during
the
compaction
process.
Whenever
you
write
a
new
SS
table,
so
they
support
prefixes.
You
can
do
like
there's
a
light
clause
that
you
can
add
to
it.
B
So
you
know
the
things
that
people
kind
of
are
used
to
in
relational
databases
where
they
want
to
do
range,
queries
and
they're,
not
a
hundred
percent
sure
the
queries
that
they're
going
to
do
ahead
of
time.
This
will
support
a
lot
more
flexibility
and
it's
you
know
it's
really
cool
I've
played
with
this
a
bunch
and
I
wrote
a
blog
post.
It's
on
rusty,
razor
blade,
calm,
it's
it's
very,
very
cool
stuff.
It
was
really
fun
to
actually
use
a
like
claws
and
see
results
come
back.
B
The
only
the
only
thing
that
you
have
to
keep
in
mind
is
because
they're
B+
trees,
their
memory,
mapped
files
you're.
Definitely
what
I'm
going
to
want
to
run
this
in
systems
where
you
have
more
free
available,
Ram,
not
so
much
on
issue,
if
you're
in
somewhere,
like
Amazon,
where
you
can
fire
up
a
machine,
you've
got
so,
let's
say
60
gigs
of
ram.
You
know
it's
okay,
if
20
gigs
of
that
is
indexes
totally
fine.
Yeah.
C
I
think
the
other
thing
to
keep
in
mind
is
they
still
have
some
of
the
same
caveat
says
as
the
existing
secondary
indexes
right.
They
don't.
They
still
don't
make
sense
for
indexing
all
types
of
data
you
don't
want
to
index.
You
know
an
email
address
or
some
other
unique
value
with
these,
because
the
fan
out,
when
you
query
still
has
to
happen.
It
still
has
to
touch
essentially
every
note
in
the
cluster
build
a
query
response
so
fill
out.
B
Yep
yeah,
the
scatter
gather
aspect,
is
definitely
going
to
be
always
problematic
and
I.
Think
in
this
case,
for
your
example
like
emails,
you
would
probably
want
to
use
the
materialized
view
feature
that
got
introduced
in
30
and
so
now,
yeah
so
material
like
that.
The
interesting
thing
that
here
is
that
you
can
say
that
as
views
while
they
may
be
a
little
bit
slower
at
right,
time
are
going
to
give
you
a
huge
performance
boost
agreed
time.
B
If
it's
something
like
email
where
you
can
get
a
super
fast
lookup,
it's
going
to
be
better
than
secondary
indexes,
so
it's
cool.
I
think
we've
got
multiple
tools
that
can
solve
different
problems,
different
ways.
Each
one
has
certain
trade-offs,
but
I
think,
as
we
get
used
to
them,
we're
going
to
see
some
really
good.
You
know
recommendations
come
out
and
kind
of
advice
jama.
B
A
This
is
critical
to
ever
explaining
Cassandra
to
anyone
like
every
time.
I've
explained
the
the
right
path
and
the
immutable
files
and
how
the
read
path
merges
those
things
together,
the
ability
to
say,
hey,
let's
insert
a
row
flush,
the
disk
into
the
row.
Flush
to
this
look
I've
got
two
copies
there,
I
can
I.
Can
you
can
see
now
that
this
data
is
on
desk
in
multiple
places,
and
you
can
now
see
what
the
repub
does?
That's
always
been
really
important.
A
I
used
this
a
couple
of
weeks
ago,
the
new
SS
table
dump
when
I
was
looking
into
what
happens
when
we
drop
it
drop
a
column
in
sync
ul
and
it
just
works,
and
it
does
a
really
good
job
of
outputting
it,
and
it's
really
a
useful.
If
you're
trying
to
understand
things,
I
wouldn't
I,
wouldn't
use
it
as
a
way
to
backup
your
data
or
export.
A
It
I
think
there's
much
better
ways
to
do
that,
but
as
a
what's
going
on
here,
I've
been
doing
this
for
a
few
years
now
and
the
a
couple
of
times
we've
had
to
get
someone's
SS
table,
pull
it
out,
convert
it
to
Jason.
Take
that
tiny
little
piece
out
that
that
somehow
make
things
crash
back
in
the
day
and
then
put
the
SS
table
back
so
every
now
and
again
they're
useful
for
that
I.
Don't
think
it's
such
a
cause.
Now
it's
mostly
as
a
learning
tool.
It's
invaluable
yeah.
C
And
and
I
guess
it's
good
to
point
out
that
SS
table
dump
doesn't
have
a
sort
of
the
inverse
operation
of
loading
it
back
in
right.
We
don't
have
a
JSON
to
us
as
table
equivalent
anymore
yeah,
but
yeah.
I
agree
with
there
and
it
can
be
really
instructive.
Just
for
looking
at
how
that
is
stored
at
on
disk,
you
know
from
from
a
teaching
perspective,
but
also
you
know
for
doing
support
and
operations.
C
That
really
can
help
you
understand
so
so.
Somebody
recently
on
on
the
mailing
list
wondered
why
partition
was
so
large,
even
though
it
had
relatively
little
data
in
it
and
just
by
dumping
the
SS
table.
They
are
able
to
see
that
it
had
whatever
10,000
tombstones
that
they
didn't
know
existed
there,
so
being
able
to
see
that
sort
of
information
community
really
helpful
and
debugging
different
types
of
problems,
yep.
B
Very
good
stuff,
so
we
were
talking
a
little
bit
about
you
know.
We've
got
three
dot
for
out
looking
forward
since
I.
You
know,
I,
don't
really
get
the
opportunity
to
have
committers,
you
know
in
the
room
very
often
or
in
the
virtual
room
Tyler.
You
would
tell
me
a
little
bit
about
some
of
the
stuff
that
you're
working
on
going
forward
for
the
future
versions
of
Cassandra.
What
do
you
got
for
me.
C
The
future
is
future
is
bright
for
Cassandra,
so
off
the
top
of
my
head.
There's,
you
know
one
of
the
things
that
is
in
progress
right
now
is
non-frozen
UDTs,
so
we're
looking
at
being
able
to
store
those
split
across
multiple
cells.
So,
right
now
we
we
essentially
force
them
to
be
serialized
into
a
single
cell
with
the
frozen
keyword
I'm,
so
we're
making
that
optional.
C
Now
so
that
they'll
be
stored
across
multiple
cells,
and
that
allows
you
to
update
each
field
in
aedt
separately
or
individually,
and
it
can
also
allow
you
to
optimize
the
read
path.
If
you're
only
selecting
a
single
field
from
the
UDP,
you
don't
have
to
deserialize
the
entire
thing
mm-hm,
so
it's
kind
of
a
fun
one
I'd
say
by
far
the
biggest
change
that
some
working
on
right
now
is,
as
part
of
the
work
to
switch
Cassandra
from
this
seata
model
right
now,
the
staged
event-driven
architecture
to
a
thread
per
core
model.
C
So
this
this
will
have
really
big
performance
implications
for
Cassandra
really
what
what?
What
we're
looking
to
do
is
eliminate
a
lot
of
the
overhead
that
we
see
from
from
context
switch
switches
right
now.
Cassandra
uses
tons
of
different
threads
split
across
a
lot
of
thread
pools.
So
it's
constantly
switching
during
context
switches
which
is
expensive
in
terms
of
CPU
caches.
C
C
Things
like
that
that
can
really
help
out
the
throughput
performance
of
Cassandra.
So
it's
a
massive
undertaking
and
we're
looking
to
do
it
bit
by
bit
right
now,
we're
focusing
just
on
the
read
and
write
paths
but
I
think
by
36
or
38.
We
might
start
seeing
some
of
the
first
parts
of
this
be
released
into
the
wild
and
we'll
see
how
it
how
it
works
in
in
real
life,
nice
yeah.
B
Yeah,
that's
a
big
project,
yeah
yeah!
It
sounds
like
a
lot
of
a
lot
of
rewrite,
but
yeah
you
definitely
be
overhead
of
context
which
is
and
locking
everywhere.
If
you
ever
like,
follow
a
program
with
estrace.
You
can
just
see
it's
just
mutexes
all
over
the
place
and
getting
getting
rid
of
that
making
it
more
optimized,
definitely
improve
performance
overall,
so
that
I'm
actually
very
excited
for
that.
B
So
we've
got
three
dot
for
a
pretty
cool
release.
We
got
some
really
interesting
stuff
coming
in
the
future.
Let's
back
it
up
a
little
bit
and
talk
about
the
three
dot:
zero
storage
engine
so
Aaron
you
just
wrote
a
blog
post
on
the
30
storage
engine.
You
can
find
the
link
in
this
week
and
Cassandra
blog
post.
B
It's
pretty
it's
very
detailed.
This
is
this
is
something
that
Aaron
I
love
your
attention
to
detail
and
stuff
like
this.
I
saw
a
talk
that
you
gave
maybe
three
years
ago
when
I
was
first
learning
Cassandra
on
the
right
path
and
the
read
path
and
just
how
Cassander
works
and
it's
it's
good
that
you
didn't
get
lazy
and
it
give
me
some
something
less
than
that.
So
I
appreciate
that
so
yeah
I,
don't
know.
What
can
you
tell
us
about
the
new
new
format
like
what
are
some
reasons
why
this
thing
exists?
B
A
But
the
extensibility
that
that's
added
into
the
platform
is
huge
and
it's
all
been
done
on
some
nasty
hacks
on
the
existing
storage
engine
and
the
biggest
one
I
think
was
that
the
existing
storage
engine
had
no
concept
of
a
CTL
row.
They
were
something
that
was
kind
of
hacked
on
top
of
the
internal
storage
engine
row,
which
came
to
be
known
as
a
partition.
So
just
that
as
a
basic
fundamental
thing
to
say:
hey
we
now
we
now
know
what
the
data
model
is.
A
Let's,
let's
efficiently
store
that
in
the
storage
engine,
it's
led
to
a
bunch
of
improvements.
That
again,
we're
probably
not
going
to
see
all
of
the
impact
for
a
while
there's
a
really
great
post.
That
Sylvain
did
when
this
first
came
out.
That
explained
just
the
impact
on
how
it
can
reduce
the
on
disk
size
and
there's
a
lot
of
stuff
in
that
blog
post
I
point2
around
that
of
understanding
things
like
hey
every
cell,
that
we
put
on
disk
as
a
timestamp.
A
What
about
if
we
record
that
from
an
epoch
of
every
cell,
that's
in
that
SS
table
rather
than
the
UNIX
epoch?
So
if
you've
got
a
time
series
data
model,
for
example,
when
we
flush
the
disk
we're
going
to
record
something
that
says,
write
the
highest
type,
the
lowest
timestamp
we
have
here
is
twelve
o'clock.
A
So,
if
I
want
to
record
the
timestamp,
that's
1201
all
I
need
to
record
is
that
it's
60
seconds
or
600,
six
thousand
milliseconds
whatever
hi,
then
that
twelve
o'clock
times,
then
all
that
type
of
stuff
means
and
some
variable,
int
and
coding
means
that
on
disk,
it's
a
lot
more
efficient.
And
if
you
have
a
look
at
that
post
and
there's
plenty
of
links
in
there
to
go
to
the
code,
you
can
really
get
a
feel
for
the
idea
of
it.
A
Now
we
know
what's
actually
going
on
disk
and
if
you
look
around
at
some
of
the
old
examples
of
how
to
explain,
cql
three,
it
used
to
be
all
right
here.
C,
ql,
3,
now
I,
better
explain
how
it
stores
it
in
the
storage
engine
and
that's
really
complicated
because
look
there's
this
cell
here
on
the
storage
engine
that
doesn't
have
a
name
and
doesn't
have
a
value.
That's
important,
like
just
trust
us
on
that
game,
and
we
were
talking
earlier
and
Tyler
we're
saying
it.
A
We
don't
repeat
all
of
your
clustering
keys,
the
values
of
those
used
to
be
repeated
for
every
non
primary
key
cell
in
that
row,
and
that
doesn't
happen
anymore.
All
that
information
is
stored
once
and
it's
a
so
much
more
efficient
on
disk
and
really
sets
the
groundwork
for
the
next
couple
of
years.
It's
really
exciting.
Yeah.
C
One
of
the
cool
things
is
that
you
know,
because
we
had
so
much
redundant
information
and
alert
Cassandra
versions
compression
made
a
really
big
difference
it
would.
It
would
take
care
of
a
lot
of
those
issues,
or
at
least
you
know,
mitigate
them,
but
yeah.
If
you
look
at
that,
a
blog
post
by
Silvan
that
Aaron
mentioned
you
can
see.
The
new
storage
format
is
so
efficient
that,
in
a
lot
of
cases,
it's
smaller
than
the
compressed
SS
tables
from
the
previous
version,
even
without
compression.
C
A
B
Well,
one
of
the
one
of
the
things
is
it's
nice
to
just:
have
this
EEMA
even
encoded
separately
right
like
that's,
that
that'll
I
mean
that
used
to
be
just
a
separate
ticket
right.
It
was
like
no!
No!
No!
It's
totally
ridiculous,
like
fact
that
your
column,
like
that,
the
name
of
your
field
can
like
dramatically
increase
the
size
of
your
SS
tables
is
just
absolutely
nuts
and
I
mean
that
that
alone
is
a
huge
win,
and
then
you
talk
about
not
repeating
certain
values
like
TTL,
zuhr
time
stamps
or
encoding
them
as
a
delta.
B
From
the
beginning
of
time
stamp
that
you
saw
for
that
particular
row,
I
mean
the
savings
are
huge.
I
think
I
remember
seeing
like
for
certain
size
tables.
You
can
see
like
a
10
fold
reduction
right,
like
that's,
absolutely
crazy,
to
be
able
to
see
an
optimization
like
that
like.
When
do
you
ever
get
something
that
gets
ten
times
better?
Never
so.
C
Going
to
say,
I
mean
you
know,
it
depends
on
the
workload,
but
especially
if
you're,
if
you're,
storing
a
lot
of
small
values,
especially
like
you
know,
a
single
rope
or
partition
or
even
a
wide
row
kind
of
format.
If
you
don't
have
large
values
that
you're
storing
you'll
see
a
massive
reduction
in
size
on
disk
with
the
304
minute,
yep.
A
Yeah
I
think
all
the
way
down
to
the
individual
self
storage.
You
know,
whereas
previously
we'd
have
the
time
stamp
in
every
cell,
if
all
the
rows,
if
all
the
cells
in
the
row
have
the
same
time
stamp,
we
just
have
it
at
the
row
level
and
that's
that
simple
understanding
of
knowing
that
these
things
are
all
collated.
These
all
these
things
are
together.
A
It's
a
safe
space
and
saves
reading
of
disk
down
to
things
like
I
can't
remember
if
this
was
already
in
the
2
point
0
engine,
but
we
know
when
different
data
types
are
fixed
and
when
they're
variable
width
and
blends
are
just
encoded
as
a
bite
and
then,
if
there's
three
ball
wins
together
is
just
three
bytes
in
a
row,
and
we
know
when
we
go
to
read
that
off
to
disk.
What's
in
that,
what
what
actual
columns
are
in
that
row
they
were
order
and
so
fixed
with
things
can
be
read
very
efficiently.
A
Obviously,
and
then
the
the
complex
cells
which
the
complex
to
accomplish
cells
are
things
that
you
do
with
our
things
like
UDTs
and
collection
types,
and
previously
these
were
frozen
as
Taylor
was
talking
about
earlier
on
now,
they're
they're
non
frozen
and
they're
so
much
more
extensible
and
will
support
the
type
of
feature
that
ty
was
talking
about
a
non
freezing
UDTs
so
that
you
take
your
idea
of
here's.
My
column,
that's
defined
in
my
table.
It's
really
a
list
or
a
UDT
or
whatever.
A
It
is
that
when
it
gets
to
the
storage
engine
now
explodes
into
a
bill
of
materials,
type
approach
where
it's
like.
Okay,
here's
my
column,
it's
a
cell,
yet
my
cell
is
made
up
of
other
cells.
Each
of
these
cells
is
then
encoded
as
a
cell
inside
of
that
one
cell,
and
it
is
then
individually
addressable
that
type
of
feature,
but
it
wasn't
around
previously
Tyler.
You
were
talking
earlier
about
a
really
interesting
point,
which
was
the
dealing
with
dense
versus
fast
tables
and
optimizations
for
those
yeah.
C
One
of
the
kind
of
the
cool
things
that
Silvan,
designed
into
the
new
three
dot,
o
storage
format
is
this
will
will
switch
to
using
a
different
storage
format
based
on
whether
the
set
of
columns
that
are
actually
used
is
is
sparse
or
dense.
So,
for
example,
if
you
you
know,
if
you
only
have
ten
columns
in
your
table
and
you
pretty
much
always
writes
it
to
all
of
them
and
we're
going
to
use
this
dense
format,
that's
that's
optimized
for
that.
C
On
the
other
hand,
if
you
have
say
a
thousand
columns
define
for
your
table
and
each
row
only
normally
has
two
or
three
columns
actually
set.
We
switch
to
to
a
different
format.
That's
that's
specifically
optimized
for
that,
as
well,
so
sort
of
on
both
ends
of
the
use
cases.
We've
got
a
more
efficient
format
than
we
had
in
in
20,
so
it's
kind
of
nice
to
just
optimize
for
these
different
ways
that
people
use
Andrew.
A
Yeah
that
was
really
interesting,
as
I
was
reading
through
the
code
on
that
it's
in
there
encoded
cells.
If
it's,
if
it's
mostly
missing,
if
they're
mostly
there,
then
we
encode
it
one
way,
if
they're
mostly
missing,
then
we'll
we'll
go
a
different
approach
and
similar
the
surprising
bit
there
supporting.
A
Therefore,
if
your
number
of
clustering
keys
is
above,
I
believe
it's
32,
then
they're
encoded
in
a
different
way
than
if
it's
less
than
32
they
put
into
chunks
of
32
clustering
keys
and
coded
so
really
a
lot
of
attention
to
detail
of
dealing
with
things
that
nowadays
we
might
laugh
at
but
later
on,
we
might
be
like
wow,
that's
really
great.
It
handles
that
hundred
and
twenty
828
keys
in
my
clustering
index.
That's
great.
C
This
you
know,
there's
a
lot
of
features
in
there
that
that
are
kind
of
we
don't
really
utilize
them
now,
but
there
they
set
a
nice
foundation
for
for
a
lot
of
features
that
were
interested
in
building
soon.
So
hopefully,
this
this
format
will,
you
know,
help
us
to
to
kind
of
grow
and
support
new
features
efficiently,
without
resorting
so
much
to
the
old
hacks
that
we
we
kind
of
had
to
build
on
top
of
things
like
in
the
two
da
dough
storage
engine.
What.
C
A
There's
a
really
like
macro
level
thing
in
here,
which
is
the
the
actual.
The
entire
on
disk
format
is
now
pluggable.
So
previously
there
was
one
way
we
wrote
to
disk,
and
that
was
it.
It's
now
there's
an
interface.
Well,
there's
a
factory
in
interface,
a
whole
mechanism
in
yet
to
say:
hey,
let's,
try
a
different
on
this
format
like
whoop-dee-doo,
let's
go
and
code,
this
stuff
is
Parque
whatever
you
know
as
an
experiment,
and
that's
that's
a
nice
big
top
level.
A
Extensibility
point
that
probably
didn't
exist
and
I
think
the
all
of
the
extra
knowledge
and
stare
will
lead
us
to
be
able
to
improve
performance
over
time.
You
know
we
know
what
cells
are
encoded
into
HSS
table
right.
Let's
that's
some
useful
information.
We
could
go
and
do
things
with
and
the
extensibility
around
the
encoding
of
cells
on
the
desk
will
mean
that
I
think,
like
the
UDTs
and
all
those
types
of
things
will
become,
I
think
could
probably
see
some
more
more
activity
in
that
field
and
more
complex,
predicate
pushed
out
the
disk
absolutely.
C
That's
that's
a
good
point.
Like
the
you
know,
the
non-frozen
UDTs,
those
are
so
much
easier
to
do
with
the
new
storage
engine.
There
was
a
ticket
that
got
talked
about
on
here
a
couple
weeks
ago.
Optimizing
the
number
of
disk
seeks
that
we
do
based
on
SS
table
metadata.
If
you,
if
you
have
like
a
limit
one
on
your
query
and
that's
something
that
would
would
have
been
really
insane
to
do
with
the
old
storage
engine
but
is
easier
now
doing
/
partition
limits,
as
is
something
else.
C
A
B
Yes,
alright
guys
well,
this
has
been
pretty
fun
I,
I'm
pretty
sure
that
people
are
gonna
want
to
hear
more
about
the
SS
table.
Format
learn
more
about
it.
I
definitely
recommend,
as
I
said
before,
reading
errands
blog.
This
is
it's
awesome.
It's
in
the
blog
post
that
accompanied
this
video
on
planet
Cassandra
so
definitely
check
that
out
get
into
it
and
there's
lots
of
really
good
stuff
is
coming
up
and
that
has
just
been
released.
So
it's
pretty
pretty
fun
week.
B
A
B
A
Going
to
say,
I'm
in
San
Francisco
with
the
tail
p
is
going
on
the
road
next
week.
So
nate
is
going
to
be
talking
at
the
Cassandra
they
in
atlanta
on
thursday
next
week,
and
then
I'm
going
to
be
talking
in
san
francisco
at
the
san
francisco
made
up
on
Monday
week
about
cql
three
and
storage
engine
three
and
then
talking
down
in
South
Bay
meet
up
about
how
the
back
up
Cassandra
grace.