►
From YouTube: Cassandra Summit Europe 2014 Keynote
Description
Speakers: Billy Bosworth, CEO at DataStax; Jonathan Ellis, CTO at DataStax & Apache Cassandra Chair; Zohar Melamed, Director and Technical Fellow at Credit Suisse; Jim Anning, Head of Data & Analytics at British Gas Connected Homes.
A
A
Electronic
banking
is
safer,
entertainment
to
streamed
in
the
blink
of
an
eye,
and
social
media
feels
a
lot
less
crowded.
It's
a
new
world
where
online
applications
are
deployed
with
maximum
velocity
and
cost-efficiency.
Transforming
data
into
real-time
insights
that
fuel
innovation
becoming
an
Internet
Enterprise,
is
a
modern-day
imperative
that
creates
dynamic
interaction
and
makes
every
customer
feel
like
they're.
Your
only
customer
join
the
world's
most
innovative
companies
become
an
Internet
enterprise
with
data
stats.
B
Good
morning,
everybody
and
welcome
to
our
European
Cassandra
summit
of
2014.
It
is
really
really
great
to
have
everybody
here
today
and
it's
a
pleasure
to
be
here.
We
come
from
the
Bay
Area
in
California,
not
just
for
the
warmth
and
the
sunshine
of
London,
but
also
because
there
is
really
something
special,
that's
starting
to
happen
here
in
terms
of
the
Cassandra
community
and
the
momentum,
that's
being
created
by
everything
that
you
all
are
doing
with
the
fascinating
things
that
we
see
from
the
use
cases
from
customers
of
all
shapes
and
sizes.
B
And,
if
you
think
about
just
where
we
are
today
from
just
a
few
years
ago,
it's
pretty
amazing.
We
know
of
over
6,000
well
over
six
thousand
active
community
members
that
are
happening
right
now
here
in
this
area,
and
we
estimate
there's
probably
four
to
five
times
that
who
are
not
yet
part
of
our
lists
and
our
meetup
RSVPs.
But
this
represents
an
explosion
from
where
we
were
just
a
few
years
ago.
B
It's
the
ecosystem,
it's
getting
things
to
grow
and
expand
quickly
takes
talent,
and
the
talent
that
we
have
here
in
the
room
today
represents
that
catalyst
that
can
help
grow
this
industry,
the
talent
getting
together
right
now
represents
really
the
power
of
creation,
if
you
think
about
it,
because
you
have
the
talent
and
the
stuff
to
come
together
and
then
explode
out
there
into
an
amazing
new
world
of
technology
in
an
amazing
new
world
of
applications.
This
is
real.
B
This
isn't
theoretical
when
I
get
to
hear
so
many
things
that
are
happening
and
how
industries
are
changing
and
you're
going
to
hear
a
little
bit
later
about
how
actual
lives
are
being
changed
by
what
you
are
doing.
You're
creating
this
new
world
and
it's
fantastic
and
it's
exciting,
is
probably
the
most
exciting
time
that
I
can
remember.
But
we
owe
this
moment
to
a
lot
of
different
people
like
yourselves,
who
have
been
here
through
varying
stages
of
our
growth.
B
So
it
would
be
a
little
American
of
me
to
not
take
a
moment
and
thank
some
people
who
actually
got
us
here
and
I'd
like
to
do
that
on
a
timeline
and
look
at
it
in
terms
of
groups
of
people.
Who've
been
working
with
Cassandra
for
years
and
if
we
think
about
our
first
group
of
people,
these
are
our
old
timers.
This
is
represented
by
the
lovely
and
talented
mr.
B
Jonathan
Ellis
here
on
the
screen,
who's
the
Apache
Cassandra
chairman,
but
this
group
represents
pioneers-
and
in
this
group,
if
you're
in
this
group,
we
owe
you
a
tremendous
amount
of
debt,
because
you
had
the
foresight
and
the
fortitude
at
a
time
when
anything
was
up
for
grabs.
Technologies
were
springing
up
all
over
the
place,
but
you
dedicated
yourselves
to
making
Cassandra
something
really
special.
B
And
that
brings
us
to
a
second
group
of
people
represented
by
folks
like
Emily,
who
joined
us
a
couple
of
years
ago,
and
if
you
have
more
than
a
couple
years
of
experience
with
Cassandra
you're,
a
card-carrying
veteran,
because
this
industry
moves
so
quickly
and
what's
great
about
you
in
this
group-
is
that
you
took
the
technology
from
the
shadows
and
you
brought
it
into
the
light.
And
now
it
could
be
seen
and
start
to
be
propagated
and
used
for
all
sorts
of
new
things.
B
B
That
probably
means
something
very
different
to
you
than
it
means
to
me,
but
I
was
a
college
football
player,
you
would
say
University
here
and
in
the
United
States
college.
Athletics
are
a
very
big
deal.
I
was
fortunate
enough
to
have
a
scholarship
and
I
played
American
football,
and
my
friend
group
looks
something
like
this.
These
are
the
types
of
folks
I
hang
out
with.
B
That
would
actually
shape
the
whole
destiny
of
my
career
for
the
next
20
years,
and
that
was
this.
On
the
one
hand,
if
you're
old
enough
to
remember
this
piece
of
hotness,
this
is
an
IBM
es
9000
and
it
was
all
the
rage
back
in
nineteen
ninety-two.
It
was
an
innovative
and
powerful
piece
of
hardware
that
ran
a
database
called
IMS
and
an
IMS
had
been
around
for
over
20
years
and
it
was
the
established
standard.
B
This
is
where
I
was
told
my
future
was,
as
a
computer
science
major
put
my
time,
talent,
energy
into
this
world,
because
this
is
where
all
the
mission
critical
stuff
happened,
and
that
was
not
going
to
change.
But
there
was
a
problem.
There
was
a
new
technology
that
was
emerging
and
this
is
something
called
a
spark
station.
We
used
to
call
on
pizza
boxes
and
this
ran
something
that
represented
a
change
is
shift
from
the
way
we
thought
about
working
with
our
date,
and
it
wasn't
so
much
the
spark
station
or
UNIX
that
infatuated
me.
B
It
was
this
technology
called
Oracle,
because
in
1992
Oracle
released
version
6
and
in
version
6,
some
really
groundbreaking
things
started
to
emerge,
called
referential
integrity
and
functions
in
procedures
and
triggers.
Suddenly
we
started
seeing
the
world
a
little
bit
differently
in
how
we
related
to
our
data,
but
make
no
mistake
about
it.
I
lived
at
that
time
in
what's
called
the
Midwest
of
the
United
States
and
the
way
the
technology
waves
occur
in
the
United
States
is
it's
very
coastal,
so
the
west
coast
is
often
the
the
birthplace
of
much
of
the
technology.
B
They
gave
us
a
new
way
to
look
at
how
our
data
related
so
I
as
a
person
would
be
in
a
database
and
I
would
be
a
part
of
an
ER
diagram.
I
would
be
have
attributes
about
myself
date
of
birth
and
address
and
I
might
shop
at
a
store
and
that
store
would
have
attributes
and
there
would
be
products
in
a
store
and
we
all
related
together.
B
This
is
old
news
to
most
of
you
in
the
room,
but
some
of
you,
the
room,
probably
so
young
you're,
not
even
sure
what
an
ER
diagram
is,
go.
Look
it
up
on
Wikipedia
door,
it's
pretty
interesting
stuff,
but
something
is
changing.
There's
a
fundamental
change!
That's
occurring
right
now
and
there's
a
lot
of
reasons
for
it,
but
I
think
at
the
core.
B
If
you
think
about
it,
the
old
way
of
scaling
things
up
and
trying
to
fully
distribute
something,
while
at
the
same
time
making
sure
that
it
stays
radically
connected
is
challenging.
But
this
is
the
essence
of
what
Cassandra
is
and
if
you
start
to
peel
away
the
layers
and
get
behind
all
the
smoke
and
mirrors
of
no
sequel.
And
what
is
it
all
about
the
deeper
you
probe
into
a
technology
like
Cassandra,
you
find
at
the
core.
B
So
when
I
think
about
twenty
6020
I
think
about
in
terms
of
a
bell
curve-
and
you
could
take
about
twenty
percent
of
your
workloads-
that
I
think
are
going
to
be
statically
very
well
equipped
for
relational
workloads
for
the
next
15-20
years.
We
probably
won't
touch
them.
There
are
systems
that
I
wrote
in
1993
and
94
they're
still
in
place
today,
and
they
won't
change
for
another
five
years.
That's
fine!
B
One
of
those
today
really
interesting
company
that
we
work
with
call
tomara
health
and
amara
health
analytics
does
something
special
around
the
condition
called
sepsis
and
substances,
a
blood
disorder
that
occurs
when
an
infection
in
your
body
gets
so
severe
that
your
body
starts
just
shutting
down
its
organs
to
the
point
where
you
will
eventually
just
die
from
it.
Unfortunately,
I
have
some
first-hand
experience
with
this
with
a
family
member
he's.
B
Okay,
he
made
it
through,
but
it
was
a
horrible
experience
for
us,
because
I
still
remember
the
doctor
walking
in
and
saying
we've
induced
tacoma
we're
going
to
leave
them
in
the
coma
for
three
or
four
days.
We're
going
to
pump
him
full
of
antibiotics
and
we're
going
to
hope
that
we
can
relieve
the
symptoms
of
the
sepsis.
But
his
body
has
to
do
the
work
and
I
remember
at
the
time.
This
is
about
six
seven
years
ago.
I
remember
at
the
time
asking.
Is
there
really
nothing
more?
B
We
can
do
and
the
doctor
looked
at
me
and
said
the
problem
with
substance.
Is
you
can't
treat
it
until
you
know
you
have
it
and
you
don't
know
you
have
it
until
you
have
it,
and
so
once
you
reach
that
point,
it
gets
too
difficult
to
treat
in
any
effective
way,
but
amaura
health
started
with
this
problem
and
said
we
have
a
better
way
of
doing
this
and
they
began
with
this
notion
of
obviously
a
patient,
a
patient
walks
into
a
hospital.
B
They
have
all
the
normal
information
that
go
nicely
into
a
relational
database,
but
immediately
they
said
we
need
different
types
of
data.
We
need
that
data
from
the
healthcare
providers.
We
need
those
nurses
notes.
Those
doctors
notes
those
things
that
create
this
semi
structured
data
or
unstructured
data.
That
was
the
first
disk
qualifier
from
the
old
world
because
it
didn't
fit
very
well
in
that
type
of
model.
But
then
the
next
thing
they
do.
B
When
you
start
to
get
sick
as
they
start
plugging
on
sensors
all
over
your
body
and
those
start
feeding
information
and
what
amara
hell
said
is
if
we
cannot
look
at
a
patient
in
discrete
points
of
time,
but
instead
actually
walk
through
that
entire
stay
of
that
patient
through
time
and
analyze
that
through
time,
what
we
can
do
is
get
to
a
correlation
point
that
a
human
could
not
do.
They
can't
compute
it
fast
enough.
B
There's
too
many
variables
and
we
can
alert
the
caregiver
that
a
sepsis
condition
is
likely
to
occur
in
the
next
X
hours
or
days,
and
now
they
can
intervene
early.
But
this
notion
of
being
able
to
handle
that
kind
of
data
flow
and
dealing
with
data
through
time,
instead
of
in
points
that
are
static
and
discreet
change
the
requirements
for
their
system,
and
that's
why
I
said
at
the
beginning,
this
isn't
just
industry-changing
or
career
changing.
B
This
is
going
to
alter
the
very
nature
of
our
life
in
terms
of
how
we
do
things
with
our
data.
So
that
brings
us
to
this
middle
60
in
the
middle
60
is
what
I
call
the
the
jump?
Ball-
and
that
means
that
there
are
systems
out
there
today,
that
are
traditional
systems
that
have
run
on
traditional
applications
for
a
long
time.
B
But
the
challenge
for
many
of
your
companies
here
today
is
how
do
I
compete
in
an
era
where
everything
is
changing
so
quickly
around
me
in
the
disruption
is
occurring
occurring
largely
from
the
internet
companies.
They
were
the
ones
who
are
the
progenitors
of
most
of
this
technology.
They
got
to
start
from
scratch.
B
They
didn't
have
the
legacy
applications
to
deal
with
behind
the
scenes,
and
so
they
are
disrupting
nearly
every
industry
you're
going
to
hear
about
to
in
just
a
few
minutes
very
traditional
industries
being
radically
disrupted
by
this
approach,
and
so
the
real
challenge
is:
how
do
you
take
a
traditional
company
and
find
a
way
to
deliver
the
potency
and
the
power
and
the
data
nimbleness
of
an
Internet
company?
That
is
a
non-trivial
problem.
B
It
requires
new
and
different
ways
of
thinking.
It
requires
new
and
different
technologies
to
make
this
happen,
but
it
is
absolutely
possible-
and
it's
happening
around
us
now
at
an
accelerated
pace,
and
this
is
where
you're
going
to
hear
a
lot
of
over
the
next
few
days.
One
example
is
a
grocery
store
that
we
work
with.
They
do
a
little
tour
with
us
where
we
go
out
and
talk
to
different
companies
and
people
trying
to
learn,
and
they
take
out
their
old
loyalty
card,
which
was
a
little
barcode
scanner
at
the
time
to
check
out.
B
They
would
scan
it
again.
That's
that
whole
thing
about
point
and
time
and
discrete,
and
then
he
tosses
it
aside,
and
he
says
that's
not
how
we
do
it
anymore,
pulls
out
his
phone
and
says
this.
Is
your
new
concierge
you're
doing
your
shopping?
Now
you
keep
your
list,
you
keep
your
offers.
You
keep
all
of
the
things
you
want
in
your
hand
and
when
you
walk
into
the
store
it
even
goes
one
step
further
and
thank
goodness
for
people
like
me.
B
It
actually
navigates
you
through
the
store
to
the
shelf,
to
pick
up
your
things
in
optimal
path,
so
that
you
don't
waste
any
time
at
the
shopping
location.
That's
fantastic!
That's
a
really
old
industry,
grocery
shopping,
that's
being
upset
by
this
technology
and
being
radically
changed
at
the
same
time,
and
so
this
is
really
the
challenge
for
you
and
you
have
this
power.
As
I
said
to
create
this
new
world,
you
are
going
to
write
our
future.
I
don't
know
what
the
future
is
going
to
look
like
exactly.
B
Make
this
happen
and
with
that
I'd
actually
like
to
move
to
hearing
from
a
couple
of
your
authors
who
are
in
this
with
you
and
who
are
doing
some
really
cool
things
in
some
new
ways:
I'm
going
to
start
with,
as
I
said,
a
very
old
industry,
the
financial
industry.
And
would
you
please
join
me
in
welcoming
to
the
stage
mr.
Zohar
mohmed
have
you
were
in
a
club
and
talk
to
us
Melamed
eyes,
lr?
How
are
you
so
Zohar?
C
Will,
as
it
says,
my
name
is
Zoya
I
work
for
credit,
suisse's,
big
financial,
global
institution,
I
work
in
the
investment
banking
division
and
I'm,
the
CTO
for
the
equity
derivative
business
globally,
and
my
speciality
within
the
organization
is
around
risk
management
technology
for
the
equity,
business
globally.
I
think,
interestingly,
what
you
just
said
a
couple
of
minutes
ago
really
charmed
with
me.
Not
there
is
my
official
job
description.
You
know
this
is
what's
on
the
on
the
blurb
on
the
bio,
but
if
you
really
really
ask
what
I
do
you
know?
C
What
I
actually
do
is
you
know
over
the
last
six
or
seven
years,
I
think
I'm,
a
bridge
between
the
Internet,
the
scale,
the
dynamism
you
know,
the
economics
and
and
the
internal
reality
of
you
know
a
large
multinational,
I
tea
shop
and
bringing
that
kind
of
technology
to
bear
on
the
problems
that
we
solve
through
through
the
difficulties
of
doing
it
in
a
large
organization.
You
know,
that's
that's
the
reality
of
what
they
do
so.
B
Being
in
a
bank
is
fascinating
to
me
because
a
it's,
a
complex
institution
with
a
lot
of
different,
complex
offerings
and
services,
but
yet,
at
the
same
time
we
see
often
the
financial
industry
is
leading
the
way
with
some
new
technologies.
But
you
also
have
to
manage
an
incredible
amount
of
risk
because
you
can't
afford
to
make
the
same
mistakes
that
say
a
grocery
store
could
make.
Yours
are
going
to
be
a
lot
more
visible
and
a
lot
more
painful.
B
C
I
think
it
was
probably
inconceivable
that
we
would
even
consider
that
not
just
not
just
technically
in
terms
of
you
know
what
the
comfort
zone
was
for
the
for
the
engineers,
but
certainly
organizationally.
You
know
it
was
a
very,
very
big
ask,
but
I
think
there
are
two
factors
there.
One
is
when
you
look
at
the
technology
that
is
there.
There
are
some
problems
that
we've
historically
encountered
for
years.
You
know,
building
these
complex
systems,
which
this
technology
really
puts
the
bed
for
us
and
really
gives
us
a
leg
up
and
I.
C
Think
the
other
thing.
When
we
sat
down
and
looked
at
it
we
said
well,
you
know
we
want
to
build
a
system
for
the
next
20
years,
not
for
the
previous
20
years.
Yeah
and
when
you,
when
you
look
at
that
and
when
you
think
about
it
that
way,
whatever
organizational
standards
you
happen
to
have
in
place
at
that
point
in
time
are
really
just
a
reference
point,
and
you
have
to
look
at
them
and
say:
is
that
really
the
way
to
go?
Where
will
I
be
in
five
years
time?
C
B
C
C
Not
it's
not.
There
is
no
polite
disruption.
Yeah,
I
think
it's
it's.
It
always
represents
a
challenge,
I
think
for
the
organization
as
much
as
for
the
engineering
team
and
I
think
inevitably
in
large
organization,
innovation
and
disruption
of
this
kind
of
nature
happens,
bottom-up,
it's
very,
very
rare
for
traditional
organization,
large
large
corporations
to
have
this
gentle
disruption
coming
down
from
the
top,
where
someone
makes
that
decision
and
its
communicated
to
the
troops
and
people
go
and
follow
that,
and
the
challenge
for
a
large
organization
is
how
do
you
integrate
that?
C
How
do
you
make
that
happen
within
a
very,
very
tight
within
a
very
highly
regulated
and
audited
and
controlled
environment?
You
can't
take
any
risks,
as
you
quite
rightly
said,
it's
not
a
grocery
shop
and
we
have
to
work
within
a
very,
very
tight
framework
of
controls
and
at
the
same
time
you
don't
want
to
stop
innovation
things.
You
know
if
there's
one
thing
that's
guaranteed
is
change,
nothing
else
is,
is
guaranteed
and
so
I
think
as
an
organization
from
a
technology
perspective.
C
Over
these
five
years,
we've
been
in
this
transition
to
a
grassroots
innovation
model
where
the
innovation
modalism
much
more
federated,
and
we
look
at
people
experimenting
and
innovating
and
we'd
like
to
give
people
more
space
to
do
that
and
at
the
same
time,
move
away
from.
You
know
some
illusion
of
control
that
you
have
in
a
large
organization
to
a
reality
of
visibility
and
really
seeing
what
people
are
doing.
Is
it
working
and
very,
very
quickly,
be
able
to
try
it
out
and
say:
okay.
This
is
good
for
us.
C
B
That
you
hit
on
something
there
that
is
I
can
tell
you
is
so
incredibly
common
that
we're
finding
in
more
and
more
organizations
that
illusion
of
control
versus
the
reality
of
the
visibility.
If
you
were
to
pass
on
some
of
your
wisdom
that
you've
learned
and
going
through
this
now
for
the
last
three
to
five
years,
give
it
to
us
on
two
levels:
first,
maybe
from
the
business
perspective,
because
what
what
we're
seeing
is
now.
This
is
as
much
about
business
problems
as
it
is
technology
problems
and
then
maybe
from
the
technology
side.
C
C
C
It's
really
just
about
understanding
that
when
you
take
the
risk,
what
kind
of
risk
you're
about
to
take
that
there
is
a
community
that
you
work
within
I,
think
historically,
people
felt
much
more
comfortable
going
with
the
product
that
has
a
you
know
a
vendor
behind
it.
It's
the
same
today,
but
I
think
we
like
to
see
that,
together
with
a
large
community,
a
large
open
source,
community
and
I
think
that's
that's
played
extremely
well
for
us
on
the
technical
side,
that's
great
I.
B
C
B
We
had
the
privilege
of
working
with
with
Zohar
and
his
team
for
a
while
now
and
those
those
nuggets
that
we
get
from
people
who
have
lived
through.
This
are
becoming
more
and
more
common,
and
this
next
individual
is
somebody
who
confronted
that
same
problem
at
British
Gas,
and
they
took
a
federated
approach,
but
boy
they
really
took
a
federated
approach.
So
please
join
me
in
welcoming
to
the
stage
mr.
Jim
Manning.
B
So
Jim
had
a
great
opportunity
last
night,
with
Jim
at
our
executive
track
to
go
through
a
full
presentation.
I
will
tell
you
guys
in
advance
I
told
him.
I
was
going
to
throw
them
under
the
bus
if
you
can
track
him
down
later
in
the
day.
To
he's
got
some
amazing
slides
of
some
stuff
that
they
do.
Is
data
scientists
in
their
world
but
before
I
get
there
and
get
you
harassed
all
day.
Give
us
a
little
bit
of
background
on
who
you
are
and
what
you
do.
Yes,.
D
I'm
Jim
Manning
I,
head
up
data
and
analytics
this
amazing
organization
called
connected
homes,
which
was
set
up
by
British
Gas.
Two
years
ago,
career
wise,
it's
been
pretty
eclectic.
I
spent
the
first
half
of
my
career,
doing
some
very
traditional
corporate
stuff
around
technology
change.
Then
a
mid-career
crisis
went
off
rented
a
house
in
Spain,
taught
myself
to
code,
came
back
to
the
UK,
found
the
startup
three
and
a
half
years,
bombed-out
going
to
look
back
on
it
as
the
the
longest
and
most
expensive
training
course
of
my
life
education.
D
But
but
during
that
time,
I
got
a
really
sort
of
interested
in
data
and
data
visualization
and
what
we
could
do
with
that
then
met
the
guys
who
were
setting
up
connected
homes,
and
you
know
boy
have
they
got
some
interesting
data.
So
I've
been
d
enough
for
last
18
months
and
it's
been
a
blast
great.
B
B
D
I
mean
British
Gas
is,
is
I,
think
about
185
years
old.
So
you
know
in
terms
of
a
nice,
a
polish
company
I.
You
know
it
looks
like
what
you
think
was
but
I
think
when
maybe
three
years
ago
they
were
starting
to
see
this
new
category
appearing
of
the
connected
home
and
clearly
there's.
There
are
no
correlations
between
being
an
energy
company
and
you
know,
being
in
people's
homes
and
being
a
company
that
people
trust
to
go
into
that
I'm.
D
So
they're
10,000
from
expert
engineers
in
British
Gas
to
visit
50,000
homes
a
day
so
already
had
that
presence
in
the
home.
So
so
a
connected
homes,
proposition
kind
of
made
sense,
but
what
they
did
was
amazing.
Really
they
set
up
this
organization
called
connected
homes.
They
made
a
very,
very
separate
thing:
separate
offices
attracting
you
know,
different
talent,
it's
about
a
30
70
split
between
people
who
came
from
British
Gas
and
people
who
came
from
outside
and
they've
just
created
this
at
this
amazing
environment,
where
we
have
the
freedom
to
innovate.
D
D
Mean
that
high
okay,
exactly
what
I
vector
hive
active
heating,
so
yeah,
so
tonight
I've
got
no
idea
what
time
I'll
get
home,
but
I
know
that
you
know
about
15-20
minutes
before
I
get
home
on
the
train,
I
can
switch.
My
heating
on
my
house
would
be
hot.
Warm
I'll
have
hot
water
and
I
won't
have
spent
a
lot
of
money,
heating,
an
empty
home
and
I.
Think
what
we're
finding
is,
as
people
get
this
technology,
they
can't
just
expect
it.
You
know
why?
D
Wouldn't
you
want
to
be
able
to
control
things
in
your
home
from
your
phone
I
think
that
experience
is
now
kind
of
led
us
to
think
about.
You
know
what
are
all
the
other
things
and
there's
a
really
exciting
future
with
with
that
product,
one
of
the
other
things
we
do
Smart
Energy
report,
so
we
take
data
from
from
smart
meters,
and
then
we
use
that
data
to
try
and
help
people
get
a
kind
of
real,
in-depth
understanding
of
how
they
use
their
energy.
D
B
What
you're
doing
and
what
so
hard
doing
or
so
representative
of
this
internet
enterprise
concept,
he
talked
about
being
the
conduit
of
making
that
happen.
You
you
took
it
to
an
extreme
started,
a
whole
new
sub
company
around
it.
What
challenges
did
you
face
along
the
way
in
the
in
the
very
brief
journey,
thus
far
with
connected
homes,
so.
D
I
mean
I
think
you
know,
I'd,
like
I,
mean
I
think
because
of
what
what
what
they
didn't
setting
up
a
separate
thing.
Yet
we
haven't
had
to
face
some
of
the
challenges
that
I
guess
other
people
in
the
room
have
so
you
know
we
have.
We
have
product
guys
set
next
to
technology,
guys
we
don't
have
this.
We
don't
have
referred
to.
You
know
the
business
as
a
separate
thing,
so
I
think
that's
really
really
helped.
D
Clearly
we're
we're
doing
something
new.
Clearly,
there
aren't,
like
hundreds
and
hundreds
of
thousands
of
people
out
that
have
done
this.
So
you
know
we
have
had
a
bit
of
a
roller
coaster
in
terms
of
the
technology,
but
you
know
we're
getting
that
we're
resolving
it.
I'm
really
lucky
to
have
a
you
know,
really
smart
team
of
people
working
on
those
things,
so
yeah
I
think
you
know
we
we've
got
to
work
through
the
challenges
where
we're
iterating
fast,
it's
okay
to
fail
and-
and
we
learn
fast.
D
I
mean
in
terms
of
technology,
if
you,
if
you
take
this
this
internet
of
things
thing-
and
you
know,
depending
on
who
you,
who
you
talk
to
people,
are
saying
by
2020,
there
might
be
50
billion
connected
devices
or
there
might
be.
You
know,
200
billion
connected
devices.
So
taking
that
and
then
then
adding
that
to
you
know
the
volume
of
data
adding
that
to
the
kind
of
data
science
you
want
to
do
we
kind
of
get
this
sweet
spot
in
the
middle
of
those
which
which
for
us
is
Cassandra
and
spark.
D
So
we
kind
of
really
excited
to
be
using
that
that
technology
in
terms
of
in
terms
of
the
what's
next,
if
you
think
about
it
from
a
customer's
point
of
view,
it's
kind
of
two
things:
it's
what's
happening
in
my
home
and
put
me
in
control.
So
you
know
building
a
product
that
says
you
know:
hey
Billy
you're
your
central
heating
broke
down
yesterday.
It
is
futile
because
you're
going
to
be
yeah,
I
know
I'm
cold
and
a
shower,
but
building
a
product
where
we
can
say
you
know,
hey
Billy.
D
It
looks
like
your
central
heating
boiler.
Is
going
to
break
down
next
week?
We've
got
an
engineer
in
your
area.
Next,
Thursday
are
you
around?
Can
you
come
around
and
fix
it?
That's
that's
a
really
great
customer
experience
and
then
from
the
energy
perspective,
you
know
we're
at
right
now
we're
in
the
kind
of
the
game
of
last
month.
This
is
how
you
spent
you
know.
B
D
I
think
for
me
it's
it's
working
with
the
right
people
when
you're
innovating,
when
you're
trying
to
do
four
things
fast.
You
know
you
need
to
have
a
team
around
you
who
are
you
know
small
got
smart
technical
chops,
but
people
that
aren't
afraid
to
experiment.
People
that
learn
fast
and
people
are
okay.
Working
in
that
environment,
where
you
know
sometimes
you're
going
to
fail
and
I've
just
been
so
lucky
to
be
able
to
get
that
team
around
me
and
lots
of
them
I
think
in
the
audience
today.
A
D
B
As
I
said,
we
were
getting
to
watch
this
innovation
happen
in
industry,
after
industry,
after
industry,
and
this
concept
of
taking
those
traditional
enterprises
and
allowing
them
to
move
and
act
with
the
speed
and
the
power
of
the
Internet
companies
is
what
is
really
shaking
up
everything
in
the
marketplace
and,
as
we
talked
about
those
pioneers
we
mr.
Ellis
has
entered
the
building.
B
He
had
some
travel
problems
and
he
arrived
just
about
an
hour
ago,
I
think
so
he
is
going
to
be
bright-eyed
and
bushy-tailed,
but
Jonathan
and
I
have
a
very
great
relationship
for
years
ago
about
three
and
a
half
years
ago,
he
asked
me
to
join
the
company
as
CEO
we
were.
We
were
pretty
tiny,
but
that's
a
lot
of
trust
for
an
engineer.
B
E
E
Funk
you
at
ebay
summarized
this
in
three
points
that
Cassandra
gives
you
a
peer-to-peer
architecture,
unlike
master-slave,
there's
no
need
for
failover
when
things
go
wrong:
there's
no
downtime,
linear,
scalability
and
multi
data
center
support
and
I.
Think
these.
These
three
reasons
are
at
the
core.
Why
of
why
cassandra
is
the
best
choice
for
building
Internet
of
Things
and
web
and
mobile
applications
today
and
if
I
were,
if
I
were
to
simplify
that
even
further
I
would
start
with
that.
First,
one
that
Cassandra's
peer-to-peer
architecture
is
where
everything
else
comes
from
so
I've.
E
One
I
want
to
dig
into
that
just
a
little
bit
in
a
cassandra
cluster,
a
client
can
ask
any
cassandra
node
to
route
its
request
to
the
nodes
that
have
the
data.
So,
in
this
case,
I
have
three
replicas
that
have
the
data
I'm
interested
in
and
the
coordinator
tracks,
as
it
sends
out
requests
to
those
replicas.
It
tracks
how
quickly
they're
responding.
So
it
has
an
idea
of
which
nodes
are
busiest
at
a
given
point
in
time,
and
so
it
can
send
that
request
to
the
replica
that
is
least
busy.
E
E
Halfway
through
this
I
killed
one
of
the
nodes,
and
this
is
what
happens
without
the
rapid
read
protection
that
there's
about
a
10-second
window
until
the
failure
detector
realizes
that
yeah
that
nodes
down
and
he's
not
coming
back
and
then
request
start
getting
redirected
to
other
replicas
with
rapid,
read
protection.
There's
there
is
no
window
of
downtime.
We
do
take
a
brief
performance
hit
because
we
had
all
those
you
know,
fifteen
thousand
or
so,
requests
that
were
in
flight.
That
now
need
to
be
retried
all
at
once.
E
So
you
do
take
that
performance
it
as
that
set
of
requests,
gets
redirected
to
the
to
the
other
replicas,
but
it
Rick
can
see
that
the
the
cluster
recovers
quickly
and
there's
no
period
where
no
progress
is
being
made
now.
One
other
thing
that's
interesting
about
this
graph-
is
that
this
green
line
representing
the
the
rapid
read
protection,
not
is
not
just
better
when
things
go
badly.
Wrong
have
I
killed
that
node,
but
I'm
peaking
at
a
higher
throughput,
not
not
by
a
whole
lot,
but
by
a
little
bit.
E
So
what
what's
going
on
there
is
that
rapid
read
protection
is
able
to
retry
not
just
request
to
the
the
node
that
failed,
but
whenever
a
request
is
slow
because
there
was
no,
there
is
a
GC
pause
on
the
replica,
or
maybe
that
replica
just
happened
to
have
a
load
spike.
When
my
request
to
it
I
can
we
try
those
requests
as
well
to
another
replica,
so
what
this
gives
us
is
a
system
that
gives
you
very
reliable
performance.
E
So
what
what
I
have
here
is
a
this
is
a
production,
Cassandra
cluster
we're
looking
at
now.
This
isn't
just
my
my
tests
in
a
lab
over
a
week
of
data
and
the
bottom
line,
the
purple
line
is
my
median
response
time.
My
average
response
time,
so
that's
fluctuating
between
700
and
900
microseconds,
and
you
can
see
that
this
application
has
a
periodic
component
to
it
that
when
the
when
the
users
are
awake
and
active,
then
there's
more
load
on
the
system.
So
that's
a
normal
part
of
this
of
this
application.
E
The
top
line
here,
the
blue
line,
is
my
99th
percentile
response
time.
So
these
are
the
slowest
one
percent
of
my
request.
Those
are
still
coming
in
at
about
thirteen
hundred
microseconds,
so
within
about
fifty
percent
of
the
average.
So
this
this
is
what
Cassandra's
peer-to-peer
architecture
gives.
You
is
that,
since
there's
no
bottleneck
of
a
master
replicas,
there's
no
single
point
of
failure
and
I
can
spread
those
requests
out
to
whichever
replica
is
responding
fastest
and
deliver
that
consistent
performance.
E
So
what
I
want
to
talk
about
today?
I
want
to
talk
about
three
aspects
of
Cassandra
21
I
want
to
talk
about
performance.
I
want
to
talk
about
data
modeling
and
I
want
to
talk
about
operations,
and
I
also
want
to
spend
a
few
minutes
talking
about
Cassandra
30
and
what
to
expect
next
year,
when
we
deliver
that
so,
first
of
all,
in
just
in
terms
of
generic
read
performance
in
two
dot,
one
we've
we've
improved
quite
a
bit
over
20,
almost
almost
a
hundred
percent.
E
The
green
line
in
the
middle
here
is
cql
for
Cassandra
20.
The
orange
line
at
the
top
up
is
to
dot
one.
So
I'll
give
a
little
more
color
to
this.
That
most
of
this
advantage
that
you're
seeing
here
comes
from
being
able
to
group
requests
together
between
the
Cassandra,
client
and
server,
so
that
we
introduced
a
cql
native
protocol,
that's
optimized
for
cql
queries
and
and
when
you're
using
that
native
protocol,
you
can
do
asynchronous
queries.
E
So,
when
I,
when,
when
your
application
needs
to
do
10
queries
to
load
a
page,
it
doesn't
need
to
do
a
query.
Wait
for
response.
Query
wait
for
response.
It
can
do
all
those
requests
in
parallel
and
so
to
get
these
to
get
this
performance
benefit.
You
need
to
be
taking
advantage
of
that
in
the
driver.
You
need
to
be
able
to
take
advantage
of
prepared
statements
in
the
driver,
so
I'm
not
saying
that
that
this
is
a
magic
upgrade
to
two
dot
one
and
everything
magically
gets
faster.
E
You
do
need
to
be
taking
advantage
of
those
advanced
features,
but
when
you
do
then,
then
this
is
a
realistic
expectation.
I
think.
The
other
thing
I'd
point
out
here
is
the
blue
and
red
lines
at
the
bottom.
Here.
Those
are
the
thrift
performance
numbers,
so
those
those
have
barely
changed
at
all.
So
if
you
are
one
of
the
you
know
old
guard
of
Cassandra
users
who
started
out
using
thrift,
then
I
would
say
that
that
the
evidence
in
favor
of
switching
to
c
ql
is
starting
to
get
hard
to
ignore.
E
One
of
the
one
other
thing
about
about
this
graph
is
that
this
is.
This
is
very
much
a
best-case
scenario.
This
is
a
scenario
where
my
entire
data
set
fits
in
memory,
so
I'm
posting
two
hundred
thousand
requests
per
second
I'm,
not
doing
disk
I/o
everything's
in
the
page
cache.
So
let's
look
at
a
scenario
now
where
everything
doesn't
fit
in
the
page
cache.
So
what
we're
looking
at
here
is
what
happened
during
a
compaction
in
Cassandra
20
when
that
doesn't
fit
in
the
page
cache.
E
E
E
So
if
you've
been
around
Cassandra
for
a
while,
you
know
that
we
spend
some
effort
making
sure
that
compaction
won't
eat
up
all
your
IO,
and
so
we
can
throttle
the
compaction
down
and
say:
okay
compaction,
you
only
get
16
megabytes
per
second
of
dis,
guy
or
or
or
whatever,
so
that
works
fine,
we're
keeping
the
performance
between
eighty
and
ninety
thousand
requests
per
second,
while
the
compaction
is
going
on.
But
when
the
compaction
finishes.
E
All
of
a
sudden,
we
now
say:
okay,
Cassandra,
start
serving
reads
from
that
new
data
file
that
you
just
created
and
delete
the
old
ones
that
are
hot
in
the
page
cache.
And
so
that's
that's
where
that
performance
hits
a
wall
is
when
I'm
I've
now
serving
reads
from
that
cold
data
file
that
is
uncashed
and
then,
as
we
pull
it
into
the
cache,
the
performance
gradually
ramps
up.
So
we
deliberately
ran
this
experiment
on
a
spinning
disk
machine.
To
give
you
a
dramatic,
no
difference
so
having
it
on.
E
Ssd
does
mitigate
this
to
some
degree,
but
the
effect
is
still
there
so
in
two
dot
one.
What
we
do
is
we
as
we
create
that
new
data
file.
We
start
serving
reads
from
it
right
away
as
soon
as
we
start
compacting
some
partitions
full
of
data
out.
We
can
start
serving
reads
from
those,
and
so
there's
no
big.
E
Right
performance
looks
similar
in
some
respects
to
the
Reed
performance
graph,
I
showed
earlier,
where,
in
the
sense
that
two
dot
one
is
almost
a
hundred
percent
faster
than
20,
but
there's
an
important
difference,
which
is
that
you
know
the
orange
line
of
two
dot
one
here.
You
know
it's
mostly
around
200,000
per
second,
but
there's
spikes
of
weakness.
Down
to
you
know,
120
or
so,
and
what's
going
on,
there's
we're
actually
maxing
out
our
commit
log
discs.
So
cassandra
is
able
to
spread
its
data
files.
E
It's
SS
tables
across
multiple
disks,
but
it's
limited
to
a
single,
commit
log
volume.
So
we're
addressing
this
for
30
we're
addressing
it
in
two
ways.
One
is
by
introducing
commit
log
compression.
So
if
I'm
writing
less
data,
then
it's
easier
for
the
commit
log
just
to
keep
up.
That's
pretty
simple,
but
the
other
is
is
striping
the
commit
log
across
multiple
disks
so
where
we've
got
a
two
pronged
attack
to
that
problem.
In
the
meantime,
it
is
something
to
be
aware
of
now.
This
is
a
bit
of
an
extreme
case.
E
Finally,
we've
we've
introduced
a
new
compaction
strategy,
so
we've
we've
had
a
couple
kind
of
generic
compaction
strategies
that
you
can
throw
at
more
or
less
any
workload.
We
have
sized
heerd,
which
combines
SS
tables
based
on
how
large
they
are
is
better
suited
for
right,
dominated
workloads,
and
we
have
leveled
compaction
that
tries
to
guarantee
that
most
reads
will
be
served
from
a
single
SS
table
so
level
compaction
because
of
that
guarantee
that
it
that
it
tries
to
provide
is
much
more
aggressive
about
doing
compaction
and
has
much
higher
write
amplification,
so
you're.
E
So
it's
especially
if
you're
trying
to
throw
a
high-end
jest
volume
at
leveled
compaction.
It
might
not
be
able
to
keep
up
so
there's,
there's
certainly
scenarios,
especially
in
time
series
data
where
people
want
to
have
guarantees
on
their
read
performance,
but
they
also
need
to
have
a
high
end.
Gesture.
Eight,
and
so
neither
size
tier
no
level
compaction
is
a
really
good
fit
for
that.
E
So
what
we're
working
on
is
introducing
compaction
strategies
that
can
take
advantage
of
knowledge
about
the
workload
to
specialize,
how
they
approach
compaction
and
give
you
the
best
of
both
worlds.
So
dead,
tiered
compaction
is
by
an
engineer
named
Bjorn
from
Spotify
and-
and
this
is
their
approach
for
for
the
time
series
problem.
E
E
This
is
my
the
fundamental
tuning
parameter
for
date,
tiered
compaction.
It's
called
base
time
seconds
and
says
how
many
seconds
worth
of
SS
tables.
Do
you
want
me
to
aggressively
compact
together
and
as
many
s's
tables
as
there
are,
that
span
that
window
I
will
compact
all
of
them
together
and
it
does
respect
the
the
minimum
compaction
threshold
which
defaults
to
four.
E
But
when
I
get
to
the
next
window
up
so
like
leveled
compaction,
where
you
go
up
exponentially
in
terms
of
you
have
tennis's
table's
worth
of
data
than
100,
then
a
thousand
day
tiered
compaction
works
on
a
similar
level
in
terms
of
these
windows,
we've
defined,
where
I
have
my
initial
base
time
in
seconds
and
then
I
multiply
that
out
by
my
minimum
threshold
for
the
next
window.
So
if
my
my
first
window
is
one
minute
and
I
have
a
minimum
threshold
of
for
my
next
one
will
be
four
minutes.
E
E
So
if
you
have
data
that
matches
these
assumptions,
then
you
can
get
competitive
performance
on
on
the
ingest
side,
while
also
getting
this
guarantee
that
any
scan,
that's
within
that
base
time
in
seconds
will
only
hit
a
single
SS
table
or
at
most
the
minimum
threshold
says
tables
now
notice
that
this
this
is
actually
available
in
20
11,
as
well
as
two
dot
one.
So
our
compaction
api's
are
stable
enough
that
we
were
actually
able
to
introduce
this
without
making
any
changes
to
abstract
compaction
strategy
or
sized
here
door
levelled
compaction.
E
E
If
you
do
want
to
experiment,
then
there's
a
couple
things
you
should
be
aware
of.
One
is
that
the
default
base
time
seconds
is
an
hour
which
in
retrospect
is
probably
way
too
large
for
almost
any
place.
You'd
want
to
use
it.
So
if
you
think
about
it,
an
hour
is
3600
seconds.
If
I'm
writing
a
hundred
thousand
rows
per
second,
that's
three
and
a
half
million
rows
that
I'm
asking
it
to
compact
very,
very
aggressively.
E
So
you
know
the
trade-off
is
that
as
you,
the
base
time
seconds
defines
the
window
that
I'm
going
to
get.
You
know
that
single
SS
table
read
performance
on
so
in
most
cases
where
you
have
an
aggressive
ingest
rate,
a
window
of
one
or
five
or
so
minutes
is
probably
going
to
be.
You
know
adequate.
The
other
thing
to
be
aware
of
is
that
Dave
tiered
compaction,
primarily
optimizes,
a
query
that
looks
like
this
second
bullet
select
where
and
then
I
give
it
a
time-bound
now
I.
E
You
know
I,
I'm
using
a
column
named
time
here,
but
this
doesn't
actually
need
to
be
a
time
stamp
or
a
date
type.
You
know
it
could
be
a
time
uuid.
It
could
even
be
sequential
integers
that
that
follow
a
time-based
pattern,
but
the
important
thing
is
I
need
to
give
it
and
need
to
give
it
some
kind
of
bounds
tween
which
to
scan,
and
that
will
let
the
Cassandra
query
planner
optimize
that
and
take
advantage
of
the
date
tiering
that
we've
done
in
particular
this
third
bullet
intuitively
as
a
human.
E
You
would
expect
that
if
you're
asking
Cassandra
give
me
the
most
recent
rose
that
you
know
implicitly
that
defines
a
window
that
says
you
know
that
that
first
base
time
seconds
window-
that's
the
window
I
should
be
looking
at,
but
the
Cassandra
query
planner
know
predates
date,
tiered
compaction
strategy
by
about
a
year.
So
you
know
we
weren't.
E
E
E
Sections
you're
familiar
with
we
introduced
in
20
gives
you
a
way
to
prevent
race
conditions
by
giving
your
inserts
or
updates
a
condition.
So
if
I
have
two
users
trying
to
create
the
Patrick
McFadden
account,
I
can
give
this
condition
if
not
exists
and
Cassandra
will
make
sure
that
only
one
of
those
gets
applied.
So
simple,
but
you
know
single
statement.
Transactions
you'll
have
a
certain
limited
usefulness.
E
We
can
extend
lightweight
transactions
across
multiple
statements
within
a
batch,
but
to
demonstrate
that
I
want
to
introduce
another
new
feature.
This
was
introduced
in
2060
kind
of
midway
into
the
20
relief
cycle,
and
these
are
the
star
static
columns,
so
static
column
is
a
column.
That's
shared
across
all
the
rows
in
a
partition,
so
in
this
case,
I
have
a
bills
table
where
each
partition
represents
the
bills
for
a
given
user.
E
Now
a
multi
statement,
lightweight
transaction
is
implicit
when
you
have
a
batch
where
multiple
modifications
are
being
made
to
the
same
partition.
So
since
both
of
the
update
that
has
the
the
lightweight
transaction
condition
is
affecting
partition
user
one
and
the
insert
is
also
affecting
partition
user
one,
there
implicitly
part
of
the
same
transaction,
so
they'll
be
atomic.
E
You
cannot
have
transactions
across
multiple
partitions
just
within
a
partition.
One
more
example:
data
sex
as
a
customer
who
needed
an
event
log
that
provided
a
specific
guarantee.
They
needed
multiple
producers
and
multiple
observers
in
the
log,
but
for
business
logic,
reasons
elsewhere
in
the
stack
bad
things
would
happen
if
events
were
inserted
out
of
order,
so
the
normal
way
to
do
it.
Anything
time-based
with
cassandra
is
well
just
just
throw
time
you
you
IDs
at
it
right,
and
so
that
way,
no
coordination
is
needed.
Everyone
generates
their
own
time.
E
You
you
IDs,
and
everything
works.
Fine.
The
problem
with
that
is
that
you
know,
since
different
machines
have
clocks
net
that
may
be
off
by
a
millisecond
or
two.
Now
it's
very
easy.
In
fact,
it's
almost
guaranteed
that,
after
an
observer
sees
one
event
that
an
that
another
event
could
be
inserted
before
that
which
is
which
is
bad
in
this
scenario.
E
So
what
we
did
was
we,
we
created
a
log
that
uses
a
static
sequence
ID
to
order
the
events
so
we'll
initialize
each
log
by
setting
its
sequence
to
zero,
and
then
we
insert
events
by
incrementing,
the
sequence
and
in
the
same
transaction
batch
inserting
the
new
event.
So
this
makes
sure
that
we
can
never
insert
an
event
that,
with
an
ID
earlier
than
one
that's
already
been
inserted.
E
So
then,
these
two
concepts,
the
static
columns
and
the
multirow
lightweight
transactions,
unlock
some
powerful
concepts
in
in
this
kind
of
data.
Modeling
that
that
you
don't
need
to.
You
know
reach
for
a
zookeeper,
lock
or
Redis,
lock
or
whatever,
that
you
can
do
a
lot
of
these
linearly
consistent
operations
using
lightweight
transactions
directly
in
Cassandra.
E
So
the
other
thing
that
that
we
focused
on
in
two
dot,
one
is
letting
you
d
normalize
your
data,
so
that
your
data
structure
in
your
object
model
in
in
your
application
matches
your
table
schema
and-
and
so
you
can,
you
can
map
one
to
one
very
easily
from
that
object
model
to
that
row
in
the
Cassandra
table,
so
that
the
key
new
concept
here
is
user-defined
types,
so
I
I
can
create
a
type
here.
I've
created
an
address
type
at
the
top.
E
That
has
several
primitive
fields
and
a
collection
of
phone
numbers
and
I
can
take
that
type
now
and
use
it
in
my
table
definition.
So
I
have
an
addresses
map
that
includes
this
new
address
type
as
the
map
value
and,
yes,
you
can
nest
user-defined
types,
so
you
know
if
I
have
a
map
of
addresses
in
Java
I
can
very
easily
turn
that
into
a
single
Cassandra
row.
E
Now
now,
I'll
call
your
attention
to
where
that
type
is
used
in
the
create
table,
there's
a
keyword
frozen
and
what
that
means
is
that
this
type
is
serialized
as
a
blob,
so
I
cannot
update
just
the
city.
In
my
address,
for
instance,
I
need
to
replace
that
entire
address
with
a
new
one,
and
so
so
the
30
will
support
unfrozen
user-defined
types
in
two
dot.
E
You
can
have
a
map
of
maps,
a
map
of
maps,
of
lists,
a
map
of
lists
of
sets.
You
know
all
of
that
is
possible.
Now
you
still
need
the
frozen
keyword.
Again.
Nesting
collection
is
only
allowed
when
the
nested
piece
is
frozen
in
two
dot.
One
we've
also
introduced
collection
indexing.
So
here
I
have
a
songs
table.
I
have
a
set
of
tags
associated
with
each
song
and
I
create
an
index
on
those
tags.
E
So
the
the
pain
point
with
counters
in
in
20
and
earlier
comes
from
a
design
decision
that
we
made
early
on,
which
was
that,
in
order
to
provide
the
best
increment
performance,
we
were
going
to
write
just
the
increment
to
the
commit
log.
So
when
I
increment
my
counter
by
one
for
the
first
time,
I
write
that
increment
of
1
to
the
commit
log,
my
counter
value
is
also
one
I
incremented
again.
My
counter
value
is
too
but
I
write
another
one
to
the
commit
log
I
decrement
by
2.
E
Now
my
counter
value
0,
but
I
write
minus
2
to
the
commit
log.
So
the
the
implication
that
this
has
is
that
when
I
go
and
I
flush,
this
data
from
mem
table
to
disk,
then
I
need
to
update
the
commit
log
and
I
need
to
mark
the
commit
log
and
say
don't
replay
these
increments,
because
they're
already
in
add
it
in
a
file
on
disk.
Now,
the
problem
is
that
you
can't
mark
the
commit
log
and
write
the
new
SS
table
atomically.
E
So
what
can
happen
is
there's
a
window
there's
a
narrow
window
where,
after
I
write
the
file
and
before
I
mark
the
commit
log,
if
I
lose
power
or
I
kill
dash
nine
in
the
process,
then
next
time
Cassandra
starts
up.
It's
going
to
replay
those
increments
that
it's
already
counted
once
so
you
can
double
count
those
increments.
E
So
the
challenge
for
us
in
fixing
this
was
twofold.
First,
we
needed
to
fix
it
without
losing
put
without
going
backwards
on
performance
and
second,
we
needed
to
fix
it
in
a
way
that
we
could
still
do
mixed
version
clusters
during
upgrade.
So
we
didn't
want
to
force
people
to
upgrade
all
their
their
entire
cluster
to
two
dot
one
at
once.
E
We
needed
to
be
able
to
start
with
a
tutorial
close
to
replace
one
node
upgrade
another
to
two
dot,
one
upgrade
another
to
two
dot
one,
and
meanwhile
it's
translating
back
and
forth
between
these
two
types
of
counter
operations,
so
it
thats.
That's
why
it
was.
It
was
a
tough
problem
to
solve,
but
we
were
able
to
do
that
and
in
two
dot
one.
E
E
So
the
way
we
were
able
to
match
performance
while
doing
this,
because
we
have
to
read
the
existing
counter
value,
so
we've
introduced
overhead
that
wasn't
there
before
is
we
promote
hop
counters
to
effectively
atomic
long's
inside
the
JVM
and
that
that
gives
us
enough
performance
back
to
make
up
for
having
to
read
that
value.
So
we
we
benchmarked
two
scenarios,
because
they
do
have
different
performance
characteristics.
E
One
is
a
low
contention
scenario,
which
means
that
the
counters
that
I'm
updating
are
spread
across
many
partitions
and
so
in
in
you
we
have.
We
have
20
in
blue
to
that
one
in
orange
you
can
see
the
two
dot
ones.
You
know
five
or
so
percent
faster
at
the
high
end,
but
more
importantly,
is
much
more
consistent.
E
E
You
know
there's
a
lot
of
pain
that
comes
from
repairs,
taking
too
long,
I'm
not
sure
if
my
repair
is
finished
correctly
and
so
on,
and
the
root
of
the
problem
is
building
the
the
Merkel
tree
that
repair
uses
to
synchronize.
So
what
how
r
appare
works
is
it
takes
the
data
for
the
table
builds
a
hash
tree
out
of
it
and
the
and
they
exchange
the
trees
across
replicas.
So
just
like
our
sink,
the
first
step
is
to
check
the
hashes
before
we
actually
started
sending
any
data
around.
So
this
is
very
network
efficient.
E
The
tree
is
a
handful
of
megabytes
mail
up
to
a
you
know,
a
hundred
or
so
megabytes
for
a
very
large
data
set,
but
that
so
so
compared
two
terabytes
of
data.
Transferring
this
tree
around
is
a
small
small
thing
to
do.
The
problem
is
bit
to
build
that
tree.
I
need
to
scan
that
entire
data
set.
So
if
I
have
a
terabyte
of
data
to
build
that
tree,
it
takes
me
eight
hours.
If
I
have
two
terabytes
of
data,
it's
going
to
take
me
16
hours
next
time.
E
So,
as
I
add
more
data
to
the
system,
I'm
still
needing
to
scan
everything
to
build
that
that
merkel
tree.
So
what
I?
What
I
want
to
do
is
I
want
to
scan
just
the
new
data
now
I
want
once
I've
repaired,
a
set
of
SS
tables.
They
stay
repaired,
no
barring
hardware
failure,
because
SS
tables
are
immutable.
We
never
update
SS
tables
in
in
play
in
in
place.
We
write
new
SS
tables
out
with
the
new
information,
so
those
nuances
tables
are
what
we
should
be
repairing
now.
E
E
So
if
you're
going
to
upgrade
from
20
and
start
using
incremental
repairs,
there's
a
couple
steps
you
need
to
do
to
make
to
make
sure
it
goes
well.
So
full
repairs
are
still
the
default
to
opt
in
to
incremental.
You
pass
past
the
dash
I
NC
flag,
but
you
can't
just
jump
in
and
start
doing
that
or
it
will.
E
It
will
say:
oh
you
know
all
those
all
those
SS
tables
that
didn't
participate
in
this
repair,
I'm
going
to
put
them
in
the
unrepaired
arena
and
now
they're
going
to
start
over
getting
compacted
from
level
zero
in
your
level
compaction
strategy.
And
so
that's
not
what
you
want.
That's
not
what
you
want
to
happen
have
happened.
So
the
recommended
way
to
upgrade
is
to
do
this.
First
run
upgrade
SS
tables
to
get
everyone
on
the
two
dot.
One
ss
table
format
and
the
20
format
doesn't
know
how
to
record
repair
metadata.
E
Then
you'll
use
the
new
SS
table
repaired,
set
tool
to
mark
those
SS
tables
as
being
repaired,
and
then
you'll
run
a
full
repair,
so
you're
you're
lying
to
it
and
then
and
then
you're.
Turning
the
lie
into
the
truth
by
marking
them
repaired
before
you
actually
repair
them,
and
the
reason
you
do
that
in
that
order
is
new,
SS
tables
are
being
flushed,
while
this
is
going
on.
E
Finally,
we've
added
support
for
windows
and
by
added
support
for
I,
don't
mean
to
imply
that
Windows
is
production,
wet
ready
into
that
one.
We
were
projecting
that
it
will
be
production
ready
in
30,
but
we've
hit
the
big
roadblocks
in
two
dot
one
and
we're
ready
to
get
some
wider
testing.
So
the
way
I
view,
windows
and
two
dot
one
is
that
two
dot
one
is
kind
of
an
extended
beta
for
windows,
support
and
and
yes,
I
know.
This
is
the
old
windows
logo,
the
new
ones,
ugly.
E
So
so
430
I
wanted
to
hit
some
of
the
highlights
of
what's
coming
up
again,
we've
got
a
mix
of
improvements
in
data
modeling
in
operations
and
in
performance.
Just
to
start
on
the
operational
side,
we've
had
hinted
handoff,
you
know,
since
you
know,
ridiculously
early
versions
of
Cassandra
and
since
1
dot.
E
It
takes
at
least
four
rights
to
insert
and
deliver
a
hint
here's
where
I
get
that
number
from
when
I
first
write
the
hint
I
gets
sent
to
the
commit
log,
then
it
goes
into
the
mem
table
and
then
it
eventually
gets
right
and
goodness,
as
an
SS
table,
so
commit
log
anissa's
table.
That's
two
rights,
then,
when
it
gets
replayed
I
write
a
tombstone
marking
it
deleted
to
the
commit
log,
that's
right,
number
three
to
the
mem
table
and
then
to
the
it
to
an
SS
table
right
number.
Four,
then.
E
So
what
what
tends
to
happen
is
that
this
hint
design
works
fine
when
there's
momentary
hiccups
due
to
garbage
collection,
pauses
or
whatever
in
your
cluster,
but
when
you
know
when
you
lose
a
node
for
several
hours
and
you're
constantly
churning
through
hints
or
you
lose
multiple
nodes
and
you're
in
your
writing
hints
from
for
multiple
targets.
That's
that's
where
this
starts
to
make
things
a
little
bit
fragile,
because
you're
chewing
up
a
ton
of
Io
compacting,
those
hints
back
and
forth
so
for
three
todo
we're
introducing
a
specialized
hint
storage
format.
E
E
So
that's
what
the
the
format
does
it's,
basically
as
simple
as
you
can
imagine,
we
have
a
file
where
we
append
hints
to
into
basically
like
a
commit
log,
we're
just
constantly
appending
and
we
split
that
up
by
the
target
replica,
so
each
each
machine
that
we
have
hints
for
we
have
a
different
file
for
and
we
actually
break
these
up
into
files
of
several
megabytes
apiece
so
that
we
can
do
a
little
bit
fine-grained
hint
delivery.
But
when
we
deliver
the
contents
of
a
file,
we
just
delete
it.
E
That's
it
it's
it's
it's
free,
so
this
makes
hints
a
much
lower
impact
on
your
cluster
and
I.
Think
that's
going
to
make
a
lot
of
operators
very
happy.
Next
up
is
we're
introducing
JSON
support.
So
let
me
put
a
big
red
flag
on
this
and
and
clarify
what
I
mean
by
that
I
do
not
mean
that
we're
introducing
support
for
schema-less
documents.
If,
if
you
have
one
document
that
it
has
a
user
ID,
that's
an
int
another
document.
That's
a
user
ID!
That's
a
string!
That's
bad!
Don't
let
anyone
tell
you
otherwise!
E
That's
that's
not
a
feature!
So
what
we
want
to,
let
you
do
is
let
you
interact
more
natively
with
all
the
other
services
that
speak
Jason,
because
Jason
is
a
great
interchange
format.
It's
a
terrible,
schema
definition
language,
but
it's
a
great
energy
asst
format.
So
what
we?
What
we
do
is
we
leverage
the
existing
schema
that
you
have
so
I
have
my
address
and
my
users
table
from
earlier
and
given
a
JSON
document
that
corresponds
to
the
ID
in
the
name
and
the
addresses
fields.
E
Cassandra
will
map
that
to
us
a
c
ql
row
for
you,
so
it's
it's
basically
syntactic
sugar
that
saves
you
having
to
deserialize
that
JSON
message
and
then
re
serialize
it
out
of
c
ql,
but
it's
very
useful.
Syntactic
sugar,
we're
also
introducing
user-defined
functions,
I'm
not
going
to
go
into
detail
here,
because
Robert
stuff
has
an
entire
talk
later
on
today
about
the
work
that
he's
done
on
this,
but
just
briefly
any
language
that
can
interact
with
the
Java.
E
Hang
on
JavaScript
in
a
piada,
so
what
I
was
looking
for?
You
can
write
out.
You
can
write
a
user-defined
function
in
you
can
use
those
in
a
select
query.
Now:
that's
that's
the
the
first
basic
set
of
functionality.
We
also
anticipate
being
able
to
use
them
in
indexes,
so
I
can
create
an
index
using
the
value
of
the
function
result
rather
than
the
column
value
itself,
and
then
query
them
by
using
the
function
in
the
where
clause
as
well.
E
Finally,
speaking
of
indexes
we're
introducing
global
indexes
in
30,
so
the
indexes
that
Cassandra's
had
since
odot
seven,
our
local
indexes
so
I,
create
my
songs.
I
create
an
index
on
albums.
I
insert
some
data
when
I
go
and
query
that
index
and
say
give
me
all
the
album
all
the
songs
in
the
album
tres
hombres
I
need
to
ask
every
node
in
the
cluster.
E
Hey:
do
you
have
any
songs
for
tres
hombres,
because
each
node
is
indexing,
the
data
that
it
owns
so,
if
I
have
10
nodes
in
my
cluster
I
need
to
ask
10
nodes.
If
I
have
a
hundred
nodes
in
my
cluster
I
need
to
ask
100
notes,
so
you
can
see
that
the
scales
poorly.
This
is
a
good
match
for
when
you're
doing
a
spark
job
and
it's
useful
for
that,
but
for
an
oil
TP
application.
This
is
why
we
tell
people
best
practice.
E
Is
the
index
can
be
convenient
if
it's
a
low
volume
query,
but
if
it's
a
core
piece
of
your
application
that
you're
serving
a
hundred
thousand
times
per
second,
then
this
isn't
the
way
to
go.
You
need
to
denormalize
that
into
another
table,
so
you
can
just
grab
the
data
you
need
from
a
single
partition.
So
that's
what
global
indexes
does?
Is
it
automates
moving
that
data
into
a
single
partition?
Is
it
it
takes
all
the
data
corresponding
to
a
given
index
value
and
puts
that
yeah
in
a
single
partition?
B
Thank
You
Jonathan
all
right
I,
want
to
take
a
moment
just
to
thank
our
sponsors,
who
make
this
event
possible
and
putting
on
something
like
this
is
never
trivial,
never
easy,
but
they
have
all
been
amazingly
supportive
of
everything.
We're
doing,
and
it's
great
to
have
them
here
today
and
I'm
sure
you'll
see
lots
of
different
opportunities
to
interact
with
them,
which
will
be
fantastic.
B
We
want
to
really
be
great
hosts
for
you
today,
so
if
you
ever
get
disoriented
or
confused
on
where
you
should
be
just
right
out
in
the
back,
there
stop
and
talk
to
anybody
at
the
tables
take
advantage
of
our
meet
the
experts.
Room
I
can
tell
you
in
our
us
summit
that
was
one
of
the
most
popular
and
heavily
attended
rooms
that
we
had
during
the
whole
session.
B
Go
there
and
basically
ask
anything,
get
a
chance
to
really
get
into
some
good
conversations,
and
finally,
as
I
said
you're
here
today,
to
learn
and
to
share
and
to
grow
and
to
become
a
catalyst
for
this
amazing
future
that
you're
about
to
write
for
us.
So
thank
you
so
much
for
being
here.
We
hope
you
have
a
fantastic
day.