►
Description
Speaker: Dave Gardner, Senior Engineer at Hailo
Slides: http://www.slideshare.net/planetcassandra/no-whistling-required-cabs-cassandra-and-hailo-by-dave-gardner
Hailo has leveraged Cassandra to build one of the most successful startups in European history. This presentations looks at how Hailo grew from a simple MySQL-backed infrastructure to a resilient Cassandra-backed system running in three data centres globally. Topics covered include: the process of migration, experience running multi-DC on AWS, common data modeling patterns and security implications for achieving PCI compliance.
A
And
I'm
excited
and
a
little
bit
nerve
united
states
today.
My
first
trip
stateside
my
first
Sandra
summit
and
today,
I'm
gonna,
be
talking
to
you
about
Cassandra
at
halo
and
when
I,
when
I
proposed.
This
talk,
I
kind
of
that's
as
far
as
I
took
it
really
I.
Just
thought.
I
will
come
and
talk
about
my
use
case.
I'll
figure
out
the
details
later
and
then,
when
it
came
to
the
point
when
I
was
actually
writing.
My
slides
I
was
having
some
difficulty
kind
of
formalizing.
A
Cassandra
is
something
that
we
use
and
we
don't
have
to
put
a
lot
of
energy
into
using
it,
and
we
don't
really
thinking
about
it
too
much
so
from
a
coma
Cassandra
perspective,
that's
absolutely
fantastic,
but
from
at
all
prospective
it
made
my
life
slightly
hard,
and
this
hasn't
always
been
the
case.
I'm
from
the
UK
I'm
from
London
and
I
started
using
Cassandra
back
in
2010
and
back
then
we
were
using
version.
Naught
point
six
and
I
think
it's
fair
to
say
that
in
those
days
Cassandra
was
not
something
that
you
could
just
use.
A
So
that
led
me
to
start
the
meetup
group
in
because
in
a
in
London
it's
the
the
longest
running
Cassandra
user
group
in
the
world.
That's
my
claim
to
fame
and
I
found
it
in
screw.
The
motivation
really
was
to
try
and
find
some
people
who
were
who
are
using
this
database
in
2010.
Who
could
pretty
much
tell
me
how
to
use
it
now
that
was
my
motivation
and
back
then
a
common
theme
of
the
user
group
was
kind
of
war
stories.
A
It
was
people
would
come
and
tell
their
tale
about
how
they
got
burned
and
how
it
blew
up
on
them
and-
and
we
learn-
and
we
went
forward
so
fast
forward
to
today-
2013-
and
it's
it's
quite
impressive-
really
to
think
how
far
cassandra
has
come
in
that
time.
Now
we're
on
version,
1.2
and
I.
Think
Jonathan's
talk
yesterday
kind
of
really
brought
it
home
to
me
just
how
many
new
features
coming.
A
A
So
that
leads
me
nowadays
with
a
database
that
I
haven't
really
had
to
think
about
in
the
same
way
that
I
would
have
done
say
in
2010,
and
that
was
kind
of
got
me
thinking
about
my
talk
because
back
in
2010
you
know
a
good
Cassandra
tour
would
usually
involve
some
pain
and
some
some
real
effort
to
get
the
thing
working,
whereas
nowadays
we
don't
really
have
that.
So
what
am
I
going
to
talk
about
today?
A
I
have
some
contact
with
and
management
increasingly
less
so
so
that
was
quite
a
valuable
experience
for
me
to
go
and
talk
to
these
people
so
before
I
get
started
on
I'm
kind
of
on
Cassandra
I'm.
Going
to
just
tell
you
a
little
bit
about
halo
halo
is
the
the
taxi
magnet
it's
an
app
that
runs
on
your
iPhone
or
Android,
and
the
press
of
a
button.
You
can
hail
a
licensed
taxi
to
come
and
pick
you
up.
So
this
is
in
London
and
all
you
have
to
do
is
hit.
A
A
So
that's
kind
of
halo
in
a
nutshell,
we're
making
it
really
easy
to
get
a
cab
and
to
give
you
some
kind
of
context
to
the
sort
of
technology
platform
that
we
we
operate
and
we
built
halo
to
come
a
long
way
since
November
2011,
where,
when
we
launched
in
London
it's
now
the
world's
highest
rated
taxi
app.
We've
got
over
10,000
five
star
reviews.
Now:
we've
got
over
half
a
million
registered
passengers
and
a
halo,
a
hey
Liz
accepted
around
the
world
once
every
four
seconds.
A
So
we've
come
a
long
way
and
that's
kind
of
you
can
see
that
in
the
the
cities
we
operate
in,
we
now
operate
in
10
cities
globally
from
Tokyo
to
Toronto.
So
it's
a
we've
made
a
lot
of
progress
and
that's
and
that's
not
the
end
of
the
story
either.
So
halo
is
a
company.
That's
growing
and
we
really
have
global
ambitions
right
now.
A
So
with
that
in
mind,
we'll
start
to
look
at
Cassandra
and
how
we
how
we
ended
up
with
Cassandra
halo
when
we
launched
in
november
two
thousand
eleven.
We
we
didn't
use
Cassandra
that
point
halo
was
a
platform
that
was
built
by
quite
a
small
number
of
people,
a
team
of
three
or
four
back
n
engineers,
and
we
had
to
sort
of
web
applications
based
on
PHP
and
MySQL,
and
we
had
to
kind
of
java
backend
to
do
most
of
the
heavy
lifting
we
used
for
single
availability
zone
resilient
in
AWS.
A
So
why
did
we
end
up
using
Sandra?
What
was
the
motivation
behind
adopting
Cassandra
well
before
launch?
The
focus
of
halo
is
all
about
features.
We
needed
to
get
the
platform
ready
to
deliver
the
core
experience
of
halo
in
London,
and
once
we
launched
we,
we
had
a
slightly
different
folk.
We
knew
we
wanted
to
expand
globally
and
we
knew
we
wanted
to
become
a
utility.
So
we
wanted.
We
wanted
this
like
really
resilient
and
reliable
system.
We
wanted
hailer
to
always
work.
A
So
if
you
wanted
to
get
a
taxi,
we
wanted
to
be
able
to
get
you
a
taxi.
We
didn't
want,
have
any
downtime,
we
didn't
won't,
have
any
any
periods
where
we
were.
We
were
having
difficulty
so
that
kind
of
desire
for
greater
resilience
seemed
to
be
a
good
fit
with
Cassandra's
design
and
its
kind
of
high
availability
characteristics.
A
The
international
expansion
plans
seemed
to
be
a
good
fit
for
Cassandra's
global,
multiple
data
center
replication,
and
then
we
had
some
we
had.
You
know
we
had
expected
growth,
we
were,
we
were
going
to
invest
in
marketing.
We
had
plans
for
its
global
expansion,
so
we
wanted
a
database
that
we
didn't
get
in
the
way
of
those
plans.
A
So
the
path
to
adoption
of
of
Cassandra
was
was
largely
a
unilateral
decision.
It
was,
it
was
really
develop,
a
lead.
This
was
back
in
the
days
when
we
were
running
out
of
a
boat
on
the
Thames,
and
we
had
quite
a
small
office,
we're
all
in
the
same
room
and
we
we
sort
of
the
development
team
decided
to
adopt
it
fundamentally.
A
The
way
we
went
about
kind
of
bringing
into
our
architecture
was
that
we
we
took
the
PHP
MySQL
web
apps
and
we
we
broke
down
the
functionality
that
they
provided
in
two
independent
services
that
did
kind
of
one
job
well
and
those
services.
We
reused
cassandra
for
the
data
store
and
then
slowly
we
we
started
to
sort
of
plug
report.
You
know
hollow
out
the
functionality
of
those
the
web
apps
and
replace
them
with
this.
A
A
So
this
is
an
example
of
entity
storage
disease.
Are
we
store
our
customer
details
in
Cassandra,
so
we
use
the
row
key
here?
Is
a
64-bit
integer
and
we're
using
a
kind
of
snowflake
style,
globally
unique
number
generation
and
then
the
column
names
so
things
like
created,
timestamp,
email
there
there
the
cassandra
column
names
and
then
the
values
are
the
actual
property
values.
So
this
is
kind
of
an
oft-used
pattern.
It's
quite
a
straightforward
way
of
using
Cassandra.
We
have
one
row
per
record
and
they
get
distributed
around
the
globe,
which
is
quite
nice.
A
A
What
you
don't
want
to
do
is
you
don't
want
to
read
the
whole
record
change
one
thing
and
then
write
the
whole
record
back,
because
if
you
do
that,
you've
got
a
potential
for
race
condition.
So,
if
actually
one
thing
over
writing
another,
whereas
by
just
changing
the
one
column
you
you
avoid
that
you're
basically
saying
you
know
to
set
this
piece
of
information.
A
This
gives
you
an
idea
of
the
kind
of
read
and
write
workload
that
we're
doing.
This
is
just
one
of
our
kind
of
entities,
and
this
is
the
customer
entity,
and
you
can
see
the
rates
are
quite
low,
so
we're
peaking
at
about
fifty
reads
a
second
and
the
right
rate
is
really
low.
So
this
kind
of
gives
you
an
indication
that
we're
not
really
using
Sandra,
because
we
have
a
big
data
problem
or
even
a
really
high
volume
problem,
we're
using
Sandra.
For
other
reasons,.
A
A
So
when
you,
when
you,
when
you
take
a
journey,
will
be
sending
you
messages,
potentially
SMS
messages
emails
and
we
keep
a
record
of
those
four
things
like
customer
service
and
so
that
customers
can
request
a
you
know
us
to
send
them
a
copy
of
the
receipt
and
things
like
that,
and
so
what
we're
doing
here
is
we're
storing
all
the
information
under
one
row.
So
here
the
the
day,
the
date
is
the
row
key,
so
2013,
06
01,
that's
the
row
key
and
fit
in
each
day.
A
We're
storing
all
of
the
email
send
all
of
the
messages
sent
under
that
one
row.
The
column
name
here
is
a
time
uuid.
So
that's
a
type
one
globally,
unique
identifier
and
baked
into
that.
It
has
the
concept
of
when
it
was
generated
and
Cassandra
can
understand
these,
so
Cassandra
will
be
able
to
order
the
order.
The
columns
by
time,
so
what
you
end
up
with
is
we
end
up
with
one
row
that
contains
all
of
this
stuff
sent
ordered
by
time.
A
A
So,
instead
of
storing
everything
for
the
day
under
one
row,
we're
storing
everything
sent
to
me
forever
and
this
kind
of
works
out
for
us,
because
the
volumes
are
on
that
high,
so
we're
not
expecting
to
send
millions
and
millions
of
messages
to
people
where
you
know
keep
it
quite
small
and
that
kind
of
leads
into
the
key
considerations
for
time
series.
This
is
an
example
of
the
kind
of
read
em
right
work
like
we're
doing
for
this
particular
the
communications
case.
A
This
this
work,
though,
we've
got
a
slightly
higher
I'm
right
rate,
which
is
the
green
line.
Then
read
so
it's
kind
of
flipped
around
and
you
can
see
it
kind
of
follows
a
that's
our
sort
of
pattern
of
use.
When
you
know
we
have
rush
hour
and
busy
and
evenings
and
stuff,
we
do
have
higher
volume
stream.
So
this
is
this
is
our
stats
kind
of
event
stream,
and
this
is
a
time
series
data
as
well.
This
is
this
is
sort
of
this.
A
Is
that
kind
of
higher
highest
volume
data,
and
with
this
we're
peaking
at
around
five
thousand
right
operations
per
second
and
the
read
you
can
see,
the
read
rates
are
really
sporadic.
This
is
a
kind
of
reporting
system
that
we
use
and
you
can
see
that
this
is
on
Friday
there's
a
sort
of
some
general
traffic.
You
know
people
requesting
stuff
out
of
our
platform
and
then
at
the
weekend
it
kind
of
disappears.
It
goes
down
to
nothing,
so
that's
kind
of
this
kind
of
shows
on
our
highest
volume
case.
A
So
the
key
consideration
for
time
series
really
is
to
choose
the
roki
carefully,
because
what
you
want,
what
you
don't
want
to
do,
is
you
don't
wanna
pour
all
of
you
know
all
the
records
in
one
row
without
use
case
it
kind
of
works,
because
we
do
not
do
normalizing
on
right.
So
generally,
one
record
will
update
three
or
four
different
rows
and
we're
storing
a
sensible
number
per
row.
A
A
The
client
libraries
were
using
a
halo
we're
using
the
stein
acts
java
client,
which
is
the
netflix
open
source
project.
We
use
in
PHP
casa
for
PHP
and
we're
using
ghazi
for
go
we're
not
using
cql
at
the
moment
we're
kind
of
using
the
the
kind
of
older
style
rift
base.
You
know
our
pc
client
and
for
us
that
that
seems
to
see
what
we're
doing
right
now
we
might
move
to
c
ql
in
the
future.
A
So
the
the
other
sort
of
half
of
our
use
case
at
cassandra
is
analytics.
One
of
the
things
we
lost
when
we
migrated
data
to
Cassandra
was
the
ability
to
to
conduct
queries
of
the
the
sorts
of
queries
that
you'd
say.
You
know
count
ness
calculate
the
sum
of
this
or
average
with
a
group
by
clause.
So
that's
something
we
did
a
lot
of
in
the
PHP
MySQL
web
app,
so
we'd
have
you
know
analytics
to
be
able
to.
A
You
know
count
how
many
jobs
the
driver
are
done
or
how
many
jobs
a
customer
had
done,
and
the
migration
to
Cassandra
meant
that
we
kind
of
lost
that
ability,
because
Cassandra
doesn't
have
that
ability
to
to
say,
select
cam
star
or
some
the
other.
So
we
use
a
product
called
hakuna
analytics,
which
kind
of
gives
us
that
such
a
facility
back,
we
define
we
kind
of
predefined
query
templates,
which
basis
eze.
A
We
have
to
know
ahead
of
time
how
we
want
to
query
it,
and
then
hakuna
will
write
all
the
data
to
Cassandra
and
denormalize
it
massively
on
right
such
that
we
can
query
it
in
real
time.
So
we
quite
light
is
because
the
integration
is
very
straightforward
for
us.
It
just
gives
us
this
facility
without
having
to
really
work
at
it.
So
we
find
that
very
helpful.
This
is
kind
of
an
example
of
what
what
we
can
do
with
this
tool,
but
we
wouldn't
be
able
to
do
with
Cassandra.
A
So
this
is
a
kind
of
query
language,
that's
specific
to
a
canoe
called
aql,
and
you
can
see
you
know
if
you've
ever
done
any
sequel,
you'll
kind
of
recognize,
what
what
we're
doing
and
if
you've
ever
used
cassandra
you'll
recognize
that
you
wouldn't
be
able
to
do
this
with
war
cassandra.
This
facility
just
doesn't
exist,
and
we
can
do
things
like
you
know,
group,
by
different
time
periods
and
stuff,
and
one
of
the
features
that
we're
just
kind
of
starting
to
explore
is
the
kind
of
dashboard
side
of
it.
A
So
this
is
relatively
new
and
we're
kind
of
using
this
to
give
us
some
operational
in
sight.
Whether
halo
is
running
really
so.
This
is
an
example
of
a
sort
of
plotting
customer
demand
over
time,
and
one
of
the
the
newest
dashboard
features
is
the
kind
of
the
geographic
heat
maps.
So
halo
is
a
most
of
our
data
is
geographic
in
nature.
We
have
customer
demand
at
specific
location
and
kind
of
drive,
a
supply
at
a
specific
location
and
one
on
the
feet.
A
This
is
the
main
one
really
people
joining
our
team
people
joining
our
company
would
generally
come
with
a
back
history
of
sequel
experience.
So
you
know,
generally
people
will
just
have
10
years
experience
of
MySQL
or
something
Cassandra
experience
is
unlikely.
I,
don't
think
anyone
has
joined
our
company
with
prior
Cassandra
experience.
So
that's
a
challenge,
that's
something
that
we
have
to
kind
of
work
around
and
mitigate
and
I.
Think
that
kind
of
leads
on
to
the
second
point,
which
is
that
some
people
can
sue
themselves
in
a
foot
and
I.
A
A
So
we
have
to
kind
of
guard
against
that.
So
some
of
the
lessons
learned
on
the
development
from
well
one
of
the
things
I
think
is
really
important-
is
to
have
an
advocate
in
the
team
and
I've
kind
of
taken
that
role
in
our
company,
but
perhaps
I
haven't
been
quite
as
on
it,
as
I
should
have
been
I.
Think
it's
important
to
try
and
get
everyone
bored.
You
know
we've
got
people
joining
the
company
continued,
so
you
need
to
kind
of
sell
in
the
dream
you
to
tell
them.
A
A
I
think
it's
cql
can
kind
of
encouraging
SQL
mindset
you
it's
good
in
one
way,
because
you
can
you
can
get
people
on
board
quicker
because
the
way
they
interact
with
the
database
it's
closer
to
how
they
used
to
use
sequel
but
at
the
same
time
it's
kind
of
dangerous,
because
it's
I
think
it's
important
with
Cassandra
to
understand.
Understand
the
underlying
storage
engine.
You
know
understand
about
the
kind
of
colocation
of
data
and
understand
kind
of
you
know
how
how
to
play
to
its
strength.
You
know
d,
normalizing
and
stuff
like
that.
A
A
I
think
the
overall
feeling
from
the
team
is
that
cassandra
is
allowed
a
very
small
team
of
operations,
people
to
achieve
things
they
wouldn't
have
considered
before
it
existed.
So
the
main
thing-
the
main
kind
of
point
here
is
about
the
kind
of
global
active,
active
replication
and
the
fact
that
we
can
do
that
with
a
really
small
team.
A
This
is
halo.
This
is
where
we
operate
in
the
moment.
We've
got
offices
on
the
ground.
Here,
we've
got
people
doing
stuff,
so
we're
in
we're
all
the
way
from
Tokyo
and
Osaka,
obviously
London,
where
we
started
in
Dublin,
we're
down
in
Madrid
and
Barcelona
in
Spain,
and
then
we're
over
in
the
US
and
Toronto
Canada
Montreal
Washington,
Boston,
New
York.
Of
course,
we've
just
launched
there.
So
there's
a
lot
of
places
and
you
can
kind
of
get
a
feel
for
why
the
replication
story
of
Cassandra
was
so
important
for
us
now.
A
This
is
kind
of
what
our
setup
looks
like.
We
run
two
clusters
of
Cassandra
in
production.
Each
cluster
has
six
nodes
in
every
in
each
region
and
we're
in
we're
fully
on
AWS
are
in
AP
southeast
one,
which
is
the
Singapore
region.
That's
the
service,
Tokyo
traffic,
and
you
know
the
Japan
market
we're
in
u.s.
East,
one
which
is
Virginia,
which
covers
off
North
America
and
we're
in
EU
west
one
which
covers
off
Europe.
Basically,
the
we've
separated
our
clusters
into
two
we've
got
a
stats
cluster
and
operational
cluster.
A
We've
done
that
because
the
operational
cluster
is
is
the
thing
that's
needed
to
get
a
taxi.
If
this
cluster
starts
working,
our
app
stops
working.
The
top
cluster
is
less
important.
It's
not
on
the
critical
path,
so
we've
split
out
the
use
cases
and
the
workloads
really
to
isolate.
You
know
potentially
huge
volumes
of
data
being
ingested
in
the
stats
cluster,
taking
out
the
operational
side
of
things
and
we
haven't.
There
will
be
a
third
data
center
for
the
stats
cluster,
where
we
haven't
quite
got
around
to
doing
that
migration.
A
Yet
each
of
our
region's
we're
operating
in
a
Amazon
virtual
private
cloud
and
we're
using
openvpn
links
between
the
the
V
pcs
to
connect
them
we're
using
em
one
large
machines
at
the
moment
with
provision
die-offs,
so
we're
sort
of
paying
Amazon
to
give
us
guaranteed
levels
of
I
ops.
So
we're
running
on
EBS,
which
is
I,
guess
quite
unusual.
Most
people
aren't
running
on
EBS
the
operational
cluster,
we're
looking
at
about
100
gig,
a
node
in
the
moment,
and
the
stats
cluster
about
600
gig.
This
is
with
compression.
A
The
way
we
do
backups
is
reasonably
caveman.
We
take
SS
table
snapshots.
We
were
then
uploading
these
two
s3,
but
we
found
that
that
was
kind
of
saturating,
all
our
network
bandwidth
and
causing
issues.
So
now
we
just
eat,
take
EVs
snapshots
of
the
SS
table
snapshots
and
that's
instant.
So
this
is
kind
of
one
of
the
reasons
we
use
EBS.
Is
that
kind
of
ability
to
take
snapshots
quickly
we're
not
using
any
sort
of
clever
tool
like
Netflix
tools
that
allow
us
to
do
smarter
things?
A
We
chose
that
because
the
operations
guys
chose
it
because
it's
quite
uncomplicated,
apparently
I,
don't
know
anything
about
it
really
and
the
tests
that
we
did
suggested
it
added
like
a
one-percent,
IO
performance
here.
So
it's
quite
manageable.
Really
we
use
op
Center,
which
is
the
data
text
tool.
We
use
the
free
version
and
the
ops
guys
are
quite
quite
enthusiastic
about
it
as
a
tool.
A
We
feel
that
it
kind
of
gives
new
staff
and
easy
in
on
kind
of
getting
to
grips
with
what
sander
is
how
it
operates
and
it's
kind
of
the
just
the
ability
to
have
these
simple.
You
know
the
simple
screens
of
data
that
tell
you
what
what's
going
on.
So
it's
a
it's
a
kind
of
quick
win.
It's
a
free
thing.
You
can
install
the
news,
multi
DC.
A
This
is
kind
of
the
main
one
of
the
main
motivations
for
our
adoption
of
Cassandra,
and
it's
really
the
the
big
success
story
and
that
we
think
of
Cassandra,
as
though
we
were
able
to
you
know
when
we,
when
we
launched
our
Singapore
region,
all
we
have
to
do
is
bring
up
some
machines
and
sort
of
type
a
few
things
in
and
it
was
it
was
online
and
active.
You
know
zero
downtime
and
it's
been
rock
solid.
We
haven't
had
any
problems
at
all
with
it.
A
We
read
and
write
a
local
quorum
consistency
level
and
in
order
to
kind
of
make
that
work
we
use.
We
use
narrow
repairs
on
schedule
to
go
around
the
node
just
to
make
sure
that
all
the
data
is
is
in
sync,
so
that's
kind
of.
If
you
went
to
Jason
Browns
talk
yesterday,
he
was
talking
about
repair
to
make
sure
that
any
inconsistencies
are
dealt
with.
So
we
do
that
on
kind
of
a
rolling
rolling
nature
around
the
cluster
each
night
we've
recently
started
to
use
compression.
A
So
we
were
at
the
point
where
our
stats
cluster
was
running
at
about
one
and
a
half
terabytes
per
node
and
obviously
with
Cassandra.
You
need
you
need
sort
of
fifty
percent
Headroom
to
be
able
to
do
the
major
compaction.
So
we
were
we
needed
sort
of
three
terabytes,
a
node
which
was
I,
think
it
was
actually
more
than
we
had
available
and
at
that
point
we
didn't
really
want
to
add.
More
nodes
are
going
to
talk
about
that
a
little
bit
later
in
a
management
perspective,
but
we
did
we
didn't
want
it.
A
We
didn't
want
to
scale
this
out
for
really
for
cost
reasons,
so
we
thought
we'd
try
out
compression,
just
see
what
happened
and
we
just
turned
it
on.
It
was
very
straightforward
and
it
just
worked
and
it
gave
us
enormous
savings,
really
we're
down
to
sort
of
six
hundred
gigabytes
per
node.
Now
it's
very
easy
to
accomplish,
and
we
just
run
no
tools,
upgrade
SS
tables
to
kind
of
apply
the
compression
to
all
the
historical
SS
tables.
A
So
lessons
learned
operationally
I
think
the
main
one
is
that
Cassandra
doesn't
demand
a
lot
of
your
attention.
It
certainly
doesn't
demand
a
lot
of
our
attention.
Anyone,
I
don't
know
anyways,
we'll
use
anyones
used
cassander
a
lot,
but
this
is
a
view
of
op
center
of
our
stats
cluster
and
it
has
an
operations
person.
You
kind
of
want
to
see
the
circles
to
be
the
same
size
and
you
don't
want
to
see
lots
of
streaming
going
on.
A
So
this
cluster
shows
circles
that
are
very
much
not
the
same
size
and
quite
a
lot
of
streaming
going
on,
but
this
cluster
still
works.
This
is
still
carrying
on
soldiering
on
fine.
When
I
took
this
screenshot
for
the
presentation
it,
I
would
do
realize
that
our
cluster
was
kind
of
so
out
of
balance
really.
So
the
point
is
that
you
know
I
think
of
where
we
are
as
a
company.
Now
we
probably
need
to
invest
a
little
bit
more
upfront
and
Cassandra.
You
know
cassandra
has
been
very
good
to
us.
A
A
So
finally,
the
management
perspective-
and
this
is
kind
of
the
one
that
I
think
is-
is
kind
of
was
the
biggest
eye-opener
for
me
when
I
was
preparing
this
presentation,
it's
not
something
as
a
developer.
That
I've
really
thought
about
before
the
presentation
and
the
the
management
perspective
is
that
this
is.
This
is
a
quote
from
one
of
our
from
our
vp
operations
and
he
was
kind
of
saying
the
days
of
the
quick
and
dirty
or
over,
and
his
point
really
was
that
that
technically
our
management
believe
that
cassandra
is
a
perfectly
fine
solution.
A
A
So
the
first
main
area
where
is
really
about
there
is
a
concern
that
we're
putting
a
lot
of
data
into
a
datastore
and
we
can't
get
it
back
out
and
that's
the
perception
from
management
so
that
I'm
not
saying
I,
don't
necessarily
think
that's
true.
There
are.
There
are
ways
of
query
and
cassandra
in
a
kind
of
a
no
ad
hoc.
You
could
run
if
you're
running
DSC.
You
can
just
run
hive.
A
If
you
want
to
bake
it
yourself,
you
can
run
Hadoop
I've,
but
the
perception
amongst
management
is
that
we
can't
as
a
company
and
that
we've
we've
chosen
this
technology
that
basically
mean
that
we
can't
query
and
I
guess.
As
developers
we
focused
on
the
operational
side,
we
focused
on
we're
a
taxi
app.
You
need
to
get
a
taxi.
You
know
we're
going
to
focus
on
that
use
case
and
we
haven't
necessarily
considered
them.
Management
use
case
of
you
know.
A
I
want
to
get
some
data
out,
and
what
that's
meant
is
that
when
we
were
all
in
one
room
on
our
boat
on
the
Thames
management,
could
pretty
much
go
up
to
any
member
of
the
team
and
say
hey:
how
many
did
it
added
order
and
then
they'd
be
able
to
type
a
quick,
SQL
query
and
answer
the
question
with
Cassandra?
They
they
can't
do
that.
The
number
of
people
who
are
able
to
accomplish
that
diminishes.
A
There
is
a
kind
of
caveat
to
this,
which
is
that
actually,
our
relational
data
is
at
the
point
now,
where
you
can't
really
do
that
either
because
we've
got
we've
got
a
lot
of
data
still
in
relational
data
stores,
we're
slowly
migrating
and
actually
that
data.
You
know
into
the
point
now
where
you
can
quite
easily
log
tables
and
take
make
problems
for
production
by
running
queries
against
it,
the
sorts
of
things
that
we
would
have
been
able
to
do
a
while.
A
A
Is
there
a
business
value
in
storing
this
data
and
that
kind
of
leads
back
to
my
point
earlier
about
that
stats
cluster,
where
we
got
up
to
the
point
where
we
will
have
one
point:
five
terabytes
per
node
in
a
12
node
cluster,
and
it's
it's
costing
us
money
to
store
that
and
the
question
is:
are
we
getting
a?
Are
we
getting
business
value
for
storing
that
data,
or
are
we
doing
it
for
no
reason
and
I?
A
Think
from
the
development
background,
there's
a
danger
of
you:
do
it
because
you
can
and
cassandra
gives
you
a
tool
that
will
just
do
it.
You
can
keep
adding
nodes.
You
can
keep
stirring
as
much
data
as
you
want
and
that's
fantastic.
But
the
question
is
the
question
from
management
is:
is
there
a
reason?
Is
there
a
business
reason
for
doing
so?
A
A
Another
interesting
point
is
singing
from
the
same
hymn
sheet:
I,
don't
know
whether
that's
whether
American
people
understand
that
for
a
turn
of
phrase,
but
but
basically
we
have
one
of
our
very
senior
engineers
in
the
company,
one
of
the
founding
engineers
he
wasn't
100%
sold
on
Cassandra.
He
was
he
was.
He
was
unsure
that
he
had
bought.
You
know
the
advanced
trees
were
going
to
get
from
that
outweigh
the
disadvantages
and
I
think
we've
just
proceeded
anyway,
really
without
getting
buy-in
from
him
and
then,
when
business
concerns
would
surface.
A
That
kind
of
lack
of
consistency
within
the
development
team
would
potentially
exacerbate
the
problem
and
probably
what
we
should
have
done.
This
is
probably
me.
I
should
have
made
more
of
an
effort
up
front
to
get
all
the
people
on
board.
You
know
amongst
the
development
team
and
I.
Don't
think
that
would
have
been
that
hard
if
I'd
have
actually
put
the
energy
into
doing
so.
It's
just
that
I
didn't
and
then
finally
kind
of
provide
solutions.
I
think
we
should
have.
A
We
should
have
invested
time
and
effort
up
front
in
in
providing
fundamentally
an
ad-hoc
query
interface
to
Cassandra
and
I.
Think
that
would
have
would
have
kind
of
headed
off
a
lot
of
the
perception
from
management
that
this
thing's
not
quibble
and
I.
Don't
think
I
would
have
been
that
hard
to
do
so.
We
should
have.
We
should
have
sort
of
provided
those
solutions
probably
from
day
one
and
what
we
could
then
have
probably
done
is
turn
the
kind
of
graph
from
earlier
into.
A
Something
looks
a
bit
more
like
this,
where
we're
saying
you
know
pretty
much
everyone,
if
you
can
query
sequel,
you
be
able
to
ruxandra,
so
that's
that
Sonny
will
be
looking
to
do
so
to
kind
of
wrap
up.
A
Halo,
we
really
like
Cassandra.
We
like
the
solid
design
principles
that
it's
founded
on
and
the
fact
that
it's
designed
to
be
distributed
from
day.
One
I
think
that's
a
really
important
point.
We
like
the
hecho
characteristics
and
the
easy
multi
DC
set
up
their
kind
of
the
two
killer
features
for
us.
We
don't
have
an
enormous
volume
of
data.
A
We
don't
have
an
enormous
volume
of
read
and
write
requests,
particularly,
but
what
we
do
have
is
a
need
to
run
on
three
continents
and
a
need
to
run
something
that
is
going
to
be
very
resilient
and
reliable,
and
then
the
simplicity
of
operation
I
think
it's
easy
to
overlook
that.
But
Cassandra
faraz
has
been
very,
very
easy
to
operate.
We
haven't
really
had
to
put
any
energy
in
at
all,
perhaps
to
a
detriment,
and
it's
kind
of
you
know
that's
the
long
term.
That's
the
cost.
A
You
are
paying
every
week,
you're
you're,
you
know
you
put
database
in
you've,
got
to
maintain
it.
You've
got
to
operate
it
for
years
and
that
simplicity,
the
fact
that
all
the
nodes
are
the
same,
a
modulus
cluster.
There
are
many
moving
parts
that
makes
life
a
lot
easier
for
the
ops
team
for
successful
adoption.
I
think
it
kind
of
boils
down
to
two,
not
many
things.
A
It
boils
down
to
having
someone
internally
he's
going
to
kind
of
sell
a
dream
and
get
everyone
on
board
get
the
developer,
who
is
in
one
hundred
percent
sure
convince
them
up
from
get
everyone
to
learn
the
fundamentals
you
know
when
people
join
the
team
have
a
way
of
teaching
them
about
Cassandra
before
you
throw
them
in
the
deep
end.
You
know:
stop
them
shooting
themselves
in
the
foot.
A
Investing
tools,
that's
something
we
should
have
done.
I
think
that's
an
important
point
that
sander
adoption
developers
are
used
to.
You
know
when
you're
building
your
software,
you
can
just
kind
of
you
know,
execute
queries
to
get
stuff
out
to
see
how
it's
running,
to
debug
it
and
with
Cassandra
it's
it
would
have
been
useful
for
ours.
A
I
think
to
have
those
tools
up
front,
a
kind
of
Hadoop
integration
to
be
able
to
do
batch,
analytic
queries
and
then
finally,
keep
management
in
the
loop
make
sure
you
explain
the
trade-offs
of
the
decisions
you're
making
if
you're
adopting
an
our
SQL
store.
It's
not
all
going
to
be
positives.
You
know
every
decision
you
make
is
going
to
have
trade-offs
so
just
make
those
clear
upfront
say
we're
making
this
decision
for
the
right
reasons,
but
these
are
the
things
we're
going
to
be
giving
up.
A
You
know
potentially
are
more
widely
used
and
people
who
have
experience
with
in
terms
of
SQL
we're
going
to
be
giving
that
up
to
get
these
other
things
so
the
future
for
halo
and
we're
going
to
continue
to
invest
in
Cassandra's
we
expand
globally.
We've
got
big
plans
to
launch
you
know
hundreds
of
cities
and
next
year
we're
going
to
hire
some
people
I
think
to
to
look
at
Sandra,
specifically
I.
A
Think
we're
probably
at
the
point
in
our
in
our
kind
of
you
know
business
now
where
we
need
someone
who
is
an
expert
in
our
company.
You
know,
as
we
start
to
rely
on
Cassandra
more
and
more
as
it
becomes
our
primary
data
store,
we're
going
to
want
to
have
those
those
skills
in
house,
so
we'll
probably
start
to
recruit
that
person
soon
we're
going
to
focus
on
expanding
our
reporting
facilities.
A
This
comes
down
to
kind
of
really
the
bat,
the
batch
analytic
side
of
things,
giving
people
about
those
tools
to
be
able
to
answer
those
questions
quickly
and
easily
and
then,
finally,
in
terms
of
the
business
and
halo
I've
got
aspirations
to
extend
our
network,
so
a
network
of
1
million
consumer
installs.
With
this
kind
of
virtual
wallet,
we
want
to
extend
that
beyond
calves.
Calves
is
like
the
step,
one
we're
going
to
move
into
other
areas
and
we're
going
to
continue
to
hire
the
best
engineers
in
london,
NYC
and
asia.
C
A
So
the
question
was
about
the
analytic
side.
I
think
what,
with
the
main
thing
we're
thinking,
is
that
we
need
to
be
able
to
execute
arbitrary
queries
against
Cassandra,
datastore
and
that'll
probably
take
the
form
of
Hadoop.
So
we've
already
got
we're
already
using
hakuna,
which
gives
us
the
kind
of
real-time
business-focused
analytics
for
things
like
in
app.
So
when
a
driver
might
look
at
their
staffing
out
within
our
driver
app
and
that's
powered
already,
we've
got
that
sorted.
A
No
yeah,
that's
a
good
point.
A
nice
loaded
question
there
so
yeah.
The
question
was
about
how
you
would
arrange
the
topology
and
the
way
we
would
do
is
we
would
have
another
datacenter
so
where
we've
currently
got
31
in
Europe,
America
and
Asia
we'd
probably
have
another
one
in
London
and
we'd
replicate
probably
just
one
replica
to
that
data
center.
So
we
have
a
complete
copy
of
all
the
data
in
one
data
center
and
that
would
be
used
solely
for
reporting
and
analytics.
B
A
So
the
good
questions
about
backup
so
I
think
I
think
there
is
some
nervousness
amongst
management
about
Cassandra
I.
Think
that's
one
of
the
main
reasons
we
running
on
EBS,
rather
than
ephemeral.
There's
that
kind
of
that
potentially
irrational,
fear
that
if
all
of
our
nodes
stopped,
if
we
stopped
all
of
our
nodes
in
AWS
and
then
brought
them
back
up,
we'd
have
no
data
left.
A
If
we
want
a
femoral,
which
is
a
slightly
absurd
view,
but
but
that's
kind
of
one
li
driving
force
between
EBS
in
terms
of
the
backups
I
guess
near
disaster
recovery.
What
if
we
introduce
corruption?
What
if
we
accidentally
delete
all
of
our
records?
So
we
we
wanted
to
keep
kind
of
historical
snapshots
really
in
time,
so
that
we
could
go
back
a
week
a
month,
I'm
with
it
with
EBS.
We
can
do
that.
A
So
they
kind
of
use
the
backups
to
restore
a
new
cluster
to
do
reporting
on
we've
done
two
exercises.
You
know
in
in
kind
of
a
year
halo
where
we've
kind
of
said
right.
Let's,
let's
test
this,
let's
see
if
we
can
actually
recover
data
and
both
of
those
exercises
were
positive.
Eg
we
were
actually
able
to.
You
know
fire
up
a
brand
new
cluster
from
the
backups,
but
it
was.
A
It
was
a
very
time-consuming
process
to
go
through
that,
which
is
why
I've
only
done
it
twice
and
I
guess
there's
there's
a
sort
of
hope
that
you
know
that
it'll
keep.
If
we
did
it
today,
it
would
work,
but
we
don't
actually
know
that.
So
I
guess
it
would
be
nice
to
have
something
kind
of
automatic,
perhaps
that
you
could
sort
of
press
a
button
and
verify
that
it's
all
working.
E
A
A
It's
not
like
we're,
storing
everything
you
know
under
one
row,
some
people
you
every
now
and
again
you
find
someone
who's
kind
of
got
a
data
model
that
stores
everything
under
three
rows
and
they
have
a
1000
cluster
and
they
wonder
why
it's
not
very
well
balanced,
so
we
don't
really
have
that
what
we
have
is
kind
of
a
I
effectively.
All
we
really
need
to
do
is
is
finish.
The
compression
make
sure
that's
fully
fully
done
and
then
run
repair,
and
you
know
make
sure
all
these
inconsistency.
A
So
all
the
date
will
be
streamed
from
nodes
where
it
needs
to
to
make
sure
all
the
nodes
have
all
the
data
they
need
and
theoretically,
once
we've
done,
then
it
should
look
a
bit
more
like
our
other
cluster,
which
is
pretty
pretty
well
balanced.
So
it's
pretty
it's
basically.
Just
run
repair
pretty
much.