►
Description
Speaker: Rick Branson of Instagram
It's upsetting whenever we hear that we can't have things that we want. It'd be nice to live in a world where it was possible to have things ACID transactions, uniqueness guarantees, and sequential counters that were globally and always available. What makes this worse is that when we're told we can't have them, people just wave their arms around in the air and shout things like "CAP theorem." In this talk, I'll walk through some of these "ponies" and demonstrate the points at which things start falling apart with practical, real-world examples.
A
Good
afternoon,
how
are
you
guys
doing
good
all
right?
My
name
is
Rick
Branson
I'm,
an
infrastructure
engineer
at
Instagram.
That
means
that
I
am
sort
of
a
DevOps
role.
I
read
a
lot
of
code,
but
I
also,
you
know,
keep
things
running
I'm.
Giving
a
talk
today
called
why
you
can't
have
that
pony.
The
idea
with
this
is
that
I
feel
like
a
lot
of
people
talk
about.
A
You
know
why
you
can't
have
things
like
distributed
transactions
and
things
like
that,
but
they
just
kind
of
throw
out
terms
like
cap
theorem
and
things
like
that,
but
they
don't.
You
know,
there's
not
a
lot
of
stuff
out
there.
That
I
feel
like
has
a
really
decent
sort
of
description
of
sort
of
taking
things
through
the
you
know,
actual
like
protocols
and
the
way
things
work
and
talking
about
kind
of
where
you
hit
these
walls.
A
You
know
thing
and
they
have
this
paper,
and
it's
supposedly
says
that
cap
theorem
doesn't
let
you
do
this
or
does
let
you
do
this
or
what
not
and
then
you
kind
of
read
it
and
you're
like
okay,
you
know
it
says
as
any
asset
database
during
Network
partition
foundation.
Db
must
choose
consistency
over
availability,
all
right
that
seems
fine.
Then
you
get
a
little
deeper
and
it
says
this
does
not
mean
the
database
becomes
unavailable
for
clients
when
multiple
machines
or
data
centers
hosting
a
foundation
DB
database
are
unable
to
communicate.
A
Some
of
them
will
be
unable
to
execute
rights,
since
the
interesting
interpretation
of
what
availability
means.
It
kind
of
shows
you
that
you
know
everybody's
kind
of
got
their
own
definition
in
terms
of
that,
and
this
differs
from
cap
availability
when
you
actually
read
the
paper
and
they
admit
it
in
their
white
paper,
but
ultimately,
there's
I
feel
like
they're,
just
trying
to
draw
you
into
this
discussion
and
obviously
they're
trying
to
sell
their
product.
So,
let's
talk
about
these
ponies.
I
mean
these
foundation.
A
Db
guys
are
right
and
there
have
been
many
people
talking
about
this
there's
the
lot
of
the
new
SQL
people
are
doing.
This
will
DB
and
some
of
the
others
are
you
know
using
this
sort
of
trying
to
sort
of
you
know,
spin
these
words
into
saying
that
they
have
this
high
availability
and
also
in
there
they
have.
A
You
know
it
kind
of
blows
done
to
this.
It's
the
data
just
kind
of
get
stale.
If
you
can't
synchronize
it.
If
you
can't
talk
between
two
systems
that
are
trying
to
synchronize
data
with
one
another,
well
they're
going
to
fall
out
of
consistency,
and
so
it's
misunderstood.
A
lot
of
people
think
it's
this
like
pick
two
of
these,
these
of
these
three
when,
in
reality,
I,
don't
know
where
this
came
from
I,
don't
know
who
kind
of
came
up
with
this
idea
of
pic.
A
Two
I
actually
spent
a
lot
of
time
last
night,
trying
to
dig
and
find
like
a
link
where
somebody
like
has
some
legitimate
sort
of
thing
behind
it
other
than
just
saying
it,
but
really
it's
just
pick
one
when
it
comes
down
to
it.
So
this
is
your
choice
and
I
think
the
consistency
thing
is
little.
It's
a
little
miss
named
I'd
like
actually
prefer
the
term
consensus,
because
consistency
confuses
it
with
like
acid
consistency,
which
is
completely
different.
A
This
has
to
do
with
you
know:
are
things
synchronized
between
multiple
nodes,
so
you
lose
these
things.
Cap
tells
us
we
lose
some
of
these
attributes.
If
we
want
availability,
strict
resource
allocation,
which
is
like
bank
accounts,
with
your
flight
booking
things
like
that,
compare
and
swap,
I
want
to
say,
update
where,
for
instance,
like
in
a
sequel
query.
I
want
to
set
this
value
where
this
value
is
equal.
A
You
can
do
some
something
like
that,
but
you
can't
guarantee
that
the
result
is
going
to
be,
for
instance
like
if
you
increment
a
counter,
and
things
like
that.
You
can't
guarantee
that
you're
going
to
get
the
the
only
person
winning
that
result
and,
of
course,
unique.
This
guarantees
that
this
all
kind
of
falls
in
the
same,
these
same
buckets,
global
uniqueness
guarantees.
A
You
have
to
say
you
know,
for
instance,
if
you're
handing
out
user
names
on
if
your
Twitter
and
you're
handing
out
user
names,
you
want
those
to
be
unique,
and
you
know
an
AP
system
simply
can't
guarantee
that
and
also
get
I'm.
Sorry,
a
CP
system
can't
guarantee
guarantee
that,
because
it's
it,
it
would
gives
you
consistency
all
right.
Sorry,
I'm,
confusing
myself
it's!
This
is
how
confusing
suppose
something
get.
The
AP
system
can't
guarantee
that
you
have
that
uniqueness
and
the
CP
system
can.
A
However,
the
CV
system
is
subject
to
availability
problems
and
there
is
hope
and
I'll
talk
about
there's
some
hope.
Some
new
think
develop
into
Cassandra's
some
new
research
that's
coming
out,
but
I'll
talk
about
this
towards
the
end.
Let's
talk
a
little
bit
about
this
resource
allocation
problem.
I
think
it's
kind
of
the
easiest
one
to
understand.
A
It
seems,
like
you
know:
hotel
booking
financial
accounts,
warehouse
inventories,
anytime,
you
sort
of
have
a
a
restricted
pool
of
resources
that
you're
trying
to
dole
out
to
people
in
limited
amounts,
and
so
let's
talk
about
in
terms
of
like
the
cassandra
model,
which
is
you
know,
a
synchronous
replication,
and
this
is
very
simplified.
This
is
the
cassandra
model
provides
availability
over
consistency
on.
Yes,
it
is
tunable,
but
you
know
in
order
to
really
exploit
the
properties
of
availability,
you
have
to
use
a
lower
level
of
consistency.
A
So
say
we
have
a
client,
we
get
it
balanced
and
comes
back
for
a
hundred
dollars.
We
have
two
replicas
each
one's,
storing
the
balance.
If
one
client
says
I
want
to
deduct
$75,
the
first
replica
says:
okay
and
at
the
same
time,
before
it's
replicated
that
change
to
the
other
replicas.
It
gives
that
old
balance
to
that
to
the
to
the
second
client,
so
it
tries
to
do
that
same
deduction,
and
your
result
is
that
both
of
these
end
up
with
a
negative
balance.
Now,
yes,
I
realize
that
banking.
A
This
happens
in
the
real
light
in
real
life.
That
banking
is
a
sort
of
tired
example
for
this
stuff,
because
most
everybody
in
this
room
is
probably
experienced
an
overdraft
fee
because
of
something
some
consistency
problem
that
exists
within
a
banking
system,
but
this
is
purely
for
illustration
purposes.
So
how
can
we
kind
of
do
better?
How
can
we
fix
this
if
we,
if
we
do,
want
to
choose
consistency
over
availability?
What
does
that
look
like
when
we
start
digging
into
this?
This
is
a
system.
That's
got
two
replicas.
A
It's
a
protocol
called
two-phase
commit
two
phase
commit
involves
a
coordinator
node,
and
this
is
kind
of
the
the
like
system
that
serializes
all
of
it
all
of
the
rights
that
that
sort
of
aligns
every
reason
right
so
that
they're
all
in
order
and
everything
is
hunky-dory
kind
of
looks
like
this.
A
client
would
say,
deduct.
$75
the
coordinator
might
say:
okay,
give
me
a
second
I
need
to
conduct
this
transaction
and
it
would
send
messages
to
the
different
replicas
and
they
would,
they
would
say.
Okay,
I've
prepared
this
transaction.
A
So
if
another
client
came
in
instead
do
the
same
thing.
While
this
was
running,
you
know,
client,
on
the
left-hand
side.
Is
you
know
that
transactions
still
pending
and
the
one
on
the
right
hand,
side
you
know,
tries
to
send
when
the
coordinator
will
block
it
will
not
allow
that
to
go
forward?
That's
the
job
of
the
coordinator,
so
that
the
client
on
the
right-hand
side
is
still
pending,
but
now
we've
agreed
since
we
got
those
messages
back.
A
We've
agreed
that,
yes,
the
client
on
the
left-hand
side
should
be
able
to
commit
this,
and
so
we
agree
that
the
new
value
is
$25
and
then
that
coordinator
can
say.
Okay
I
failed
that,
because
I
have
a
consistency
requirement
to
keep
this
balance
above
zero
dollars,
pretty
simple,
it
does
require
all
the
replicas
to
respond,
which
is
unfortunate.
A
It
means
that,
basically,
if
I
can't
I
can't
contact
one
of
the
replicas
I'm
in
timeout
mode,
it's
almost
like
using
Cassandra
with
a
consistency
level
of
all
you
know,
11
failure
will
cause
you
to
have
a
problem.
The
coordinator
is
also
a
single
point
of
failure
and
a
bottleneck
so
forever.
Whatever
the
coordinator
is
in
charge
of
coordinating
it's
the
Sol
system,
through
which
you
can
both
read
and
write
data,
which
is
a
problem
if
the
coordinator
dies
you're
kind
of
left
in
this
weird
state.
A
You
can't
actually
like,
for
instance,
in
Cassandra,
because
operations
are
out
impotent
because
they
do
converge.
You
can't
you
can
retry
things
with
this
system
because
you
don't
understand,
make
sense
of
the
state
without
sort
of
you
know,
reading
everything
you
sort
of
are
in
this
state
and
you
just
have
to
say,
I,
don't
know
where
I
am
and
you're
in
this
weird
sort
of
near
land
between
transactions.
A
You
can
use
a
standby
coordinator,
for
instance,
to
replicate
that
log.
This
is,
for
instance,
what
they're
working
on
doing
with
a
dupe
with
the
well-known
name,
node
single
point
of
failure.
So
you
have
the
synchronous
replication
of
coordinator
log,
and
this
is
because
this
is
this
is
exactly
from
the
two-phase
commit
Wikipedia
article,
and
basically
you
assume
that
you
have
stable
storage,
no
node
crashes
forever.
The
data
is
never
lost
or
corrupted
and
that
any
two
nodes
can
communicate
with
each
other,
which
none
of
which
happens
in
real
life.
A
So
we
can.
You
can
do
better,
though
there's
a
system
called
paxos
and
I'll
try
to
explain
Paxos
in
the
most
simple
way
possible,
because
I
think
most
explanations
out
there
are
way
over
complicated.
It's
granted.
It
says
it's
still
consistency
over
availability
with
this
we
have
to
have
three
replicas
paxos
requires
it.
You
have
to
have
at
least
you
have
to
have
a
essentially
you
have
to
have
a
tiebreaker
anytime.
A
A
So,
for
instance,
if
this
other
client
comes
in
and
tries
to
say
that
it'll
hold
that
until
it
finishes
so
it'll
say.
Okay
I've
already
promised
this
other
node
that
this
account
is
locked,
at
least
until
maybe
a
timeout
or
whatnot
it'll
mark
that
it'll
queue.
That
operation
is
pending
the
next
step.
Once
it's
once,
the
proposer
has
received
the.
A
That
they'll
basically
send
out
an
accept
and
say
this
value
is
$25.
That's
after
I
take
the
hundred
dollars
out
of
the
balance
I
now
$25
left,
I
sent
an
accept
out
now
notice.
We've
got
two
pending
requests
to
either
both
of
these.
These
you
know
to
both
of
these
nodes
and
the
a
is
still
the
leader
in
this
in
this
case
that
still
everybody's
locked
waiting
for
this
transaction
off
thats
led
by
a
to
finish,
and
then
after
after
this
step,
we
do
what's
called
a.
We
acknowledge
the
acceptance
of
that
transaction.
A
Everybody
broadcasts
to
everybody,
so
that
everybody
knows
that
everybody
else
has
accepted
this.
This
basically
gone
through
that
round.
The
value
in
this
is
that
it's
very
fault-tolerant.
Obviously
it's
a
lot
of
extra
work
to
go
through
this
as
well,
and
most
people
don't
use
Paxos
for
bank
accounts,
they
use
it
for
leader
election
and
things
like
that,
and
then
they
assign
a
master
and
things
like
that.
But
this
is
again.
A
This
is
a
demonstration
for
demonstrative
purposes
and
then
it
would
just
simply
say
fail
because
once
it
got
through
that
round,
it
would
know
that
there
what
wasn't
seventy
five
dollars
in
the
account
and
or
I'm
sorry
there
wasn't
seventy
five
dollars
in
the
account
there
was
only
25.
They
would
be
unable
to
check
that
from
the
balance.
It's
somewhat
fault,
tolerant
to
node
failure
like
I
said
it's
a
quorum
base,
so
it
can
say
that
if
I
have
a
majority
of
nodes,
I've
got
a
and
B
already.
A
I
can
assume
that,
because
I
have
a
majority
this
transaction
can
go
through.
It
allows
the
round
to
continue
also
this
property
of
broadcasting.
The
results
to
everybody
in
that
round.
These
are
considered
the
learners
technically
means
that
you're
also
sure
that
you're
not
going
to
get
conflicting
values
at
the
end
of
this
and
you're
not
dependent
on
one
node,
for
instance
the
proposer
to
actually
to
it,
could
fail
in
the
middle
of
this,
so
we
need
to
basically
broadcast
to
the
rest
of
them
so
that
everybody
understands
the
value
everybody
agrees
on
that
value.
A
It's
still
subject
to
cap,
though
it's
not
it.
It's
at
first
everybody
gets
excited,
they
see
Paxos
a
c-note
failure
they're
like
oh.
This
is
great,
you
know,
but
the
problem
is
that
you
can
still
have
network
partitions
and
you
still
have
to
choose
which
side
of
that
network
partition
and
network
partitions
are
not
always
just
single.
You
know
two
sides
they
can
be
complicated,
they
can
have.
You
can
have
some
nodes
that
can
reach
others,
and
then
you
know
you
can
sort
of
have
this
multifaceted,
weird
diamond-shaped
network
partition.
A
So
you
know
this
is
in
the
simplest
case.
You
know
no
day
is
trying
to
propose
and
it
simply
can't
talk
to
the
other
ones.
So
anything
on
no
table
fail
at
that
point
and
even
reads
will
fail
at
that
point,
because
it's
in
a
strictly
consistent
system,
you
can't
guarantee
that
nothing's
changed
on
the
other
side
of
that
network.
Partition
you
lose
so
you
do
lose.
You
know
your
strict
resource
allocation,
like
we
talked
about,
like
banks,
hotel
booking
things
like
that
compare-and-swap
unique
dis
guarantees.
A
Now
that
doesn't
necessarily
mean
that
you
know
the
best
business
decision
is
to
go
with
strong
consistency,
but
you
simply
can't
those
are
the
things
that
you
can't
get
around
in
terms
of
from
a
very
technical
perspective
and
I.
Think
if
I
could
reduce
these
down
to
anything,
it's
basically
the
ability
to
pick
a
winner.
You
know,
for
instance,
in
the
immobile
uniqueness
guarantees
somebody
has
to
win
a
username.
A
You
can't
have
two
people
with
the
same
username.
If
that's
your
business
requirement,
I'm
sure
that
there
might
be
some
intrepid,
Souls
out
there
that
allow
that
to
happen,
and
they,
you
know,
lottery
out
the
number
or
they
decide.
You
know,
based
on
some
kind
of
heuristic
who
gets
it
if
there's
some
kind
of
conflict,
but
you
know
I
think
most
people
would
just
like
to
have
one
username.
So
consistency
is,
is
it's
something
that's
important
in
a
system
like
that?
A
Can
we
still
get
acid,
though
I
didn't
necessarily
put
these
acid
transactions
in
there
and
I?
Think
that's
because
it's
a
little
bit
of
them
is
a
misconception
that
you
can't
have
sort
of
these
acid
properties
in
an
eventually
consistent
system,
even
with
distributed
transactions,
definitely
within
a
row
so
think
of
a
cassandra
row.
In
terms
of
you
know
it's
on
a
single
node
or
a
single
set
of
replicas.
You
can
get
things
like
durability
and
adam
icity
through
the
commit
log.
A
The
commit
log
ensures
that
you
know
once
something's
been
written
to
the
commit
log.
It'll
continue
to
get
replayed
and
you
get
durability
through
that
at
the
process,
crashes,
it'll
get
replayed
and
that's
really
a
and
d.
You
got
isolation
and
1.1.
So
now,
if
you
do
different
changes
within
a
within
a
batch
in
Cassandra,
you'll,
see
you'll
see
those
changes
or
you
won't
isolation
being
like.
A
If
you
add
a
thousand
columns
in
one
batch
you're,
either
going
to
see
the
thousand
columns
or
not
you're
not
going
to
see
500
of
those
columns
get
inserted
in
consistency,
I
mean
maybe
it's
it's
very
loose.
Cassandra
is
obviously
not
like
this
really
strict.
You
know,
sequel,
like
system
where
you
have
these
really
complex
constraints.
There
are
some
constraints
ticking
it
could,
potentially,
you
could
add
on
to
a
system
like
Cassandra
things
like
you
know,
I
want
the
value
to
be
within
these
ranges.
A
As
long
as
it's
not
something
like
a
counter
where
you
can
have,
you
know
incremental
operations
as
long
as
it's
like
a
string,
I
don't
want
the
string
to
exceed
255
characters
as
long
as
you're
only
replacing
that
string.
For
instance,
you
can
guarantee
that
so
what
about
like
Crofts
row
or
distributed
transactions?
This
is
kind
of
the
Holy
Grail.
It's
a
little
more
interesting
I.
Think
then
doing
sort
of
oh
great.
A
I
have
a
row,
you
know
it's
useful
for
things
like
if
you
have
a
graph
database
and
you've
got
you
want
to
track
both
the
outgoing
side
of
the
note
in
the
inbound
side
of
the
node
for
query
purposes.
It's
really
good
to
be
able
to
guarantee
have
some
of
those
guarantees
around.
You
know
when
you
do
a
distributed
transaction,
making
sure
that
the
system
will
follow
through
with
those
those
things.
A
Consistency.
Probably
really
not.
I
mean
when
we
think
about
consistency.
We
think,
like
foreign,
key
constraints
and
global
consistency
like
global
uniqueness
requirements
and
things
like
that,
I'm
going
to
I'm
going
to
go
ahead
and
say
this
is
in
this
is
my
opinion
that
it's
probably
not
really
something
that
that
is
feasible.
It
is
in
some
ways,
but
again
you
know
it
would
be
only
for
maybe
a
value
like
this
string
doesn't
exceed
255
characters.
That's
relatively
easy
to
provide
something
that
doesn't
require
information
from
other
nodes.
A
Durability
doesn't
really
apply
per
se,
I
mean
maybe
distributed.
Durability
would
be
like
you
know,
writing
to
multiple
nodes,
but
really
acid
is
kind
of
a
poor
choice
of
term
in
a
way
for
a
distributed
transaction,
but
will
kind
of
muscle
through
it
1
point2,
they
added
atomic
batches
I
was
talking
to
Matt
Dennis
who's
with
data
sacks-
and
yes,
you
could
have
accomplished
this
before,
but
it's
nice
that
Cassandra
takes
care
of
it
for
you
and
actually
kind
of
deals
with
these
edge
cases.
Kind
of
works
like
this.
You
have
a
client.
A
It
sends
a
batch
to
one
of
the
nodes
in
the
Cassandra
ring
and
that
Cassandra
that
node
will
actually
replicate
that
Babs
log
to
another
one
to
another
node
in
the
ring
before
going
forward,
and
then
it'll
do
a
right,
so
just
write
out
that
batch
to
all
the
nodes
that
it
needs
to
write
it
to
it.
Was
it
in
the
main
point
this
is
that
it
resists
coordinator
def.
This
is
the
hardest
part
of
implementing.
This
is,
if
you
have
the
coordinator
die,
but
let's
sing
that
you
sent
the
batch
2
from
the
client.
A
How
do
you
guarantee
that
it
didn't
just
fail
in
the
middle
of
that
and
you've
got
partial
rights,
and
things
like
that?
This
is
really
useful
for
like
if
you
have,
if
you're
writing
indexes
of
your
data,
you
have
like
a
main
column:
family,
that's
call
your
data
in
it
and
then
you
want
to
have
query
in
different
ways.
These
are
the
you
know
the
cassandra
way
to
do
your
data
modeling,
it's
a
lot
harder
if
you
have
no
way
of
knowing
that
that
transaction
is
going
to
finish
all
the
way
out.
A
A
Actually
pointing
to
the
wrong
clicking
on
the
slide
kind
of
a
rookie
mistake,
I
really
meant
to
point
at
isolation
there.
Basically
there's
this
paper,
that's
being
that
came
out
of
Berkeley
there's
this
guy
named
Peter
Bayless,
who
actually
wrote
the
predict
consistency
code,
I,
think
you
wrote
it
in
the
Cassandra
in
the
newer
version
of
Cassandra.
That
can
actually
do
a
prediction
of
exactly
how
eventually
consistent
your
data
is,
and
when
can
you
sort
of
predict
when
you'll
reach
the
consistent
state,
but
he's
really
smart.
A
Basically,
the
guarantee
is
that
data
won't
appear,
so
you
can't
read
it
until
it's
live
on
all
the
replicas
and
that's
cross
row.
So
if
you
need
to
write
like
two
different
sides
of
the
same
data
set,
but
you
don't
want
that
to
appear
until
everything
is
done,
it's
really.
This
is
a
really
valuable
kind
of
concept.
Again,
it's
still
a
concept.
It's
still
something
like
the
paper
came
out
like
a
month
a
half
ago,
so
people
are
still
noodling
on
this.
A
You
know
it's
like
it
is
among
a
replica
group,
so
you
is
that
a
data
center
I,
don't
know
it'll,
be
interesting
to
hear
from
once
this
paper
sorts
to
make
rounds
within
the
Cassandra
community
start
hearing
about
what
how
people
think
about
you
know
implementing
this
in
this
system.
If
at
all,
I'd
be
really
excited
to
see
this,
it
does
still
require
convergence,
though
ultimately
the
property
of
Cassandra.
That
makes
it
a
bit.
You
know
a
bill,
make
it
makes
it
able
to
survive,
outages
and
survive.
A
These
things
in
and
repair
itself
is
the
ability
to
take
you
no
conflicts
and
always
converge
them
reliably.
You
never
are
in
a
state
in
Cassandra,
where
you
don't
know
how
to
kind
of
get
back
to
a
stable
state
which
is,
unlike
you,
have
where
you
know,
split
brain
issues
with
master
slave
systems
and
such
so
cool.
Hopefully
this
was
not
too
didn't.
Put
you
guys
too
much
to
sleep.
A
A
A
feeling
that
would
be
the
first
question,
so
we
use
Cassandra
for
high
right
low,
read
rate
data.
Most
of
our
data
is
stored,
like
our
social,
social
data
like
likes
and
comments
and
information
about
photos,
and
things
like
that
is
stored
in
Postgres,
just
a
charted,
postgres
setup.
It
was
it's
been
that
way
since
sort
of
the
beginning,
there's
definitely
been.
A
There
have
definitely
been
talks
of
moving
things
over,
but
there's
nothing's
really
kind
of
solidified,
because
there's
just
you
know,
you
have
a
certain
number
of
hours
in
the
day
they
kind
of
get
things
done
and
as
the
data
size
grows,
it
becomes
less
and
less
sort
of
becomes
definitely
non-trivial
to
to
migrate
data
over
and
doing
it
doing.
The
way
that
is
make
sense,
I
mean
we're,
definitely
growing
our
usage
of
it,
though
it's
just
a
matter
of
finding
the
right
use,
cases
and
pairing
it.
A
Just
specifically
the
types
of
data
we
use
it
for
is
like
security
logs
auditing
and
things
like
that
spam
fighting.
You
know
that
kind
of
basically
stuff
that
we
want
to
make
sure
we
have
and
keep
and
we
don't
don't
worry
about
performance
problems
or
outages
and
it's
very
low
maintenance
for
us.
Actually,
it
hadn't
had
a
machine
dial
up
yesterday.
It
was
like
kind
of
didn't
even
realize
it
until
you
know,
I
look
back
through
my
pages
and
saw
that
replacing
everything
was
fine.
So.
A
No
most
of
the
data
we
have
there
is
designed
to
be
permanent
unless
we
do
permanent
deletion
of
the
data
like,
for
instance,
if
a
user
deletes
their
account,
you
know
that
we
have
to
go
through
and
scrub
all
that
data,
but
we
try
to
keep
the
ideas
to
keep
this
data
for
around
forever.
So
if
you
have
any
more
questions
about
that,
I'm
trying
to
also
talk
about
that
I
was.
A
So
the
quorum,
the
the
quorum
and
Cassandra
doesn't
guarantee
that
you.
It
does
guarantee
that
you
for
a
given
row
that
you'll
see
that
if
you
read
and
write
with
quorum,
you'll
basically
have
this
sort
of
read
your
rights.
Consistency
where,
if
you
write
something
you'll
read
it
back,
but
what
it
doesn't
guarantee
is
that
if
you
have
multiple
nodes
instead
of
I'm
sorry
multiple
rows
inside
of
a
batch
that
are
on
different
nodes,
which
is
very
likely
given
distribution
of
a
cluster.
That
you'll
actually
see
those
changes
in
an
isolated
fashion.
A
So
say
you
make
updates
to
two
different
rows
that
are
on
different,
completely
different
partitions.
The
sort
of
advanced
the
sort
of
innovation
with
hat
is
that
it
basically
allows
you
to
only
see
that
when
everything
is
finished
with
that
transaction,
it's
truly
like
a
a
transactional.
You
say
begin
submit
things
to
push
into
the
data
to
push
him
to
the
cluster
and
then
you
commit,
and
then
things
would
be
only
seen
after
it's
sort
of
percolated,
so
that
everybody
would
sort
of
get
that
instead
of
just
certain
given
Rose.
A
Is
zookeeper
is
built
on
a
paxos
a
basically
a
protocol
called
zookeeper
atomic
broadcast
that
is
built
on
top
of
Paxos.
What
the
taxes
that
I
explained
was
a
very,
very
simplified
for
it's
sort
of
the
most
simple
version
to
implement.
There's
several
versions
that
are
that
are
more
complicated.
That
can
avoid
some
latency
problems.
Zookeeper,
specifically
only
really.
A
D
A
So
the
question
is:
how
do
you
model
your
data
when
you
have,
for
instance,
you
can
have
a
network
split,
it
really
is
extremely
use
case
dependent.
It
depends
on
your
business
decisions
too.
I
mean
this
stuff.
This
stuff
has
to
basically
kind
of
be
educated,
all
the
way
up,
because
you
know
if
people
don't
understand
that
you
have
to
make
that
fundamental
trade-off.
They're
gonna
make
bad
decisions
they're
going
to
ask
you
for
things
like
you
know,
one
hundred
percent
perfect
up
time
with
global
consistency,
which
is
impossible.
A
So
it's
important
to
educate
people
in
your
business
about
you
know
how
these
things
work
and
and
then
try
to
come
up
with
sort
of
contingency
plans
on
what
to
do
and
when
these
things,
when
there's
problems
like
this,
for
you
know
a
good
example
is
when
Amazon
over
sells
their
inventory
now
you
know
I'm
not
sure
this
is
because
of
eventual
consistency,
problems
or
not,
but
they
certainly
have
done
it
before
I've
been
there.
They,
you
know,
send
you
an
email
and
say
sorry
and
here's
five
dollar
credit
or
something
like
that.
A
I
mean
that's,
that's
a
way
to
solve
that
problem
and
it
becomes
sort
of
an
actuarial
problem.
It's
not
a
it's.
It's
you
kind
of
create
a
sort
of
you
kind
of
have
to
create
a
formula
of
like
if
we're
down
for
this
long.
How
much
is
it
costing
us
versus
if
we
may
be
over,
sell
a
few
things
or
we
have
to
give
people
five
dollar
gift
cards.
I
mean
that's
its
kind
of
that
bad
choice.
A
We
use
us
three
and
we
have
experimented
with
using
Cassandra
for
it,
but
I
think
s3's
just
been
perfectly
fine
for
us,
so
it's
I
mean
they've.
It's
been
actually
really
kind
of
incredible.
How
well
it's
worked
for
us,
so
there
was
there's.
Not
a
lot
of
sort
of
the
motivation
would
be
to
fix
things
that
are
breaking
or
broken
or
sub-optimal
and
s3
is
definitely
not
one
of
those
things,
but
I
have
sort
of
experimented
with
it.
A
A
Yeah,
so
we
use
flash
exclusively
for
our
postgres.
We
don't
use
it
for
Cassandra,
because
Cassandra
is
again
low,
right
or
high
right
low,
read
data.
We
push,
you
know
tens
of
thousands
of
rights,
plus
probably
more
now
through
Cassandra,
whereas
we
may
do
a
few
hundred
reads
per
second.
So
it's
really
the
right
performance
is
great.
I
mean
we
just
don't
need
you
know,
that's
just
ease
for
that.
A
As
far
as
the
the
flash
drives
for
post
grass,
you
know
it's
we
haven't
done
any
like
I,
think
we
I
don't
think
we've
done
any
benchmarking.
That
would
make
sense
anybody
else.
We
I
mean
it's
worked
really
well
for
us.
You
know
basically
like
in
a
relational
database
in
order
to
get
any
kind
of
reasonably
good
performance
out
of
it.
You
need
a
giant
drive
array
or
you
need
a
lot
of
RAM
or
you
need
SSDs.
So
I
mean
it's.
F
Nope
I
think
I've
heard
a
lot
of
suggestions
about
how
you
it
is
possible
to
achieve
multirow
consistency
using
mostly
theoretical
mechanisms
or
anything.
So
do
you
care
about
that?
How
do
you
do
it?
Do
you
use
zookeeper
as
a
transactional
layer
on
top
of
Cassandra?
What
does
this
to
instagram
do
nosso.
A
Basically,
if
you
read
the
differential
relation
she's
out
rewriting
at
the
other
direction,
checking
to
make
sure
it's
there
similar
to
what
you
know
Cassandra,
you
can
set
a
ten
percent,
read
repair
for
replicas
and
this
would
actually
be
across
different
rows,
not
just
within
a
within
a
partition
that
makes
sense
at
all.
Maybe
okay,
we
maybe
we
discussed
this
offline.
D
Hey
hi,
hey
first
time
long
time,
big
fan,
paxos
over
high
latency
links
or,
if
you're
doing
it,
multi
region,
I,
don't
know.
Yeah
Rep,
instructor,
I,.
D
A
I
mean
it
it
can.
It
depends.
I
mean
Paxos,
is
kind
of
like
more
an
idea
than
like
a
a
like
implementation
of
an
idea.
Okay
I
mean
there's
like
multipacks
hills
and
generalized
Paxos,
and
then
there's
like
zookeeper,
which
is
its
own
version
of
it.
I
would
say,
like
you
know,
it's
generally
recommended
not
to
run
a
zookeeper
out
sort
of
with
unreliable,
like
wham
leagues,
I
mean
the
whole
idea.
Is
that
you
know
you
can
you
can
resist
sort
of
small
small
sets
of
network
failures
but
ultimately
like?
A
If
you
know
your
link
between
two
data
centers
goes
down,
which
is
much
more
likely
than
inside
of
your
data
center,
then
you're
sort
of
in
this
state
where
part
of
your
system
can
is
almost
you
know
it's
basically
not
very
functional.
You
also
like
it's
weird,
because
you
end
up
sort
of
spreading
its.
It
becomes
weird
to
like
spread
things
up
and
see.
If
you
have
two
data
centers,
do
you
put
like
how
do
you
tie
break
that
right?
Some
people
don't
have
three
data
centers
to
always
put
something
in,
for
instance.
A
Typically,
though,
people
will
set
up
a
zookeeper
like
sets
of
zoo
keepers
and
different
data,
centers
they're
doing
something
like
consistent
configuration,
they'll
set
up
a
zookeeper,
you
know
and
zoo
keeper
installation
in
each
data
center
and
they'll.
Basically,
you
know
use
the
local
one
in
that
datacenter
I
think
it's
a
good
pattern.
A
E
They
interleave,
they
don't
guarantee
a
didn't,
guarantee,
atomicity
and
then
I
research
on
that
and
found
the
problem
that
probably
a
factor
package.
We
use
and
may
be
an
older
version
that
assigned
a
different
time
stamp
within
the
batch
okay.
So
that's
a
specific
problem,
then,
on
some
thinking.
If
we
have
two
different
data
centers
that
audit
even
two
different
JVM,
it's
very
possible
to
assign
the
same
time
stamp,
then
that
will
break
Thomas
city
right.
It.
A
Won't
because
all
that's
happening,
then,
is
that
the
conflict
resolution
cassandra
is
sort
of
ruling
that
out
the
all
the
atomicity
guarantees
is
that
that's
that
batch
of
operations
is
applied
or
it's
not,
and
then
the
conflict
resolution
is
actually
a
completely
separate
protocol.
On
top
of
that,
basically,
those
rights
will
actually
still
happen,
but
then,
once
Cassandra
actually
performs
the
read
it'll
see
that
one
turns
out.
Timestamp
is
higher
than
the
other
ones
and
then
resolve
that
yeah.
E
A
Everything
everything
would
still
reach
the
eventually
the
same
result,
though
that's
the
idea
is
that
you
know,
eventually,
you
would
all
every
every
every
replica
is
is
going
to
eventually
receive
the
same
result
through
the
entry
entropy
system,
so
it
may
not
be
now.
It
may
not
be
like
you
know,
in
one
millisecond,
but
it
will
be
eventual.
E
E
A
Right
right,
so
that's
are
you
saying
that
if
one
update
sees
a
different
update
that
was
there
before
and
how
those
converge,
they
will
still
those
that
they
will
still
be
applied
in
order,
though
in
a
way,
so
the
ultimately
like
it
ultimately
everybody's
going
to
see
the
same
set
of
the
idea
with
with
this
with
a
dynamo
based
system?
Is
it?
A
A
Well,
I
would
have
argued
that
it
should
be
expected.
It
may
be,
it
may
be
surprising,
but
it
is
the
way
that
it
works
so,
but
eventually
the
highest
time
simple
win
for
that,
given
column,
hottest
time
stamps
going
to
win
and,
like
you
said
there
are
there,
are
there
are
ordering
things
like
the
delete
will
get
ordered
before
that.
But
again,
that's
if
that
delete
operation
is
there
so
and
that's.
Why
tombstones,
for
instance,
stay
around
for
ten
days,
because
that
gives
that
time
for
things
to
catch
up
eventually,.
A
B
B
Have
you
ever
run
into
anybody
using
like
I
know
and
sometimes
in
high
frequency
trading
systems?
They
have
special,
like
atomic
clocks,
that
you
could
hook
into
your
system
to
get
really
good.
You
know
like,
as
you
know,
the
clock
adrift
using
ntp
are
in
a
local
data
center,
so
you
have
any
experience
with
that.
No.
A
B
A
I
think
it
it's
it's,
it's
all
it's
impossible
to
guarantee
that
they
always
go
forward
in
a
without
having
global
coordination.
Obviously,
something
like,
for
instance,
like
the
true
time.
Api
would
be
a
way
to
do
that
effectively.
The
global
coordination
is
affected.
Everything
sort
of
all
of
those
systems
proceed.
You
know
in
the
same
fashion,
but
even
the
true
time,
API
that
they
have
in
the
spanner
paper,
which
actually
talks
about
this
exact
problem
in
in
detail.
You
know
can't
necessarily
guarantee
that
you'll
end
up.
A
You
know
it
can't
guarantee
that
that
you
won't
can't
still
have
conflicting
rights.
You
know
you
have
to
basically
have
a
way
to
converge,
those
that
data
together,
whether
it's
last
rider
wins
using
timestamps
using
vector
clocks
using
immutable
data
that
doesn't
actually
have
to
converge,
and
you
know
it's
it's
spanner
again
is
still
at
a
you
know:
it's
still
a
dependent
on
it
on
having
open
communication
available,
you
know
and
and
no
partitions
for
rights,
so.
G
A
Actually,
the
the
stuff
I
was
talking
about
earlier,
the
Hat
paper
actually
is
came
out
from
the
same
set
of
sort
of
the
same
research
group
that
released
this
thing
called
PBS,
which
is
basically
a
probabilistic
way
to
determine
the
sort
of
window
of
consistency
based
on
it's
like
based
on
network
latency
and
like
Albert.
You
know
basically
statistical
things
that
it
statistics
that
you
collect
Cassandra
actually
has
a
way
to
do
this
with
no
tool.
A
Now
you
can
say
no
to
a
predict,
consistency,
give
it
some
parameters,
and
it
will
actually
tell
you,
you
know,
give
you
some
kind
of
prediction
on
that.
It's
not
obviously
you're,
not
perfect,
it's
not
a
guarantee,
but
it
is.
It
is
a
probabilistic
and
it
can
actually
tell
you
it's
within
a
certain
percentage
of
certainty.
A
C
B
One
more
so,
how
do
you
feel
about
cap
and
that
probabilistic
determination
verse
the
systems
that
say
they
have
like
an
ordered
set
of
operations
and
then,
but
you
take
into
account
like
real
disk
failures
and
outages.
Do
you
think,
like
probabilistic
consistency
on
average
actually
does
better
consistency
than
those
systems
that
claim
to
be
really
consistent?
But
when
you
think
about
like
disks,
you
know
disk
getting
corrupt
and
other
scenarios
like
bad
like
bit.
Rot,
are
eventually
consistent
or
distributed
consistent,
distributed
ly
consistent
systems
better
at
that
problem.
H
Standpoint,
how
can
how
is
the
replication
going
to
keep
up
with,
like
not
necessarily
just
most
recent
copy
of
the
record
but
versions
of
that
record,
as
it
moves
forward
in
time
is?
Is
that
does
that
sort
of
change
the
whole
acid
discussion
about
like
locking
records
and
locking
a
version
of
a
record,
and
so
does
do
there
any
consideration
in
this
in
this
replication
for
that
kind
of
topology?
So.
H
Like,
like
you
know,
when
we're
talking
about
Cassandra
we're
talking
about
like
the
the
you
know,
the
records
that
are
being
sort
of
kept
at
multiple
nodes
right.
So
there's
this
whole
concept
at
play
when
we're
talking
about
temporal
data
versions
of
a
record
for
an
interval
of
time
right
right.
So
in
that
respect,
a
lot
a
lot
of
the
issues
that
arise
or
unlocking
an
asset
around
that
record
come
from
also
the
consideration
of
the
version
of
the
record
that
has
to
be
locked
internally
in
the
engines.