►
Description
Speaker: Sameer Farooqui, Freelance Big Data Consultant and Trainer
Slides: http://www.slideshare.net/planetcassandra/cassandra-vsh-base
Have you wondered what actually happens when you submit a write to Cassandra? This vendor agnostic technical talk will cover the internals of the read and write paths of Cassandra and compare it to other NoSQL stores, especially HBase so you can pick the right database for your project. Some of the topics mentioned are consistency levels, memtables/memstores, SSTables/HFiles, bloom filters, block indexes, data distribution partitioners and optimal use cases.
A
Hello,
everybody,
my
name
is
samir
and
I'm
a
freelance
trainer
and
consultant
for
hadoop
cassandra
and
sometimes
openstack.
In
the
past
year,
I've
taught
about
50
classes
throughout
the
country
and
in
india
and
usually
my
classes.
A
When
I
talk
about
hadoop
towards
the
end
of
the
class,
I
get
a
very
common
question
which
people
ask
me:
okay,
now
that
you
know
about
like
hbase
and
hive
and
pig
well.
How
does
that
compare
to
mongodb?
Why
would
I
want
to
use
cassandra
instead
of
hbase
right?
So
what
I
wanted
to
do
in
this
presentation
is
specifically
talk
about
why
cassandra
is
different
from
the
other
competing
nosql
architectures.
A
So
a
really
quick
question:
how
many
people
here
have
either
installed
cassandra
or
done
a
quarry
to
retrieve
data
in
and
out
of
it
or
are
familiar
with
the
architectures
like
what
a
ss
table
is
or
a
mm
table?
Okay,
cool,
so
about
50?
What
about
the
same
question
for
hbase,
okay,
about
10
percent?
A
A
It's
really
hard
to
know
which
database
to
choose
for
your
solution,
and
you
know
every
vendor
claims
that
their
benchmark
is
better
than
their
competitors.
A
So
the
way
I
think
you
should
approach
this
problem
is
first
try
to
identify
which
one
of
these
categories
of
databases
is
correct
for
your
use
case
after
you've
decided
upon
a
category,
then
selectively
choose
a
specific
database
and
I'll
try
to
tell
you
in
a
vendor
agnostic
way.
What
I
tell
my
clients
for
how
to
do
this
three
to
four
years
ago.
It
was
the
wild
wild
west
of
nosql
databases,
and
it
was
really
hard
to
tell
how
to
pick
one.
A
There
were
no
books,
no
youtube
videos
and
no
benchmarks,
but
these
days
I
think
there
are
clear
leaders
emerging
in
each
one
of
these
categories,
and
there
are
databases
that
are
dying
in
these
categories,
so
you
should
once
you
pick
a
category.
You
should
really
pick
one
of
those
leaders
as
opposed
to
one
of
the
abandoning
databases
all
right.
So,
first,
let
me
tell
you
about
these
categories
and
you
can
be
thinking
about
which
use
case
is
correct
for
you,
so
we'll
talk
about
key
value.
A
A
The
key
value
pairs
are
stored
in
a
container
called
buckets.
It's
different
for
different
databases,
but
that's
a
common
terminology.
You
only
have
consistency
per
key,
so
you
can't
put
two
key
value
pairs
in
and
if
one
of
them
doesn't
get
committed,
pull
them
both
back
out,
so
atomicity
is
per
key.
A
A
A
But
in
these
databases
you
can
actually
put
any
type
of
type
of
data
you
wanted
the
value
the
value
is
not
examinable,
so
you
can't
say
give
me
back
all
of
the
keys
where
the
value
equals
the
color
blue
right.
You
can
only
say
give
me
for
this
key.
Give
me
the
value
and
the
value
can
be
a
object,
a
blob,
json
xml
and
then
some
of
these
key
value
databases.
You
can
also
attach
a
metadata
item
to
the
value.
A
So
for
a
name,
you
can
also
attach
a
metadata
item
there
saying
like
this
is
the
age
of
the
person,
for
example
right.
So
that's
that's
kind
of
almost
getting
into
key
document
territory,
but
think
of
it
kind
of
like
key
value.
For
now.
A
The
use
cases
for
this
type
of
databases
would
be
things
like
content.
Caching,
web
session
information,
user
profiles,
preferences
and
shopping
carts,
so
here's
an
ideal
use
case
like
if
you're
developing
a
website
when
somebody
types
in
their
username
and
password
to
log
into
your
website.
A
Maybe
you
have
customization
done
on
the
front
end
where
the
background
color
can
be
specific
for
different
users,
or
maybe
the
language
is
different,
or
maybe
the
time
zone
settings
will
affect
how
the
web
page
will
be
rendered
in
that
case,
as
soon
as
a
person
logs
in
you
want
to
very
quickly
retrieve
their
preferences
for
your
site
and
display
that
html
and
dynamically
render
it
right.
So
in
those
type
of
cases
where
you
want
a
very
quick
lookup,
it's
a
simple
look
up.
A
A
Key
document
is
the
other
category.
Mongodb
and
couch
base
are
probably
like
the
really
popular
ones
here.
This
is
kind
of
like
a
key
value
database,
but
the
value
is
examinable
and
a
very
common
misconception
is
that
in
a
key
document-
databases
you're
going
to
store
documents
like
word
documents
or
pdfs.
That
is
not
the
case.
A
The
documents
tend
to
be
xml
or
json
or
bson.
Right,
usually,
I
see
json
and
the
structure
of
the
json
or
the
xml
does
not
have
to
be
identical
across
the
different
keys.
They
should
be
similar,
but
they
don't
have
to
be
identical.
So
here
is
a
use
case
that
I
think,
exemplifies
key
document
databases.
A
Let's
say
that
you
are
developing
a
software
that
is
going
to
profile
children
in
refugee
camps
in
africa
all
right.
So
what
you
do
is
you
like
basically
go
to
the
camp
and
you
interview
children
and
you
find
out
information
as
much
information
as
you
can
in
hopes
of
uniting
the
child
back
with
the
child's
parents
right.
So
in
this
case
we
have
a
key
that
I
kind
of
randomly
made.
It's
called
one.
I
know
the
first
name
last
name
and
location
of
this
child.
I
know
that
she
speaks
two
languages.
A
I
know
the
mother
and
the
father's
name,
I
know
which
camp
they're
in-
and
I
have
a
blob
picture
that
I
put
in
now
for
another
child,
though
perhaps
I
don't
know
all
that
information,
this
other
child
is
younger,
it's
right
so
for
dee.
I
only
know
that
she
knows
some
swahili.
I
don't
know
who
her
parents
are.
A
So
you
see
how
the
the
two
schemas
are
somewhat
similar,
but
they're
not
identical,
and
in
these
type
of
databases,
they're,
very
tolerant
of
incomplete
data
and
later
on
once
you,
maybe
somebody
else
in
the
camp
will
recognize
dee
and
will
tell
tell
you
hey
by
the
way
I
do
know
who
her
parents
are-
or
I
know
you
know
some
more
information
about
her.
In
that
case,
you
can
connect
back
to
you
get
to
that
value
and
you
can
append
only
what
you
want
to
it.
A
So
the
use
cases
here
are
like
event:
logging,
content
management
systems,
blogging
platforms
or
web
analytics
are
all
pretty
good
use
cases
for
something
like
this
and
keep
in
mind
that
in
this
case,
you
don't
have
to
pull
the
entire
value
out
in
a
lot
of
these
type
of
databases,
you
can
reach
into
the
value
and
only
pick
out
the
specific
things
you
want.
A
A
You
don't
want
to
use
these
type
of
databases
for
any
complex
transactions.
There
are
spanning
different
operations
and
you
can't
really
have
a
strict
schema
here.
If
an
item
is
missing
like
on
the
key
value
on
the
right,
then
there's
no
null
store
there.
It's
just
it's
just
basically
blank.
So
sometimes
when
you
query
it
you'll
get
nothing
back.
A
A
The
node
has
properties
attached
to
it,
or
here
I
have
a
phone
number
attached.
Gps
coordinates
for
where
the
phone
currently
is,
and
the
imsi
number
the
international
mobile
subscriber
number
and
here's
another
node
with
a
different
phone
number
right.
So
a
node
is
basically
like
an
instance
of
an
object.
An
application
and
the
next
concept
in
graph
databases
is
edges.
So
edges
are
also
called
relations,
so
here's
an
edge
they
can
be
unidirectional
or
bi-directional,
but
a
call
is
made
from
one
for
to
another.
So
I
called
that
number.
A
The
415
number-
and
you
know
here-
is
another
node
and
a
uni-directional
edge
to
it,
and
now
here
is
a
node
that
is
contacting
me.
So
it's
in
the
other
direction
and
you
can
see
that
these
edges
have
properties
either
called
or
text
and
here's
a
little
bit
more
characters
and
you
can
add
more
properties
to
these
edges.
So
maybe
I
want
to
keep
track
of
like
these
edges
were
initiated,
perhaps
but
by
mobile
phones,
and
this
one
edge
in
the
middle
was
done
by
a
landline
and
here's
some
more
properties.
A
You
can
add,
like
a
duration
of
a
call
to
the
edge
and
the
time
of
the
call,
the
time
the
call
was
placed.
So
the
use
cases
for
this
is,
you
know,
connected
data
like
social
networks,
shortest
path,
algorithms,
recommendation
engines,
routing
dispatch,
location
services,
and
you
don't
want
to
use
these
for
really
large
scale.
Examples.
A
Titan
is
a
graph
database
on
top
of
cassandra
and
infinite
graph
is
a
another
graph
database
that
can
scale
beyond
one
node.
But
you
know
the
the
complications
between
the
nodes
and
the
relations
is
pretty
complex,
and
these
things
usually
run
on
one
machine.
Neo4J
is
probably
the
most
popular
one
here.
Flockdb
is
another
one,
but
flogdb
only
lets
you
query
one
level
deep,
as
opposed
to
like
neo4j
and
other
graph
databases
can
let
you
do
a
traversal
of
just
what
querying
the
graph
is
called.
A
You
can
do
a
traversal
that
says,
you
know,
give
me
all
of
the
people
who
are
twice
removed
from
the
central
node
next,
a
much
more
emerging
category
is
real
time
and
there's
a
little
squiggly
behind
before
us,
because
it's
technically
near
real
time
is,
I
guess
the
category
here
and
in
this
example,
I'm
talking
about
storm
the
way
storm
works
is
it's
used
to
analyze
data
in
motion
before
you
actually
even
commit
it
to
the
hard
drive.
So
the
data
comes
in
from
a
spout.
A
Think
of
this,
like
an
application
that
you
write,
that
connects
to
the
twitter,
firehouse
wiring
api
and
it's
downloading
tens
of
thousands
of
tweets
per
second
into
this
pipeline
coming
into
your
data
center
and
then
in
the
very
middle.
Next
to
the
spout
you
have
a
bolt.
A
bolt
is
basically
consuming
the
stream
and
it
is
picking
out
the
hashtags
every
time
a
tweet
comes
in
with
a
pound
symbol.
A
In
a
word,
it's
picking
that
out
and
it's
basically
counting
the
hashtags
that
are
perhaps
coming
in
or
it's
every
time
it
sees
a
specific
hashtag
increments,
a
counter
for
that
hashtag
and
then
stores
it
in
a
database
down
there.
So
it's
taking
the
incoming
stream
and
making
two
streams
out
of
it.
One
of
the
streams
continues
onward
into
a
database
where
you're,
storing
the
full
tweets,
but
then
the
hashtags
are
being
aggregated
or
counted
and
they're
going
into
a
separate
database.
A
A
These
things
can
get
pretty
fast,
but
you
know
these
things
are
very
emerging
and
I
think
most
of
these
are
in
alpha
or
beta
stages
and
there's
almost
no
books
written
on
these
impala
and
stinger
are
a
little
bit
different
from
storm
or
here
you're
capturing
the
data
before
anyone
goes
to
the
database,
but
in
empower
storm
you
basically
do
near
real
time
queries
on
data
in
rest,
that's
already
in
a
database.
A
Next
we
come
to
column,
family
and
the
main
concepts
here
are
that
on
the
left,
you
have
rows.
Those
are
the
row
keys
and
in
a
database
like
hbase
you're,
going
to
actually
group
certain
row
keys,
like
ranges
or
row
keys
into
something
called
a
region
all
right,
so
like
an
age
base
that
region,
one
through
three
rows,
will
be
stored
on
a
specific
machine
slave
machine
10
and
the
next
region.
A
Rows
four
through
six
will
be
stored
in
another
machine,
maybe
machine
20
in
your
hbase
cluster,
okay,
so
region,
one
is
those
three
rows
and
all
of
the
column,
families
all
three
column,
families
for
those
three
rows
so
so
for
row.
One
for
example,
you
can
see
that
on
there
it's
there's
three
three
column:
families
we
have
a
column.
Family
is
basically
just
a
bunch
of
related
columns.
The
reason
you
want
to
be
thinking
about
how
you
do
the
data
model
here
and
why
would
you
want
to
use
multiple
column
families?
A
Is
you
want
to
put
data
that
you're
going
to
query
together
into
a
column
family
right?
So
you
kind
of
don't
want
to
do
queries
which
are
going
to
span
column
families,
because
each
column,
family
is
essentially
written
into
a
file
by
itself
right
so
over
here
we
actually
have
two
regions.
Let's
say
this
is
hbase.
We
have
two
regions
right
region.
One
is
on
one
machine
region:
two
is
on
a
different
machine
region.
A
One
on
the
first
machine
is
going
to
have
like
three
files,
basically
one
for
each
column,
family
and
the
other
machine
will
have
three
files,
one
for
each
column,
family
and
if
you
end
up
doing
a
query
across
column,
families
which
these
databases
allow
you're
gonna
have
to
read
it
across
at
least
two
files,
and
it
could
be
potentially
slower
other
reasons
to
be
thinking
about
column
families.
Is
you
configure?
Usually
these
databases
per
column
family?
A
So
you
can
put
data,
that's
highly
compressible
in
column,
family
one,
and
maybe
when
you
query
that
it'll
be
a
little
bit
slower
because
you
gotta
decompress
but
column
family
two
is
maybe
where
you
want
the
really
high
performance
stuff.
So
you
don't
compress
that
and
you
can
in
something
like
hbase
and
cassandra.
A
Also,
the
these
cells
are
versioned.
Now
this
is
kind
of
internals.
To
these
databases
you
shouldn't
really
be
using
these
versions,
so
you
can
do
it
if
you
want.
There
are
exposed
java
methods,
but
it's
not
recommended.
This
is
kind
of
more
of
an
internal
thing
within
the
database.
So
you
could,
however,
in
a
lot
of
people,
do
it
in
hbase.
I
guess
less
so
in
cassandra,
it's
really
discouraged,
but
you
can.
A
You
can
version
your
data
and
when
you
do
the
versioning,
you
can
set
the
version
per
column,
family,
so
column,
family,
one
is
compressed
and
just
one
version
maintained,
column,
family,
two,
we're
going
to
do
three
versions
and
we're
not
going
to
compress
it
and
if
you
add
a
fourth
version,
it'll
override
the
oldest
version,
so
some
people
call
the
the
data
axis
here
kind
of
like
key
value.
You
can
think
of
a
key
see
that
down
there.
A
A
There
are
no
nulls
stored
here,
so
one
of
the
good
things
about
these
data
models
is
that
you
know
in
column,
family
one.
I
have
four
columns,
but
maybe
for
row
one
I
don't
have
any.
I
don't
know
what
I
don't
have
the
data
for
those
columns.
I
don't
put
anything
there.
There's
no
nulls
there.
Nulls
have
a
real
cost
on
disk
and
databases
are
relational
databases,
so
we're
here,
there's
no
nulls.
I
only
have
xy,
so
I
mean
it's
not
putting
nulls
all
over
the
place.
A
So
it's
good
for
sparse
data
models
perhaps
later
on
I'll
know
what
the
other
data
columns
are
supposed
to
be,
and
I
can
populate
them
in
there
later.
So
other
things
you
want
to
know
about
these
column.
Family
databases
is,
you
should
know
your
queries
up
front
and
the
read
and
write
queries
will
determine
how
you're
going
to
do
the
data
modeling
if
you're
a
startup-
and
you
think
you're
going
to
be
changing
your
queries
around
all
the
time.
A
If
you
do
this
join
right,
but
in
these
databases
you
really
want
to
understand
the
architecture,
at
least
at
the
level
where,
when
somebody
gives
you
their
data
model-
and
they
tell
you
what
the
architecture
is
like
for
the
system-
and
they
say
this
is
a
query.
I'm
going
to
do
a
good,
age-based,
developer
administrator
will
be
able
to
say
things
like
well.
You
know
what
I'm
not
going
to
allow
you
to
run
that
query,
because
that
query
is
going
to
put
a
lot
of.
I
o
pressure.
A
It's
going
to
take
too
many
disk
seeks
right
now:
okay,
that's
kind
of
a
background
column,
family
databases,
but
if
you
think
this
is
the
kind
of
database
you
want,
how
do
you
pick
it?
I
gave
you
like
six
examples
in
that
first
slide.
Well,
I
think
you
should
pick
it
based
on
three
different
things.
First
thing
is:
who
are
these
databases
being
adopted?
So
I
this
is
indeed.com
it's
a
job.
A
It's
a
job
posting
website
kind
of
like
monster
or
career
builder,
and
I
typed
in
a
bunch
of
column
family
databases
into
here,
and
we
can
see
very
clearly
that
if
you
pick
a
cumulo
or
hpcc
or
hyper
tata,
hyper
table
you're
going
to
be
in
a
very
small
club
right.
If
you
pick
hvac
for
cassandra,
the
orange
is
cassandra
and
blue
is
hbase.
You
can
see
that
they're
they're
really
getting
adopted
and
they're
really
close
to
each
other.
So
this
is.
This
is
basically
saying
that.
Well,
you
can't.
A
A
So
you
want
to
actually
go
to
the
left
side
of
this
page
and
you
can
narrow
it
down
to
computer
and
technology
field
and
then
it'll
do
a
somewhat
of
a
better
job.
But
even
here
you
can
see
that
it's
saying
that,
like
cassandra
was
getting
a
bunch
of
queries
under
databases
and
or
computers
and
technology,
but
cassandra
was
didn't
exist
in
2005
right,
so
I
mean
you
gotta.
A
You
should
maybe
be
thinking
that,
maybe
up
to
here
this
level
right
here,
might
just
be
other
people
searching
for
other
cassandra
related
stuff
and
it's
kind
of
like
a
kind
of
a
skew.
You
know
so
this
stuff
down
here
is
kind
of
skewed.
I
think-
and
these
databases
were
invented
around
this
time.
This
is
when
the
world
found
out
about
them.
This
is
when
quarry
started
started
to
happen,
and
I
was
trying
to
figure
out
what
this
spike
was.
A
This
is
when,
because
around
the
time,
cassandra
became
a
top-level
apache
project,
so
maybe
people
just
just
kind
of
got
a
bunch
of
press
releases.
A
This
is
showing
google
trends
search
volumes
in
different
countries.
I
found
it
kind
of
interesting
that,
first
of
all,
south
korea
there's
something
happening
in
south
korea
that
they're
like
doing
the
most
hadoop
and
cassandra
and
h-based
searches.
So
it's
really
interesting,
and
then
this
leftist
cassandra
it's
adopted
more
in
india
than
and
over
here
you
can
see
that
hbase
is
adopted
more
in
china
and
india
is
kind
of
after
taiwan
down
here,
so
china
prefers
hbase
and
india
seems
to
prefer
cassandra.
A
Another
thing
you
can
do
to
decide
which
database
you're
going
to
use
do
your
due
diligence
by
going
to
the
apache
user
groups,
and
this
is
the
activity
on
the
apache
user
mailing
list.
You
can
see
that
in
may
2013
there
were
about
567
messages
on
the
apache
emailing
list
and
agebase
is
somewhat
similar.
So
I
mean
these
two
databases
are
kind
of
neck
to
neck
right
now,
right,
so
how?
How
would
you
actually
go
a
little
bit
deeper
to
pick
one?
A
A
Basically,
facebook
took
the
storage
engine
out
of
dynamo
and
married
it
to
the
data
model
of
bigtable
for
their
inbox
search
application
and
they
open
sourced
the
cassandra
database
and
put
into
apache
they
kind
of
somewhat
abandoned
it
and
they
adopted
hbase
much
more
heavily
these
days
at
facebook,
but
jonathan
ellis,
the
cto
of
datastax
was
the
first
non-cassand
non-facebook
committer
to
cassandra
and
he's
brought
this
database
very
very
far
from
its
humble
origins
when
it
was
first
released.
A
This
is
hbase
it's
it's
kind
of
this
is
actually
hadoop
right,
so
hadoop
origins
are
2003.
Google
file
system
paper
basically
is
implemented.
95
of
that
architecture
is
implemented
as
hdfs
in
apache
mapreduce,
because
the
whitepaper
becomes
apache
mapreduce.
Google
did
all
of
this
coding
in
c,
plus
plus
released
the
white
paper
architectures,
but
no
code.
A
It
was
up
to
companies
like
yahoo
and
later
on,
facebook,
twitter
and
a
bunch
of
other
companies
to
implement
the
ideas
of
these
white
papers
in
in
java,
which
they
found
to
be
a
more
robust
language
than
c
plus
plus
because
of
his
garbage
collection
and
handling
of
memory
leaks.
A
A
Okay,
both
cassandra
and
apache
are
written
in
java,
they're
columnar
into
databases,
they've
been
proven
to
scale
to
more
than
a
thousand
nodes
in
production,
very
low
latency
reads
and
writes,
meaning,
probably
under
100
millisecond,
reads
and
writes
they
use.
This
idea
called
log
structured,
merge,
trees,
which
I'll
talk
about
in
a
minute
and
they're
atomic
at
the
row
level.
So
you
cannot
like
lock
two
rows
and,
like
shuffle
data
around,
you
can
only
lock
one
row
if
you
want
to
lock
two
rows.
A
You'll
have
to
do
it
the
application
side
and
then
release
the
lock
there.
There's
no
support
for
joins
transactions
or
foreign
keys.
In
both
of
these
databases.
Now
they
defer
cassandra's
more
of
a
peer-to-peer
architecture,
hbase's
master
slave.
All
the
google
stuff
was
done
in
a
master
slave
idea
where,
if
the
master
node
goes
down,
then
everything
kind
of
goes
down.
But
these
days
there
is
more
high
availability
in
the
hadoop
world.
Cassandra
comes
from
dynamo
and
amazon
did
a
purely
peer-to-peer
architecture,
no
master
slaves.
Cassandra
has
tunable
consistency.
A
Age
base
is
strict
consistency.
I'll
talk
about
that
in
a
bit,
cassandra
has
secondary
indexes,
natively
available
hbase
does
not.
There
are
tricks
to
be
able
to
do
it.
In
hbase,
cassandra
writes
the
incoming
data
directly
to
the
linux
file
system
like
ext3
or
ext4.
Hbase,
writes
it
down
to
hdfs
conflict
resolution
during
in
cassandra
is
handled
during
reads
and
conflict
resolution
in
hbase
is
handled
during,
writes
and
we'll
talk
about
other
ideas
in
a
minute.
All
right,
let's
just
think
for
a
second,
why
amazon
invented
the
amazon
dynamo
database.
A
So
if
you
read
the
white
paper,
their
use
case
was
they
wanted
to
make
a
shopping
cart
service
that
will
always
allow
people
to
add
data
into
the
cart.
They
never
wanted
to
be
able
to
reject
an
item
that
somebody
added
on
their
website
to
a
shopping.
Cart,
no
matter
what
no
matter
whether
there
are
you
know:
hurricanes,
taking
down
data,
centers
or
power,
outages
or
disk
failures
or
network
failures.
They
needed
that
right
to
be
successful
for
and
not
lose
revenue
on
that
sale.
A
So
if
somebody
basically
like
logged
into
the
website
from
their
browser
on
their
computer
and
they
added
something
to
their
shopping,
cart
and
then
like
they
logged
in
from
their
ipad
and
the
shop
added
something
else
to
the
shopping
cart,
perhaps
there's
conflicting
things
in
in
their
shopping
cart,
because
they're
logged
into
both,
then
what
happens
is
when
you
go
to
the
checkout
dynamo
will
go,
read
both
conflicting
versions
of
the
items
and
merge
them
together
and
give
that
back
to
you,
you
can
do
different
things
you
can
also.
A
If
there's
conflicts
you
can
you
can
look
at
them
and
you
can
only
look
at
the
timestamp
and
pull
and
basically
throw
away
the
old
timestamp
data.
So
you
have
the
options
of
what
you
want
to
do.
This
database
dynamo
was
designed
for
applications
that
need
tight
control
between
the
trade-offs
between
availability,
consistency,
cost
effectiveness
and
performance.
A
Other
things
they
use
dynamo
for
was
like
best
seller
list,
customer
preferences,
sales
ranks,
product,
catalogs
and
service
management.
Now
google
designed
bigtable
mainly
for
keeping
the
entire
web
crawl
cache
data.
They
were
basically
doing
like
a
distributed
parallel,
parallel
wget
and
like
downloading
the
html,
into
h,
into
big
table
into
a
column
in
big
table
and
then,
and
that
was
its
own
column,
family
with
one
column
with
html
files,
and
then
they
were
basically
doing
like
a
mapreduce
job
against
that
column.
A
That
would
derive
statistics
and
metadata
out
of
that
each
html
file
and
then
populate
a
different
column
family
with
metadata
like
this
is
an
english
language.
On
this
row
number
two,
it's
english
and
if
the
geography
is
you
know,
this
is
like
uk
english
and
it
was
like
last
updated
at
this
time,
and
you
know
all
this
information
goes
in
other
metadata
column,
families
which
are
not
compressed,
but
google
like
compresses,
this
first
column
found
just
all
html.
It
is
highly
compressible
because
it's
all
text
data.
A
At
one
point
in
time
google
used
bigtable
for
many
different
products,
gmail
youtube,
google
earth,
google,
finance
and
even
though
hbase
is
almost
a
clone
of
big
table
in
recent
years,
it's
the
architecture
started
to
diverge.
The
big
table
of
white
paper
came
out
in
2006,
where
we
are
like
six
seven
years
away
from
there,
and
even
though
the
early
prototypes
were
a
clone.
Now
it's
kind
of
becoming
a
little
bit
different
and
google
also
designed
bigtable
to
be
able
to
handle
a
variety
of
workloads.
A
Sometimes
they
wanted
to
be
able
to
do
like
high
throughput.
You
know
batch
processing,
jobs
and
other
times
they
wanted
to
be
able
to
do
like
really
latency
sensitive,
serving
data
to
serve
a
real
website.
So
it's
kind
of
a
diverse
database.
A
A
It's
gossiped
with
in
the
past
so
basically
like
if
node
four
like
gossips
with
node
eight
note,
four
might
tell
eight
like
a
bunch
of
stuff,
like
you
know,
hey
by
the
way,
like
a
few
seconds
ago,
me
like
four,
I
talked
to
three,
and
actually
he
didn't
respond
back,
so
eight
will
be
like
all
right
and
then
maybe
five
will
talk
to
eight.
The
next
second
and
five
will
gossip
with
eight
and
say
I
was
able
to
talk
to
three
right,
but
four
was
not
able
to
talk
to
three.
A
A
Node
3
is
suspicious
but
see
this
is
a
more
advanced
way
of
determining
the
health
of
nodes
than
a
simple
heartbeat,
where,
if
a
heartbeat
doesn't
come,
the
node's
probably
dead
right,
but
in
this
case
it's
more
interesting
because
with
the
gossip,
maybe
just
the
network,
connection
between
four
and
three
is
was
was
having
a
problem
to
switch.
But
five
to
three
was
fine,
so
note,
3
is
not
down
and
note.
8
can
trust
that
3
is
still
up.
It's
just
four
can't
get
to
it.
The
way
the
actual
failure
detection
happens.
A
Is
this
japanese
protocol
called
fee
accrual
detection,
which
allows
you
to
set
a
suspicion
level
on
a
continuous
scale
on
nodes
to
suspect
whether
they're
down
or
not,
hbase
doesn't
do
this.
Hbs
is
a
little
bit
different
architecture.
So
hbase
has
a
lot
of
moving
parts.
There's
master
machines
and
slave
machines,
then
the
green
name,
node,
is
the
master
of
the
file
system,
and
you
can
see
that
the
master
of
the
file
system
has
a
slave
component
called
the
data
node
on
each
node.
A
The
data
node
heartbeats
with
the
name
node
every
three
seconds
and
each
data
node
is
telling
the
name
node
I'm
up
and
running,
and
these
are
the
things
happening
from
storage
and
perhaps
a
network
perspective
on
me
and
there's
the
hbase
master.
You
have
a
standby,
age-based
master
as
well.
There's
a
standby
name
node
and
the
hbase
master
actually
does
not
heartbeat
with
the
region.
Servers.
The
the
slave
component
on
each
slave
machine
here
is
called
a
region
server.
A
A
region
server
is
basically
storing,
like
you
know,
maybe
20
to
100
regions.
Remember
in
the
previous
slide,
we
had
rows,
133
was
a
region
and
that
region
actually
was
made
of
three
column
families
with
three
files
coming
out.
Well,
you
basically
put
something
like
you
know:
10
20,
maybe
100
regions
on
each
region.
Server.
A
All
right
you
have
to
have
zookeeper
as
well,
and
zookeeper
should
be
run
on
an
odd
number
of
machines,
so
leader
election
will
be
able
to
work.
If
you
have
two
four
machines,
two
can
say
one
thing:
two
can
say
another
thing
and
that
can
cause
problems
so
run
out
odd
number
of
machines,
three
five
or
seven.
A
You
can
start
with
three
and
sometimes
you'll
also
need
mapreduce.
If
you
wanna
do
like
full
table
scan.
So
now
you
need
a
job
tracker
as
well.
There's
no
high
availability
for
the
job
tracker
right
now
and
now
you
have
a
task
tracker
running
on
each
machine
which
will
launch
daughter,
jvms
for
maps
and
reducers
to
actually
run
to
do
the
table
scans.
So
you
have
a
lot
of
moving
parts
here.
A
You
have
a
whole
bunch
of
master
machines
if
you
want
to
be
able
to
run
this
in
a
highly
available
manner
on
the
slave
machines,
each
one
of
these
components
is
a
jvm
running
with
its
own
heap,
and
the
task
tracker
is
launching
a
bunch
of
daughter,
jvms
maps
and
reducers
so
and
you've
got
zookeeper
up
there.
So
learning
this
architecture
is
going
to
take
more
time.
It
took
me
like
six
months
to
a
year
to
learn
hbase,
but
maybe
only
a
couple
of
months
to
get
comfortable
with
cassandra.
A
Heartbeats
from
the
region,
server
go
to
zookeeper
in
hbase
and
then
zookeeper
will
tell
the
hbase
master
if
there's
a
problem
with
one
of
the
slave
machines
and
then
the
master
will
take
the
decision
to
actually
do
something
to
recover
from
a
region.
Servers
failure
or
maybe
just
one
disk,
crashed
and
some
regions
got
lost.
Age-Based
master
will
help
take
care
of
that
problem.
A
So
I
mentioned
effort
to
deploy.
These
are
just
some
more
information
on
that
really
quickly.
I'm
going
to
talk
about
how
a
write
happens
in
hbase,
you
have
a
client
there
client's
going
to
contact
zookeeper
and
say:
where
is
the
root
table,
so
the
client
wants
to
write
row
500,
okay
and
some
column
in
row.
500
but
client
doesn't
know
where,
where
the
region
is
being
hosted
on
the
slave
machines
client
goes
to
zookeeper,
says
where's
the
root
table.
Zookeeper
says
the
root
table
is
on
this
machine
over
here.
A
The
client
goes
to
that
machine
pulls
the
root
table
out
and
now
the
root
table
is
on
the
client.
That's
what
the
root
table
looks
like
it
doesn't
show
where
real
500
is.
Basically,
you
go
to
the
root
table,
you
say
where's
row
500,
it
says:
well,
you
need
to
go
to
meta2.
For
that.
You
got
to
go
to
this
other
table
meta
too.
He
should
know
where
the
region
actually
is
so
now
the
client
is
I'm
going
to.
Basically,
you
know
query
the
root
table
say
where
is
row
500,
meaning?
A
That
knows
where
row
500
is,
and
then
the
root
table
says
that
you
know
it's
over
there,
that's
the
meta
two
table
and
then
the
meta
two
table
is
basically,
you
know
pulled
in
to
the
client
as
well,
and
then
the
client
queries
the
meta
table
and
says:
okay,
where
is
the
region
that
has
row
500
and
then
the
meta
table
actually
tells
it
that
and
then
the
client
goes
to
the
specific
region,
server,
which
is
hosting
the
specific
region
out
of
hundreds
of
regions
for
row
500,
and
maybe
the
regions,
maybe
also
the
rows
above
and
below
it,
and
it
actually
sends
that
x
that
right
to
that
machine
and
then
what
happens
is
underneath
of
hbase
h
base
just
sends
it
from
the
hbase
jvm
down
to
the
data
node
jvm
for
hdfs,
and
then
it's
the
hdfs
file
system's
responsibility
to
do
a
synchronous
replication
from
that
first
machine
to
another
machine
in
your
hadoop
cluster,
and
then
another
replication
happens.
A
Well,
no,
it
happens
in
a
pipelined
way,
so
the
hbase
just
writes
to
one
machine.
It's
hdf's
responsibility
to
take
that
right
and
replicate
it
in
a
pipelined
manner
to
two
other
machines
to
achieve
their
replica
factor
of
three.
A
So
the
thing
is
and
then
basically
after
that
happens,
then
hbase
is
going
to
the
hbase
region
server.
There
will
tell
the
client,
okay,
I'm
done
with
the
right.
So
it's
a
synchronous
right
that
just
happened
underneath
of
hbase
there's
no
control
or
the
replication
for
each
right
or
consistency
for
each
right.
A
You
can
just
kind
of
set
that
for
the
entire
thing
and
these
root
and
meta
tables
are
cached,
so
it's
never
going
to
have
to
go
back
to
zookeeper
for
the
root
and
eventually
or
learn
about
all
the
meta
tables
and
then
in
the
future.
When
decline,
does
it
right?
It's
just
going
to
go
directly
to
a
slave
machine.
It
doesn't
have
to
keep
going
back
to
zookeeper
and
cassandra
is
different.
You
guys
have
probably
seen
this
before
and
cassandra
when
the
client
does
it
right.
A
It
does
a
right,
let's
say
with
replication
factor
three
consistency
level,
one.
So
the
client
will
randomly
pick
a
machine
called
a
coordinator
machine.
It
doesn't
have
to
be
seven,
but
it
is
in
this
case
and
the
coordinator
will
say:
okay,
you
want
to
write
with
replica
factor
three:
okay
I'll,
send
it
to
three
different
machines.
A
Basically,
the
the
row
key
gets
partitioned
with
a
mummer,
hash
or
md5,
and
whatever
long
number
comes
out
of
the
hash
will
determine
which
machines
it
will
go
to
and
the
replica
one
two
three
is
now
sent
to
three
machines
and
consistency
level.
One
means
that
as
soon
as
one
machine
responds
back
and
says,
I
got
that
right.
Then
the
coordinator
will
continue
on
with
the
client
and
give
the
ride
successful
and
the
other
two
rights
are
still
going
to
be
implemented.
A
But
it's
just
you
know
it's
not
holding
up
the
right,
but
if
you
want
to
hold
up
the
right,
you
can
a
little
bit
more
so
now
we're
doing
replication
factor
three
right
just
like
before,
but
we're
saying
we
want
to
get
the
right
committed
to
at
least
two
machines
because
of
some
you
know
garmin
sarbanes-oxley
right
and
then
after
two
machines
respond
back.
Then
we
tell
the
client
we're
done.
A
This
is
flexible,
so
you
can
do
replication
factor
like
four
right.
Four
rights
gets
sent
and
consistency
level
is
two
and
then
and
then
you
send
it
back
to
the
client.
So
this
is
a
very
really
important
thing
to
understand
about
the
differences
cassandra.
Has
you
know
tunable
or
elastic
consistency?
A
When
you
do
the
read
to
guarantee
strong
consistency,
you
only
have
to
read
from
one
machine
some
machine.
Some
region
server
is
responsible
for
that
region
and
only
that
machine
can
respond
back
and
give
you
back
that
right.
Those
other
two
copies
are
dormant
and
they're
kind
of
like
hidden
in
hdfs.
Unless
that
first
machine
crashes,
you
don't
read
from
those
other
two
places
at
all.
You
can
only
go
to
one
machine
to
do
the
reach.
So
to
get
this
to
get
strong
consistency
in
h
base.
A
The
right
was
already
strong
right
and
the
right
was
maybe
a
little
bit
slower
as
well,
because
you
had
to
wait
for
all
three
to
happen
and
when
all
three
happen
in
hbase,
you
can
choose
whether
you
really
want
all
three
to
go
to
disk.
The
first
one
will
definitely
go
to
disk
the
second
and
the
third
can
only
you
can
you
can
configure
so
it'll
go
in
ram
and
then
eventually
get
flushed
later
or
for
the
second
and
third
in
h
base.
A
All
right.
We
have
about
10
minutes
left.
Let's,
let's
see
how
much
more
I
can
cover
here.
This
is
a
idea
called
log
structured,
merge,
trees
which
comes
from
bigtable.
We
have
a
key
there
like
before,
and
we
want
to
write
this
value
x.
We've
already
found
the
actual
region
server
in
h
base
and
we've
already
located
the
ring,
the
node
in
cassandra.
Okay,
this
is
the
node
that
we
want
to
do
the
right
on.
A
Okay,
the
right
comes
to
this
node
now
from
either
the
coordinator
machine
in
cassandra
or
from
like
the
client
in
h
base
all
right
and
it
comes
into
a
jvm
in
both
cassandra
and
hbase.
Blue
is
going
to
be
hbase
and
green
will
be
cassandra
from
the
jvm.
The
right
gets
kind
of
duplicated
to
two
places.
It
goes
into
a
something
called
a
commit
log
or
a
write
ahead.
Log
in
cassandra.
You
call
it
commit
log
in
hbase.
You
call
it
right
ahead
log,
so
the
z
went
into
there
and
this
is
a
sequential
file.
A
The
what
yes,
that's,
why
it's
the
same
color
as
hbase
up
there
and
green
is
commit
log
for
cassandra
all
right,
so
that
that
mem
store
is
usually
a
64
megabyte
memory
buffer.
The
right
is
persistent
to
disk.
You
know
it's
safe
and
it's
in
a
memory
buffer.
As
soon
as
those
two
things
happen,
you'll
basically
give
the
right
successful
back
all
right,
then.
What
happens
in
the
background
is
if
you're
using
h
base
remember.
This
is
only
one
node.
A
The
write
always
comes
to
so
you
got
to
do
that
synchronous
replication
into
hdfs
with
cassandra.
Remember
the
right
went
to
two
other
nodes
already,
so
you
don't
have
to
do
that
kind
of
thing.
That's
the
client
already
sent
it
to
two
other
nodes
and
the
client,
the
two
other
nodes
reloaded
to
their
own,
commit
logs.
A
All
right,
then,
that
memory
buffer
is
eventually
going
to
get
more
data
coming
in,
because
more
writes
will
come
in
the
64
megabyte
memory
buffer
will
get
full
when
it
gets
full.
It
gets
flushed
to
a
file.
The
file
is
called
ss
table
and
cassandra
or
h,
file
in
hbase
and
remember
in
hbase.
You
still
have
to
then
replicate
that
synchronously
to
three
different
places
and
for
every
column
family
in
these
databases,
you'll
have
a
different
mem
store.
So
this
is
an
example
right
now
of
a
table
with
two
column.
A
Families-
and
you
see
the
first
column.
Family
is
flushing
to
its
own
file.
The
other
column
family
is
flushing
to
its
different
file,
and
when
these
flushes
happen,
this
is
a
flush
and
in
the
file.
You
actually
also
maintain
a
bloom
filter
and
a
block
index
in
cassandra.
You
can
only
do
a
row,
only
bloom
filter
in
hbase.
You
can
do
a
row
or
column
bloom
filter
and
what
these
things
are
doing
is
by
the
way
in
hbase.
A
That
h
file
is
what
that's
called
the
h
file
has
the
block
index
and
the
bloom
filter
in
that
file
that
gets
flushed
in
cassandra.
You
actually
have
like
three
files,
so
you
have
like
one
file
with
the
data
one
file
for
the
block
index
and
like
one
file
for
the
bloom
filter,
and
the
reason
you
have
that
is
because
imagine
that
these
flushes
keep
happening.
The
memory
buffer
keeps
getting
full
and
you
have
like
a
bunch
of
ss
tables
now
and
the
way
you.
A
A
It
goes
into
memory
and
then
really
quickly
when
you
want
to
do
a
read,
you
first
go
check
all
those
bloom
filters
in
memory
and
say:
hey,
which
ss
tables
have
the
row
500
and
maybe
one
or
two
of
them
will
say
yes
and
then
you
go
directly
into
the
ss
table,
and
then
you
check
the
index
really
quick
to
see
where
in
the
file
it
is,
and
you
go
pick
it
out
of
that
file.
You
do
a
disk
seek
into
that
file
and
then
you
actually
pull
it
out.
A
All
right
now
in
cassandra
you
can
flush
per
column,
family,
so
remember
I
had
two
mem
stores
there.
They
can
individually
get
filled
and
when
one
of
them
gets
filled,
they'll
get
flushed
all
right,
but
so
one
column
family
got
full
memory
buffer.
It
got
flushed
by
an
h
base.
If
one
flush
happens
for
a
column,
family,
all
the
other
flush,
all
the
other
mem
stores
will
get
flushed
as
well,
so
maybe
one
column
family
has
like
pictures
you're
storing
in
it
and
other
column.
A
Families
are
just
like
tiny
metadata
stuff,
so
it's
not
getting
full.
But
when
you
fill
up
this
mem
store
with
the
pictures
now
you're
gonna
have
to
flush
the
other
ones
as
well,
and
this
will
probably
put
a
lot
of
memory
pressure
on
you,
because
you're
flushing
these
little
tiny
files,
synchronizing
them
with
hdfs
or
the
network.
A
There
is
work
being
done
in
cassette
and
hbase
2
under
jira
3149.
To
do
the
flush
per
column.
Family
secondary
indexes
are
natively
supported
in
cassandra.
You
want
to
use
a
secondary
index
in
cassandra
when
your
column
has
a
certain
key
or
certain
value.
That's
showing
up
multiple
times
so
like
this
is
like
a
secondary
index
is
a
good
use
case
for
like
if
you
have
a
column,
that's
storing
states,
there's
only
50
possible
values
and
you
have
a
billion
rows.
A
It's
good
good,
good
choice,
but
if
you're
storing
email
addresses
in
the
column
don't
put
a
secondary
index
on
there
to
query
it
out,
right
use
different
techniques,
no
native
secondary
index
support
in
hbase,
but
there's
the
idea
called
triggers.
Where,
when
you
update
a
value
on
a
cell,
you
can
cause
a
trigger
to
be
launched
to
go
update,
another
column
family,
where
you're
doing
like
a
poor
manned
secondary
index
with
a
sec.
Another
column,
family
you're
like
rolling
your
own
index.
A
A
There's
a
really
good
talk
by
rick
branson
is
13
minutes
long
on
youtube
about
ssd
support
in
cassandra
so
check
that
out.
One
thing
you
can
do
with
cassandra,
though,
is
you
can
actually
put
like
you
know
you
can
put
like
the
ss
tables
those
flushed
files
on
ssd,
but
you
can
put
the
commit
log
on
spinning
disks.
If
you
want
you
can
choose
in
hbase,
you
can't
really
do
that,
because
hbs
is
kind
of
dumb.
A
As
far
as
the
file
system
goes,
hbase
just
hands
the
right
to
hdfs
and
hdfs
is
responsible
for
putting
that
on
disks
on
its
own
machine
and
other
machines,
and
in
hbase
you
can't
really
separate
the
you
know
the
commit
log
to
go
or,
like
you
know,
the
the
right
ahead
log
to
go
on
ssd
and
like
the
flush
files
to
go
on
spinning
this.
You
can't
really
do
much
control
there.
However,
the
map
aren't
intel.
Distributions
of
hbase
are
starting
to
make
some
effort
in
this
field
and
in
the
hdfs
those
two
jiras.
A
There's
preliminary
discussions
happening
for
being
able
to
decide
where
you're
going
to
put
the
file,
whether
it's
going
to
be
on
ssd
on
a
node
or
a
spinning
disk
compactions.
I'm
not
going
to
talk
much
about
this
I'll.
Just
tell
you
that
in
cassandra
you
can
do
tiered
and
leveled
compactions
in
hbase.
A
You
can
only
do
tiered
if
you
want
to
learn
more
about
what
a
compaction
is
check
out
that
talk
from
that
blog
from
jonathan
ellis
reading,
after
disc
failures
in
cassandra,
if
a
specific
disk
fails-
and
that
happened
to
be
the
disk
that
had
the
data
you
wanted
to
query-
it's
not
a
big
deal,
because
when
you
do
the
read,
you
know
you're
not
going
to
that
machine.
Necessarily
you
don't
know
which
machine
it
is
you
just
kind
of
randomly
pick
a
machine.
A
That's
now
the
coordinator
machine
and
that
coordinator
machine
will
do
like
a
hash
of
the
key.
Then
it
will
see
it
will
know
which
three
machines
to
do
the
query
from
and
if
one
of
the
machines
can't
respond,
it
doesn't
matter.
One
of
the
other
two
will
respond
in
hbase.
If
there's
only
one
machine
who
can
do
that
read
because
that's
the
region
server
responsible
for
that
region.
A
So
if
a
disk
fails
that
regent
server
is
going
to
basically
have
to
get
the
data
from
one
of
the
other
two
machines
or
the
network
and
send
it
back
to
you
so
perhaps
after
disk
failures,
hbase
will
respond
a
little
bit
slower
than
cassandra
will
a
few
more
two
more
slides.
I
believe
data
partitioning
in
cassandra
when
you
are
writing
your
table
out
with
the
rows
row
keys,
you
can
do
either
ordered
partitioner
or
random
partitioner
random
partition
means
oops.
A
So
in
hbase
you
very
commonly
will
actually
take
the
row
key.
You
want
to
write
and
you'll
actually
do
an
md5
hash
or
something
of
it.
Then
you
get
the
md5
hash
and
append
it
to
the
beginning
of
the
key,
and
then
you
send
that
into
hbase.
So
now
the
key
is
kind
of
going
to
a
random
machine
and
you're
not
getting
a
hot
spot
on
just
one
machine,
but
it
depends
on
your
use
case.
A
So
sometimes
the
order
partitioner
does
make
sense,
but
you
should
know
for
cassandra
like
95
of
use
cases
use
the
random
partitioner,
not
the
ordered.
You
have
three
more
minutes
triggers
and
coprocessors.
This
is
under
development
for
cassandra,
but
it
already
exists
in
hbase.
What
this
lets
you
do
is
basically
send
client
supplied
code
into
the
address
space
of
the
server
to
run.
So
basically,
you
know
you
can
do
something
like
every
time
you
have
a
special
column
family
and
you
say
like
look
anytime,
anybody
does
a
put
in
this
column
family.
A
I
want
you
to
also
run
this
other
logic
to
do
something
else
and
that
all
happens
in
the
cpu
of
that
machine.
So
you
can
say
anytime,
anybody
does
a
put
in
this
column
family.
I
want
you
to
also
do
like
something
like
increment
a
counter
in
a
different
column.
Family
comparing
set
is
under
development
for
cassandra
but
supported
in
hbase.
This
is
basically
like
atomic
read,
modify,
write,
it's
also
called
compare
and
swap
or
check
and
set,
and
finally,
multi
data
center
support
or
disaster
recovery
is
significantly
stronger
in
cassandra.
A
It's
very,
mature
and
well
tested
see
the
initially.
When
cassandra
was
developed
at
facebook
by
former
amazon
dynamo
engineers.
You
know
facebook
needed
multi-data
center
support.
A
It
was
a
global
company
with
the
data
stored
all
over
the
world
and
you
don't
want
to
have
european
users
doing
a
query
to
a
data
center
in
like
virginia
it's
the
mil
the
latency
is
too
high,
so
they
needed
multi-data
center
support,
so
that
was
kind
of
one
of
the
core
things
built
into
dynamo
and
cassandra
from
day.
One
rackspace
is
also
a
big
contributor
to
cassandra
because
of
its
multi-data
center
support.
So
with
cassandra
you
can
have
a
recovery
point
objective,
be
zero,
meaning
you
can
rest
assured.
A
It
will
be
a
little
bit
slower
to
do
these
type
of
rights
because
you're
sending
the
rights
to
both
data
centers,
but
you
can
do
the
right
where
you
say.
I
need
the
right
to
go
to
both
data
centers
and
go
to
disk
on
both
data
centers
and
only
then
go
and
tell
the
client
that
you
did
the
right
in
hbase.
The
replication
is
only
asynchronous,
so
the
recovery
point
objective
cannot
be
zero.
You
might
lag
a
few
milliseconds
seconds
or
minutes
behind
in
cassandra.
A
A
I'm
also
authorized
data
stacks
training
partner,
so
I
can
teach
data
stacks
curriculum.
You
can
shoot
me
an
email
if
you're
interested
in
getting
any
training
classes
from
me.
I
do
custom
consulting
and
training.
I
used
to
work
at
hornworks,
accenture,
r
d
and
symantec.
You
can
find
me
on
linkedin
twitter
and
there
is
a
youtube
video
I
have
as
well.