►
Description
Speaker: Ben Vanberg, Senior Software Engineer at FullContact
Here at FullContact we have lots and lots of contact data. In particular we have more than a billion profiles over which we would like to perform ad hoc data analysis. Much of this data resides in Cassandra, and we have many analytics MapReduce jobs that require us to iterate across terabytes of Cassandra data. To solve this problem we've implemented our own splittable input format which allows us to quickly process large SSTables for downstream analytics.
A
A
A
If
you
guys
are
interested
in
that
kind
of
thing
or
have
similar
problems
to
what
a
lot
of
people
do
with
contacts
keep
an
eye
on
us,
there's
going
to
be
some
really
cool
stuff
coming
out
in
the
near
future,
and
if
you
need
to
contact
me
about
any
of
this
stuff,
Jack
bird
on
Twitter,
so
with
that
I'm
going
to
go
ahead
and
get
started,
so
this
slide
I.
Actually
this
is
cool,
actually
changed.
This
slide
just
last
night
change
some
of
these
just
last
night.
A
So
this
was
like
a
typical
introduction
yesterday
and
now
I
thought
about
it
more
and
it's
probably
more.
What
I'm
trying
to
explain
here
is
our
journey
in
implementing
a
specific
use
case.
So
it's
really
a
story
about
how
we
went
from
A
to
B
and
solve
the
problem.
We
had
so
I'm
going
to
talk
about
that
use
case
and
then
we'll
talk
about
how
we
implemented
it
and
then
at
the
end,
I'll
do
some
example
code.
A
Those
individual
search
targets
then
get
rolled
up
into
what
we
call
a
profile,
which
is
the
entire
profile
that
could
contain
many
social
profiles
for
an
individual
and
that
gets
funneled
back
as
the
response
for
the
user
as
what
we
call
the
full
contact.
Basically,
the
enriched
data,
so
email
address
in
and
rich
contact
out
and
this
profiles
Cassandra
database
is
what
I
want
to
talk
about
today
and
that's
what
this
work
is
centered
around.
So
with
that
talk
about
the
goal
that
we
set
out
in
our
journey
to
solve.
A
Basically,
we
have
all
this
data
in
Cassandra
and
we
want
to
perform
analytics
on
it.
That's
a
little
hard
to
do
that,
just
straight
up
queries
to
your
production
Cassandra,
so
we
thought
about
how
we
could
possibly
do
it
in
other
ways,
the
kind
of
things
we
wanted
to
accomplish
or
like
count
how
many
profile
types
we
have
so
the
types
I
talked
about,
like
the
email
queries
to
get
a
profile,
the
Twitter
queries
to
get
a
profile.
So
what
kind
of
types
of
queries
are
people
doing?
A
And
what
are
we
returning
additionally
like
how
many
of
those
profiles
have
social
data
associated
with
them
and
what
type
and
blah
blah
blah
all
that
kind
of
stuff
whatever
we
want
to
do.
That's
the
cool
thing
about
where
we
wanted
to
what
we
wanted
to
solve.
Basically,
ad
hoc
tat
data
analytics,
so
we
could
do
whatever
kind
of
analysis
on
this
data
we
wanted
in
the
future.
A
So
some
of
the
key
factors
about
our
use
case
that
I'm
going
to
keep
in
mind
for
this
talk
and
for
the
implementation
that
we
built
is
that
we
use
netflix
is
Priam
for
backups
at
full
contact.
We
really
like
netflix
open
source
stuff,
and
this
was
a
really
good
fit
for
us.
We
use
a
lot
of
their
things,
but
freedom
for
backups,
and
I
should
point
out
that
those
are
both
snapshots
and
compressed
the
backups.
A
Additionally,
in
Cassandra,
our
tables
are
those
profile
tables.
I
mentioned
our
size,
tiered
compaction,
and
we
end
up
with
SS
tables
on
the
order
of
200
gigabytes
right
now,
those
will
continue
to
grow
as
time
goes
on
and
actually
they're
bigger
than
they
were
when
we
started
out
on
this
project,
if
you
guys
are
familiar
with
size,
tiered
compaction
I
mean
you
basically
have
certain
number
of
files
kind
of
grow
in
magnitude
and
then
that
biggest
file
is
going
to
constantly
grow
it
kind
of
tapers
off.
But
that's
what
we're
up
against.
A
That's
what
allows
us
to
do
a
lot
of
the
stuff
that
we
do
makes
Cassandra
real
easy,
MapReduce,
really
easy
all
those
things
so
before
I
move
on
I
just
like
to
ask
if
how
many
people
in
the
crowd
have
MapReduce
experience
or
know
how
it
works,
awesome,
so
good
number,
so
I
won't
dive
too
deep
into
those
details.
So
here's
where
we
started,
we
had
a
system
that
accomplished
this
goal
kind
of
so
we
would
generate
queries
for
our
cassandra
database
and
I
guess
touch
on
that
briefly.
What
that
means.
A
Basically,
we
had
a
data
store
full
of
all
these
profiles.
We
don't
really
know
which
ones
we
want
to
query.
We
happen
to
have
another
system
that
I
won't
talk
about
too
much.
That
knows
information
about
what
we'd
be
interested
in
within
that
data
store.
So
we
would
ask
that
system
to
generate
the
data,
to
query
the
cassandra
data
for
us
and
that
could
take
a
long
time
because
MapReduce
jobs
that
took
days
because
I
think
we're
talking
on
the
order
of
billions
of
contacts
that
we're
going
after
here.
A
So
that's
why
this
takes
such
a
long
time
so
moving
on
from
there,
we
would
move
to
this.
Taking
that
data,
the
queries
that
we
generated
for
Cassandra,
we
would
bring
up
like
a
total
mirror
or
production
class
because
we
didn't
want
to
hit
our
production
cluster
with
thousand
reads
per
second
for
days
and
days
and
days.
So
that's
going
to
highlight
some
of
the
costs
on
the
right
hand,
side
that
I
want
to
keep
an
eye
on
as
we
go
through
this
they're,
not
super
high
cost,
but
they're
not
small
either.
So
this
would.
A
So
the
next
thing
we
would
do
is
a
step
that
could
take
days
as
well
to
process
that
data
and
the
final
result
is
what
came
out
typically
for
us.
This
was
based
around
an
export
or
almost
a
match
test
for
a
customer,
so
customers
are
interested
in
that
information
before
they
come
on
board
as
a
real
customer.
A
So
in
this
picture
we
have
a
MapReduce
cluster
up
in
AWS
for
days
and
days
and
days.
We
have
this
Cassandra
cluster
up
for
days
and
days
and
days
and
all
its
serving
is
to
spit
out
this
final
report.
Essentially,
so
that's
where
we
started,
and
these
things
I've
already
touched
on
the
limiting
factors
of
this
current
implementation,
three
to
ten
days
of
total
time
and
then
like
twenty
seven
hundred
dollars
in
extra
AWS
cost
for
us.
A
But
wasn't
a
lot
of
flexibility
there
for
ad
hoc
analytics,
but
the
biggest
thing
to
me
is
the
engineering
time
I
think
you
can
imagine
this
process
where
you
build
some
MapReduce
jobs
and
run
it
for
three
to
ten
days
and
find
out
in
your
final
output
that
you
screwed
something
up
that
sucks.
So
there's
a
lot
of
babysitting
for
this
stuff,
a
lot
of
rerunning,
and
it's
just
painful
so
thinking
about
moving
forward.
How
could
we
do
this
better?
A
We
knew
that
query
and
Cassandra
didn't
scale
at
all.
We've
already
seen
that,
so
we
knew
about
the
cassandra
SS
tables
and
those
would
be
really
cool
if
we
could
somehow
MapReduce
across
that
data
and
as
it
turns
out,
other
people
have
this
same
problem.
Maybe
some
of
you
do
as
well,
but
a
couple
key
things
that
we
needed
to
have-
and
you
guys
are
probably
familiar
with
this-
so
I
won't
go
too
deeply
into
it
is
that
we
really
need
those
SS
tables
to
be
directly
available
on
HDFS
and
that's
for
MapReduce
to
perform.
A
Reading
them
and,
like
I
said,
these
are
quite
sizable
files
for
Ross
using
pram.
They
reside
in
s3,
so
we
can
pull
them
from
s3
into
HDFS,
but
that
still
takes
it
quite
a
bit
of
time
at
like
I,
think
we
have
a
9
node
cluster,
so
you
get
like
a
200
gig
SS
table
max
size
on
each
nodes,
so
that
takes
a
good
bit
chunk
of
time.
So,
in
addition
to
just
having
them
on
HDFS,
we
need
to
make
that
available
as
input
to
the
MapReduce
jobs
right.
A
So
how
could
we
do
that
nicely
so
when
we
first
set
out
like
somebody
must
have
built
this
already?
So
let's
just
do
that
so,
like
I
said
originally
we're
big
fans
of
Netflix
OSS
and
we're
using
netflix
premium.
So
this
thing
called
Netflix
adjust
this
solves
exactly
this
problem
right,
so
we
set
out
to
basically
test
it
see
how
it
would
work
for
us
we'll
ran
into
couple
snags
right
from
the
get-go
as
it
turns
out.
This
works
really
good
for
netflix
is
use
case
and
that's
like
Cassandra
10
in
general
and
no
compression.
A
So
we
did
take
a
look
at
that
and
that
worked
pretty
well
for
us,
but
I
think
what
it
came
down
to
was
the
compression
a
compression
was
a
sticking
point
for
us.
So
when
you
try
to
run
this
and
it
being
familiar
with
MapReduce,
you
know
that
MapReduce
is
really
good
at
big
data,
but
it
sucks
at
big
files
and
that's
really
big
files
that
you
can't
split
into
small
chunks
right,
which
is
at
this
point.
A
What
we
were
dealing
with,
we
had
200
gig,
SS
tables
being
churned
through
on
a
single
thread
took
a
long
time
to
do
so.
We
took
a
look
at
another
solution,
really
cool
piece
of
software
called
Cassandra.
Mr
helper
does
exactly
what
we
want
to
do,
including
support
for
12,
which
is
where
we
currently
reside.
A
These
guys
have
written,
really
good.
I
0
stuff.
That
knows
how
to
read
these
tables
really
well.
So
why
do
it
yourself?
The
limiting
factor
there
is
that
it
doesn't
it's
not.
It
doesn't
support
HDFS.
You
can't
run
this
on
an
HDFS
filesystem.
So
what
Cassandra
mr
helper
would
do
is
copy
the
SS
tables
out
of
HDFS,
so
you've
already
copied
it
from
s3
to
HDFS.
Then
we
copy
it
out
of
HDFS
to
the
local
file
system,
and
this
is
for
our
tests.
A
I,
don't
know
if
those
guys
did
it
exactly
the
same
as
we
did
or
not,
and
our
machines
that
we're
using
I
don't
believe,
have
SSDs.
So
it's
even
it's
a
little
harder,
and
this
is
the
same
problem
with
the
with
size,
tiered
compaction
across
the
board.
You
need
to
have
double
the
disk
space
to
hold
these
things,
and
in
this
case,
because
priam
koream
compresses,
the
backups
as
well
and
ours
were
compressed,
they
would
copy
them
off
to
the
local
filesystem
decompress
them.
A
So
now
you
have
like
the
decompressed
version
side-by-side
with
the
original
version.
Then
you
can
nuke
all
those
old
ones.
So
at
that
point
you
need
like
double
your
disk
space
right,
so
we
kept
running
a
disk
space
and
quickly
decided
that
this
might
not
scale
for
us.
So
we
kind
of
took
out
that
3m
decompression
and
we
actually
wrote
a
distributed
copy
custom
distributed
copy
that
would
bring
those
from
s3
and
be
snappy
them
right
into
HDFS.
A
So
we
could
avoid
that
bit,
but
you
still
end
up
copying
these
files
to
the
local
file
system,
so
but
the
biggest
thing
with
Cassandra,
mr
helpers,
their
input
format
was
not
splittable
at
all,
so
that
wasn't
going
to
work
for
us,
because
we
still
had
the
single
thread
processing
these
big
tables.
Granted.
We
have
a
lot
of
different
tables
right
with
size,
tiered
compaction,
you
got
small
ones,
ramping
up
to
really
big
ones
and
those
small
ones
would
get
chewed
through
pretty
quickly.
So
you
have
like
say
a
24
node
cluster
and
you
have.
A
We
had
nine
node
Cassandra
cluster
on
this
24
node
MapReduce
cluster.
You
chew
through
the
small
files
really
fast,
but
then
you'd
have
these
9
giant
files
being
chewed
by
single
threads
and
your
MapReduce
cluster
you're,
really
not
leveraging
MapReduce
for
what
it's
good.
At
that
point,
it's
kind
of
a
waste
so
but
we
did
get
the
job
done
and
it
took
us
only
60
hours
to
do
that.
Some
of
those
single
threaded
processes
took
a
really
long
time.
A
A
A
So
basically
it
starts
out
by
just
reading
the
SS
tables
and
I'm
kind
of
like
glazing
over
the
fact
that
we
already
have
to
stream
these
files
into
HDFS
right
and
that
whole
thing
I
talked
about
where
we
decompress
them
on
the
fly
into
HDFS.
So
that's
a
whole
other
thing
to
think
about,
but
I'm
trying
to
simplify
here.
This
takes
many
many
hours,
that's
the
most
of
the
60
hours.
That
I
was
talking
about,
and
it's
all
done
on.
Hdfs
processes,
the
data
in
our.
So
we've
actually
made
a
really
good
progress
here.
A
I
mean,
if
you
think
about
it.
We
went
from
many
days
to
60
hours,
that's
still
over
two
days,
but
it's
a
lot
better
and
then
our
average
cost
on
a
MapReduce
cluster.
350
bucks
pretty
awesome,
but
we
still
thought
we
could
do
better.
So
we
set
out
to
do
the
same
thing
as
a
justice
and
Cassandra
mr
helper,
but
we
wanted
to
make
those
compressed
SS
tables
the
ones
with
internal
compression,
not
just
the
Korean
compression.
A
We
just
set
out
to
make
those
splittable
and
I.
Guess.
Well,
I'll
talk
about
this
in
a
second,
but
we
needed
to
make
these
splittable
compression
being
enabled
made
it
more
difficult.
Obviously,
and
we
needed
the
Cassandra
I
oak
code
to
run
an
HDFS,
and
that
was
really
a
big
one
and
then
we
need
a
way
to
define
the
splits.
So
our
approach
was
to
leverage
the
SS
table
metadata.
There's
lots
of
files
that
come
along
with
SS
tables
if
you've
seen
them
on
the
file
system.
A
A
So
all
of
this
basically
would
tie
into
a
MapReduce
input
format
that
we
could
just
plug
in
and
run
with
existing
MapReduce
code.
We
call
that
the
index
index,
because
you're
basically
creating
an
index
over
the
cassandra
index,
that
is
an
index
into
the
data
file,
a
little
confusing,
but
it's
a
it's
one
of
those
things
where
you
come
across
the
code
and
you're
like
somebody
typed
index
index
twice.
Why
delete
it?
You're
like?
A
Oh,
it
is
an
index
index,
so
I
don't
know
if
you
guys
are
familiar
with
Hadoop
lzo,
but
for
what
it's
worth
it.
It's
very
similar
to
this
implementation,
and
the
thing
to
point
out
is
that
lzo
compression
is
similar
to
the
problem
with
Cassandra
compressed
tables,
in
that
they
both
have
not
a
single
giant.
A
A
Then
there's
this
index
file.
That's
what
we're
going
to
use
to
actually
try
to
index
into
the
data
file
to
create
our
index
and
define
our
splits
for
Hadoop
and
MapReduce,
and
the
compression
file
compression
info
file
that
comes
along
I'll,
just
mention
it
really
quickly,
because
what
it
does
is
it
points
to
those
blocks?
Those
compress
blocks
within
your
data
file?
That
I
was
mentioning.
A
We
didn't
really
have
to
worry
about
this
too
much,
because
the
really
good
Cassandra
I/o
library
uses
that
for
us
and
like
abstracts
away
all
that
detail
that
we
didn't
really
care
about.
Well,
we
didn't
want
to
care
about
right
because
somebody's
already
written
it
really
well,
why
redo
it
so
here
it
is.
We
have
to
take
that
Cassandra
I/o
library
and
poured
it
to
HDFS
right.
It's
not
too
bad,
but
there
are
some
tricky
things
about
it.
This
was
probably
the
hardest
thing
to
do.
A
Actually.
Writing
the
code,
just
finding
bugs
within
that
and
whatnot
is,
can
be
tricky
they're,
using
fight
buffers
and
all
kinds
of
stuff,
and
when
you
get
giant
byte
buffer
exceptions
and
now
produce
logs,
it
gets
kind
of
ugly,
but
anyway,
this
allows
random
access.
So
random
reads
into
the
data
file
leveraging
that
compression
info
file,
so
it
can
get
those
blocks
and
go
to
scan
across
to
wherever
it
wants
and
read
that
data,
which
is
key
for
what
we're
trying
to
do
right
so.
A
This
kind
of
shows
like
at
the
top.
You
have
a
couple
splits
right.
This
is
obviously
a
simplified
version,
but
within
a
split
we
have
the
index
file,
which
is
key
and
the
offset
of
key
in
an
offset
key
in
an
offset
into
the
data
file
and
I'll
just
point
out.
That
gets
a
little
confusing
thinking
about
it.
But
the
index
offset
into
the
data
file
is
in
to
the
uncompressed
data,
not
the
compressed
block.
A
So
then
our
splits
just
become
start
offset
and
end
offset,
and
we
can
configure
our
splits
to
be
as
big
as
we
want
for
Hadoop
and
it's
a
little
fuzzy
because
you're
talking
about
SS
tables
and
not
really
like
blocks
of
data
in
HDFS,
so
we
kind
of
get
to
the
block
size
in
HDFS,
but
we
kind
of
don't
so
it's
close,
but
it's
fuzzy,
but
it
allows
us
to
solve
the
problem
if
that
makes
sense.
So
I
just
want
to
really
quick
go
over
there.
A
The
original
solution
right
we
started
with
generating
queries
days
processing
these
queries
many
days,
and
I
guess
I
should
point
out
that
we
were
limited
to
a
thousand
queries
per
second
on
that
data
store.
Don't
know
if
I
said
that
before
and
then
you're
talking
about
millions,
millions,
maybe
billions
of
queries
over
many
days
and
then
we'd
process
that
data
I
could
take
some
days
to
cost
us
some
money.
A
Final
solution,
we
index
those
SS
tables
first
step.
We
have
to
run
that
index
index
because
this
reader
requires
those
indices
to
split
that
data
across
the
MapReduce
cluster
and
chew
it
in
parallel.
That
takes
a
couple
hours
to
run
on
those
big
giant
files.
Right
now
we
have
it
implemented
as
a
multi-threaded.
Java
executable-
probably
rewrite
that
as
a
MapReduce
job
itself,
so
that
can
go
even
faster,
because
at
that
point,
you're
just
chewing
some
text
data
pretty
easy
to
do.
A
Then
we
read
the
SS
tables.
We
have
the
index,
we
can
split
it
across
our
cluster
and
we
just
we
read
all
those
up.
So
all
this
is
on
HDFS,
pretty
awesome
for
what
it's
worth.
We
use
elastic
MapReduce
in
AWS,
then
we
process
that
data
and
that
takes
some
hours
and
our
overall
average
cost,
and
these
numbers
are
just
like
estimates.
It's
about
165
to
200
bucks
for
a
given
run.
A
A
One
thing
I
wanted
to
point
out
that
I
didn't
probably
make
clear
from
the
beginning
is
that
when
we
had
the
nine
files
single-threaded
on
a
MapReduce
cluster
for
a
long
long
period
of
time,
you're
not
leveraging,
you
could
add
as
many
machines
as
you
want
right,
you're
not
going
to
go
any
faster.
Now
with
the
splittable
format,
we
can
add
machines
and
go
faster,
so
I
think
now
we're
doing
like
48
machines
and
we
go
through
this
in
10
hours.
A
We
could
double
that
and
probably
go
faster,
but
10
hours
is
pretty
good
for
us.
At
this
point,
I
guess
I'd
also
point
out
that
we
haven't
invested
a
lot
of
time
in
more
tuning.
You
at
this
point,
because
we've
decreased
it
so
much
that
we're
kind
of
like
all
right,
let's
focus
on
other
stuff
for
a
while
and
leverage
the
benefits
we've
gained
so
far.
A
A
A
First
thing
to
point
out:
is
this
this
key
type,
which
you
know
notice
in
the
mapper
you
get
a
byte
buffer?
That's
your
key!
You
get
an
SS
table,
identity!
Iterator!
That's
your
rogue
allows
you
to
iterate
over
column,
so
that
guy
comes
straight
from
Cassandra.
This
key
type
is
also
straight
from
Cassandra,
abstract
key
type.
We
have
a
composite
key
and
both
of
our
values
in
there
are
columns
are
utf-8
types.
That's
all
we're
saying
here,
but
it
is
a
little
goofy
looking
and
deserves
explaining.
A
A
So
here
we
use
that
key
type
and
that
allows
us
to
basically
deserialize
our
byte
buffer
key
put
it
into
a
text
object
that
we
can
pass
along
to
our
downstream
reducer.
We
go
ahead
and
do
the
JSON
column
parsing,
and
we
write
that
out
pretty
straightforward
stuff
right
at
this
point.
All
we're
doing
is
processing
that
SS
table
data.
You
could
do
it,
however.
A
It's
worth
pointing
out
that
Netflix
adjust
this.
Does
this
exact
same
thing
more
or
less?
So,
let's
look
at
the
reducer
and
all
of
these
this
one
is
like
you
can
do
what
you
need
to
do
here.
I
didn't
really
include
like
the
details
here,
other
than
commenting
that
this
is
where
you
pace
things
together,
because
we're
reading
with
the
Cassandra
I/o
libraries
will
get
all
of
our
all
of
our
rows.
That
might
exist
in
the
entire
cluster.
A
So
if
you
have
three
copies,
you'll
get
three,
you
need
to
look
at
the
time
stamps
to
determine
which
one
of
those
is
actually
the
valid
row
right
and
you
can
deal
with
tombstones
here
as
well.
If
you
have
tombstones
in
your
data,
you
can
iterate
over
your
columns
and
nuke
those
we
don't.
You
need
to
deal
with
that
in
our
use
case,
because
we
do
a
lot
of
rights
and
a
bunch
of
reeds,
but
we
don't
really
delete
things
so
something
we
kind
of.
A
We
do
it,
but
we
don't
really
need
to
and
if
you
guys
want
more
detailed
examples,
I
love
for
some
feedback
on
the
github
open
source
stuff,
because
always
looking
for
feedback
and
making
this
thing
better.
So
the
only
other
thing
I'm
going
to
point
out
here
is
the
MapReduce
config
you're,
probably
familiar
with
it,
passing
our
mapper
pass
in
our
reducer
there's
this
SS
table
row
input
format.
This
is
the
input
format
that
allows
you
to
get
a
row,
and
this
is
very
similar
to
what
the
cassandra
mr
helper
does
for.
A
So
that
allows
us
to
plug
in
our
new
stuff.
The
cool
thing
is,
if
you
happen
to
have
used
Cassandra
M
our
helper
already,
you
can
it's
really
easy
to
change
your
code
to
use
this
different
input
format
and
your
code
should
change
very
minimally
and
in
our
case
we
had
a
bunch
of
test
code
from
the
journey
we
went
through
and
once
we
had
this
built,
it
was
just
like
switch
out
the
config
and
go
and
it
works.
A
A
Two
steps
run
the
indexer
really
simple,
java
jar
Hadoop
jar
pass
it
in
run
the
indexer
and
point
it
at
the
root
where
you've
stored
your
SS
tables
in
HDFS
right.
Additionally,
you
can
specify
how
big
you
want
your
splits
to
be
so.
The
index
will
create
those
the
appropriate
size
default
1024,
which
is
what
we
use
just
kind
of
random
default,
but
it
works
well
for
us
running
the
job
command
line.
A
The
dupe
Jar
Jar,
simple
example,
key
thing
to
point
out
here
is
the
create
table
you
need
to
pass
in
your
full
creates
cql
statement.
This
is
so
that
we
can
tell
the
Cassandra
IO
code
here
create
column,
family
metadata
based
on
this
create
statement
and
then
plug
that
into
the
random
access
reader
and
read
the
SS
table.
So
that's
it
very
important
couple
tunings
that
we
do
speculative
execution
false,
that's
nice,
because
some
of
these
jobs
get
big.
This
one's
really
important
right
here,
the
num
tasks,
the
MapReduce
job,
num
tasks.
A
That
says
how
many
times
do
I
want
to
reuse,
a
jade
that
I
use
to
run
a
mapper.
We
do
one
because
these
Cassandra
IO
code
uses
a
lot
of
off
heap
memory.
You've
heard
these
Cassandra
guys
talking
about
how
they're
leveraging
off
heap
memory
to
keep
the
garbage
collection,
noise
down,
and
things
like
this,
it's
just
more
performant
for
them.
So
you
can
accumulate
that
if
you
don't
restart
a
new
jbm,
we
did
some
io
sort,
tunings
pretty
standard
stuff
dealing
with
giant
files.
A
So
it's
real
simple
to
create
this
stuff,
granted
a
lot
of
hand
waving
and
simple
examples.
But
generally
you
write
your
MapReduce
job.
You
run
the
indexer
cross,
your
SS
tables,
and
then
you
run
the
SS
table
reader.
Then
you
have
data
that
you
can
process
downstream
or
you
could
even
write
your
MapReduce
jobs
to
process
that
stuff
in
line.
However,
you
like,
but
it
opens
up
the
door
to
do
a
lot
of
ad
hoc
analytics,
which
is
cool,
so
our
goal
was
accomplished.
A
These
numbers
are
actually
super
cool
because
I
hadn't
actually
calculated
these
percentages
until
I.
Put
this
slide
show
together.
I'm,
like
ninety-six
percent
decrease
in
processing
times
awesome,
ninety-four
percent
decrease
in
resource
costs.
Super
cool
I
need
a
raise
reduced
engineering
time
to
me,
that's
the
biggest
one
because
we're
not
spinning
our
wheels,
maintaining
this
mucking
with
it
watching
it
for
days
and
days
dealing
with
the
issues
that
we
come
across
things
like
that.
For
me,
that's
like
really
the
big
one
I
mean
these
other
costs.
A
They're
relatively
small,
like
I,
showed
you
the
numbers,
they're
not
huge,
but
we
did
run
those
things
monthly
and
we're
a
small
startup
company,
so
we're
sensitive
to
cost.
So
those
things
are
important,
so
we
open
source
this
thing,
hoping
that
some
people
can
benefit
from
it.
So
you
can
grab
this
slide
deck
and
go.
Look
a
dupe,
dash
SS
table
check
it
out
right
now.
Our
plans
in
the
future
kind
of
doing
some
cool
stuff
to
bring
it
up
to
speed
with
two
dot.