►
Description
Speaker: Thomas Pinckney, Senior Director of Engineering at eBay
SlideShare: http://www.slideshare.net/planetcassandra/e-bay-nyc
Recommendation and personalization systems are an important part of many modern websites. Graphs provide a natural way to represent the behavioral data that is the core input to many recommendation algorithms. Thomas Pinckney and his colleagues at Hunch (recently acquired by eBay) built a large scale recommendation system, and then ported the technology to eBay. Thomas will be discussing how his team uses Cassandra to provide the high I/O storage of their fifty billion edge graphs and how they generate new recommendations in real time as users click around the site.
A
A
So
when
we
think
about
a
recommendation
system,
they're
a
bunch
of
different
parts,
there's
understanding
what
the
users
intent
is
that
hey
they're
looking
for
shoes,
there's
understanding
their
kind
of
unique
context
like
they're,
a
size,
12
and
so
any
shoe
that
we
show
them
that
isn't
a
size.
12
is
kind
of
an
irrelevant
recommendation
and
then
their
matters
of
taste
and
aesthetics.
So
what
color?
What
style?
What
are
all
the
intangibles
in
those
shoes
that
are
going
to
make
the
person
really
want
them
and
and
and
and
like
them?
A
A
A
Well,
the
kind
of
ideal
goal
is
that
we
have
a
list
of
everything
that
you
like,
and
a
list
of
you
know
for
me
like
everything
I
don't
like
so
my
taste
profile
might
be
that
I,
like
hiking
to
this
cabin
in
New,
Hampshire
and
I,
like
Python
I,
like
reading
the
New
York
Times
I,
don't
like
plaid
shirts
I,
don't
like
sushi.
That
was
a
terrible
romance
novel
I.
Don't
think
it
was
one
of
her
best
so
now
the
problem
is
it.
This
is
kind
of
like
having
a
full
scale
map
of
the
world.
A
It's
perfectly
accurate
and
not
very
practical.
You
know
there's
no
way
that
you're
really
going
to
get
this
list
from
someone
of
everything
they
like
everything
they
don't
like,
even
if
you
could
get
them
to
sit
down
and
tell
you
these
things
so
study
after
study
has
shown
that
people
aren't
very
in
their
own
opinions
about
what
they
like
or
don't
like.
A
So
just
for
example,
there
was
a
study
where
people
were
asked
shown
a
list
of
things
and
that's
what
they
like
them
or
don't
like
them,
and
then,
two
weeks
later,
they
were
shown
the
same
list
and
asked
again
whether
they
liked
them
or
don't
like
them,
and
their
answers
were
only
about
the
same
eighty
percent
of
the
time.
So
there's
a
there's
a
serve
an
upper
bound
as
to
how
much
you
can
ask
people,
even
if
you
could
have
infinite
patience
from
them.
A
So
our
challenge
is:
how
do
we
build
a
taste
profile
for
someone
and
our
sort
of
core
thesis
is
that
all
of
the
things
in
your
taste
profile
are
not
unrelated,
that
your
likes
and
your
dislikes
are
all
highly
correlated.
So
one
of
the
things
that
we
did
previously
at
the
startup
we
were
at
a
startup
called
hunch
that
was
bought
by
ebay
about
a
year
ago.
We
surveyed
lots
of
people,
so
we
survey
tens
and
tens
of
thousands
of
people
and
would
have
some
all
sorts
of
questions.
A
And
one
of
the
questions
we
would
ask
is:
do
you
like
President
Obama?
Do
you
support
him
and
another
question
we
might
ask
Abby:
do
you
prefer
arugula
or
iceberg
lettuce?
And
what
do
you
know,
but
the
old
stereotype
of
arugula,
loving
liberals
is
true.
So
if
you
sort
of
look
like
the
guy
at
the
bottom,
like
where
I
grew
up
in
South
Carolina,
you
probably
prefer
iceberg
lettuce
statistically
and
if
you
voted
for
Obama,
you
have
a
bias
toward
actually
liking
arugula.
A
So
this
idea
that
things
are
correlated
starts
giving
us
a
toehold
or
a
weigh-in,
just
try
to
build
a
taste
profile
for
someone.
So
let's
start
thinking
about
this
so
trying
to
figure
how
to
formalize
this
a
little
bit.
So
this
drawing.
What
we
have
is
red
circles
represent
users,
so
we
have
a
user.
A
You
know
a
and
another
person
B,
and
then
we
have
some
things,
so
we
have
like
the
Republican
Party
and
we
have
arugula
and
we
have
all
sorts
of
different
things
there
and
when
green
arrows
represent
a
user
liking,
something
and
express
preference,
we've
asked
them.
They've
told
us
they've
liked
this
thing.
A
red
arrow
represents
a
dislike.
They've
said
they
actually
dislike
something.
So
we
have
here
user,
a
I
said
they
sort
of
like
the
Democratic
Party
they're
like
arugula.
A
We
have
some
other
user,
they
said
they
like
the
GOP
party
and
they
dislike
arugula.
So
if
we
have
some
new
user
that
we
know
a
lot
about
this
user
see
all
we
know
is,
if,
like
the
Democratic,
Party
well
sort
of
we
squint
and
look
at
this
and
go
AB,
probably
from
what
I
note
here
in
this
drawing
they
probably
would
like.
A
Arugula
is
the
only
other
person
in
this
drawing
who
has
liked
the
Democratic
Party
also
liked
arugula,
so
is
getting
to
the
idea
that
we're
trying
to
find
either
somehow
similar
people
or
find
similar
patterns
and
connections
to
other
people.
So,
let's
try
to
sort
of
zoom
in
a
little
bit
more
detail
about
how
we
try
to
think
about
that.
So
imagine
we
plotted
in
sort
of
two
dimensions,
one
in
the
x-axis.
A
A
arugula
space
and
if
we
actually
survey
people,
we
see-
maybe
not
as
exactly
this
data,
but
something
similar
to
this,
where
maybe
that
user,
a
from
the
prior
drawing,
is
in
the
upper
right
corner
where
they
like
Obama
a
lot
and
they
like
arugula
a
lot
user
be
from
the
prior
drawing
who,
like
the
GOP
party,
is
in
the
lower
left
quadrant
and
we
have
some
other
people
in
there.
And
now
we
have
this
new
user
user
see
again
who
are
trying
to
figure
out
their
arugula
preference.
A
Given
that
they've
said
they
are
on
the
x-axis
at
a
known
point,
they
like
Obama,
a
certain
amount.
So
what
the
question
is?
Where
on
the
y-axis
do
we
think
they
fall,
and
we
can
answer
that
question
by
doing
a
something
like
a
linear
regression
through
the
data
points,
we
see
that
there's
really
this
better
axis
for
understanding
this
data
set.
A
It's
really
more
powerful
to
understand
this
as
a
matter
of
sort
of
this
concept
of
maybe
political
orientation
versus
trying
to
think
about
it
in
terms
of
lettuce
and
specific
candidates
that
we
see
that
this
user
SI
falls
on
this
regression
line.
At
a
certain
point,
we
now
know
we
can
now
infer
a
y-value
for
them
how
much
they
like
arugula.
We
can
also
do
some
other
interesting
things.
We're
not
just
plotting
people
on
this
line.
A
We
can
plot
things
like
the
lettuce
like
different
candidates,
and
so
we
can
start
intuitively
thinking
that
similarity
is
somehow
based
on
GMA
distance
along
this
line
so
that
that
user,
a
is
close
to
arugula,
which
captures
this
idea
that
they,
like
each
other
or
they're,
similar
user
B,
is
very
far
apart
from
user.
A
on
this
line
so
capturing
the
idea
that
users,
a
and
B,
are
very
dissimilar
from
each
other
and
instead
of
just
doing
this
in
one
dimension
around
political
affiliation.
A
We
can
do
this
around
many
many
dimensions,
so
we
can
find
what
are
called
many
latent
factors
that
help
explain
people's
behavior.
So
these
are
factors
that
typically
are
predictive
but
are
very
hard
to
directly
measure.
So
it's
very
difficult
to
ask
someone.
You
know:
where
are
you
on
the
introverted
to
extroverted
pneus
scale,
but
it's
very
predictive
generally
in
terms
of
predicting
what
they
will
like
or
dislike
to
understand
where
they
fit
on
that
scale.
A
So
what
we're
doing
here
is
we're
finding
things
that
people
have
clear
preferences
on
a
specific
product,
a
specific
concept
taking
their
known
preferences
from
those
very
specific
cases
and
using
that
to
map
them
on
to
these
more
predictive
scales.
One
of
the
interesting
things
here
is
that
we
didn't
pick
these
court.
A
These
dimensions
ourselves,
the
the
machine,
picked
them
so
the
algorithms
pick
the
dimensions
that
were
most
predictive
for
the
either
millions
of
hunch
users
or
the
hundreds
of
millions
of
ebay
users
and,
interestingly,
the
dimensions
that
get
picked
turn
out
to
match
roughly
very
similar
dimensions
that
sociologists
used
to
describe
people.
So
if
you
look
at
like
the
big
four
personality
model,
most
of
the
those
first
four
factors
about
people,
one
of
which
is
introverted
versus
extroverted
pneus-
is
something
that
also
very
much
shows
up
in
our
data.
A
So
in
some
ways
it's
kind
of
an
empirical
validation
that
these
are
at
least
the
first
few
of
these
dimensions
are
some
of
the
basic
ways
that
to
think
about
people
and
understand
people's
differences.
Now
in
our
particular
models
for
doing
product
recommendations,
we
use
about
50
of
these
different
factors.
So
once
you
get
past
the
first,
maybe
fifth
of
five
or
six
of
them,
it
becomes
very
hard
to
figure
out
what
they
are.
You
know
the
algorithms
aren't
labeling
the
axis
that
this
is
the
masculine
to
feminine
access.
A
What
it
is
is
saying,
is:
there's
an
access
that
stretches
from
lipstick
and
mascara
to
you
know
again.
These
are
all
sort
of
in
some
ways.
Kind
of
these
stereotypes
of
the
data
in
some
ways
confirms
from
lipstick
and
mascara
all
the
way
to
sort
of
cordless
drills
and
a
lot
of
other
sort
of
stereotypically
masculine
products
right.
A
So
now
going
back
to
this
original
idea
of
how
do
we
build
a
taste
profile?
We
have
a
little
bit
of
a
simpler
problem
instead
of
enumerated
you
like
and
dislike.
What
we
now
need
to
do
is
build
a
plot.
Everyone
in
everything
into
this
high
dimensional
taste
space,
because
then
all
it's
a
matter
of
doing
is
doing
distance
calculations
to
each
other,
to
figure
out
what
you
like
and
what
you
don't
like
things
that
are
far
away
from.
You
are
things
that
are
going
or
go
into
the
don't
like
bucket.
A
This
is
a
little
bit
sort
of
a
quick,
so
I
apologize
if
I'm
skipping
over
some
some
details
here,
but
going
back
to
that
original
example,
earlier
of
user,
a
B
and
C
and
the
lettuces,
what
we
can
do
is
start
adding
some
numbers
here
to
make
a
little
bit
more
to
build
a
little
bit
more
of
a
formal
model
here
so
for
every
user
and
for
everything
I'm
showing
coordinates.
So
that's
where
they
are
in
taste,
space,
so
user
a
is
a
coordinate,
location,
1-2
user
B
is
at
minus
1
to
the
arugula.
A
Is
that
coordinate
one-
point-five?
How
did
we,
how
do
we
come
up
with
those
coordinates?
Well
won't
go
into
the
details,
but
we
have
this
requirement
that
we,
we
think
of
every
green
edge
in
this
graph,
every
like
as
having
just
somewhat
arbitrarily
the
value
to
and
every
red
edge
representing
a
dislike,
is
having
the
value
negative
two,
and
then
we
say
that
the
user
and
the
item
connected
by
that
edge
have
to
have
a
dot
product
of
their
coordinates
that
equals
that
edge
value.
A
So
this
means
that,
in
the
case
of
the
a
and
the
and
the
lettuce
dot
product
means
take
the
x
value
of
each
multiply
them
together,
add
them
to
the
product
of
the
two
Y
values.
So
in
this
case
X
the
two
x's
are
1
the
2
y's
or
minus
two
and
minus
point
5,
you
do
the
math,
and
that
does
indeed
equal
that
edge
value
of
two.
A
So
using
this,
we
can
figure
out
what
the
coordinates
are
for
the
Democratic
noted
in
this.
Drawing
that,
whatever
his
coordinates
are
they
have
to
satisfy
the
constraint
of
the
dot
product
with
user.
A
has
to
be
two
and
the
dot
product
with
user
c
has
to
also
be
too.
You
see
the
two
simultaneous
equations
that
result
from
this
there's
a
solution
to
those
equations
of
x
equals
to
y
equals
0.
A
So
now
that's
sort
of
how
you
might
just
given
a
static
configuration
of
known
set
of
preferences
come
up
with
the
coordinates
for
everything,
but
on
a
site
like
ebay.
You
don't
just
have
a
bunch
of
data
about
people
ahead
of
time
where
you
just
take
all
this
transaction
history.
You
know
you
look
at
purchases,
call
every
purchase
alike
and
compute
everything
that
you've
got
people
that
you
know
they're
buying
stuff
constantly
new
users
are
showing
up
constantly,
so
new
things
for
sale
are
coming
up
constantly,
so
we
can't
just
statically
compute
everything.
A
We
have
to
do
this
incrementally
in
real
time
and
have
this
work.
So
imagine
you
starting.
You
start
in
a
situation
where
you're
on
the
left
side,
where
you
have
some
users,
see
who
likes
this
SLR
camera
mueser
a
and
likes
this
camera
lens
user.
B
doesn't
like
that
camera
lens,
but
they
do
like
this
point
and
shoot
camera.
A
A
We
know
that
the
camera
has
assessed
by
the
requirement
that
it
dotted
with
a
equals
two
and
it
dotted
with
C
is
also
too
and
that
I
know
if
you
can
see
in
the
slides
here,
but
the
coordinates
with
a
camera
and
a
have
changed
from
the
left
hand
to
the
right
hand,
side
based
on
this
new
piece
of
information.
So
this
is
the
process
of
folding
in
a
new
incremental
piece
of
information
to
change
the
coordinates
of
where
things
are
in
in
taste.
A
So
now,
I'm
gonna
talk
a
little
bit
more
about
how
we
actually
implement
this
so
sort
of
less
about
the
theory
more
about
how
we
actually
are
building
this
there's
two
halves
to
the
system.
One
half
is
this
updating
asynchronous
process
that
happens
decoupled
from
individual
page
loads
on
the
site.
So
if
someone
comes
in
and
purchases
something
an
event
is
triggered
goes
over
an
event.
Bus
is
eventually
received
by
our
updating
processes
and
they
then
add
this
new
edge
into
this
graph
structure.
A
A
We
also
store
all
of
those
updated,
coordinates
into
a
separate
database
that
is
used
purely
for
serving
recommendations
at
runtime
to
customers.
So
that's
the
second
half
of
the
problem
is
when
a
page
load
comes
in
sets,
say
someone
is
looking
at
this
compact
camera.
That
makes
a
call
over
to
our
recommendation
engine
to
say:
hey
I
need
safe,
I've
recommended
alternative
cameras
of
this
user
might
want
to
look
at.
We
go
query
in
this
recommendation
database.
We
know
the
users
coordinates.
So
we
know
the
coordinates
of
the
person
is
looking
at
the
page.
A
We
then
go
find
all
of
the
other
cameras
that
have
coordinates
very
close
to
this
user
and
so
that,
because
that
represents
similarity
and
predicted
affinity
between
the
user
and
those
cameras,
and
then
we
show
those-
maybe
that
maybe
would
take
the
top
five
cameras
that
are
closest
to
the
user
and
those
become
that
users
recommendations.
So
this
side
of
the
problem
is
optimized,
basically
for
just
very
fast
read-only
performance,
because
you
know
ballpark,
we
will
probably
when
this
is.
A
So
that's
why
this
whole
path
is
just
optimized
for
read-only
scalability
as
making
things
as
fast
as
possible,
so
the
zooming
into
that
Cassandra
piece,
the
where
we
store
our
taste
graph
of
every
user
and
everything
and
all
the
connections
between
them.
We
have
in
this
first
version
that
we've
built
about
40
billion
different
edges
in
our
graph,
connecting
about
2
billion
eBay
listings
things
that
are
for
sale
or
have
been
for
sale
connected
to
about
200
million
different
users.
That's
about
five
terabytes
of
data.
A
We
replicate
everything
twice
so
about
ten
terabytes
of
data,
and
this
is
a
very
small
starting
piece
since
we're
only
looking
at
a
few
key
signals
right
now.
So
things
like
purchases
and
bidding
on
an
auction
and
things
like
that,
there
are
a
whole
bunch
of
other
signals
that
we
want
to
bring
into
the
graph
to
express
new
types
of
connections
between
peoples
and
thing
things
that
will
probably
result
in
a
roughly
quadrupling
of
this.
So
we
estimated
eventually
will
will
probably
be
at
around
nearly
200
billion
edges
in
our
graph
stored
on
Cassandra.
A
We're
in
the
process
of
upgrading
from
a
16
machine
to
a
32
machine
cluster
they
these
are
beefier
machines,
so
we're
looking
forward
to
that
that
beefier
node
support
in
the
future.
The
machines
are
connected
to
an
SSD
disk
array,
/
ice
guzzi
and
10
Gigabit
Ethernet.
This
is
not
necessarily
the
ideal
sort
of
set
of
hardware
skews
for
this,
but
it's
for
a
variety
of
reasons,
a
kind
of
preferred
skew
within
ebay.
So
it's
it's!
What
we
use
here.
It's
got
a
you
know
huge
amount
of
I/o
capacity.
A
The
disk
array
is
do
about
400
to
500
thousand
iOS
a
second.
So
it's
a
little
bit
there's
a
lot
of
RAM,
obviously
per
machine.
So
it's
a
little
bit
different
than
maybe
the
the
ideal
skew
we're
using
see.
Cassandra
108
we're
on
size,
tiered
compaction.
For
now
the
moving,
hopefully
shortly
to
leveled
compaction,
to
improve
some
of
the
the
read
latency.
We
want
to
get
more
people
more
apps,
some
other
apps.
Besides
this
recommendation
app
running
on
this
data
store
and
some
of
them
have
even
stricter,
read
latency
requirements.
A
So
we
think
that
level
compaction
will
be
great
for
us
and
we're
also
really
looking
forward
to
bloom
filters
and
things
like
that
being
off
heap,
so
we're
running
with
an
eight
gigabyte
heap
right
now,
but
otherwise
we
start
seeing
some
pauses
that
again
affect
latency.
So
we're
really
looking
forward
to
getting
stuff
off
the
heap.
A
The
schema
for
how
we're
storing
this
data
is
that
the
edges
are
basically
a
wide
rows
where
we
just
for
given
user.
If
we
represent,
you
know
ten
edges
between
them
and
ten
things
that
they
have
bought
or
bid
on
on
the
site.
We
just
keep
adding
new
columns
to
represent
each
edge
and
they're
different
types
of
edges
like
a
purchase
versus
a
bid.
We
treat
them
differently
in
terms
of
how
we
wait
them
in
some
of
our
calculations,
so
we're
recording
the
the
edge
type.
A
A
Our
our
nodes
are
not
wide,
there's
been
a
one
taste
vector
or
one
coordinate
per
node,
so
our
read
performance,
if
you
look
at
number
of
keys,
read
per
second
is
dominated
by
nodes,
because
a
single
read
gives
us
all
the
edges
for
a
user
or
an
item,
but
we're
today
at
our
busiest
time
doing
about
fifty
thousand
reads
a
second
off
of
off
of
the
cluster,
and
this
is
actually.
This
was
actually
benchmarked
against
the
16
machine
cluster.
A
So
we
we
think
that,
as
we
double
or
quadruple
the
amount
of
data
we're
going
to
continue
to
be
able
to
get
very
good,
read
scalability
from
our
32
machine
cluster
and
then
writes
per
second
are
never
around
somewhere
between
three
and
four
thousand
writes
per
second
and
we're
doing
a
write
every
time
someone
does
some
action
on
the
site
that
causes
us
to
update
their
coordinates,
their
position
in
taste
space.
So
today,
purchases
watches
bids
things
like
that.
A
So
that's
how
we're
thinking
about
building
our
and
working
on
our
next
generation
recommendation
system,
how
we're
approaching
the
the
taste
and
aesthetic
modeling
part
of
the
system
at
eBay
I'm
happy
to
talk
about
some
questions
and
also
happy
to
have
a
Vulcan
girl
who's.
The
engineer
here
who
actually
built
all
of
the
Cassandra
side.
We
can
get
into
tons
more
details
than
I,
probably
can
but
Oh.
B
A
That's
a
that's
a
great
question,
so
we
think
it's
more
powerful
to
understand
the
user
versus
understand
the
item
to
item
affinities.
So
it
may
be
that
you
and
I
are
looking
at
a
book,
but
when
you
think
about
the
other
books
that
you
might
want
to
recommend
to
us,
that's
partly
informed
by
the
book
we're
looking
at,
but
it's
also
partly
for
about
who
we
are.
And
so,
if
you
just
do
item
to
item
similarity,
you
lose
that
second
dimension
of
the
problem.
A
So
there
might
be
many
different
types
of
people
that
are
all
interested
in
a
in
a
particular
book,
but
for
slightly
different
reasons,
and
we
want
to
be
able
to
capture
that
so
that
the
other
books
that
we
recommend
take
advantage
of
that
additional
information.
We
have
there's
another
difference
between
sort
of
the
ebay
recommendation
problem
and
nearly
serve
every
other
retail
recommendation
problem
out
there,
which
is
that
ebays
listings
are
sort
of
all
unique.
A
In
common
with
each
other
again
so
you're
not
going
to
see
like
on
Amazon,
you
might
say
that
well,
10,000
people
bought
this
book
and
then
disproportionately,
a
large
number
of
those
10,000
went
off
to
buy
this
other
book
because
both
books
have
been
on
for
sale
on
amazon
for
two
years.
So
that's
how
you
can
accumulate
10,000
data
points
about
each
one.
So
you
can
do
that
item
two
item,
similarity
directly.
That's
never
going
to
happen
on
eBay.
A
Have
some
side
projects?
No,
so
buffer
cache
still
gets
a
pic.
Extra
ram
still
gets
used
for
buffer
cache
that
the
working
set
the
the
data
file
set
is
bigger
than
physical
RAM,
and
so
that
buffer
cache
is
is
very
valuable.
There
also
is:
there
is
some
ability
to
store
some
data
off
heap
already
in
the
existing
versions
of
Cassandra
too,
so
we
use
some
of
that.
But
the
simple
answer
is
buffer
cache.
D
A
You
know:
that's
it,
it's
a
good
point
that
there
are
several
parts,
be
you
know
that
you
can't
just
make
a
recommendation.
You
need
to
be
able
to
justify
it
and
explain
it
to
people,
so
you
might
have
the
best
idea
in
the
world
for
what
someone
should
buy,
but
if
you
can't
explain
it
to
the
then
they're
not
going
to
necessarily
understand
your
brilliance,
so
partly
one
of
the
things
we
think
about
is
how
do
you
try
to
justify
this,
and
so
there
are
very
simple
things
that
people
do
like.
A
Just
the
old
people
who
bought
this
bought
that
you're
trying
to
provide
an
explanation
that
hey
many
other
people
have
done
this.
Maybe
you
should
look
into
it
too.
There
are
other
things
that
you
can
try
to
do
like
in
case
of
books.
You
can
say
hey
more
by
this
author
right.
These
are
very
basic
things,
but
they
do
help.
Explain
why
you're
making
the
recommendation?
The
other
part
of
it,
though,
is
if
we
do
make
a
recommendation
we
track
and
log.
A
You
know
that
we
showed
you
these
five
things
and
you
didn't
click
into
any
of
them
or
you
clicked
in
which
is
a
little
bit
harder
to
interpret
like
if
you
didn't
click
into
any
of
them.
Does
that
mean
you
just
didn't
look
at
them
and
so
didn't
notice
our
recommendations,
but
there
may
be
lower
on
the
page,
or
does
it
mean
they
were
all
really
bad?
It's
hard
to
figure
out
if
you
click
on
one
of
them,
but
not
the
other
four,
then
that
actually
tells
us
something
pretty
informative.
A
That
means
that
at
least
of
those
four
or
five
recommendations
we
showed
you,
the
one
you
clicked
on
was
somehow
the
best
one.
If
you
went
on
to
buy
it
even
better
and
that
somehow
the
other
ones
were
less
good
recommendations,
and
so
we
can
infer
some
negative
feedback
against
those
from
you.
So
over
time
our
model
will
start
getting
a
little
bit
smarter
about
you
not
liking
those
kinds
of
items.
A
D
A
It's
one
of
these,
so
on
the
hunch
side,
before
we
were
part
of
ebay,
we
used
like
who
you
followed
on
Twitter
and
what
you
had
liked
on
facebook,
very,
very
informative,
so
less
informative.
Exactly
about
who
your
friends
were
on
Facebook,
more
informative
about
what
you
would
like
to
what
your
interests
were
very
informative
on
Twitter
about
who
you
followed
in
the
shopping
context
right
you've
got
it
a
bunch
of
different
challenges,
so
a
lot
of
times
people
come
to
the
site
and
they're
not
logged.
A
In
the
first
thing,
you
do
when
shopping
usually
isn't
login.
So,
unfortunately,
the
way
a
lot
of
retail
works.
Is
you
login
only
at
the
end
of
the
process,
so
even
knowing
who
you
are
the
start
and
knowing
much
less
who
your
friends
are
and
things
like
that
is
not
immediately
available
all
the
time.
The
second
thing
is
just
that
you
know
a
lot
of
people
say:
look.
Why
are
you
asking
me
who
my
friends
are
I
want
to
buy
this?
A
You
know
you
know
pair
of
shoes,
so
there's
sort
of
like
five
percent
of
people
that
are
sort
of
excited
about
that,
and
they
want
to
say
hey
what
do
you
think
about
this
shoes
I'm
think
I'm
buying
it
look
at
the
great
deal
I
got
and
they
do
want
to
engage
with
their
friends
about
it,
the
other.
Ninety
five
just
wants
a
good
price
and
they
want
to
find
it
and
get
on
with
their
life.
A
You're
telling
me
yeah,
so
the
question
was
that
it
dealing
with
finding
layton
factors
and
a
large,
sparse
matrix
is
very
hard,
and
so
the
matrix
here
is
every
user
versus
everything
and
for
every
cell
we
have
a
did
you
buy
it
or
bid
on
it,
and
the
vast
vast
majority
of
these
cells
are
empty
because
most
people
have
not
bought
or
bid
on
most
things.
So
you
know
maybe
one
in
a
hundred
thousand
one
in
a
million
one
in
10,000
of
these
cells
are
filled
in
and
they're
all
the
others
are
all
unknown.
A
Question
marks
and
our
our
served
job
is
to
fill
in
the
rest
of
those
cells
with
a
prediction
about
whether
you
would
like
this
or
not
so
it's
very
difficult
to
find
computationally
very
difficult
to
find
layton
factors
in
a
very
large,
sparse
matrix
like
this.
So
the
the
high-level
approach
is
a
method
called
alternating
least
squares,
where
in
some
ways
you
can
think
of
it
as
start
with
a
set
of
random
factorizations
or
a
bunch
of
random
coordinates
for
people
and
then
incrementally
improve
them.
A
A
E
A
A
little
bit
of
both,
so
we
did
try
a
bunch
of
different
systems
and
generally
because
of
the
scale.
One
of
the
things
is
that,
even
if
it
doesn't
necessarily
at
first
seem
to
make
sense
to
spend
effort
to
optimize
something,
every
five
percent
turns
out
to.
You
know
actually
matter
a
lot
here,
because
the
size
is
so
big,
so
building
something
ourselves
as
long
as
it
wasn't
too
complicated
and
too
hard
made
sense,
and
in
this
case
it
turned
out.
You
know
it's
actually
not
that
hard
to
represent
a
graphing
Cassandra,
it's
pretty
straightforward.
A
Made
sense
for
us
to
do
that,
if
you
we
also
already
had
all
the
algorithms.
We
were
running
on
top
of
that
graph,
so
one
of
the
reasons
sometimes
to
use
off-the-shelf
graph
database
or
graph
packages
not
necessary
to
Tora
JH
representation,
but
for
the
algorithms
that
run
on
top
of
it,
and
we
already
had
the
algorithms
that
we're
running
on
top
of
it
also
so
that
just
took
away
another
reason
to
to
do
that.
But
I'm
not
swearing
that
we
made
the
right.
The
decision.
A
F
A
So
this
is
this
sort
of
next
generation
platform
is
basically
in
very
limited
testing
in
North
America
right
now
so
at
peak.
Maybe
ten
percent
of
people
going
to
ebay
on
certain
pages,
we'll
see
recommendations
powered
by
this
system.
So
it's
it's.
Basically,
the
very,
very
early
stages
of
being
rolled
out,
I
I,
know
that
so
I've
now
been
at
ebay
for
about
a
year.
A
So
I
was
at
this
company
hunch
before
I
know,
eBay
has
spent
a
a
number
of
years
looking
at
both
outside
vendors
for
this
problem,
as
well
as
building
a
internal
solution
and
I
think
the
reason
that
all
of
you
know,
frankly,
all
of
the
external
vendor
solutions
had
failed
previously
was
sort
of
getting
around
these
problems
of
how
real
time
this
has
to
be
that
people
and
things
for
sale
appear
constantly.
They
disappear
constantly
and
then
separately.
The
idea
that
there's
no
catalog
there's
no
product
catalog.
That
says
these
are
all
the
same
item.
A
Maybe
they're
not
maybe
they're
suddenly
different,
due
to
condition
that,
because
they're
used
and
things
like
that,
so
a
lot
of
the
outside
sort
of
historic
solutions
that
people
had
tried
and
looked
at
really
depended
on
this
sort
of
item
to
item
collaborative
filtering
where
you
had
a
strong
sales
history
for
this
sku
and
you
could
see
that
what
other
skews
people
went
on
to
buy
and
that
pattern
took
months
years
quarters
to
develop
and
be
and
be
discoverable,
and
that's
just
not
a
situation
that
exists
at
ebay.
Nothing
sticks
around
for
months.
G
A
Yep
and
those
are
all
really
hard
problems
that
we
don't
have
I
won't
say
that
we've
solved
the
one
piece
you
ask
a
couple
different
things
about,
for
example,
maybe
I
click
into
one
item,
but
it's
not
really
right
for
me.
There's
some
simple
things
we
can
do
like
how
quickly
do
you
bounce
back
with
a
back
button
right?
You
know
if
you
click
into
something,
but
immediately
bounce
off,
that's
actually
probably
more
of
a
dislike
than
a
like.
A
That
gets
hard,
especially
in
multi-device
world,
because
I
saw
it
in
my
computer,
but
then
I
searched
for
to
my
iphone
when
I
was
waiting
for
the
Train
and
I
did
both
hold
on
not
logged
in.
So
you
don't
know
that
it
was
Tom
in
both
cases.
In
the
cases
where
it
gets
easier
is
where
I
do
it
on
the
same
device
or
I'm
logged
in
in
both
cases,
and
you
can
look
sort
of
stitch
sessions
together
over
time
and
say
that
while
actually
two
weeks
ago,
tom
was
exposed
to
this.
A
A
So
we're
using
honestly
only
a
tiny
fraction
of
the
data
we
could
be
using
and
that's
why
I
call
this
kind
of
the
v1
version
where
we're
looking
at
a
couple,
very
basic
factors
or
behaviors
of
people
like
what
they've
purchased,
what
they've
bid
on
what
they've
watched
there.
We
have
a
list
of
literally
some
alt,
nearly
100
additional
factors
that
we
would
like
to
use.
It's
just
that
we're
taking
a
fairly
you
know
deliberate
process
of
every
time.
We
consider
a
new
factor.
A
We
add
to
the
model
we
see
if
the
models
predictive
power
goes
up
or
down
through
analyzing
things
like
prior
purchase
histories,
doing
things
like
crowdsourcing
human
judgments,
where
you
get
a
few
thousand
people
to
look
at
the
new
algorithm
versus
the
old
algorithm.
So
it's
a
it's
a
fairly
laborious
process
to
add
additional
factors
to
the
model.
A
Sure
so
we
this
this
work
sort
of
viewers
a
little
bit
also
into
all
those
other
parts
of
making
a
good
recommendation
that
are
not
about
taste.
So
you
know
going
back
to
say:
let's
thing
about
a
book
recommendation
partner:
the
issue
is
well:
does
the
user
even
want
books
right
now
generally
on
ebay?
We
sort
of
solve
that
by
being
fairly
simplistic
later,
if
you're
searching
for
books
looking
at
books,
we
take
that
as
your
intent,
and
we
show
you
more
of
more
things
like
that,
so
we
can
solve
the
intent.
A
A
This
is
a
little
bit
complicated
by
it's
hard
to
actually
figure
out
whether
two
books
are
exactly
identical,
may
have
different
isp
ends,
but
when's
a
hardback
one's
a
soft
back
one's
an
audiobook.
We
don't
want
to
still
recommend
that
book.
So
there's
clustering
and
text
feature
analysis
that
we
try
to
do
to
understand
that
still
topic
wise.
These
books
are
all
too
similar
to
what
you've
already
bought.
A
Let's
try
to
find
other
stuff,
and
then
it
becomes
this
taste
question
of
well
of
those
other
books
that
are
maybe
in
the
same
genre
that
you've
been
recently
looking
into
say.
You
know
historic
nonfiction.
What
are
the
books
that
taste
wise
match
you
best,
so
we
sort
of
think
of
it
as
generating
a
recall
set
first,
which
are
here,
are
a
thousand
books
that
represent
all
of
the
kind
of
objective
criteria
that
we
think
you're
going
to
want.
It's
a
book
and
historical
nonfiction.
A
A
Maybe
a
thousand
books,
and
that's
one
thing:
that
is
over
time.
We
want
to
read
more
and
more
and
more
and
there's
a
lot
of
work
we're
trying
to
figure
out.
If
how
do
you
I
mean
this,
can
leverage
a
lot
of
sort
of
geospatial
indexing,
because
a
lot
of
this
querying
becomes,
you
know,
find
all
of
the
items
in
us
in
a
sphere
that
are
closest
to
me
right,
so
they're,
better
things
to
do
in
terms
of
how
to
query
efficiently.
A
A
Basically
every
you
know
every
week,
Vulcan
here
buys
more
machines
for
Cassandra
and
trades
in
machines
for
we're
also
buying
more
my
sequel
machines,
in
fairness
to,
though
we
but
to
get
to
your
original
question
about
why
we
chose
this.
Is
that
theoretically,
we
think
that
this
is
the
the
right
architectural
solution
for
a
very
high
right,
scalability
problem
that
inherently
has
a
very
data,
parallel
architecture.
You
know
our
recommendations,
even
though
there's
this
graph
that
does
connect
everyone
to
everything
it
can
still
be
kind
of
divided
into
nice.
A
Parallel
updates
that
you,
when
you
do
write
updates,
you
update
your
immediate
neighborhood
when
I
do,
updates
to
myself
I
update
my
immediate
neighborhood.
Those
can
go
on
concurrently
in
different
parts
of
Cassandra,
so
architectural
II
that
high
degree
of
right
concurrency
matched
our
problem,
and
we
thought
that
Cassandra
was
this
architectural.
The
right
solution
to
it.
A
C
A
A
A
We
are
using
in
other
sections
of
ebay,
so
we're
using
it
here
to
store
this
graph
of
people
and
things
there
is
and
I'm
ashamed
to
admit.
I
actually
don't
know
in
detail.
Some
of
the
other
applications.
I
know
one
of
the
other
ones
is,
is
around
some
social
signals.
So
when
people
do
facebook
connect
or
tweet
things,
logging,
those
connections,
those
I
think
applications
are
a
little
bit
more
kind
of
logging
oriented
where
the
data
is
written
and
then
queried
less
and
overwritten
less
I.
A
A
H
H
G
A
I
mean
I
think
one
thing
that
we've
thought
about
is
if
we
could
push
computation
into
the
Cassandra
cluster
itself.
So,
instead
of
right
now
we
do
our
patterns.
Basically
read
a
ton
of
data
from
Cassandra
into
app
servers.
We
do
a
bunch
of
we
generate
a
series
of
linear
equations,
we
solve
them
and
we
write
the
results
back
well.