►
Description
Presented by Flora Dai, Pandora.
Today, data is produced on a massive scale but the ability to retrieve real-time data with accuracy and efficiency is essential to utilize the information we store. We will discuss how to implement an efficient search with open-source technologies and how Pandora applied it to music curation.
About GitHub Universe:
GitHub Universe is a two-day conference dedicated to the creativity and curiosity of the largest software community in the world. Sessions cover topics from team culture to open source software across industries and technologies.
For more information on GitHub Universe, check the website:
https://githubuniverse.com
A
Hi,
my
name
is
flora,
dye
and
I'm
here
to
talk
to
you
about
mine,
the
music
searching
efficiently
with
open
source.
So
a
little
about
me,
that's
my
face.
I'm,
a
software
engineer
at
Pandora,
I'm
work
on
the
content,
engineering
team,
graduated
from
UC,
Berkeley
and
I
have
one
animal,
that's
very
near
and
dear
to
my
heart,
that's
Alfred
or
alpha
short,
he
did
not
willingly
put
on
that
short
costume,
but
in
exchange
he
is
kind
of
an
cat.
So
you
win
some.
A
A
So
in
order
to
talk
to
you
about
how
we
built
an
efficient
search
system
for
recreation
team
at
Pandora,
I'm,
going
to
give
you
a
little
background
about
Pandora's
content
systems,
we'll
go
over
how
this
produced
a
search
problem
and
how
it
formed
our
design
design
goals.
Then
we'll
cover
the
implementation
and
infrastructure
and
then
wrap
up
with
the
results
and
its
impacts.
So
let's
get
started
what
makes
up
Pandora's
content
systems
for
a
high
overview.
A
Historically
pandora's
built
its
catalog
through
our
curators
physically
buying
CDs
and
ripping
them
as
music
consumption,
moved
away
from
CDs
and
toured
mp3s.
Our
curators
got
content
by
making
purchases
through
online
music,
retailers
such
as
iTunes
and
Bandcamp,
and
so
this
music
is
curated
and
analyzed,
and
it
forms
what
we
refer
to
as
the
library
of
a
content
that
has
power,
pandora
radio
and
contribute
to
the
Music
Genome
Project.
For
many
years,
however,
in
the
world
of
on-demand
content
and
streaming,
the
need
for
music
to
be
immediately
available
requires
a
different
system
for
getting
music.
A
So
with
that
pandora
inked
deals
with
content
partners
to
provide
music
through
the
form
of
DDX
messages.
So
what
DDX
messages
are?
It
stands
for
digital
data
exchange
and
what
they
are
is
an
XML
format,
standardization
of
commercial
data.
So
in
regards
from
music,
this
is
information
like
what
you
would
normally
see
on
a
CD
pamphlet
or
purchase
page.
A
So
that's
the
information
like
the
title,
the
artists,
the
track
listing
the
producer
musicians
on
each
track
and
so
on
and
so
forth,
and
this
new
content
delivery
system
which
I'll
refer
to
as
ingestion
provided
content
on
a
much
larger
scale
than
how
we've
gotten
content.
There
are
carriers
in
the
library
before,
on
average,
our
ingestion
system
is
handling
the
delivery
and
storage
of
thousands
of
tracks
a
day.
But
now
we
have
two
separate
content
systems
that
are
storing
our
content.
A
We
have
the
library
which
has
all
the
curated
and
musical
analysis
with
it,
as
well
as
the
ingestion
system,
with
d/dx
formatted
content
with
no
musical
analysis
tied
to
it
whatsoever,
and
so
the
tools
that
our
curators
are
using
are
built
on
our
legacy
frameworks,
they're
built
on
older
and
outdated
frameworks
and
they're
hard
to
use
and
clunky.
They
allow
visibility
into
the
library,
but
they
don't
have
the
tools
to
actually
view
into
our
ingestion
system.
So
we
need
this
tools
for
help
them
in
order
to
facilitate
getting
content
into
Pandora.
A
So
the
question
becomes:
how
do
we
bridge
this
gap?
Well,
first,
we
attempted
the
simple,
naive
solution:
why
not
do
it
the
easy
way
so
utilizing
existing
infrastructure?
We
built
a
simple
UI
using
existing
api's
from
our
ingestion
and
library
systems.
This
was
the
fast
and
easy
way.
So
obviously
what
went
wrong
well,
first
off
the
ingestion
API,
the
response
times
vary
tremendously.
If
we
were
pulling
a
single
object,
it
returns
in
a
normal
millisecond
fashion,
but
for
search
queries
we
would.
A
It
would
take
seconds
two
minutes
to
only
timeout
and
this
would
be
a
visual
result
to
our
curators,
showing
that
we
couldn't
get
the
results
for
them,
because
the
request
itself
timed
out,
so
this
led
to
our
curators
not
being
able
to
find
content
and
going
out
to
go
purchase
it
duplicating
content
within
our
system.
Another
reason
is
that
the
search
syntax
itself
was
sequel
based,
so
ingestion
utilized
the
post
Postgres,
so
that
included
having
api
queries
at
our
api
request,
send
arguments
that
are
in
the
form
of
sequel,
syntax.
A
This
isn't
the
most
friendliest
of
user
queries,
and
so
now
this
added
more
to
our
curators
workload
as
they
had
to
become
familiar
with
a
clunky
syntax.
Thirdly,
the
ingestion
api
didn't
have
a
good
scoring
system
to
determine
which
results
mattered
so
for
sorting
on
basic
parameters
like
the
release
date
or
sorting
by
an
artist
named
alphabetically.
This
prompted
expensive
database
queries
which
again
led
to
a
response,
delay
or
timeout.
A
So,
for
example,
if
our
curator
was
searching
for
Drake
and
saw
all
these
results,
they
may
be
actually
looking
for
the
famous
Drake
that
we
all
know
of
hot
leave.
Hotline
bling
fame
or
they
might
be
looking
to
expand
our
catalog
on
underground
rap
artists
and
they
may
be
actually
looking
for
I.
Think
it's
pronounced
Drake,
oh
the
ruler.
How
did
google
him
not
quite
sure
who
he
is,
but
he
is
a
rap
artist
that
you
can
find
apparently
on
song
count
and
YouTube.
A
So,
with
this,
our
curators
had
a
problem
of
trying
to
find
the
content
that
they're
looking
for
and
having
to
sift
through
a
lot
of
noise
and
then
finally,
we
relied
on
asynchronous
calls
to
provide
our
carriers
with
a
full
view
of
content
through
the
pipeline,
and
so
in
order
to
make
these
asynchronous
calls
to
retrieve
metadata.
They
rely
on
identifiers
from
our
ingestion
system
and
because
we're
not
getting
a
response
from
our
ingestion
system
or
their
timing
out
this
result
in
either
delaying
or
not
even
making
the
request
whatsoever.
A
So,
with
all
these
issues,
our
carriers
still
how
to
use
their
legacy
tools
with
no
improvement
on
their
workload,
barely
innovate,
any
visibility
into
our
ingestion
system.
So
clearly
we
needed
a
new
solution
and
we
wanted
to
implement
a
search
system
that
was
actually
usable.
So
what
do
we
need?
Well,
first
off
it
needs
to
be
fast
and
FYI.
These
are
just
all
gonna
be
catch
ifs.
The
response
time
needs
to
be
returned
quickly.
It
needs
to
be
reliable.
Just
like
this
cat
defending.
A
We
want
our
response
to
actually
return
in
its
and
be
consistent
and
to
lead
into
that.
It
also
needs
to
be
efficient.
So
content,
as
I
said,
is
coming
in
at
a
high
volume
within
our
ingestion
system,
and
we
need
to
be
able
to
handle
searching
over
a
large
growing
set
of
objects.
It
also
needs
to
be
up-to-date.
A
So,
if
we're
storing
all
this
information
to
have
to
allow
our
curators
to
search,
we
want
all
the
information
we
are
retrieving
to
be
up
to
to
be
what
they
are
at
that
time
of
searching
and
then
finally
it
needs
to
be
user
friendly.
It
doesn't
make
sense
to
build
a
search
system
that
cannot
be
used
by
our
users,
so
creating
a
search
system
from
scratch.
It's
hard
and
it's
complicated
and
it's
time-consuming.
A
So
why
reinvent
it
when
there
are
other
solutions
out
there,
so
let's
kind
of
dive
into
what
went
into
building
our
search
system.
So
in
deciding
between
open
source
and
third-party,
we
consider
the
design
goals
as
mentioned
before,
and
what
we
needed
to
meet
those
requirements.
We
also
considered
that
for
music,
it's
incredibly
variable
and
we
wanted
a
flexibility
in
our
solution
to
handle
any
changes
in
handling
content
delivery.
A
So
we
believed
open
source
would
match
those
design
goals,
as
well
as
provide
us,
the
flexibility
that
we
were
searching
for
and
then
looking
into
open
source
solutions.
We
went
actually
with
Kafka
and
solar,
so
let's
kind
of
go
further
in
detail
into
Kafka
and
solar
and
why
they
met
our
goals.
So,
first
off,
there's
Kafka
and
for
those
of
you
who
may
not
know
Kafka,
is
a
distributed
streaming
platform
that
utilizes
message
choose
to
support,
distributed
data
processing,
so
kind
of
a
high
level.
Producers
are
writing
messages
along
topics.
A
Consumers
are
consuming
those
messages
and
recording
what
they've
read
through
by
recording
an
offset
and
the
consumers
handle
the
data.
However
they've
been
designed
to
so
why
Kafka
well,
first
off
it's
an
established
open
source,
originally
developed
at
LinkedIn.
It's
been
around
for
a
while
and
it's
being
used
for
multiple
use
cases
commonly
log,
aggregation
and
stream
processing
of
data
and
leading
into
that.
A
It's
also
that
Kafka
messages
are
being
produced
in
near
real-time.
So
again,
our
search
system
will
be
getting
the
data
in
near
real-time
and
then
also
the
storage
time
for
Kafka's
messages
can
be
configured
so
that
we
can
have
messages
beery
delivered
andrey
consumes,
so
we
can
rely
on
Kafka
to
be
reliable
and
give
us
fault
tolerance.
So
Kafka's
is
delivering
content
data
to
us
in
a
fast
and
reliable
way.
A
So
how
do
we
handle
that
data
once
we
have
it
and
where
do
we
store
it,
and
how
do
we
surface
it
back
to
our
users?
Well,
a
big
portion
of
that
is
due
to
solar,
so
solar
is
also
a
reliable
and
scalable
search
platform
that
provides
indexing
and
full-text
search
as
I
just
mentioned,
and
a
big
big
reason
of
why
we
also
chose
solar
for
those
of
you
who
might
be
wondering
is
first
of
all
in
your
head.
Is
why
not
elastic
search
is
that
we
had
previous
developer
experience.
A
Also
with
the
Solar
is
that
we
had
the
schema
flexibility
so
for
the
music
there's,
actually
multiple
types
of
objects
that
matter
to
our
curators,
such
as
album
and
tracks.
So
with
the
schema,
that's
flexible.
We're
allowed
to
represent
the
multiple
types
of
objects
that
we
needed
for
our
curators.
A
Also
for
a
for
solar
there's,
a
we
are
allowed
to
customize
on
the
indexing,
so
we
can
normalize.
However,
we
needed
for
using
the
tokenizer
Xand
filters
as
well,
and
also
solar
communicates
through
REST
API
s,
and
this
would
allow
easy
integration
into
our
web-based
UI
platform.
That
would
surface
the
data
to
our
curators
and
finally,
solar
has
a
set
of
features
called
solar
cloud
that
allows
you
to
configure
what
each
node
in
a
cluster
does
as
well
as
maintain
replicas,
so
that
we
can
have.
A
A
The
solar
documents
and
a
parser
again
kind
of
zooming
in
we're
gonna
go
follow
some
arrows,
so
copy
messages
are
coming
in
they're
routed
in
parsed
they
and
the
data.
When
we
parse
it,
we
either
update
an
existing
document
or
we
create
a
new
solar
document
and
then,
as
users
send
in
search
queries.
The
user
query
is
translated
into
a
solar
query
and
then
the
solar
query
returns
results
back
to
the
user.
So
let's
break
it
kind
of
down.
A
Even
further
and
focus
on
the
Kafka
and
routing
implementation,
so
for
our
search
system
to
get
the
data
it
needs,
we
listen
to
multiple
Kafka
topics.
These
topics
have
messaging
messages
coming
in
at
various
rates,
whether
it's
100
per
minute
or
100
per
second,
it
really
depends
on
which
system
is
producing
and
what
they're
handling
at
the
moment.
So
once
any
new
messages
on
any
topic
comes
in,
we
utilize
Apache,
cough
camel
and
spring
to
route
messages
to
corresponding
functions,
to
parse
out
the
data
in
the
message
load
into
key
value
pairs.
A
Is
that
better,
all
right,
ironically,
this
is
a
music
company.
We
keep
having
audio
issues,
it's
not
my
fault,
but
I
am
cursed
like
that,
so
yeah.
So
we're
writing
actually
two
different
functions
because
again
we're
trying
to
represent
multiple
objects
and,
for
example,
in
this
case,
we're
looking
at
albums
and
we're
looking
at
tracks.
A
So
what's
interesting
about
the
message,
is
the
data
load
of
our
Kafka
messages
themselves
is
actually
the
API
response
of
the
object
in
json
compacted
with
message
pack.
So
what
works
about
this
is
that
we
don't
have
to
make
API
request
to
our
ingestion
system,
as
I
mentioned
before
our
ingestion
system
had
some
issues
where
we
would
have
problems,
retrieving
results,
so
this
allowed
us
to
not
have
to
worry
about
timeouts.
A
In
that
respect,
it
also
is
similar
in
to
handling
of
other
message
systems,
so
we
didn't
have
to
ask
ingestion
system
to
create
new
topics
for
us.
What
kind
of
created
some
problems
with
this
is
that
even
handling
messages
like
this
and
being
having
the
message
load
compacted?
It
messages
can
be
huge
and
they
can
still
come
at
an
extremely
high
rate,
so
handling
all
this
information.
A
Or
all
these
messages,
our
consumers
could
get
pummeled
and
we
would
have
to
go
into
a
live
state
in
which
we
have
to
manually
intervene
to
help
kind
of
monitor
that
information
we
use
Kafka
manager,
which
is
an
open
source,
Kafka
tool
originally
developed
at
Yahoo,
and
so
in
the
screenshot.
You
can
kind
of
see
that
we
can
look
at
our
lag.
We
can
look
at
the
partitions
that
allow
us
to
parallel
eyes,
consuming
messages
from
topics.
A
We
also
have
a
tool
that
allows
us
to
internally
grep
internet
sorry,
internally,
a
tool
that
allows
us
to
grep
by
the
message
ID
to
view
into
the
message
itself.
So
these
allow
us
to
monitor
the
leg
and
try
to
find
out
why
and
where
it's
happening.
So,
let's
kinda
dive
deeper
into
the
solar
aspect,
space
joke.
A
So
for
solar
setup,
our
schema
is
30-some
fields,
and
this
is
actually
just
a
portion
of
the
API
response
that
was
in
the
Kafka
message.
We
worked
with
our
curation
team
to
pinpoint
the
fields
that
are
relevant
to
them
in
search
from
here.
We
utilize
identifiers
to
see
if
the
document
already
exists.
A
If
it
does,
we
merge
the
document,
values,
updating
fields
that
have
changed
and
keeping
fields
that
haven't
the
same
and
so
again
why
we
actually
have
to
update
solar
documents
is
the
fact
that,
with
music,
there
are
multiple
reasons
why
we
would
get
new
deliveries
or
recent
or
or
content
providers
are
resending
the
messages.
This
could
be
a
typo
or
there
are
some
business
write,
logic
change,
and
so
they
have
to
resend
this
message
so
that
could
affect
values
that
we
store,
and
so
we
have
to
update
our
documents.
A
If
there
are
no
identifiers-
and
there
is
no
solar
document
that
exists-
we
create
one
and
then
finally,
for
each
field
that
we've
updated
and/or
created,
it
goes
through
filtering
and
indexing.
So
we
use
a
lot
of
the
filters
that
come
out
of
the
box.
The
solar,
such
as
the
end
end-edge
and
gram
filter,
which
breaks
words
into
engrams
that
allow
us
to
match
queries
in
a
search
as
you
type
way
and
another
important
one
that
we
use
in
conjunction
with
that
is
a
stop
word
filter,
so
the
software
filter.
A
We
actually
worked
with
our
curation
team
to
analyze,
which
words
are
so
common
that
they
actually
make
more
noise
in
a
search
result
so
paired
with
that
together,
we
are
a
lot
scoring
higher
results.
The
engrams
that
are
indexed
to
match
closest
to
the
search
query
with,
for
which
the
values
that
have
been
stripped
of
unhelpful
words
we
also
design
or
Solar
schema
as
a
singular
schema,
and
so
what
I
mean
by
this
is
that
we
represented
multiple
objects
within
one
schema.
A
Whatever
the
release
type
is
so
now
we
can
look
across
multiple
recently
release
types
like
albums
or
track,
and
specifically
at
one
time
or
both
in
general.
So,
what's
great
about
this
is
that
we
don't
have
to
do
cross
schema
searches
just
like
in
databases
when
you
have
to
make
join,
causes
it's
expensive
and
it's
slow.
What
are
some
not-so-great
things
about
doing
this?
Is
that
we're
overloading
fields?
A
With
our
solar
cloud,
we
have
split
our
information
into
our
index
and
information
into
four
shards,
with
a
replication
factor
of
one
over
three
servers
which
results
in
12
shards.
So
this
again
is
giving
us
the
replication
and
the
fault,
tolerance,
and
so
lastly,
we're
gonna
go
kind
of
looking
into
the
parser,
so
the
parser
we
haven't
really
matched
into
whether
either
Kafka
or
solar,
and
so
for.
The
reason
for
this
is
that,
let's
look
at
an
example,
solar
query,
so
this
is
from
the
Solar
documentation
and
pretty
much
it's
a
query.
A
A
So
how
did
we
do
for
the
overall
results
between
our
initial
implantation,
which
was
the
UI
and
the
existing
api's
as
an
vs.
our
search
system,
the
results
included,
speed.
We
improved
our
response
times
from
a
range
of
3
seconds
or
longer
to
roughly
around
200
300
milliseconds.
We
became
more
reliable
in
that
we
did
not
have
to
make
calls
to
me
just
an
API,
resulting
in
no
more
timeouts.
A
We
were
efficient
in
that
Solar
provide
us
a
way
to
manage,
searching
over
a
large
growing
set
of
data
and
provide
those
results
in
an
up-to-date
manner
because
of
the
messages
that
are
coming
in
from
Kafka
at
near
real-time
and
allowing
ourselves
and
allowing
the
in
the
data
to
be
indexed
by
solar
and
then
finally,
we
created
a
user-friendly
syntax
that
made
more
sense
to
our
curators
and
allowed
them
to
actually
use
this
tool.
Beyond
that.
A
What
the
our
search
system
helped
do
is
create
a
new
interface
with
access
to
both
content
systems,
so
the
ingestion
and
the
library
this
helped
facilitate
product
launches
into
the
on-demand
streaming
space
and,
most
importantly,
it
has
become
the
foundation
for
future
internal
tool.
Development
for
our
curators.
So
currently,
the
search
system
is
not
only
utilized
by
our
curators,
but
other
internal
content
teams
at
Pandora.
Some
improvements,
we've
made
on
top
of
our
solar
kafka
search
system
is
that
for
Kafka
to
handle
messages
to
handle
errors
outside
of
our
system.
A
We
utilize
a
passion,
camel
to
implement
a
retry
and
back
off
scheme
by
exponentially
increasing
the
time
delay
between
reprocessing
messages
that
are
coming
in
during
under
reprocessing
topics
for
solar,
we
decreased
inconsistencies,
rhe
indexing.
So
what
happens
during
Orion
Dex
or
the
reason
why
reindex
happens
is
because
either
we
are
adding
or
changing
fields
in
the
schema
or
there
is
some
mass
update
from
an
upstream
system
like
ingestion,
and
this
causes
our
search
system
to
be
inconsistent
during
a
certain
period
of
time.
A
So
we
haven't
implemented
a
manual
stop
fit
or
manual
fix
to
force,
update
documents
as
a
temporary
stopgap,
but
there
could
be
more
done.
We've
also
managed
to
look
into
our
backups
in
our
log
rotations.
So
for
our
search
system.
What
happens
is
that
our
solar
cluster
actually
has
gone
down
because
we've
ran
out
of
space
and
memory.
A
We
could
also
improve
our
monitoring
to
manage
and
see
and
monitor
our
memory
usage
to
make
sure
we
avoid
them
before
it
happens.
For
Kafka
we
could
expand
our
usage
into
legacy.
Let's
we
expand
our
usage
of
Kafka
into
our
legacy
systems
and
so
that
we
can
move
away
from
our
legacy
systems
by
utilizing
Kafka
to
send
the
messages,
and
we
could
also
improve
the
monitoring.
A
Like
I
mentioned,
we
were
using
Kafka
manager
and
we
are
using
an
internal
tool,
but
we
could
do
more
into
trying
to
monitor
when
we're
having
high
bursts
of
messages
coming
in
and
coordinate
with
external
internal
external
teams
to
manage
when
these
streaming
high
streams
of
messages
come
in,
and
so
there
are
alternatives.
You
specifically
against
picked
Kafka
and
solar,
but
for
your
cases
on
creating
a
search
system,
maybe
elastic
search
makes
more
sense.
Maybe
rabbitmq
makes
more
sense.
A
There's
a
lot
of
options
out
there
due
to
open
source
and
that's
the
cool
part
of
it.
So
in
summary,
efficient
search
can
be
implemented
with
open
source,
it's
scalable
and
flexible
to
your
design,
needs
and
goals.
And
lastly,
I
want
to
thank
all
the
people
at
pandora.
My
colleagues
who
have
contributed
to
and
continue
to
utilize
and
improve
our
search
system,
especially
to
special
thanks
to
Sam
Douglas
Omar,
Cruz,
Josh,
proline,
Sharda,
Paige,
Damon,
Wood
and
Tom
LeClair
for
the
continued
work
on
our
search
system
yeah.
That
was
super
fast
guys,
Thanks.