►
Description
In this quick deep dive, Tim will provide a peek into the system architecture and technical underpinnings of GitHub code search.
As always, feel free to leave us a comment below and don't forget to subscribe: http://bit.ly/subgithub
Thanks!
Connect with us.
Facebook: http://fb.com/github
Twitter: http://twitter.com/github
LinkedIn: http://linkedin.com/company/github
About GitHub
GitHub is the best place to share code with friends, co-workers, classmates, and complete strangers. Millions of people use GitHub to build amazing things together. For more info, go to http://github.com
A
A
This
afternoon,
the
sun's
beautiful
I
hope
you
had
a
chance
to
catch
the
keynote
and
to
see
Colin's
demo
about
the
new
code,
search
and
navigation
experience
that
we're
releasing
this
week
in
this
session,
we're
going
to
take
a
quick
peek
into
the
system
architecture
and
the
technical
underpinnings
of
that
product
and
there's
a
lot
to
cover
because
you
might,
as
as
you
might
already
know,
we
built
our
own
search
engine
for
this,
so
sit
up
straight
and
get
ready
to
talk
about
inverted
indices
and
trigrams.
A
A
Well,
you
can't
say
that
we
haven't
tried,
but
if
you've
used
GitHub
any
GitHub
search
anytime
in
the
last
14
years,
you
might
have
some
complaints.
The
truth
is
from
solar
to
elasticsearch.
We
haven't
had
a
lot
of
luck
using
general
text
search
products
to
power
code
search,
the
user
experience
is
poor,
it's
very,
very
expensive
to
host
and
it's
slow
to
index,
and
while
there
are
some
newer
code,
specific
Source
open
source
projects
out
there,
they
definitely
don't
work
at
our
scale.
So
our
motivation
is
in
three
parts
number
one.
A
We
want
to
enable
an
entirely
new
user
experience,
it's
about
being
able
to
ask
questions
of
code
and
get
answers
through
iteratively
searching
browsing,
navigating
and
reading
that
code.
We
understand
that
code
search
is
uniquely
different
from
General
text.
Search
code
is
already
designed
to
be
understood
by
machines,
and
we
should
be
able
to
take
advantage
of
that
structure
and
relevance.
A
A
When
we
say
scale,
we
mean
hundreds
of
millions
of
repositories,
trillions
of
documents
and
terabytes
of
storage
space
for
the
indices,
and
we
all
still
want
queries
to
be
fast.
So
inspired
by
a
bunch
of
smart
people,
we
built
something
from
scratch,
as
we've
talked
about
on
the
GitHub
blog
and
in
some
of
the
other
sessions.
We
call
this
search
engine
blackbird
and
we
hope
you
like
it
and
I'd
like
to
give
you
just
a
little
glimpse
into
how
it
works.
A
Let's
start
with
a
search
query
you
might
type
into
github.com
arguments
surrounding
my
query
and
slashes
makes
this
a
regular
expression
query.
The
question
mark
simply
means
the
preceding
character.
S
is
optional,
so
I'm,
looking
for
argument
or
arguments
across
all
the
code
on
GitHub
that
I
have
read
access
to
I.
Want
you
to
notice
a
few
things.
A
First
note
that
we
just
ran
a
global,
regular
expression.
Query
that
matches
over
50
million
files
and
returned
us
the
top
100
results
in
less
than
a
second
that's
insane.
Second
I
want
you
to
note
that
the
results
that
we
get
back
are
consistent
on
a
commit
by
commit
basis.
This
is
not
the
case
for
the
old
search
engine
or
for
many
search
engines
in
general.
A
A
A
Okay,
if
you
don't
have
an
index,
then
serving
a
query
means
scanning
all
of
that
content
at
query
time.
This
is
the
experience
you
get
when
using
a
tool
like
grep
or
my
favorite
rip
crap
rib
crap
is
an
amazing
piece
of
software,
but
doing
some
napkin
math
on
running
RG
on
76
terabytes
of
content.
We
can
see
that
this
isn't
really
going
to
work.
A
A
A
A
An
m-gram
index
is
a
special
type
of
inverted
index.
That's
well
suited
for
looking
up
substrings
in
content.
In
this
example
for
the
key
Lim,
we
find
millions
of
documents
again
represented
as
lists
of
integer
IDs,
where
the
characters-
l,
I
and
m,
appear
in
that
exact
sequence.
Intersecting.
The
results
of
multiple
lookups
gives
us
a
list
of
documents
where
the
string
limits
appears
where
the
trigram
index
you
need
four
lookups
Lim,
IMI
MIT
and
its
in
order
to
fulfill
this
query.
A
We
never
have
to
keep
entire
indices
or
posting
lists
in
memory.
Okay,
okay,
let's
build
an
index.
That
seems
like
a
good
idea,
but
there
still
are
so
many
repos.
Are
we
going
to
get
all
that
repository
content
indexed
in
a
reasonable
amount
of
time?
Remember
that
this
took
months
in
our
first
iteration
of
elasticsearch,
how
much
disk
space
are
we
going
to
need?
What
about
Forks,
which
are
not
supported
for
searching
in
the
current
code
search?
A
A
A
A
So
we
want
our
indexing
and
ingest
to
take
advantage
of
these
two
insights.
First,
we're
going
to
Shard
by
blobshaw,
which
gives
us
a
nice
way
of
evenly
Distributing
documents
between
shards,
while
avoiding
any
duplication.
This
is
because
the
the
Shah
is
a
Shawan
hash
of
the
file
content.
There
won't
be
any
hot
servers
due
to
special
repos
or
special
content,
and
we
can
easily
scale
the
number
of
shards
so
that
each
hosts
can
use
a
reasonable
amount
of
disk
space.
A
Second,
we're
going
to
model
the
index
as
a
tree
and
use
Delta
encoding
to
reduce
the
amount
of
crawling
that
we
have
to
do
and
to
optimize
the
metadata
in
our
index.
For
us
metadata
is
things
like
the
list
of
locations
where
a
particular
document
appears
which
path
Branch,
repo
and
information
about
those
objects,
like
the
repository
name,
its
owner,
whether
it's
public
or
private,
and
this
data
can
be
quite
large
for
popular
content.
Think
about
something
like
the
MIT
license.
A
Since
we
have
a
global
index,
we
also
want
to
design
the
system
so
that
query
results
are
consistent
on
a
commit
level
basis.
If
you
search
a
repo
while
your
teammate
is
pushing
code,
your
results
shouldn't
include
documents
from
the
new
commit
until
that
commit
has
been
fully
processed
by
our
system.
A
In
fact,
while
you're
getting
back
results
from
a
reposcoped
query,
someone
else
could
be
paging
through
Global
results
looking
at
a
different
prior
but
still
consistent
state
of
the
index.
This
is
really
tricky
to
do
with
other
search
engines,
but
Blackbird
provides
this
level
of
query
consistency
for
both
reposcoped
or
scoped,
and
globally.
Scoped
queries.
A
A
In
this
case,
the
sets
that
we're
comparing
are
the
contents
of
each
repo
which
we
represent
by
path
and
blobshaw
tuples
arms.
Of
that
knowledge,
we
can
now
construct
a
graph
where
the
vertices
are
repositories
and
the
edges
are
weighted
with
this
similarity
metric
calculating
a
minimum
spanning
tree
of
this
graph,
using
similarity
as
the
cost
and
then
doing
the
level
order.
Traversal
of
the
tree
gives
us
the
ingest
order
where
we
can
make
the
best
use
of
Delta
encoding.
A
Ninety
percent
of
the
Delta
encoding
benefit
that
we're
going
for
each
repo
is
then
crawled
by
diffing
it
against
its
parent
in
this
Delta
tree,
and
that
means
that
we
only
need
to
crawl
the
blobs
that
have
changed
there's
a
little
bit
of
tricky
sequencing
and
another
novel
data
structure
that
we
use
that
allows
the
Crawlers
to
share
some
state
with
a
search
index,
but
then
crawling
just
involves
fetching
blob
content
from
get
analyzing
it
to
extract
symbols
and
creating
documents.
That
will
be
the
input
to
indexing.
A
These
documents
are
published
to
another
Kafka
topic,
and
this
is
where
we
partition
the
data
between
the
shards.
Each
Shard
consumes
from
its
own
single
Kafka
partition
and
The
Ordering
of
the
topic
matters
and
is
how
we
get
query.
Consistency
and
indexing
is
decoupled
from
crawling
allowing
each
Shard
to
move
forward
at
its
own
pace.
A
Our
engram
indices
are
especially
interesting,
while
trigrams
are
a
known,
sweet
spot
in
the
design
space.
As
rest,
Cox
and
others
have
noted,
bigrams
aren't
selective
enough
and
quad
grams
take
up
way
too
much
space.
They
cause
some
problems
at
our
scale
for
common
grams,
like
er
space
trigrams,
just
aren't
selective
enough.
We
get
way
too
many
false
positives,
and
that
turns
into
slow
queries.
A
An
example
of
a
false
positive
is
something
like
finding
a
document
that
has
each
individual
trigram,
but
they
aren't
next
to
each
other.
You
can't
tell
this
until
you
fetch
the
content
of
the
document
and
double
check,
at
which
point
you've
done
a
whole
lot
of
work,
and
you
have
to
throw
away
that
result.
A
The
solution
we
came
up
with
is
something
we
called
sparse
grams
and
it
works
a
little
bit
like
this.
We
assume
you
have
some
weight
function
that
given
a
bigram,
you
get
back
a
weight
using
these
weights.
We
tokenize
by
selecting
intervals
where
the
inner
weights
are
strictly
smaller
than
the
weights
at
the
borders,
the
inclusive
characters
that
that
interval
make
up
your
Gram
and
we
apply
this
algorithm
recursively
until
its
natural
end
at
trigrams.
A
A
A
So,
let's
trace
this
query.
The
high-level
architecture
of
the
query
path
looks
a
little
bit
like
this
in
between
github.com
and
the
shards
is
a
query
service.
It's
going
to
coordinate,
taking
the
user
query
and
issue
it
requests
to
the
individual
shards
in
the
cluster.
We
use
redis
to
manage
a
little
bit
of
State
like
some
quotas
and
a
cache
of
our
user
permissions
you're,
going
to
type
this
into
a
search
box
during
our
Tech
preview
this
year.
A
A
In
this
case,
you
can
see
how
rewriting
ensures
that
I'll
get
results
from
public
repos
or
any
private
repos
that
I
have
access
to
then
we're
going
to
Fan
out
and
we're
going
to
send
n
concurrent
requests
one
to
each
shard
in
the
search
cluster.
Remember
how
we
sharded
the
content
by
blobshaw.
Due
to
that
a
query
request
must
be
sent
to
each
shard
on
the
individual
Shard.
We
then
do
some
further
conversion
of
the
query
in
order
to
look
up
information
in
its
indices.
A
Here,
you
can
see
that
the
regex
gets
translated
into
a
series
of
substring
queries
on
the
engram
indices
that
we
talked
about
before
mapping
a
regular
expression
into
substring.
Queries
is
a
topic
for
a
whole
other
talk,
but
the
simple
example
should
give
you
a
little
bit
of
an
idea,
because
we
don't
just
use
trigrams,
as
I
talked
about
before
we
instead
construct
Dynamic
Ram
sizes,
and
in
this
case
the
engine
uses
the
following
grams:
that
you
can
see
on
the
slide.
A
The
iterators
from
each
Clause
run
and
means
intersect
or
means
Union,
and
the
result
is
a
list
of
documents.
We
still
have
to
double
check
each
document
to
validate
the
matches
and
detect
ranges
for
them
before
scoring
sorting
and
returning
the
requested
number
of
results
back
in
the
query
service.
We
then
aggregate
results
from
all
the
shards.
We
resort
Again
by
score
filter
to
double
check
your
permissions
and
we
return
the
top
100
or
whatever
was
requested
by
the
front.
End
github.com
then
still
has
quite
a
bit
of
work
to
do.
A
A
Our
P99
response
times
from
the
individual
shards
are
something
on
the
order
of
100
milliseconds
total
response
times
are
definitely
longer
because
it
takes
a
lot
of
time
to
aggregate
those
responses
to
check
permissions
and
to
do
things
like
syntax
highlighting.
But
overall,
the
experience
is
pretty
fast.
A
single
query
ties
up
a
single
CPU
core
on
one
Shard
for
about
100
milliseconds,
so
our
64
cores
heart
shards
have
an
upper
bound
of
something
like
640
queries
per
second
that
they
can
serve
compared
to
the
grep
approach
that
we
talked
about
at
the
beginning.
A
Remember
those
10
billion
documents
we
want
to
index
Blackbird
can
index
around
100
000
documents
a
second,
so
we're
working
through
those
10
billion
documents
should
take
something
like
28
hours
but
due
to
deduplicating
by
blobshaw
and
some
of
the
Delta
compression
that
we
talked
about.
We
reduce
the
number
of
documents
we
have
to
crawl
by
something
like
50
percent,
which
means
we
can
re-index
the
entire
Corpus
in
14
hours.
A
Plus
I
should
note
that
our
hosts
right
now
are
only
around
10
percent
utilized,
so
there's
plenty
of
Headroom
to
go
faster
and
for
future
scaling.
There's
some
really
big
wins
on
the
size
of
the
index
as
well.
Remember
that
we
started
with
76
terabytes
of
original
content
that
we
want
to
search
our
Delta
indexing
brings
that
down
to
around
22
terabytes
of
unique
content
and
the
index
itself
clocks
in
at
just
20
terabytes,
which
includes
not
only
all
the
indices,
including
those
engram
indices,
but
also
a
compressed
copy
of
all
unique
content.
A
This
means
that
our
index
is
roughly
a
quarter
the
size
of
the
original
data
now
we're
getting
somewhere.
This
is
a
system
that
can
scale
not
only
that,
but
this
is
a
system
that
is
really
delightful
to
use.
It
puts
your
code,
your
company's
code,
your
team's
code
and
the
world's
code
at
your
fingertips,
just
to
search
away
and
there's
no
setup
required,
I'd
like
to
invite
all
of
you
to
sign
up
for
our
beta.