GitHub GitHub Universe 2022 - Cloud, 21 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The technology behind GitHub's new code search - Universe 2022

Description

In this quick deep dive, Tim will provide a peek into the system architecture and technical underpinnings of GitHub code search.

As always, feel free to leave us a comment below and don't forget to subscribe: http://bit.ly/subgithub

Thanks!

Connect with us.
Facebook: http://fb.com/github
Twitter: http://twitter.com/github
LinkedIn: http://linkedin.com/company/github

About GitHub
GitHub is the best place to share code with friends, co-workers, classmates, and complete strangers. Millions of people use GitHub to build amazing things together. For more info, go to http://github.com

A

A

Thanks for being here, I'm really glad that it's not raining out on us here on the garden stage.

A

This afternoon, the sun's beautiful I hope you had a chance to catch the keynote um and to see Colin's demo about the new code, search and navigation experience that we're releasing this week in this session, we're going to take a quick peek into the system architecture and the technical underpinnings of that product and there's a lot to cover because you might, as as you might already know, we built our own search engine for this, so sit up straight and get ready to talk about inverted indices and trigrams.

A

First, though, why even do this? Why build a search engine from scratch? Aren't there plenty of existing open source Solutions out there already? Why build something new?

A

Well, you can't say that we haven't tried, but if you've used GitHub any GitHub search anytime in the last 14 years, you might have some complaints. The truth is from solar to elasticsearch. We haven't had a lot of luck using general text search products to power code search, the user experience is poor, it's very, very expensive to host and it's slow to index, and while there are some newer code, specific Source open source projects out there, they definitely don't work at our scale. So our motivation is in three parts number one.

A

We want to enable an entirely new user experience, it's about being able to ask questions of code and get answers through iteratively searching browsing, navigating and reading that code. We understand that code search is uniquely different from General text. Search code is already designed to be understood by machines, and we should be able to take advantage of that structure and relevance.

A

In addition, searching code has unique requirements. You want to be able to search for a period or an open parenthesis. You don't want stemming or for stop words to be stripped from your query. You want to use regular expressions and last GitHub scale is truly a unique challenge.

A

When we first deployed elasticsearch, it took months to index all the code, which was something like 8 million repositories at the time, and you all are constantly pushing changes which is great but actually quite challenging for a search engine to handle.

A

When we say scale, we mean hundreds of millions of repositories, trillions of documents and terabytes of storage space for the indices, and we all still want queries to be fast. So inspired by a bunch of smart people, we built something from scratch, as we've talked about on the GitHub blog and in some of the other sessions. We call this search engine blackbird and we hope you like it and I'd like to give you just a little glimpse into how it works.

A

Let's start with a search query you might type into github.com arguments surrounding my query and slashes makes this a regular expression query. The question mark simply means the preceding character. S is optional, so I'm, looking for argument or arguments across all the code on GitHub that I have read access to I. Want you to notice a few things.

A

First note that we just ran a global, regular expression. Query that matches over 50 million files and returned us the top 100 results in less than a second that's insane. Second I want you to note that the results that we get back are consistent on a commit by commit basis. This is not the case for the old search engine or for many search engines in general.

A

So what happened in that? Second, how would you build a system to allow running a regular expression on millions of repos and terabytes of code actually before I answer that, let's take a look at some real numbers?

A

We've been running a tech preview of this for most of the year on cs.github.com, and here are some of our current statistics.

A

You can search for over 25 million repositories representing 76 terabytes of code and 10 billion unique documents, that's a lot of data and that's a lot of code.

A

So the answer to my question about how to build a system that allows searching all this code is you need an index, but before we talk about search indices, I want to explore one alternate approach. We get this question a lot. Why don't you just use grep?

A

Okay, if you don't have an index, then serving a query means scanning all of that content at query time. This is the experience you get when using a tool like grep or my favorite rip crap rib crap is an amazing piece of software, but doing some napkin math on running RG on 76 terabytes of content. We can see that this isn't really going to work.

A

First of all, even if we've got machines with 76 terabytes of memory and put our entire 32 core machine 32 machine cluster to work and assume that we can perfectly parallelize this query: we're going to saturate 2048 CPU cores for 63 seconds to serve a single query.

A

Only that one query can run everybody else that wants to search has to get in line, and we certainly aren't indexing any new code and to double our scale from serving 0.02 queries per second to 0.04. Queries per second would mean a serious increase in our infrastructure bill.

A

There's just no cost-effective way to scale this to all of github's users and honestly this doesn't meet our user experience goals either, and then we haven't even begun to look at some of the other ways we want to search what if we want to favor results that match named symbols, how are we going to filter results to just code in my organization?

A

It's obvious that we need a different approach.

A

Our queries can only be fast if we pre-compute a bunch of information in the form of indices, and you can think of these like maps from keys to values where the values are just lists of documents where that particular key appears. Here's an inverted index for programming languages.

A

We give each document an integer ID, detect what programming language is written in and then create an inverted index where language is the key. The list of document IDs for a particular key is called a posting list.

A

An m-gram index is a special type of inverted index. That's well suited for looking up substrings in content. In this example for the key Lim, we find millions of documents again represented as lists of integer IDs, where the characters- l, I and m, appear in that exact sequence. Intersecting. The results of multiple lookups gives us a list of documents where the string limits appears where the trigram index you need four lookups Lim, IMI MIT and its in order to fulfill this query.

A

We built indices not just for the content of documents, but for other information as well: symbols, paths, languages, organizations, repository names and so on.

A

Unlike a hash map, though, these indices are way too big to fit in memory, and even if you manage to set up a server with an absurd amount of ram, it would be hard to scale your queries per second.

A

In order to serve a query, we build iterators for each index that we need to access these lazily return, sorted document IDs and the doc ID was generated by the rank of that document, and we intersect in Union the iterators, as demanded by the specific query, and only read far enough to fetch the requested number of results.

A

We never have to keep entire indices or posting lists in memory. Okay, okay, let's build an index. That seems like a good idea, but there still are so many repos. Are we going to get all that repository content indexed in a reasonable amount of time? Remember that this took months in our first iteration of elasticsearch, how much disk space are we going to need? What about Forks, which are not supported for searching in the current code search?

A

How do we keep queries fast if there are billions of documents that all have the trigram Rea, as in the first trigram of readme? This is hard stuff.

A

Well, there are a couple of key insights about the specific data we're working with that are going to guide our approach. First gets use of content, addressable hashing and the fact that there's actually quite a lot of duplicate code on GitHub.

A

There are true Forks, of course, but also repos that have just been copied and pushed with a different name. There are repos copied inside other repos vendored, there's code that is copy pasted. Even within a repo. You might have multiple branches that share 90 of the same code.

A

So we want our indexing and ingest to take advantage of these two insights. First, we're going to Shard by blobshaw, which gives us a nice way of evenly Distributing documents between shards, while avoiding any duplication. This is because the the Shah is a Shawan hash of the file content. There won't be any hot servers due to special repos or special content, and we can easily scale the number of shards so that each hosts can use a reasonable amount of disk space.

A

Second, we're going to model the index as a tree and use Delta encoding to reduce the amount of crawling that we have to do and to optimize the metadata in our index. For us metadata is things like the list of locations where a particular document appears which path Branch, repo and information about those objects, like the repository name, its owner, whether it's public or private, and this data can be quite large for popular content. Think about something like the MIT license.

A

Since we have a global index, we also want to design the system so that query results are consistent on a commit level basis. If you search a repo while your teammate is pushing code, your results shouldn't include documents from the new commit until that commit has been fully processed by our system.

A

In fact, while you're getting back results from a reposcoped query, someone else could be paging through Global results looking at a different prior but still consistent state of the index. This is really tricky to do with other search engines, but Blackbird provides this level of query consistency for both reposcoped or scoped, and globally. Scoped queries.

A

Okay, let's build a search index with blackbird we're going to walk through each of these pieces, but here's a high level overview to get you oriented.

A

Kafka provides US events to tell us to go index, something we've got a bunch of crawlers that interact with Git and a service for extracting symbols, and then we use Kafka again to allow each Shard to consume documents at its own pace.

A

First, though, we have to ingest documents in the right order, instead of just naively tuning churning through all of the documents and all of the repos on GitHub we're going to get smart about the order in which we ingest and to do this, we need a way to tell how similar one repository is to another, and when we say similar, we mean similar in terms of content.

A

So we created a novel data structure to do this. It's in the same class of data structures as Min hash and hyperlog log, and this data structure allows us to compute, set similarity and compare the symmetric difference between two sets with just logarithmic space.

A

In this case, the sets that we're comparing are the contents of each repo which we represent by path and blobshaw tuples arms. Of that knowledge, we can now construct a graph where the vertices are repositories and the edges are weighted with this similarity metric calculating a minimum spanning tree of this graph, using similarity as the cost and then doing the level order. Traversal of the tree gives us the ingest order where we can make the best use of Delta encoding.

A

Really, though, this graph is enormous millions of nodes, trillions of edges, so our MSG algorithm computes an approximation, and that only takes a few minutes to calculate and provides.

A

Ninety percent of the Delta encoding benefit that we're going for each repo is then crawled by diffing it against its parent in this Delta tree, and that means that we only need to crawl the blobs that have changed there's a little bit of tricky sequencing and another novel data structure that we use that allows the Crawlers to share some state with a search index, but then crawling just involves fetching blob content from get analyzing it to extract symbols and creating documents. That will be the input to indexing.

A

These documents are published to another Kafka topic, and this is where we partition the data between the shards. Each Shard consumes from its own single Kafka partition and The Ordering of the topic matters and is how we get query. Consistency and indexing is decoupled from crawling allowing each Shard to move forward at its own pace.

A

A quick aside about sharding the index is charted by blobshaw and sharding just means that this indexed data is spread out across multiple servers, and we need to do this because we want to be able to scale horizontally where, especially for reads we're concerned about queries per second, but we also want to scale storage, which disk space is the primary concern, and we want to scale indexing time where we're really constrained by the CPU and memory of each individual host foreign.

A

The indexer shards are then going to consume documents from this Kafka partition and build their own indices, they're, going to con they're going to tokenize to construct engrams for Content symbols, paths and other useful indices like languages.

A

Our engram indices are especially interesting, while trigrams are a known, sweet spot in the design space. As rest, Cox and others have noted, bigrams aren't selective enough and quad grams take up way too much space. They cause some problems at our scale for common grams, like er space trigrams, just aren't selective enough. We get way too many false positives, and that turns into slow queries.

A

An example of a false positive is something like finding a document that has each individual trigram, but they aren't next to each other. You can't tell this until you fetch the content of the document and double check, at which point you've done a whole lot of work, and you have to throw away that result.

A

We tried a number of things to fix this, like adding, follow, masks which use Bloom filters for the character that follows the trigram, which is essentially halfway to implementing quad grams, but they saturate way too quickly to be useful.

A

The solution we came up with is something we called sparse grams and it works a little bit like this. We assume you have some weight function that given a bigram, you get back a weight using these weights. We tokenize by selecting intervals where the inner weights are strictly smaller than the weights at the borders, the inclusive characters that that interval make up your Gram and we apply this algorithm recursively until its natural end at trigrams.

A

At query time we use the exact same algorithm, but we only keep the covering grams. This gives us a supercharged engram index and solves the problem of common grams, not being selective enough, while letting us avoid an explosion in the size of our indexes.

A

Okay, back to the indexer pipeline, the indexers. After doing all this tokenizing and building of indices in memory, they then serialize them and write them to disk using a custom format.

A

Finally, we're going to run compaction compaction, not only collapses up smaller indices into larger ones that are more efficient to query and easier to back up and move around, for example, to a reed replica or for backups, but it also K merges the posting lists byte score so that relevant documents end up with lower doc IDs and are closer to the top of the list and will be returned first from our lazy iterators during the initial ingest, we delay compaction and do one big run at the end, but then, as the index keeps up with, pushes we run compactions on a shorter interval, and this is where we handle things like document deletions now that we have an index, let's go revisit our original query, I'm going to modify it just slightly so that we can see how qualifiers work and I didn't do this to begin with, because it makes the query much faster and that's cheating so now I'm looking for this same regular expression, but only in the rails organization and only for code written in the Ruby programming language.

A

So, let's trace this query. The high-level architecture of the query path looks a little bit like this in between github.com and the shards is a query service. It's going to coordinate, taking the user query and issue it requests to the individual shards in the cluster. We use redis to manage a little bit of State like some quotas and a cache of our user permissions you're, going to type this into a search box during our Tech preview this year.

A

This has been running at cs.github.com, but starting this week you can go and do this right on github.com proper. You have to sign up for the public beta, but this is hopefully going to be the default search. Experience for everybody on GitHub, going forward in the future.

A

Github is going to pass this query on sort of, as is to the query service, just as a string, but then once we get to the query service, we're going to parse this query into an abstract syntax tree, and then we rewrite it to resolve things like languages to their canonical linguist language IDs, and we tag on extra Clauses for things like permissions and scopes.

A

In this case, you can see how rewriting ensures that I'll get results from public repos or any private repos that I have access to then we're going to Fan out and we're going to send n concurrent requests one to each shard in the search cluster. Remember how we sharded the content by blobshaw. Due to that a query request must be sent to each shard on the individual Shard. We then do some further conversion of the query in order to look up information in its indices.

A

Here, you can see that the regex gets translated into a series of substring queries on the engram indices that we talked about before mapping a regular expression into substring. Queries is a topic for a whole other talk, but the simple example should give you a little bit of an idea, because we don't just use trigrams, as I talked about before we instead construct Dynamic Ram sizes, and in this case the engine uses the following grams: that you can see on the slide.

A

The iterators from each Clause run and means intersect or means Union, and the result is a list of documents. We still have to double check each document to validate the matches and detect ranges for them before scoring sorting and returning the requested number of results back in the query service. We then aggregate results from all the shards. We resort Again by score filter to double check your permissions and we return the top 100 or whatever was requested by the front. End github.com then still has quite a bit of work to do.

A

It needs to do syntax highlighting of the code. It needs to do term highlighting of the matches that were found in the code. It needs to do pagination before finally being able to render this result to the page.

A

Our P99 response times from the individual shards are something on the order of 100 milliseconds total response times are definitely longer because it takes a lot of time to aggregate those responses to check permissions and to do things like syntax highlighting. But overall, the experience is pretty fast. A single query ties up a single CPU core on one Shard for about 100 milliseconds, so our 64 cores heart shards have an upper bound of something like 640 queries per second that they can serve compared to the grep approach that we talked about at the beginning.

A

That's screaming fast, with plenty of room for simultaneous user queries and future growth in our system.

A

Now that we've seen the full system, let's revisit some of our stats and the scale of the problem that we're trying to tackle.

A

Remember those 10 billion documents we want to index Blackbird can index around 100 000 documents a second, so we're working through those 10 billion documents should take something like 28 hours but due to deduplicating by blobshaw and some of the Delta compression that we talked about. We reduce the number of documents we have to crawl by something like 50 percent, which means we can re-index the entire Corpus in 14 hours.

A

Plus I should note that our hosts right now are only around 10 percent utilized, so there's plenty of Headroom to go faster and for future scaling. There's some really big wins on the size of the index as well. Remember that we started with 76 terabytes of original content that we want to search our Delta indexing brings that down to around 22 terabytes of unique content and the index itself clocks in at just 20 terabytes, which includes not only all the indices, including those engram indices, but also a compressed copy of all unique content.

A

This means that our index is roughly a quarter the size of the original data now we're getting somewhere. This is a system that can scale not only that, but this is a system that is really delightful to use. It puts your code, your company's code, your team's code and the world's code at your fingertips, just to search away and there's no setup required, I'd like to invite all of you to sign up for our beta.

A

When you do, we go out and we index pretty much all the code on GitHub that you have access to we'd love to know what you think and we're actively looking for feedback, so that we can make this the official experience for search on github.com thanks for your time.