GitHub Git Merge 2015, 12 May 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scaling Git at Twitter, Wilhelm Bierbaum - Git Merge 2015

Description

We’ll discuss our experience running a very large Git repository with the many projects and contributors, discuss the pros and cons of large repositories, introduce some enhancements we are testing to improve fetch performance through journaled replication, and cover some other optimizations we are pursuing.

A

Without further ado, I'm gonna invite our next speaker up. He works at Twitter. It's at scaling, get at Twitter, some of you who's heard of Twitter. It's anybody nobody's heard of Twitter. Okay, it's a it's sort of a social app. It's not big and France. So but it's quite big in the United States, so he's going to tell you all about how they scale get it at Twitter. So please welcome. Will beer bomb I.

B

Am will I've been at Twitter for about five years?

B

Yeah I have worked previously on the front end systems and the traffic management systems, and now I work on get at Twitter. We've basically decided that it's, it's really important to make Twitter a good place to work for developers, so source control was one of those areas where we're kind of lacking.

B

We had some performance problems with get in the past and we decided to try to make it better and I'll talk a little bit about large repositories and how it interacts with those repositories and the things that we've done specifically to make get work better with large repositories.

B

So Twitter now runs development of all of its services. Out of a single monolithic repository, it's one huge repository that used to be three somewhat large repositories for better or worse working in a single repository when you have a lot of services really allows us to well working in a single repositories. The way that people prefer to develop software when faced with developing hundreds of services and dealing with thousands of build targets, in addition to being in a single repository at this point, we have a single build system that helps us build everything consistently.

B

We have a lot of contributors who submit numerous changes every day. Quite a few of those people are beyond the offices, the walls of our San Francisco office. So we have a geographically diverse workforce and that's something we need to take into account when we're trying to scale the SCM.

B

We actually do use the normal canonical implementation of git when we're working on code, but we do have a few optimizations that I'm going to discuss later.

B

Yes, so this huge repository sounds like a bit of a mess, but there are some benefits.

B

For instance, why would you put everything in one repository? Well, one thing is that it's visible it's easier for people to find code if they're looking for one repository while code search solutions that can target more than one repository obviously exists, and especially in github and these kinds of things for enterprises they're not as fluent so it's just simpler instead of asking where the code, which repository the code that I'm looking for, is in to have it all in one place.

B

In addition, we run a single tool chain, so there's a single set of tools to build test and deploy and operate the services that are developed in the repository when people make improvements to these tools, everyone benefits because everything's on a single tool chain, we actually do quite a bit of. We do rely quite a bit on IDL interface, definition language.

B

So when operating many small services, which is very important to Twitter's architecture, even though they run in separate process spaces, they all get developed together, and this is because we'd like to totally minimize the amount of duplicate code there is and beyond, limiting that repetitive code, we rely heavily on the interface definition, language and code generation to create classes that model data and encapsulate the service boundaries so sharing code, an ideal and a single repository gives us an easy way to keep our client and server code synchronized, and it ensures that the deployed software is compatible when it's when it's running, so this is, in effect, we're trying to model what will happen in run time in a repository.

B

So when we're troubleshooting service issues in any environment, it's a lot easier to look at a single codebase.

B

It allows developers to mentally parse the delineation between services and these things much easier than if they were in separate code bases. So it's easier to understand the entire system.

B

It's also easier to understand the impact and scope of changes that you make so using a single bull tool and repository has benefits for the planning aspect of coding. Since everything can be compiled together, it's easy to make a change and see what breaks, rather than having to submit changes to multiple repositories, build those repositories, code and change the dependencies among them. We can just edit files land the changes that might affect multiple systems and do this all within a single commit and then run the tests and see what's happened.

B

So this makes refactoring a lot less arduous and it also plays into encouraging principled interface design and when you do have to make breaking changes, they're easier to coordinate with other people. Since the code, review and testing is all within one repository.

B

So, what's bad about having multiple or one big repository?

B

Well, some common challenges of the larger repositories are that you know some organizations have chosen to use sparse checkouts to control the number of files necessary to maintain a working copy.

B

There are many objects required for the complete representation of an entire repositories. History, history and those occupy considerable storage resources and great numbers of them can cause normal operations like git status, to perform quite slowly tuning. The file system only goes so far and that's why people end up limiting them out of history and possibly the parts will be partitioning their repositories.

B

One example of partitioning in a large repository that many people might be familiar with is what's done in the android project. The maintainer x' of the android project have chosen to partition their build tree into many smaller repositories, and then they use an external tool called repo to synchronize the projects that are at the top level. So there may be, you know, android at the top level and the code, that's part of that.

B

But all the dependencies, such as libraries that are developed by third parties, are pulled in by a separate tool and those are their own separate, get repositories to us using a tool such as repo, let's, in effect, a secondary version, control tool on top of the git just kind of feels not right, and it doesn't lead to a fluent experience.

B

So going back to get the git core suite, there's a mechanism that we have and get core called sub modules which, although they seem attractive in providing way to partition these large projects into smaller repositories. That might be, you know easier to manage.

B

The reality is that a lot of developers don't feel very comfortable with sub-modules, despite the fact that sub modules have really improved a lot over the course of history of gets development so commands that use in everyday get such as like, add or you know, commit they. They don't recognize sub module boundaries very well. So if you have changes you make in a sub module and you're at the top level, you won't be able to actually add and make these commits in parallel.

B

So this is one thing that large repositories allow us to do that isn't service very well by the sub module subsystem.

B

So I'm going to talk about how we use git at Twitter for a minute we use git in a very centralized way and, unfortunately and I think this is a case for a lot of people. Certainly people who use github. We don't really benefit from the fact that it's a distributed version control system beyond the fact that it's good at managing merging patches so for development of the services.

B

We try to do it as close to master as possible and we discourage maintaining long-running branches. The goal is to have, as as close as is possible view of the entire system in a single version having everyone works against master limits, the amount of coordination, that's necessary between developers to make changes and has a secondary effect of minimizing the number of conflicts developers have to resolve when they merge their changes.

B

While there is a project level ownership model in our code base, any developer is actually welcome to change any code. So this is one of the the benefits that you know having a strong reviewed model allows us to exposed to people.

B

Our topology is that we have a lot of read-only replicas that mirror changes from a highly available readwrite server as developers push changes. Those changes are written through to the read-only cluster in any given get installation. It's likely that there are a lot of applications, for instance, continuous integration and tooling, that read data more than those that write to it and so scaling out our read-only cluster has helped us meet the demand.

B

In the context of our read heavy workload, we use a reference repositories extensively when we're doing parallel test runs and these kinds of things where you have many copies of a repository that need to be present on a machine when you are, but you know they don't actually necessarily need an entire clone themselves. So reference repositories, if you're not familiar it, allows you to have an entire object, backing store with a separate working copy and a separate log, and all these things.

B

So well, git has pretty impressive performance in general for large repositories. There are some places where performance could be improved.

B

So when you're fetching are pushing changes, upload pack does this initial payload of all the references that you have in a repository on the server and that takes a long time to transfer when you have a lot of references and, furthermore, the disk format of the references is kind of inefficient in general.

B

So if there are a lot of changed branches- and you haven't fetched for quite some time- you might run into some locking problems because of the fact that there aren't that there are pretty low file. Descriptor limits on certain operating systems, particularly Mac OS 10, in its default configuration, runs into these problems. So when you change references, you actually have to take a lock file for the reference you have to check a lock file, possibly for the pack drafts and so on.

B

So when you have a lot of these, you might run to the file descriptor limit transactions might fail, which is not great common commands like status, and these commands also take quite a while, sometimes in the presence of many objects and references regardless of the weather. The repository is well packed.

B

Garbage collection sucks in any form nobody really likes to see automatically packing the repository for optimum performance when they're trying to get work done, especially when you have millions of objects and, unfortunately, while there's been work done to background the garbage collection, there are some locking issues with it.

B

So what has Twitter done?

B

We're experimenting with several changes to improve the performance of these repository. Our goal is to make performance of fetch push status, commit and branch faster. Specifically, since these are the most commonly used commands so to improve status performance, some people might be familiar with file alteration. Monitors Facebook has actually put this into action in their mercurial implementation and they have developed a alteration, monitor called watchman that mercurial consults to see which files have changed since the last time they asked- or there are some other semantics, but largely this is the case.

B

So what are you doing? Get status? One of the most expensive things is to make the system's system calls to do the statting for to see which files have changed since last time and watchman alleviates that by pulling that all into the user land process. So this has a pretty pronounced effect on Mac OS 10, which pretty much all of our developer reviews and pretty much every developer uses since hfs+ has pretty poor performance compared to Linux and xt4.

B

When the default Cannell configuration is used, we also have a new index format, the index tracks, the state of files and get, and we've adopted a faster hashing algorithm that uses native instructions. So it allows our index to perform a little bit better. So, as I mentioned before, when you connect to the server it sends you all these things that you might already that you probably already have so as soon as the clients connect, they receive the wrist list of references in the context of large repositories.

B

With many references, the list can be huge in our case. It's about 13 megabytes of raw data and all this data has to be sent each time, despite the fact that there's only a small fraction of branches and tags that might have changed between fetches to address sending this huge piece of data, we've started experimenting with having clients and a bloom filter representing the present state of their references.

B

So they have they, they hash everything together into a bloom filter, send this to the server and as the server is enumerated through the references that it has, it checks whether or not they're present what are their hits against the bloom filter.

B

It only sends references back that are misses so basically, using this representation we're able to cut down the number of references we have to send back to pretty much minimally pretty much a minimal number of them and right now, we've implemented this as an optimization only in an HTTP header, because it's kind of a hack, but if we were gonna formalize, it we'd like to put it in the get metadata and put it in the protocol, possibly in the sideb, and one of the other things that we've got is the most contentious thing.

B

The fetch protocol is pretty expensive when you're trying to locate objects to send back as a pack to the client. The amount of random I/o through compressed data that you have to do is pretty extreme.

B

So this means that when you're actually fetching it can take minutes to negotiate which data you actually want, and then the transfer time, on top of that to get that data. So bitmap indices, help in this area, but they're pretty computationally expensive to keep up to date for us, especially when they need to be repacked, and since we have a central deployment and pretty significant requirements to have the same data on multiple machines.

B

What we've chosen to do is rephrase the fetch protocol completely.

B

So basic information retrieval, combined with the fact that we've learned this through building a lot of serving systems says that we should try to make what we're serving as close to on the wire as close to the disk or presentation as as possible, and we realize this in our fetch, replication and fetch scheme using journals of commits and reference changes.

B

This technique means that we have no computation on the server beyond a send. Filesystem call to relay the changes to clients.

B

So how does this work when people push to the server to the right server? The objects, the objects all of our commits are serialized. We have a commit queue, so it only writes one commit at a time. That's an important part.

B

We're gonna take a global lock for each push in this case, instead of the normal reference level, lock that git takes when you're pushing you're generating an intermediate pack, and what we do is we hold on to that pack, just through messing with the configuration variables a little bit to make sure that they don't get deleted or loosened as soon as they're received by the server?

B

What we do is that we append these packs to a log structured journal file. The journals are regular files on the disk, they're named monotonically, increasing numbers of serial numbers, and we've got these set to somewhat small size limits to ease operation of dealing with these journals.

B

After the pack is upended. We also append a record of which reference was modified, so you have the entire data necessary to extract all the changes and to know which ref exchanged in this case, and these are saved to something called the extents file. So there are two files: there's the extents file. There's the journal file.

B

If the clients are configured to use these journals, when they connect to the server they pull, they they attempt to retrieve the extents file. All the data in these files is append only so. This synchronization can actually be achieved purely by requesting the rest of the bytes. Beyond the link of the current extents file, the client already downloaded or was provisioned with I'll talk a little bit more about the client provisioning in this case later so after they receive that data in the extents file, they know which references which parts of the journals have changed.

B

They try to retrieve the journal or journals, and then they extract data from these. The way that they extract data is what we call replaying the transactions, so the packs and their indices are extracted from the journal. The refs are updated in batch through ref update after they're pre-processed to reduce the unnecessary number of transitions, for instance, mutations of a branch that are later deleted turn into deletions solely and after this they basically write how far into the extents logger they've gotten.

B

So they can remember next time we're to start in the extents log and to appease the garbage collector and to prevent this proliferation of really tiny pack files present each individual transaction. We invoke another process, it's like a repack that runs in the background and combines all the small pack files into a larger one. This kind of doesn't play well with the heuristics of pack files, but in practice people end up doing full repacks, often enough that this isn't much of a problem.

B

So the fact is, log structured gives us a few operational benefits. It's really cheap to serve thousands of clients off one machine and the data they read is always going to be in their file system cache pretty much. So all we're doing to replay to give them the data they want is sending files and sending bytes with real wire and since all the files are essentially dead on the disk. There's the possibility of introducing intermediate HTTP caches that understand range requests between the client list server.

B

This is useful in situations where you have geographic diversity, because the round trips between you know, San Francisco and, for instance, London or Paris are quite long, especially considering encrypted connections.

B

So we can set up somewhat dumb proxies in far-off places that yield much better performance, that for the people that we're trying to serve changes to so basically without these journaled fetches. Our repository won't fetch at all, and this gives us a much more predictable runtime and is often faster than get fetch.

B

There are some drawbacks, for instance, it requires a bit of care to operate. This clients have to be configured properly. We have to maintain that they're configured properly, so we have an entire systems deal with that and present.

B

Instead of closing cloning, usually regular git clone, we have people fetch initial little snapshots through BitTorrent so that it doesn't take so long to actually extract all this stuff for some users fetching and replaying. All the changes might be overkill, so you can use this on a per ref basis. You might have this set up so that master always gets journals and then the rest of the branches are fetched from a different differently configured remote, where the journals aren't necessary.

B

So you don't end up transferring everybody's branches along with you know the the code that you're actually going to share and finally- and this is the most important thing that people say is that it doesn't support redaction. So if people push stuff that shouldn't be there like private keys and stuff, we don't have any way to delete it, but realistically and get people could have already read those objects and have them on their local machine and there's no way to actually enforce that objects are deleted on clients.

B

So we feel, like it's kind of a snow, wet situation fixing that would be pretty easy. However, some of the other things we'd like to do in the future are work with the sparse and shallow repositories.

B

This is especially similar the case that other people at Facebook, for instance, are experimenting with sparse repositories. We haven't gotten to that yet, but we'd like to it would also be nice to have a storage abstraction, that's a little more flexible than what's in get right.

B

Now it's going to be really crucial to making future performance improvements possible and I've started experimenting personally, with an interface that mimics, what's in Lybia lets loadable shared objects and a regular, get and kind of getting someone somewhere with that and finally, using that pluggable storage we'd like to possibly adopt a network aware storage back-end so that checkout becomes an online operation, but you're only fetching certain objects that you need and we maintain that as your objects on disk essentially become a cache instead of a necessity and with that we do want to contribute stuff.

B

We're not sure the stuff is right for everybody. We've never really wanted to maintain a fork. We kind of do right now, but if any of these optimizations appeal to people and I think that possibly the ref mutation optimizations might we'd love to work up streaming. Those and by design we've tried to make all this stuff have a pretty small footprint in the codebase.

B

We're modifying existing programs is concerned and integrating these changes should be pretty easy. All these optimal. These optimizations are totally optional and their can configure through our control through configuration and with that anybody has any questions. I don't have that much time left. But if you have questions- and you want to talk to me afterwards- that'll be fine.

B

Oh there's a question. Oh sorry, yes.

B

I'm gonna I'm gonna, it doesn't any of it does any of it. Make sense to upstream I. Think that, specifically, the thing where we send a bloom filter that indicates which branches need to come down is a huge optimization that could be up streamed. The alteration monitor is a change that we tried to upstream, but other people were working on that at the same time and get without any dependency outside of get there's a huge tendency to not want to take external dependencies, which is fine but yeah.

B

It would make sense to upstream a few of these changes. The journal isn't for everybody, it's great for situations where you have a huge repository and everybody's getting the same thing all the time, which is the case that we had but- and it's also great for mirroring it's much faster than using regular fetch mechanism for mirroring, but it might not make as much sense for everybody I'd.

B

Imagine that that can be separated out and sent as a contribution or as a separate extension to get such as what happens with get an X or these kinds of things, but it might never have a place in actual get, and it's still possible to achieve these things in in the scope of you know an extension. There aren't significant modifications necessary to make this happen. The only real modifications are to the tools that you use to to fetch in pole, for instance, cool.

B

Thank you guys.

A