GitHub Git Merge 2015, 12 May 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Git at Google: Making Big Projects (and Everyone Else) Happy, Dave Borowitz - Git Merge 2015

Description

Google likes to push the boundaries of what's possible with Git. With big projects like Android and Chrome, we need some pretty big hosting infrastructure: we built a custom globally replicated storage layer, added bundles and bitmaps to reduce server load, and built specialized tools like Gerrit Code Review. But working on large Git-based projects is still not as nice as we'd like it to be. In this talk I'll discuss what we've done to make large projects happy at Google, and what we hope to do to make large projects happy everywhere.

A

So I'm going to invite up our first speaker. This is a man who works at Google on jacott and Garret, so very deep in the sort of get infrastructure and also works on the infrastructure team for the chrome project, I'd like to welcome Dave Borowitz.

B

Good morning my name is Dave Borowitz, as Scott said, I work for the git infrastructure team at Google, and today, I want to talk about how we deal with big projects and I have a little bit of a confession.

B

The contributors summit yesterday, ivar l had a really interesting discussion about all the ways in which get is slow, I'm going to talk about some of them, but I'm like scratching the surface, there's a lot of pain that people are feeling that is not exactly the same pain that we're feeling, but I do want to keep this kind of positive. Not everything is terrible.

B

There are some little small improvements that we've made that we can make in the future to make life on a big project easier and we'll talk about some of those, and the last thing I want to say before I really get started. Is your project may or may not be as big as Google's, but everybody in this room is going to feel some kind of pain working with git at scale at some point in your career, so hopefully something I described here will be relevant to you.

B

A quick history of the stuff we've been doing at Google with get on 2008. We did the initial release of something you may have heard of called the Android Open Source project. This was originally hosted up at Oregon State University, using the same servers as kernel.org in 2011. There was a little bit of a security and scalability issue with kernel.org, so we launched this new get service on Google's infrastructure called Google source comm, that's growing a little bit over the years, most notably in 2014. The chromium project completed a big migration from subversion.

B

To get that we'll talk about a little later, some numbers here, the first of several slides it may make your eyes glaze over. We don't serve a whole lot of repositories, we're not as broad a base as something like github, but would you serve some kind of big repositories? Androids got a few hundred chrome has a few hundred tens of gigabytes of data, we're serving several terabytes terabytes of traffic a day.

B

Several gigabits per second and you'll see these numbers in red here that are actually served out of our content delivery network, not in our core servers, which is a little trick. I'll tell you about later, and hopefully you will be able to use in the future.

B

Android is is kind of big. As far as get projects go, there are 800 repositories. The default checkout gives you about 4 or 500 of those. It includes everything that you need to make a phone operating system. That's like the core OS all the bundled apps, there's a whole copy of Chrome in their device images for Nexus devices, compiler testing tools, third-party dependencies, you name it it's in Android, these sort of range in size from really really tiny repository to very kind of big repositories. Over 10 gigs in size.

B

There's also like a complete internal Fork of all of that, which is where we do our quasi closed source development before our periodic releases to the open source project, and we also have to collaborate with a lot of partners. Hardware, manufacturers, OMS chipset manufacturers, things like that. That introduces some more challenges.

B

So why is Android special I have a little asterisk here, because Android is not actually that special. It's not that, unlike a lot of projects that maybe are the ones that you work on or the size that you will grow to be in the future. It's pretty big. The total check out of a just what you get. If you follow the instructions for downloading the Android source, it's about 17 gigabytes it we have to work with lots of partners this.

B

These are pretty big companies and when you work with big company, you have to sign contracts with them. We have a really complicated permission system, because you sign a contract that says these assets can only be shared with this group of people, and these assets can only be shared with this other group of people. So this is a kind of a source of confusion for these people, sometimes, and also we want to really strive in the Android ecosystem to make sure all the tools we use are open, open, source contributors and partners.

B

They want to use the same stuff internally in their organizations that Google is using in our organization. So it doesn't do us any good if we build something that can only be used for Google, or so we want to help everybody who participates in Android. Equally, the other big project I'm going to touch on briefly here is chrome, which is a browser project that many of you probably use these projects. There are a couple of big ones: there's chromium the core browser project.

B

Blink is a fork of WebKit rendering engine started by Apple forked by Google and chromium OS is a sort of packaged version of chrome that runs on a laptop or a desktop machine. That therefore, has some of the similar problems you might have with Android in terms of working with partners, and things like that.

B

So why is Chrome special again? Not actually that special, it's kind of big, but it's not completely unreasonable. They just completed a giant migration from all this history, a subversion data into get, which is something that a lot of people go through. Taking lots of old development history and the developers of chrome were actually like very set in their ways. They built all these workflows around this, like web front-end to subversion.

B

So it's like really important that we were able to preserve their workflow as they migrated to get one thing: that was a big pain point for us was like subversion. Blame is really fast, I, don't know if you've ever tried to run, get blame on a file.

B

That's 20,000 lines, long that spans 200,000 commits of history, but it's not like the fastest thing in the world, so we had to build a pretty aggressive caching mechanism that just just to make them basically happy, so they would even except that we could go forward with the get migration.

B

These repositories are a little bit. Big chrome is about three gigabytes in size. Blink is about five gigabytes in size and sometimes I think they're really just trying to make my life difficult because they decided we have two giant repos, what's better than two giant repos combining them together, but this is. This is something that gets supports very well, and there are definitely benefits to having a large monolithic. Codebase may have heard some talk from Facebook about how they're scaling mercurial to work with a codebase is actually much larger than this in size.

B

So those benefits in terms of atomic refactoring or something we want to support for our developers to make them more productive and not let the tools get in their way, and the last thing that I previously mentioned is Chrome. Os is like basically like Android. It has all the same problems with partners with heart, unreleased hardware, with contracts and access controls, so they they feel some of the same pain that the Android team is feeling as well.

B

So when I get hit on a few times in this talk, is that being big is hard? It's hard when you're a get client just running, get commands on big repositories becomes hard, and this is where I'm only going to touch on a few things, but it's actually hard and way more ways and I'm going to talk about just the first thing you do when you download Android is like you need to get this 17 gigabytes of data from our servers to your desktop.

B

Somehow, if you're lucky enough to have a Gigabit, Ethernet connection that goes to a machine in your office that you're not sharing with anybody else, you can download this in two minutes that that seems pretty reasonable. But more likely you have a broadband internet connection you get like 20 megabits and then all of a sudden we're talking about three hours.

B

Actually, the average broadband speed here in France is about seven megabits, a second so take those numbers and multiply them by three and also you should really hope that your internet connection is not flaky at all because, as we all know, get clones are not resumable. So if you start to download two gigabytes of repository and then it craps out, 1.8 gigabytes in you have to start from scratch, which is not not super great. When we're talking about repositories of these sizes.

B

Once you get all this data, you have to actually do some Delta resolution which, depending on your knowledge of getting internals, you may or may not understand, but you do understand that it is a painful step. You have to go through checking out files. Copying them from the packed representation onto disk can take a really long time. Android has about half a million files of varying sizes, and if your operating system is something like Windows we're writing a file takes opening a file for writing takes a long time.

B

This can be kind of painful before I go too much further. I want to make this kind of like meta point about this pain, which is as tool developers and server administrators. You can only get so far by saying. Oh, don't do that you're holding it wrong. Don't put your giant binaries in git repositories, don't import 1 million commits of history, but uh it's really hard to tell somebody to do that. One really they're just trying to get their job done.

B

At the end of the day we built a service that can like handle these repositories, so you can bet that somebody is going to try to push a repository that reaches this maximum size back some size or goes over that and I don't want to be the person telling these people. No, you can't do that. I'd rather build a service and work on tools that are able to handle something of this scale.

B

We also found that one reason you can't tell people don't do, that is that they really don't like learning multiple tools. It was bad enough trying to get the chromium developers to migrate from subversion to get asking them to use a brand new tool to handle some of their source control. Just because we've run into scalability limits that we should really be able to work around can cause them a little bit of pain.

B

So one thing we can do to to reduce the size. The size causing pain for users is just fetching less data that would really be nice. um Github just had a really exciting announcement yesterday about storing large files in a way that you don't have to download them all the time. That would actually be really nice, but obviously we have not yet integrated that into Android one trick you can use today if you're not familiar with it, is a shallow clone which only gets a little slice of history.

B

It saves you, maybe fifty to ninety percent of the data transfer that you need which actually, if you think about it, is not as much savings as you might expect. Like one commit, has half of the data of a hundred thousand commits. It turns out that git has like really really really good Delta compression, but shallow clones have their own problems. Some commands just don't work like you would expect like log.

B

You can't see all the history if there's only one commit of history, they're not really good about aggressively garbage collecting old data, as you fetch new data, so can work could be improved. A little bit. Narrow clones are a topic that comes up periodically on the git mailing list. We had a nice long talk about this yesterday.

B

Also, this is the idea that you can check out only instead of just a small slice of history, you check out only a small slice of the trees and download only the files you need to check those out best. I can say about this problem. Is it's turned out to be non-trivial which, as we all know, is programmer slang, for we have no idea when this is going to happen, but we actually do have some like stopgap measures that we can implement. In the meantime, we split repositories up into multiple repositories.

B

You already knew this because I told you Android checks out 500 repositories when you download it the way that we do this in the Android project is we wrote a little just a little Python wrapper around get called repo if you've ever used repo. You know that it's it's a little more than just a little python wrapper. It's got a whole bunch of extra features. Every software project of any size grows until it contains an XML parser.

B

But it actually does a lot of useful things. um This is a little bit of XML. That I hope makes your eyes glaze over, because if I were sitting there at 9:00 in the morning 10:00 in the morning, my eyes would glaze over looking at XML, but it's a simple way to represent a list of projects that can all be checked out at once.

B

Has some defaults, you can set parallelism, you can set which repositories are going to be checked out, and this is a very representative of the kind of approach Android is taken to solving these giant scalability problems that they like actually need to be able to store this number of repositories and waiting for some solution. That's going to magically provide us with narrow clones is not going to that would have been nice eight years ago when Android was originally released. It will be great in the future, but we're just not quite there.

B

Yet we have to have these little stopgap measures. The one thing I'll say about repo. Is it's this Python wrapper it's grown a lot over time, it's kind of crufty, but there's really nothing in repo that couldn't just be implemented and get you might ask what we're going to replace it with, or maybe some of you already know and the answer that will cause some groans. I think is get sub-module if you've ever worked with git sub-module. It's not it's not great. You.

B

You could you can represent instead of having a little XML based configuration, you have a little git config based configuration, but there's a lot of like pain points working with repo right now, I like to say it's not that it's painful, it just needs a little bit of love. We need to handle recursive recursively, checking out all of the sub repositories. uh Sorry I should have said if you don't know, get sub module as a way of embedding one repository or many repositories inside of another repository.

B

So you have the super project, which has a bunch of sub modules or sub object inside of it, and whenever you're operating on the top level project, you really kind of want to operate recursively. So if you make one commit in the super project, you also need to make commits in all the sub projects that have been modified and the git commands. Don't handle this very well right now we don't have a really easy way to say only check out this subset of repositories.

B

You can sort of do like all-or-nothing or manually check out some, but Android has eight hundred and we wanted to check out five hundred by default. There's no easy configuration way to do that. We also really need to paralyze this. We can squeeze a lot more Network and CPU performance out of repo by just simply running a bunch of git commands in parallel, but there's no reason that gets sub-module has to do everything in serial. We just haven't gotten there yet, and a lot of people really like to hate on get sub-module. um I know.

B

I've certainly been like that in the past, but again I want to keep this talk kind of optimistic make. Man maybe make a humble request. If you were thinking about doing some horrible workaround for the fact that git sub-module doesn't know, doesn't do what you want to do like, for example, writing a completely different application in Python to replicate some of the features of git sub-module. Consider whether it might be more useful to contribute back to the get upstream project and make sub-module nicer for everybody. It's really. It's really there's a bunch of low-hanging fruit.

B

It doesn't have to be as bad as it is today. It could be like way more fun to use it's about. All I have to say about the client-side. Being big is also hard for servers. I spend a lot of my time running, git servers doing DevOps, II kind of stuff, so this is a topic that's near and dear to my heart, you may remember from earlier versions of get this phase when you did a clone of something like the Linux kernel, where it says: County objects for a really long time for Linux.

B

Is this just to count? The number of objects that need to be sent in a pack file over the wire would take about 60 seconds, which is not good for users, but it's also not good for server administrators, as this is taking up an entire CPU for one minute. Every time you call in the Linux kernel, so if you're sending hundreds of colons concurrently, that's like that's a lot of CPUs that you could be using for other things and Linux is not even big compared to some of the projects we're talking about here.

B

It's about half a gig in size, so about an order of magnitude smaller than Chrome.

B

Another issue with servers is that traditionally, Gate is a single home service when I say that I mean that there's for every repository there's a single server that lives on that it lives on, and that manages all the fetch and push traffic for that server.

B

Now that traffic can come from a lot of sources, it can come from users which are the people that you care about. That's not true. We care about everybody, but our end users, the people who are downloading the code. We want them to be happy. We also there are a lot of automated tools. Continuous integration, things like that that have like can provide tons of load on a server a fun story about Chrome. One time is they have this giant farm of build BOTS and they in a previous iteration of their software.

B

They were configured anytime, a git command fails. The first thing they do is throw away all the data and run git clone again, which is not the smartest think it's worse, because one day they push to configuration change to their many many many build BOTS such that every single git command failed.

B

All the time, so you have all these build BOTS constantly deleting their three goodbye repositories and running git clone again- and this is particularly bad for us, because we actually have this like system in place for setting a limit on how much data you can download. But we turn that off for Chrome because ricono, we trust you guys.

B

We trusted them, and then we had to turn the quota back on and tell them to change the architecture a little bit, but I mean this had worked for them because we built this service where, under normal operation, it was like ok for them to build or to clone 3 gigabytes at once, so they just naturally used the thing that worked. It's really hard to blame them, for that garbage collection is another thing that can really take up a lot of CPUs in.

B

If you want to serve a git repository efficiently, you periodically, you have to just go through and compact all the stuff together to do a bunch of Delta compression and get the really good compression ratios that get is used to. But you don't want to share the same CPUs that are doing user-facing traffic and making your users happy with this, like grungy background work, but that's a problem with the classical guitar kotecha.

B

So one way we've cut down. Cpu usage is a wonderful new feature and get 2.0 called reach ability bitmaps I shamelessly stole on this slide from Sean Pierce, who shame. We stole this diagram from Scott's book and the general idea of a reach ability. Bitmap is instead of just naively walking through an entire git repository tree and Counting all the objects. We store this optimized data structure where you can for each up commit in the repository you store the list of all objects, reachable from that commit and by organizing them in a particular way.

B

You can make the counting objects phase like really really fast. It's just a cute little bit of algorithmic magic and I'd like to talk about any more detail, but I got a lot of things to talk about too. So ask me, after if you're interested, this was originally implemented in jagat, the Java implementation of git that runs our infrastructure, visa and Marty at github was kind enough to port that to see, and it's now a part of get 2.0.

B

The upshot of all this is at that 60. Second, counting objects. Phase in the Linux kernel has now been reduced to about 100 milliseconds, which we think is like a tolerable amount of work to do on every request.

B

Another trick we use to really just get load off of the server's. This is both CPU and network load is to use a bundle file. So a bundle is a little pre-built pack file you can create one by running. Git bundle create that includes a pack file and the refs that that pack contains- and you can just pass this file around and run clone just pass it. The filename and I'll clone from the bundle.

B

So we do is we actually redirect all of the clone bundle URLs for each repository to a special URL on a static hosting service? In our case, a content, delivery network, that's used by the same one used by YouTube, but this could be any static hosting service. The client downloads that file then does a little incremental fetch to pick up everything that's been added to the repository since that bundle was created. This is really great for us because it's like essentially zero server.

B

We have to do a little bit of redirecting, but compared to actually shipping these gigabytes two gigabytes of data. It's really much better. You can, if you remember from one of my earlier slides, we're actually serving like three quarters of Android traffic is coming off in the CDN. Instead up off of our core data centers, which is really nice from a server administrator perspective. Also, it's better for users.

B

These bundle files are just static files, they're resumable, you can start downloading a two gigabyte file and if you get interrupted halfway through, you can resume right where you left off, which is which is pretty nice. um I know that you know has some really exciting ideas about how to take these. This resumable bundle idea and apply it to general get fetch, but that's like another thing. That is an idea, but it's a little way off I.

B

Think today, you can use this feature in repo. If you set up the redirects, we really want to get this into the core get so that anytime, you run git clone. If your server supports it, you can just offload the data and not have to worry about constructing a new pack.

B

That's about all we can do to just sort of shunt shunt work off of the kora servers. We also like to spread load among a bunch of different servers that are running in the data center, using some sort of shared file system, where you just have different get processes talking to the same underlying file system storage. That would be really nice. That would be uh it would allow us to spread the CPU load. You'd still have sort of like a disk read bottleneck because you still have a single shared disk.

B

You could do garbage collection in separate workers, so you don't have to worry about garbage, collect collection, backing up and user facing requests. You can do this sort of right now with NFS I say it works because there there are some performance problems when you have high throughput repositories running on NFS I know that github has like something nicer than NFS for for doing a shared file system, kind of approach, I'm, not sure what the like open source state of the art is here.

B

You can do replication between multiple servers, so you have one master that handles all of the right traffic going into the master and, as that's receiving writes it pushes out those to a bunch of slave servers and those slave servers can serve read traffic. This is nice. You can share a lot of the read work. You still have this bottleneck for, writes and you can have problems with replication lag if the pushes are very large than replicating those out to slaves can take a little while so.

B

The reason that I don't know too much about the state of the art using NFS is that I work for Google and we run some big data centers and we really like building things from scratch. So I hope you, like architecture diagrams. That's what I've got on the next slide.

B

There's a lot of really interesting stuff in here. I could talk about this all day, but the stuff I want to focus on is these little yellow bits. We have a single shared file system and database using Google, BigTable and Google file system, and we have a bunch of git repositories that talk to the shared file system. Now, each of these repositories concert', sorry, I'm. Sorry, these are get front-ends.

B

Each of these front-ends can serve any number of repositories in any repository can leave in lots of front-ends the way that we sort of manage this at a high level. Is we have a get aware, load balancer that redirects requests for certain repositories to certain front-ends, depending on load depending on the size of them, and we built this distributed file system layer on top of jacott the java implementation of git? That is able to page in files from a slow file system.

B

This GFS is like way slower for opens and reads than you would expect from a normal posture file system. So we need to aggressively prefetch stuff and cache it as necessary in the git front ends. We have a completely separate pool of garbage collection workers, which is nice when you are garbage collecting, 100 gigabyte repositories and another cool feature we have is before we accept the write. We actually do some replication to a remote data center.

B

All the stuff outside of this box lives in a single data center, but we actually have six of these worldwide and anytime. You do a push to us before we even say yes, that push was accepted. We've actually replicated that out to three other data centers around the world. So this is like this gives us good performance when you're is sitting in Europe and you don't have to talk to a server across the Atlantic.

B

If you want to download all of Android, we also have some data centers in Asia that are useful for a lot of our Asian partners for Android, and we also have really good, really really good availability. Since there are six these data centers. If one goes down, we like almost don't notice and that happens kind of more often than you might think.

B

Fortunately, we don't usually get more than one going out at the same time, so you might be wondering how you can do this at home. I would really love to stand up here and say you can just like download this package and push out a bunch of docker images and have this running the reality. Is they like? Some of this is a one source, the jacott DFS stuff.

B

If you are interested in a sort of dynamic caching strategies, that's some interesting code to look at, but, like there's a lot more stuff, we need to do to get this open sourced a lot of the sort of global database glue and the replication glue. We should open-source but haven't, had an opportunity yet there's some like secret sauce that were, unfortunately not in a position to share with you, like the big table implementation. But there are open source equivalents of this.

B

We would probably build a reference implementation on HBase in HDFS, rather than big table and GFS. That's that's about all I have to say about servers, but, like I said, this is a topic. That's very interesting to me. So if you have more questions about it, I'll be happy to talk after the break. The last way, I want to talk about being big.

B

Being hard is for humans, we're all humans in this room as far as I'm aware, but we find it difficult when there are hundreds of repositories to like manage those like, even if you have a tool that understands this perfectly like how do you as a human know, what you need to modify? This is a problem that you have. Even if you merge everything into a single repository, just working in a large code base is hard and I don't have any like silver bullets for dealing with that.

B

Sorry, collaboration with other developers is also a really important topic. We have more developers, means you're, processing, more changes and if you are doing peer code review, as in my opinion, you should be. This can be a source of headaches, just as you have more people, it's just more code.

B

Android has this like additional problem that they have an internal fork and external fork, for example, if they get a contribution to the open source project, they somebody needs to manually cherry-pick that back onto the internal pre-release branch and that may have diverged in the meantime- and this is just some pain that Android has semi-automated, but not really automated tools to deal with and access controls. Man, access controls are just between you and me. I think they're kind of a mess, but I'm public will really say they're.

B

Finally, crafted this is largely as a result of like I mentioned before some of the contractual obligations we have around partners or even within a company where, depending on how your organizational structure is, you might have have silos where different teams are, for whatever reason, I'm locking judge not allowed to see what other teams are doing.

B

So how can we? How can we make this better? How do we ensure that what goes into the repository history is good and when I say quality history I'm talking about like this true source of truth repository, we all know that git is a distributed version control system, but generally speaking, especially in a large organization, you have like one true source of truth. Like Lena says kernel repository or the git repository maintained by Junio, so our solution at Google is we built this tool called Gerrit code review, which some of you may have seen.

B

You may love it. You may hate it if you haven't seen it before here's a little side-by-side review. I made some comments on some of my colleague Stefan's code. You can see the progression of this change as it changed over time. I took a neat little interface for doing side by side code review. It's got a lot of features. I mean it's a bit of a Swiss Army knife. It does access controls the stuff that we use to implement all of our contractual obligations.

B

It does fancy rebasing tools, it does some sort of basic release- branch stuff and, like any good, Swiss Army knife, it has lots and lots of sharp edges. But you know I'm not really here today, to convert you to using garrote I. Think that github pull requests work for a lot of people.

B

It doesn't matter to me you do what makes code review easy for your organization, but I will say not gonna have time to talk about this, but one project I've been working on in the past few months is like I. Think code review should really be interoperable. You shouldn't have to choose that I want to do all of my reviews and Garett, or all of my reviews in github, so I've been doing some work on a on a git notes, based format for sharing code review metadata inside of a git repository.

B

That was a lot of words if you're interested in that kind of thing come to see the afterwards I'll be happy to talk about it.

B

So that's about all I have um you may think I was if you're a git contributor I was a little light on technical details or if you're, a git user I may have been too heavy on technical details, either way, I'm happy to elaborate on any of these things during the break. If you have questions, you have complaints about Google's tools that we've built I'm, a great person to throw your rotten vegetables at and I, really like here. Other people's horror stories.

B

I heard some of them yesterday at the contributor summit, and we might even get a few more today from other speakers all right. It's all I have thanks a lot.

A