GitHub Git Merge 2015, 12 May 2015

Previous Meeting

⏯

youtube image

►

From YouTube: The Challenge of Monorepos: Strategies from git-core and Open Source, John Garcia - Git Merge 2015

Description

The problem of Monorepos, and related large files and locking issues, are the Achilles heel of Git DVCS. We lay down the specifications our imagined ideal large file extension for Git. We'll talk about locking and transmission strategies as well as our current work in the area.

A

And the next person that I'd like to introduce works for a great company, github and Atlassian I think are sort of you know. People always ask me how we get along with that lassie in' and they're the best we actually did a karaoke contest, not too long ago in San Francisco, unfortunately, github lost it may have been because I was on the team. If you'd like to see a demonstration of those skills later on tonight, it will part you probably couldn't stop me from doing it, so it may happen. So this is a this guy's.

A

A developer advocate for Atlassian I'd, like you, I'd like to sorry hold on. Let me do this again, I'd like to introduce you to John Garcia.

A

Thank you. Thank.

B

You all right, thanks for coming out today, I hope everyone's enjoying Paris. This is a beautiful city. Hopefully you have time to get around and see some of the great sights I took in the Eiffel Tower when I first got here, and that was outstanding, but yeah I'd like to begin first, was introducing my talk. I'd like to talk about as many people here have, the challenge of mo no repos or dealing with large objects in get.

B

This is a problem that many of us face, I'd like to introduce myself I'm a developer philosopher, which means that I I look at what the humanities and the arts have to say about development and try to synthesize that into a philosophy about how software can best be developed. I got my start at a very young age with this machine.

B

If you don't recognize that this is the TI 99 for a this is one of the first machines you could just throw on your living room floor and plug into a regular television and do whatever you want. It I learned ti-basic with it fast forward, 20 years and now I'm working with it. Last seen at bitbucket and we are a large D BCS provider, a lot of folks use this as the source of truth repo for their team.

B

We have more than 2 million users, and so we hear a lot from the the git community at large about their experience with getting we love get. It cannot be emphasized how much we love get because for us get is sort of the rising tide that lifts all boats.

B

We have a lot of products that work in sync with git and enable teams, both large and small, to use, get to its best its best capabilities, so I'd like to talk about why we love get, and that has a lot to do with the get feature set. So it's important to to look at data integrity as a large part of the feature set because, as each file is handled by git, it's check summed into a hash file and that allows us to efficiently to efficiently make good choices about how we store the file.

B

It has an advanced branch model which allows distributed work among teams who can then merge their work later without the necessity of scheduling, work and it. It has files playing and chunking capabilities to enhance filesystem performance and to prevent over storing too many files in the same folder.

B

So getting a little bit more into depth about these topics. Data integrity when get operates on a file it'll make a checksum out of the file, and if it recognizes that this checksum is similar to a checksum of another file that it's storing it'll know that it only has to store that file. Once another thing we love is the branch model. The branch model enables work to be done specifically on one branch and to be compartmentalized and generally allows the end user to control the amount of data that they share with the server.

B

Thus restricting bandwidth use and making good choices with your resources, and then we can ignore other branches, as needed. I also like to talk about files playing files playing is the process of storing the files in the git repo in folders. That start with the prefix of the hash. So that we don't end up with a folder with a million files in it, we can have 256 folders each with a subset of the repository.

B

So, as has been discussed many times on the stage today, it's there's a conflict that we keep running into where the artists and the scientists that work with git have a need to store large files, and they have a really compelling need because the with the decrease in the cost of storage and the increase in the cost of computing power and the speed of transit, we're able to make much more compelling art and much more compelling virtual environments and, in some cases, even solve the problems of humanity, such as curing cancer.

B

So let's talk a little bit more about the object model and how this becomes a pain point. So we tell folks don't store your large objects and get. This is a kind of a proxy for a large artifact.

B

This is a work of art, that's print made by humans, it's huge and it has no place in your gate repository, but sometimes that's exactly what you end up with, and so the we looked at the folks that are reporting these issues to us and kind of took a look at the taxonomy of what the objects look like. So for us generally, our problems come with package repositories such as Bower or archives.

B

Other other use cases will have problems with either audio formats or video formats that take up a large amount of space, but then there's also scientific computing needs such as MATLAB and Simulink, and the like. The large database like MongoDB setups, that that you use huge amounts of structured data.

B

So over time, I I did some research in our support database and over the course of two years on the calendar. We saw five point two issues per week related to large repositories, that's more than one every business day. Quite literally every day, somebody who's calling us to talk about their large repo and they're sad they're, very sad, and they say things that make us sad sometimes, namely that they're going to leave kit.

B

So let's talk a little bit more about the object model and perhaps what we could do to adjust- and this is very similar to what Rick was talking about and similar to what's been done and get media and other existing art. And what we do is we store we we like to.

B

If we can store the text artifacts in the get repository but store the binary artifacts in a off-site storage solution, such as s3 or possibly even on a local Drive, if that's appropriate, so we've been doing a lot of research and we've had a lot of thought about this problem and I want to show you guys what we think about it, and maybe a little bit of a proof of concept at the end. But I want to be really clear that we don't intend to prescribe where this conversation goes.

B

We really just want to put the results of our research out for the community to see to to hopefully help the community find the best solution, because it's really and I think everyone here can agree. It's really best for the basic fundamental tools that we use to be tools that we can all use and we can all contribute to. So in our research we identified a few potential areas of improvement, first of which is cross-platform support.

B

We feel that any solution to the large binary problem should be platform agnostic to to the degree that it's able to work with POSIX and non-plastic systems, and also work with a variety of backends, your s3, your local storage. If you want to use like a fuse file system, we feel that that sort of interoperability is really crucial.

B

Performance is another big question for us another area of potential improvement. We find that the existing options when they make serial operations on a large number of files. They can have significant overhead from starting the virtual machine every time they work.

B

Finally, we think that a complete solution would probably address the question of vial, locking and, of course, that's a very difficult question. It's a certainly contentious topic in the git community and really in distributed version control thinking at large, because that's that doesn't fit so well with a distributed model so for performance.

B

The the issue that came up for us, or rather the our observation, was that existing solutions use smudge and clean filters when they operate on files, and this is very much like what Rick was talking about a moment ago, and so you start with your when you make a check out you download the repository and in if you're, using a out an external storage solution. You'll generally have files that have references to the external objects that you need for your repository and then on checkout.

B

The program will transform that into the actual artifact, then on commit the opposite is done. The artifact is distilled into a checksum, that's put into a text file. What can happen is that if you have a huge number of objects- and you have to iterate over each of those objects in sequence- that's a lot of serial operations and even the slightest overhead in starting your process will add up over time.

B

So, let's talk about portability as well. We feel that it's critical that we, that a great solution will work with all types of computers, but it will also work with all types of backends it'll work with cloud storage, local storage. Ideally it would even work with samba or or any sort of storage that you have. And finally, we find that a package that's easily to distribute.

B

Much is what Rick said that is able to compile a static binary that can be distributed instead of dealing with runtime dependencies or some sort of installer makes a much more compelling case for uptake. It makes it a lot easier for people to package it for users to download it and install it, and generally it's a much experience all around.

B

So I want to talk about file, locking because sometimes you just can't merge. As we can see here, there are file types that I mean. Of course everything can be merged, but there are file types where the cost of merging. If you, if, if human reviewers must review the code, the cost of merging is prohibitive.

B

So it's it's important that we at least consider what we might do to help proactively, prevent merge conflicts for certain file types and so we've we've kind of looked at what kind of files that happens with and it's the usual culprits. The large binaries, such as your your audio files and your video files. But in addition, some of the larger expressions of markdown or XML or even auto-generated files by unity and SolidWorks can be very difficult to merge as well.

B

So we've come up with an idea for branch oriented file, locking which would provide a way for you to serialize the commits on a single file and still interoperate branch wise, but not ever have a conflict.

B

For starters, you download your repo and when you make your clone, the file types that are controlled in this way would be locked for modification, and so because of the the processes they've been going through. They have a say, a serial commit graph so that, even though each commit may come from a different branch, each commit is an atomic expression of the new file.

B

So, if you need to come, if you need to check out this file or you need to unlock the file, we would expect that you would download the newest copy of the file and then, when you do, we release the lock and allow you to make changes once you've made changes and you're ready to commit. You commit that to the newest position in the repo and becomes the new position of the file.

B

Of course, any solution like that that's prescriptive, should follow the regular get rules and the regular get rules include a functional force option to allow users to make sensible decisions when there's unexpected circumstances. So, for instance, if somebody locks a file and then goes on vacation or leaves the organisation users need to be able to say, we understand that there is cost involved to allowing a concurrent modification. However, we are willing to bear bear that cost. So I want to talk a little bit more about expanding the object model as well.

B

So we can see we have a large object in this repository.

B

So what we would really like to see is a splaying happen so that, as with regular get the large objects are stored in folders that are prefixes of their hash code, so that they're both easy to find and also they don't accumulate too much in the same folder and then from there we can transport them to external storage.

B

Another important part of the model that we consider is local object retention, which is to say that it's important to consider which files you would want in which files you would not want to keep so that you don't fill your harddrive entirely with files. This is really a a local process that I'm going to describe, although it seems like it, the brave at heart may apply something to the server end as well, so we kind of break up the commit history into three distinct bands: a near-term band, a midterm band and a long-term band.

B

So in the the near term band. We expect that your current check out, which is represented here with the blue branch. You want those files. In addition, this green branch, which is a which shares an ancestor with that other branch. We want to keep those files that are less than 30 days old as well.

B

In addition in the midterm, and we want to make sure that we keep the files on the branch that we're using the files have belonged to a different branch. We may not be concerned with we. We think we that you vote and files beyond 90 days, we're pretty sure nobody is concerned with. So this kind of provides a paradigm for garbage collection or reclaiming local storage.

B

We were reluctant to say garbage collection, because that is a specific thing that get does, and it's very important, as Rick mentioned, to maintain consistency with the git model and the get behavior pattern so that it's less learning curve for any potential users.

B

So I'd like to show you a proof of concept that we've been kind of working on in our lab and give you an idea of what our thinking is on the subject. So here, I am creating the large object store on my Raspberry Pi using SSH FS, which is a fuse implementation, creating the repository in bitbucket in knitting, the repository on my local machine and then adding and committing my first files, I used my Raspberry Pi to point out that this works really elegantly with any sort of back-end storage. We really a core design features.

B

We want it to be not locked into anything any sort of service, any sort of server. We want to keep it. We want to keep it as simple as possible, so here I'm, adding the parameters to my get slash. Config and I'm gonna do my initial push of large objects. This will take a while, but I fast forwarded it for your viewing pleasure.

B

So we see there's 66 megabytes to upload and that's now done, we can see there's now a file on the PI, so we copy a new file into the project, commit that new file, and this uses the smudge filter to clean the file and change it into a actual repo item. And now, when we push, we see that we're pushing up to two commits, but we only have we started at 51%, because we've already pushed the first file that belongs to this commit.

B

Now we've pushed the binaries through the orange and it's time for me to open the file and do some work. So here's a previous version of this presentation that we're watching right now- and it came to me that this would be a great slide for the demo rather than a quote so. I'm gonna make a new branch for the video store that then also store the artifacts.

B

And again, we'll see that when we do the push we won't have to push, the entire stack of large objects will only push the one that is not known at the repo I'll fast forward. To that again,.

B

So now that that's pushed to the origin I'm going to make a new clone of the repo, so that I can kind of simulate what it would be like if my coworker Nicola were to work on this, so he pulls it up. He adds, in his configuration, saves the modified files, pull the binaries and, of course, this takes some time.

B

He's got to download the entire set of binaries because he's using he does not have access or he does not have the objects that I've stored yet, but once he's downloaded those he's ready to start working so he'll open the file. Oh wait! No, that's the wrong one! That's the wrong branch! Let's change branches notice! Changing branches took just a moment.

B

So now we're on this new branch. We want to put a placeholder in for where the videos video scene will go and then we'll save it we'll commit this and then we'll push this up to push the repo up to the origin and push the large objects up to up to its storage will notice again. This starts at 75% because for three out of the four objects are already found at the storage location.

B

All right now that that's complete I want to look at the large object, store and see. What's on my pie, yeah, it looks like I have four folders now one for each of the objects. Those files in the stores are chunked, which is to say that they are broken into smaller pieces so that if a part of the file doesn't actually make it, it can be uploaded separately and as just a chunk in the event that there is some sort of a connection interruption.

B

So now going back to my copy of the repo and I've downloaded the new objects and the new commits- and you can see that my copy of the presentation now has a pane for the video, so I'd like to summarize. Thank you for your time today. We feel that there is definitely an opportunity for improvement.

B

That's been identified by all of the major vendors of git hosting, and we feel that it's critical for it to address these specific features of git, that being branch, awareness, cross-platform support both on the server and on the backend or on the CV, on the server back-end and in the local machine, and also performance, which is to say that if we have to do a large number of serial operations, which is a common co-occurrence with large files that we need to be able to cycle through those files very quickly and get to a result very quickly as well.

B

Thank you for your time. Please speak with me or Nicoletta Steve afterwards, and let us know if you have any comments or suggestions.