GitHub Git Merge 2019, 13 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Native Git support for large objects - Git Merge 2019

Description

Presented by Terry Parker, Engineering, Google

"Large binary objects pose a special challenge for Git. This talk will explain how Git’s new partial clone feature and a new proposal to use content distribution networks can help."

About GitMerge
Git Merge is the pre-eminent Git-focused conference: a full-day offering technical content and user case studies, plus a day of workshops for Git users of all levels. Git Merge is dedicated to amplifying new voices in the Git community and to showcasing the most thought-provoking projects from contributors, maintainers and community managers around the world. Find out more at git-merge.com

A

Okay, so I'm Terri, Parker I work at Google as a software engineer and I'm gonna talk today about handling large files and get using some new protocol features. I managed the get core team that works on the open source project upstream and a tech lead manager of the server team that manages a large hosting service that Yvonne and men talked about the variety of clients we support there. So the agenda today is to define terms. What are we talking about with large files?

A

Why do I consider native support for large files to be important and I'm going to talk about a couple new features and get that are emerging? Well, the first one partial clones are is an emerging thing just introduced in get to dot, 17 or 18. This is why I need my speaker notes, but introduced in April of last year, and the second one is a feature: that's a work in progress. It isn't out there anywhere, yet it's just being proposed on the kid open source project, which is using content distribution networks for cloning.

A

So what do we mean by large file? How are we going to define this? Well, it can be by extension, type things, name, dot, bin or generally not text files, and so you might say, every dot. Bin file is large, whether it has a byte count. That's large or not. You may have specific VM image, extensions or android apk files, or you can define it by size, but you wanted to set that threshold at 100, K bytes, a megabyte, I think it's important to be flexible here.

A

It's it's what's large to you is just what's ever causing problems for your developers who are working with good.

A

So what do I mean by native support I mean that actually, the core internals of git know what you're dealing with and are able to help you adapt.

A

So clients can make them trade-offs. They need and servers can make the trade-offs that they need. So am I talking about getting LFS I'm, not talking about you to LFS I, don't consider LFS to be a native understanding of git LFS is using some pre-existing hooks called smudge and clean filters that basically are pre and post hooks.

A

So when you're about to push a large object, get LFS substitutes, a URL that points to that object in the place of the object and the problem with that is that you have to have everything pre-configured for that to work correctly. So there's a scenario where you've configured LFS for all of your dot bin files.

A

Brilliant engineer comes along and you know updates the bin format in a way. That's really great and it's so important that it should be named a bin and all of a sudden. These files are getting checked in and they're bypassing LFS, since I maintain a server team.

A

Something that's really important to me- is being able to serve lots of customers effectively and if we have people doing large clones and they may be coming across, you know slow or moderate speed connections if you're cloning, ten gigabytes you're not going to get that done very quickly and you're tying up a thread on the server and using lots of bandwidth. To do that.

A

So partial clones are, as I said, an emerging feature that were introduced in dot 17 or 18 in April, and we've been improving support through time. So it's best to use 2.20 of git. If you want to test this out, but the way to think about it is it allows the client to say: hey I want this clone of this repository, but I want to filter out certain objects and I will come back to you if I need them later. So on demand. Look downloading by the client.

A

So here are some example command. So you can look at and think about what they'll do and we'll walk through some of them.

A

So my sample repository in here has two files that are always being updated at the same time in every commit, so we have have a single master branch with a long chain of history, with n plus two commits and if you're not familiar with the get object, hierarchy commits have parents, and so you see that by the directed arrow they're in prime prime is the its its parent is n, prime, going back to N and a commit also points to a tree object and a tree object can point to other tree objects to represent sub directories or to actual file contents, which are called blobs.

A

So in this case we have 1m binary. That's a you know, a megabyte size, binary and a help text is subscribing the updates that were made to it each time so in this command line. We're cloning with this filter blob limit equals one megabyte and if you take a look, we have a couple grave filter, gray, ovals now and those are the things that are not being downloaded by this command. Now you may notice that the one am bin prime prime is being downloaded, and that's because a clone is actually two operations.

A

The first operation is a fetch. It fetches all the content into the docket directory in your local client and the second operation is checking out a branch. So by default it's going to check out whatever branch had points to which by default is master. So in this case the initial fetch into the docket directory did not fetch that one M that bin primetime file- it was the checkout where it said hey. This is something that I actually need to populate in the work tree. That did that as a second transaction.

A

Here's the second command git clone with no check out the no checkout stops that second phase. So it's just going to fetch things into the docket directory and in this case we said, fill blob, none so we're downloading all the commits all the trees, but none of the file contents and here's a further check out with a newer feature. The debate is probably made available. Only in 2.20, which is the filter tree, equals zero and that's saying not to check out any of the trees.

A

So it's only checking out the commits now I'm, just talking about large objects here. Why are we concerned about trees?

A

It ends up that this filtering mechanism can solve lots of youths cases and get, and my team at Google has been cooperating with Microsoft on this because they, they have I, believe the largest git repository out there, the entire Windows codebase and office code based they put into a mono repo, that's a git repo, and you can imagine that that is very, very large, and you can imagine most developers don't need to check out all that history they're only involved in one.

A

You know if you're working on office, you don't need the windows, you don't have all the windows stuff. So this is a facility to allow the further development of these large mono repos and all our developers to be efficient. So the second feature I want to talk about is- and this is this- is just a work in progress. The proposal has been made on the get upstream list and it's pretty receptive. We think this is going to happen, but let's talk about why using a content, distribution network is is important.

A

Content distribution networks are are really good at handling high peak volume. Loads I actually wanted to get that viral. Surprise. Kitty video, but I'm. A mere software engineer who doesn't understand copyright and I didn't know whether I could actually use that and get attributed. So we just get a cute puppy instead, but they are very good at scaling up and load balancing and also do a good job of moving the content close to where the user is requesting the data.

A

So there are lots of good properties, but really for me what's great about content distribution networks is that networks actually live in physical faith space and time and are subject to failure, power failures and a router those types of things not every packet that gets sent out on the Internet reaches its destination and the current get protocol is kind of crafting a custom.

A

Every when you do a clone is crafting a custom pack file with all the latest commits right up until you know the second before you requested that, and if you get 99% of the way through a 10 gigabyte transfer, and then you have some kind of failure, you have to start from scratch. It's not resumable at all and content distribution networks have the nice property where they're they're, HTTP GET commands, and if you fail 9 gigabytes out of a 10 gigabyte transfer.

A

You can just resume and pick up that last gigabyte and you can also choose to get ranges within there and you can paralyze that and issue 10 parallel 1 gigabyte requests so for clients.

A

There's some really nice properties of making get more reliable for these large clones and do we think this will work well actually, Android androids release model is to do lots of developments of new api's and stuff and kind of announced that and they have an internal copy of the code base, and then they merge that, with the external copy of the code base once a year and the whole world comes and tries to download that- and our servers right now would not be able to handle that load without using this kind of strategy and there's a tool in the Android environment that some other ecosystems use as well called the repo tool that does exactly this.

A

It will fetch a static copy from a few days ago, of the repository put that in the docket directory and then do an incremental fetch. On top of that.

A

So you know my message: my takeaway message is that we, the git project, is trying to deal with these things. You know it initially get was intended for source code. People are putting a lot of other different types of assets in there. Andget hasn't always adapted well, but the gay community is cooperating together to to make sure these things work. So you can look for these features in a release near you. There are lots of companies that blog post github does.

A

Google has the open source blog and Microsoft is also doing a lot of great work to help scale get up, and they they blog about that, and there are release notes that are put out from within the the git project itself. So look for all these features coming to a get released near you. So thank you.