GitHub Git Merge 2015, 12 May 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Building a Git Extension with First Principles, Rick Olson - Git Merge 2015

Description

Git is a great distributed version control system used by software developers around the world. But, its user base is expanding beyond the core developers, with new problems to solve. Luckily, Git is very extensible. This talk will cover techniques for building a Git tool your users will love, and that feels like a natural part of Git.

A

A

I'd like to introduce our next speaker he's also going to be talking about a scalability issue with git and how we've we've started to address that at github and and and so forth, so I'll, let uh richter. The other notable thing about rick is that he learned vim in the air force, so I'm gonna, I'm gonna, welcome rick olson from github.

A

B

All right cool, um so hi, um so I apologize for the vague talk title um until yesterday. I couldn't actually mention what I was going to talk about. So this is a blog post that we shipped last night in paris about in paris time about get large file storage. It's a it's an open source, skid extension that I've been working on. So I want to talk about building it. Why it's why it works the way it does so I'm rick olson and I go by techno winnie on the net and I work at github.

B

I'm a back end ruby and go developer.

B

My team. We work primarily on the features of github that use that work with binary files, but not in git. So like avatars, and you know the release, binary, uploads and now get lfs. So so what problem am I trying to solve what scaling issue um so, basically, so for a lot of teams are using git successfully right now, they're they're doing it. uh You know correctly right, they're, they're, working with um you know after they've gone through the learning process, they've they're working with text uh source code files, documentation.

B

uh You know if they're working on websites- or you know, they're working with uh small images and uh yeah git works great for that. That's exactly what it is designed to do.

B

Problems come up when people start working with bigger files, like I'm not talking about uh big repositories like uh like will with the twitter repository I'm talking about storing files like individual files that are really big beyond 10 or 50 or 100 megabytes, and initially the the real trouble here is that you don't even notice it's a pain point at first you're, just like committing these files happily and then slowly, like you start noticing, issues like it takes longer and longer to do a fresh clone.

B

So the first thing that we did is we set up server-side limits that analyze the uh you know, your git push on the server and if you have any files over 50 megabytes, we print a little warning in your. uh You know in your terminal and if it's over 100 megabytes, we just reject the push, and this was nice because then people weren't pushing giant files onto our servers and they weren't causing problems like uh cut.

A

B

Making our servers like generate like giant pack files on clones and things like that, um it's also giving feedback to the teams. Like hey, you know you're using git incorrectly you're you're about to have a bad time. So maybe you should look at your development process, but I I never really liked that solution, because you know you get these teams and you know: they've gone yeah.

B

They've crossed the first hurdle, like kind of learning, how git works and they start using it and they run into this issue immediately and then they get to learn about get filter branch and have rewriting their repository history and that's that's no fun and I think it gives a bad initial perception of of git. So I just I really wanted to provide a better experience for these users, so a lot of projects, projects that I've been involved in open source projects they're. You know about they're all about scratching my own itch.

B

These are problems that, like I'm experiencing myself so then I know like I can better make decisions decisions on how to solve those problems, but this this was not like that. um This isn't a problem that I was having or really like anyone at github was having is really like our users coming to us about it and talking to other people that work on git servers and things like that, like uh like atlassian and they're.

B

Having the same, you know they're getting the same feedback, so I reached out to chrissy she's she's on our user experience team, and I remember like when she joined the company at first. I was like: why do we need a user experience? Research person?

B

You know she spoke to the company about turning everyone that worked on product into user researchers and I'm, like ah that's, that's bs, so I made up made it a point to reach out to her and see if she could help me out with this project like talk to talk to our users and get insight into their workflow and see what we can do to help them out. We started out looking at metrics.

B

You know server-side metrics and like the number of support requests, and then we started reaching out to people, people that email, us or tweet at us, or maybe they just post in some git forum.

B

So we try to build up like a diverse set of teams to interview and just talk to them and heard their gripes. One of my favorite stories was from a team in south africa and they were having an issue that that I called push sniping and the problem they're.

B

Having is that they'll have an artist or someone that's working on some big file like a photoshop file or whatever, and they go to push it and pushing from south africa to github.com in america is really slow and especially, if you're pushing a giant file and what and uh push sniping is when, when, uh while they're making their push, someone else will change like a readme or you know like a normal like do normal get push and their push will complete, while the other one is still uploading.

B

And then, when that one finishes it looks at the master, ref or whatever they're updating and it returns. You know it says: oh well, the ref has changed you gotta start over now you gotta pull and start over, um so that yeah. That was a really uh unique problem. um I really liked talking to those guys so at the end of this chrissy, and I we prepared a final report with uh recommendations and aspirations, and this was um this helped inform uh my team, like some of the things to do.

B

You know that we wanted to experiment with. I also informed uh the rest of the company. You know to help. um You know prove that this is something worth taking on.

B

So what are first principles um this actually uh like a physics term, uh it's something we talk about internally in the company, but I couldn't find any reference of us talking about it publicly. So I did find this article as an interview with elon musk. uh You know the ceo of tesla and spacex and he talks about first principles and he was talking about how they design their batteries and basically just ignoring the the the current understanding on building batteries and they broke the problem down to its most.

B

You know to very basic elements and re-examine it with a clear focus and then they're able to you know, build really good batteries and all that stuff. So um so in our user research, there are two themes that that came up things things that were very important to me and things that I thought needed um needed to be a focus for this. So the first one is usability.

B

There are actually a couple tools that exist now: get media and get annex and a few others and they're built for git experts. They require a lot of upfront configuration in the repository they sometimes introduce new commands. uh They don't quite always work with the existing workflows. They don't work with any of the uh hosted services and I really wanted to solve that problem.

B

I really wanted to make it easy for people using get to to get up and going as fast as possible, get the work done so as a ruby guy so convention over configuration. This is a common theme for uh for ruby on rails, and I really like that. I wanted to apply that to get large file storage.

B

The second thing github hosts code, so I I was really interested in finding a way to, for this, get large file. Extension, get large file, storage extension to uh to work with with the um with github.com, and I wanted to do it in a way that didn't have any vendor lock-in, no proprietary solutions. I wanted to be an open api. You know just like just like git right, because all it is is your client is speaking over some defined api to another server, and you could be talking to github.

B

You could be talking to bitbucket or any of the other git servers, and they internally, like behind the scenes, are dealing with the git repositories very differently, but they're all supporting that same api, and I really really like that.

B

um So how does um awesome the animation works? So this is a um yeah. This is a diagram of how large file storage works, kind of at the high level, so you've got your local repository and your remote at the top and your code files.

B

They just go directly into the git repository, but larger files say like a photoshop file, a pointer, that's what we're calling it. um It goes into the git repository and it's really just um just like. Like a link like it's not the actual file, it's a substitute and then the actual file goes up to a large file, storage server.

B

Yeah, let's see okay and the really cool thing about this, so is it works without adding a lot of extra stuff to your your git flow? um You get workflow. So this is what the setup looks like. uh So when you first install the tool you need to run, uh get lfs init, and this sets up some some get. You know global, get configuration values and then you need to tell the get attributes file.

B

What file types you want to put you want to store in the get large file, storage server- um and uh you know so. This shows the track command, which does that for you or you can just open up the dot, get attributes file and edit it yourself- and this is what a clone is like, and you know it's simple. You know the same clone command you're used to, but then at the bottom you see downloading some file.zip. So that's that's where that get lfs is doing its thing.

B

And then this is the uh you know, really simple: pull request, workflow and there are no like new commands. Anything like that. You just create your your branch. Add your file commit and push, and then you see at the top the uploading message. So every you know your standard git workflow will keep working.

B

um So how does it do that well uh get attributes and specifically the smudge and clean filters. These are kind of awkward to talk about people get them confused. So I like to think of the git repository as a clean room and everything is, is sterile uh when you're running the git add command.

B

You have this dirty file in the working directory and, as you add it, it gets cleaned up into the git repository and and what's what's doing, is it's converting it to that that text pointer and then, when you check out it does a reverse of it? You got that clean text pointer and it's going through the smudge process as it writes it to your to your working directory, and then it spits out your like actual large file.

B

So this is what the text pointer looks like it's similar to the get media one, but we added some more metadata to it. So one is the version string this. This gives us flexibility to uh to increment the text pointer format in the future and also, if you happen, to clone a repo and you've never heard of get lfs or anything. You just see these like tiny pse files and you open them up and they don't open in photoshop.

B

You can at least look at them and like oh well, there's a url here, but let me put that in my browser and and see what it's about.

B

So another really important part is uh native app support, so this is a screenshot from the github desktop client for for the mac, and I don't know if you can see that, but that's a progress bar and at the bottom, it's downloading a large asset.

B

So you know it's integrated, just fine in the github for mac and windows, clients and, I think they're all released uh like last night too.

B

Yes, all right um the cool thing, though the interesting thing is uh the github for mac client was written before libgit2, so it still shells out to get in a few spots and they're they're, actually the first ones to implement, get lfs support.

B

The windows client was a little more difficult because from day one it was using libgit2 and it turns out the clean and smudge filter support in libya. 2 was not great.

B

This is a pull request from amy on the desktop team, and this is she's. Adding support for the the improved uh clean, smudge, filter implementation and live get too sharp.

B

I think the the problem was libgit2 would buffer the contents in memory and then pass it to some function and see and the new the updated version uh streams it. So you know we don't eat up all the memory.

B

But the really exciting part for me is the api. This is really just a json api front end for your git server. So so we have an implementation for github.com, of course, but it it's designed in a way that any git server.

A

B

Other cloud hosts or on-premise installs, like github enterprise or even the really small ones, like ghetto light or whatever they they could in theory, implement this, and since this is this should be run next to your git server. So it can take advantage of your you know of that: server's built-in authentication and authorization uh code.

B

So when you access git lfs through the api or when the you know, when the client does it knows who you are, it uses the same access controls that that github.com does and then the the client itself doesn't need to support all these different backends. It just supports us one api and then anyone can implement this api and they can use get lfs.

B

And one of the cool things about host, you know about running this server. Next to your git. Server is now your git host can understand. Get you know these large objects they're, not just text blobs. This is a photoshop file that is, that is been viewed through our uh render feature and yeah, and that file is stored in git lfs.

B

You can't, I don't, I don't think you can see, but up there in the corner like that, is the actual file size and not the file size of the text, pointer, which would be like not even 200 bytes. So on github.com, the the file looks just like a real file. It's not a text pointer.

B

And uh yeah, so here's what the api looks like. um I don't know if, if you're not a json api person, then this is probably kind of greek to you, but it's just uh so. This is a api call to download a file and the server returns. Some json properties you get the the object id, which is a sha 256 signature of the object contents and then the the file size and then there's that links property, and that includes some hyper media links and that basically just means it's.

B

The get lfs api is telling the client like where you know how it can download the file. So in this example, it's saying you can get it from get lfsserver.com like this url and then, if and then it you can also specify the http headers to set. So here we're saying you know, set your authorization header to this token.

B

To so that you have access to download this file, and then the client will follow that link and download the file um and upload requests is similar, um but you're you're, you're you're, sending the oid and the size to the get lfs server. So you know this request is saying like hey. um I want to upload this file. Tell me where, to put it uh so in that json output has more hyper media links, there's an upload link and that's you know saying yeah you can put that file on this.

B

You know get lfs server.com url and you know again it can pass in the whatever headers it needs and then there's also an optional verify hook.

B

So if, if the location of the uh the files is separate, um you may the client may need to talk back to the get lfs api to say: hey, I upload the file you can, you know, make it available so so as a real world example on github.com, when, when you use this we're going to return s3 links with the headers necessary to sign the request, and then that will give you you know just that temporary access to that.

B

To that key, you know to either upload or download it, um but you know we don't have access to uh s3. Really um I mean it is our s3 account and we can set up some stuff on the back end, but I want to build it into this api in case people want to put this in front of other storage services. So once the client is uploaded to s3, then it talks back to the to the lf and get lfs api and says: hey.

B

I'm done, you know, verify it and then once github has verified it, then it can mark the object as ready for other clients to download it.

B

um So authentication this was a big part of it, um so when you're so it so it integrates it's making api calls from your. You know from the get lfs client and those api calls will. You know, require some form of authentication, but we did not want people to set up a you know: dual passwords. We want the server to uh you know since it's hosted alongside your your your git server, then it should be able to you know, depending on the implementation, it should be able to take the same passwords or tokens or whatever.

B

So, if you're using uh https remotes it just it, there's a in a internal, get credential command and it can say, hey I'm go. I want to talk to github.com, you have a stored password and then get credentials. Well, I will just send it back because if you're using https, remotes, you've probably already entered in your password before.

B

Yeah so, but so the way it knows where the lfs server is, is it takes your remote url and by default, if there's no other configuration, it adds a suffix to it. So here you get the you know for the default. It adds this info lfs extension.

B

So um so on our servers, you know we have ha proxy set up with all these links, and you know it's looking for the different urls and if it's like github.com, you know just like the home page, it's going to the rails app, but if it's, um if it's a git url, then it sends it back to our our git service and now, if it has this info lfs suffix, then it sends it to our lfs server.

B

But we don't want you to you know. We don't want you to force us on people, because you may be using a git host that doesn't support the get lfs api because today, like there are no services besides our our reference implementation, so this isn't even quite available yet on gab.com, so you can set, you can set uh dot lfs.url in the git config and it will use that instead and this could be a server um like on heroku or whatever doing you know whatever.

B

Whatever back end you want to do, you can also set a custom lfs url per remote, so maybe your origin is from github and then your you have another branch going to a separate, git server, say bitbucket, but maybe they don't have lfs support yet or maybe you use say, use github, but you don't want to put your files uh like on our service.

B

You know you just want to use s3 or whatever you can do. That too.

B

Also, not not everybody uses https remotes, a lot of people use ssh, so we so part of git lfs is a new ssh command that that it runs, and basically it returns back the uh the header necessary to authenticate with the api. um So uh so you don't have to mess with uh get get credential set up um yeah, so the so. The initial announcement and release is just version 0.5 of the client library, and we don't have full support on github.com. Yet there's there's a waiting list and you can go to.

B

uh I should have had the url somewhere get lfs.github.com or go to the blog, and you can read about it and when we open up the wait list, then you can start using github, but the project is still new. So I want to go over some of the uh the bigger ideas with us.

B

I feel like oh geez, um so one of the things I I want is uh narrow downloads and the idea that when you check out a repository with lots of files, you don't necessarily need to download everything.

B

um Maybe you only your uh your, maybe your music composer and you just want the audio files or whatever in a specific directory. So that's one of those ideas that we're kicking around another one. This is not a popular idea and get because you know branches should make this obsolete right, but this is something very important to the this. You know this uh these these users, uh because you know these are people they're, like maybe two people start touching the same photoshop file and that's not a format that you can really merge.

B

So it'd be nice. If the second person could you know either, you know it'd be nice if they could, you know be notified, like hey someone else is in that file. Maybe you should talk to them or wait or or do it in another branch and then you're, not yeah, then you're not competing with them.

B

Another thing too, uh the actual get lfs client right now is written and go, and um you know I I love go, but for this project that doesn't really matter what matters is that we can put out a statically compiled binary that users can download, they don't need to install go, they don't need to install ruby or python, and you know with the right version and all the dependencies and stuff I mean like as a ruby developer. It's not that difficult, but it's not something. I would wish on someone that isn't a ruby developer.

B

um So yeah I mean that's, that's my talk. um We would love, you know, we'd, love feedback. I would love for other hosting services to look at this, and maybe we can come up with some solution that we can. We can all use and agree on. So this is a the current uh core team. So that's myself and uh rubyist um yeah and that's it. Let's go drink.

B

I guess I guess I have three minutes for questions or you can just grab me later. There's a question down here. Okay, so he's asking what happens if you have an existing repo with large files?

B

So then you have to go through the one-time, painful process of rewriting your history and pulling those objects out and kind of retroactively, adding lfs support, or you could just say screw it. Like we've been using this repository and we're just going to start. You know, starting today, we'll use get lfs like that's, not something I would ever do on the github repo, because we have you know: fender, gem files, um cool anyone.

B

Else he's suggesting that we use a bfg so yeah I'll, look into that um all.

A

Right cool another question up here: oh we have time for one more.

B

I'm sorry: what about garbage collection? Oh garbage collection, yeah, um so you're talking on the the local machine where the uh on the server yeah I mean it's up to the server to implement it. So they would have to know that you know if for branch gets deleted, they'd have to go through and delete the.

A

B

um Yeah, so dave is asking about uh garbage collection on the client and prefetching. um I haven't even thought about prefetching right now. um I would love to talk more about that and I think john who's going up after me he's got some ideas on garbage collection on the client so um yeah here he has to say in a bit. I guess.

B

All right cool, thank you.

B

B