GitHub Git Merge 2016, 16 Nov 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Adopt Git at Scale and Stay Sane, Lars Schneider - Git Merge 2016

Description

Git is hard to grasp. This is especially true if you've worked for decades with centralized version control systems such as Perforce, CVS or SVN. This talk is about our journey to on-board 4000+ engineers with 200+ different code bases to Git at Autodesk. Amongst other things, Lars will cover how we manage the Git client setup at that scale as well as AutoDesk's experiences with large Git repositories.

Lars Schneier is the technical lead for Git at Autodesk, working out of Berlin, Germany. His current work involves migrating large codebases to Git and establishing Git workflows within teams. He is a Git contributor, the author of the ShowInGitHub Xcode plugin, and is an avid kiteboarder.

A

Welcome everyone: my name is Las nada and I'm. The tech lead for food, get it up solutions at all of this, and today I want to talk about how we use kit at scale yeah it works. So why is it up Ting it? A challenge for us. Well, I'll just has been around for more than 20 years, and we have more than four thousand engineers working on hundreds of projects and some of some of these projects. You probably know like autocad inventor Revit and for historical reasons.

A

These projects are all kinds of different version control systems. So we have we used to have plenty of perfil service subversion, servers, mercurial servers, TFS servers, its servers, pretty much anything you could think of, and we realized well. That is bad for collaboration and for that reason we decided last year to move all our development to get on top of kid of enterprise.

A

So what are my topics today? First I want to talk about how we ease to get on boring for our engineers, keep in mind most of our engineers. They haven't used, get before at least not professionally, so we want to make it as smooth as possible for them to onboard on gate, and the second part is of talk, is I want to talk about our gate usage recommendations?

A

Yes, you all know, give us a pretty complicated tool and you can use it in all kinds of different ways and we want to make sure that our engineers use it in the best way possible, so they up that they are productive right from the start. And last but not least, I want to show you how we monitor our gate usage.

A

This is important because, with the monitoring we can identify inefficiencies early on and then we can talk to the team central developers and help them to use get in a better way.

A

So, let's get started with the onboarding pretty easy. The first thing that we need to that you need to decide at a company, is what do? What kind of protocol do you recommend to your engineers when they use get the gate? Protocol is no real option because it requires authentication. The ssh protocol is great, but the initial setup, especially on Windows, is kind of tricky, and for that reason we recommend the HTTPS protocol, which works great on all platforms. So our engineers, they started to use the HTTPS protocol and pretty quickly they asked us.

A

Why do we have? Why do we have to type my password all the time? So we got plenty of reports of tickets asking this question and all of you know the answer is pretty easy. You just need to set up your credential helper and then you don't have this problem anymore. Another support ticket that we got quite a couple of times as why are my commitment associated to me and again for an experienced get user that is easy to solve. You just need to set up, you get contact properly and well.

A

We got these really tiny trivial problems all day, long and and we realized well trivial tasks performed by more than 4,000 people. They generate actually and many support tickets and well, we thought, okay. What can we do to to make this better and we came up with enterprise config forget and what is this? There is a painless get set up with an easy way to share, get convicts and scripts within a company. So how does it work? Well?

A

Enterprise convict is built around a set of command and in order to make sure our engineers really get that this set of command sets up their environment for out of desk. We call this set up command, get Eddie SK and when you run it a couple things happen. So first we self update so that we ensure that the engineer has the latest version of our config.

A

Second, we check the local get version that is installed, and this is important su, as you probably have heard, of this security vulnerability in all get versions prior to 2.8. So for these kind of things it's important because we can actually tell the engineer well, your machine is outdated. Do something about that and we also check git LFS. We check if it is installed, so it's not installed and we install it and if it is installed, then we check that the right version is installed.

A

This is really important because um we have a lot of repositories that use lfs and we want our engineers to have a seamless experience when my nameday act with them at the third um thing that it does it. It setups your local user properly, and this is really easy to do in an automatic fashion, because we use k-dub enterprise and our enterprise instance already knows your name and your email address. So we can just use the get up api to fetch this information and run the proper git config calls.

A

And last but not least, we set up the credential helper for the respective machine and the credential app is actually a pretty interesting topic and I wonder if you use HTTPS the hds protocol on github com, then I want to encourage you to try this little arm bash snippet in your shell.

A

What you will realize it when you run it is well! You can see your your HD vs credentials in plain text, and this is bad, but is it is especially bad in an enterprise environment, because in an enterprise environment, these credentials are used usually for pretty much anything in your enterprise environment. You use it for email for HR records for ya anything, and so this is especially bad in an enterprise environment.

A

Fortunately, I'm get up already knows about this problem and they came up with a really good solution, and this is the solution, our private access tokens and what our enterprise config tool does. Is it will issue a new token for you for your machine and set up your credential helper with this token, so your domain password is kind of secure.

A

These four things are rather general and it could be interesting to other companies as well, but we do some some additional stuff on top so, for instance, our engineers they have to sign a source code source code agreement and have to renew that I think every six months or so, and these kind of things we check in our setup step as well. So if they have to renew it, we can give them like clear instructions. What I need to do and yeah again we reduce the number of support tickets, which is how main task here.

A

So how does get enterprise kit actually work? In essence, it is only a git config and, and it's a good conflict that I did this yeah. Basically, this will to the entire company and an in this kid config. We are we at on alias that we call their adsk and this lusm called the script. The setup script, which does all the things and this company get conflict, is then included with the get convict include mechanism into the engineer's global get conflict. So why do we do it that way? Well, first, this risk cross-platform.

A

So it works on on windows, mac and linux. That's important for us, because we use all these platforms and second, it works out of the box wherever it works in any shell, and so you don't need to set up some path or something that that is just not necessary. It works whatever get works, so that's great and it's also very easy to install it. So this is what we were to tell our engineers: they need to clone our enterprise configuration repository and then they need to set up this.

A

This include path for their four-legged conflict and that's it and then it just works. This works really good on mac and on linux, because kid is usually already installed on windows. That is not a case, and for that reason we made a custom version of the gate for windows installer, which basically does this automatically and and so our windows setup is really really small, and today we made this enterprise contract forget actually open source and yeah.

A

You can go to github com search, already searched enterprise, config forget and you can see how we implemented it, and maybe you can it's even useful for you. You can fork it and then adjust it to your own needs.

A

Alright and now I want to come to the second part of the presentation. That is um what kind of what are the user accumulations forget that we give our engineers, so the first one is pretty obvious. To probably most of you, it's a great branches.

A

Why do we need to tell her this to our engineer as well as I said earlier, our engineers come from many different version, control systems and most of them don't support, run easy branch creation. So that is something that an engine needs to learn that creating branches and get is actually pretty easy in fun.

A

The second recommendation is armed, create mini comets and push often to your private branches. Let us again yeah.

A

That is something really important, because your machine are might die or your laptop I might get stolen, and then your work is gone, and if you do this kind of thing like pushing often, then your work is always secure and that's of course, important for company.

A

The the threat of recommendation is rather controversial, but we recommend merge instead of rebase yeah at least until the engine really knows what we base actually is. It does we realized that this just helps a lot with a to avoid like trouble with git repositories, so because with rebase, one engineer can actually screw the experience for the entire team, which is not good, get up. It's actually helping in that area.

A

A lot true with a recently introduced feature that is the protected branches feature and I really recommend you to enable it for all your shared branches um yeah. Just because if people play around with rebase, then the other engineers are safe. um The fourth organization is kind of connected to reverse. What we tell our engine is to avoid cherry-picking, if possible and yeah. Why do we do this? Well?

A

Cherry-Picking sounds very interesting to people that come from version control systems that are Delta based like perforce so, but when you cherry-pick a change from one branch to another, then the change is its recoil, but commit is not the same and answer that's a very important thing, because um when you cherry picked and then yeah, you have the same change multiple times in your and you git repository and this this becomes um yeah. This is bad for two reasons.

A

First, when you look at the history, then that might be confusing, because the same changes the side commit messages in there with different hashes or, what's going on and more importantly, when you have, when you have our a conflict at address in this particular change, then you need to resolve the conflict in every branch that you cherry-pick the change too. So this really destroys your arm good, merge experiencing it and that's why um yeah, we don't recommend cherry-picking for your default workflow.

A

You can use it in in an exceptional case, but not by default, not the filter, combinations again rather obvious um you get ignore getting north a very good mechanism and to avoid bad things in your repository. So if you have a good ignore pert and you, your people won't commit big, temporary files that you then have to deal with, and we recommend always this very good service getting know today. Oh, um if you don't know it check it out, it's it's great to generate perfect, ignore patterns.

A

Sixth, and this is a recommendation that really really hurts us what it is really complicated for us, and that is avoid many files. If you ever get repository that contains more than hundred thousand files, then get operations tend to get slow, and this is especially the case on windows um yeah. Unfortunately, there is not too much you can do about it um besides, using an SSD or something.

A

The only real um yeah thing that you can do is split your code base, so we recommend our engineers true to take out components up their big repositories of less than a thousand files and put them somewhere, separate and and one good way to two component eyes. A repository is to use sub modules.

A

Some of us are kind of controversial because yeah, it's I, think you, you start only to like them when you really understand how they work, because otherwise they're just you ice a little bit confusing at times, um but for us they work pretty well, although I'm yeah one word for devices, so it they only work really well, and if you use a lot of automation around them and a couple of examples, what we do here is, for instance, we detect if a sub-module or was updated, and then we generate a pull request again.

A

The parent repository with with the sub-module update and then an engineer sees the update and can just press merge to update the sub-module. So this works great and and when we do this, then or whenever you have updated the sub-module in on github, you probably have seen the sum of your updates just um a change of a hash from one house to another head and that really doesn't tell you anything and that's actually quite confusing. And for that reason we take these two hashes.

A

The old has gone the new hash and generate the negative, this URL of the sub module and then at this as a comment automatically to to the pull request. And this way a reviewer can just click on this diff ul and see what has actually changed in the sub module and yeah that works great and and one catch stairs to UM when when people are not careful in there with their with the commits, they might revert the sub-module to an older version, and this is kind of stuff.

A

We detect their true and then wander up the reviewer about that. So um yeah again, automation helps a lot with sub modules.

A

When you sub more loads, then we also recommend to to ya following things for first avoid nesting them consider a sub mode. You will consider three levels of sub modules. If you make a tiny change on the third level you have one commit, then you need to update a sub- pondan on the second level of two commits, and you need to update the serpent will point to unroot level.

A

It's a third commit, so you make a time change near three commits that doesn't really scale and it's it's not fun to you, so avoid nesting them only use, use one level of sub modules and another advices avoid more than 25 sep on dudes. We made this number up, but it's kind of um it's kind of the number that we've found works great.

A

If you use more than 25 f sub modules, then it really get really become slow and the main reason is that the sub module system and get is implemented in bash, and this is not particularly fast on on Windows but Stefan Bella from google ism, and it's improving this right. Now and hopefully, we will see a lot of a good speed improvement there in the future and, last but not least, yeah use git LFS for large files. That's that's something we do in that works really well.

A

So now, I want to look at two of these recommendations more in detail and show you how our enterprise, config tool, helps to UM helps with these recommendations. So the first one is good push. If you think about it, it pushes actually pretty scary in an enterprise environment.

A

Consider the case of an of an engineer that just started get and he's working on on a secret company project, and he has a good problem and he goes to stack overflow and find the solution and how he paced the solution and for some reason he changes the remote to some public. Get up. Korn URL well and then gets good push. Well, the damage is done. It is for Android, not if you use our enterprise conflict. In this case, you would get a fatal error like that attention.

A

Do you really want to push to get up to calm and, in some cases pushing to get 0 comments, actually the legitimate, for instance, if you make an open-source contribution or something so we also um let the engineer know. Okay, if you know what you're doing you can run this command and then the push will actually go through. So unfortunately, there is no such thing as force protection get right now. So what did we do?

A

We had to be a little bit Korea creative, so we use the URL reread mechanism of good config to rewrite the gear up. Con URL took this very long message: attention something something and and when, when we run get pushed and yet will actually try to push through this very long message- and of course this will fail, because it's no real good repository and but luckily get shows the message and that's what we want right.

A

So this works great on a command line and it works great on on all the Gucci's that you get under the hood so like so sorry, orse market and yeah. That's a good push. Production I think get protection would be a pretty good thing forget and for that reason, I've wrote a proposal for a google Summer of Code and outrage ultra-cheap project, and we have already a student looking into this. So maybe you um we have this feature in it core pretty soon.

A

The second thing I want to look at in more detail is how we use Gadelha base so get in large files. Well, that's a that's a topic where you can read a lot about on the internet and um I would like to summarize the problem see a little bit, and that is files that change often and a large after compression are bad forget. What do we mean by that? Well, if you have, if you have a 10 megabyte file and that you just committed once and never touch again, that's no problem for a git workflow.

A

Really, so even it's a big binary file, no problem. On the other hand, um if you have a 100 megabyte XML file that maybe you change often that's no problem either, because XML usually compresses, really good and it stores everything gzip compress so um yeah it's.

A

For that reason, all this get in touch right, um um it's not a really tangible topic and in order to make it more tangible for our users and to give them a rule of thumb, we came up with this little formula to help them and it is take the number of binaries that you have in your repository.

A

Multiply it with the average changes per year that you do to these binaries multiply it with the average size in megabyte of these binaries and then, if the result is smaller than one well you're good to go, you can just put everything and get everything is fine. You won't get into trouble anytime soon it's bigger than 100. Then you need to find another solution for these files and another solution for these files is actually get lfs.

A

Actually it is so popular at orders that we have repositories with hundreds of files and lfs, and then we realize well I get clone operation. We can really really slow. The reason is um in git, LFS raise the space on clean and smart filters and it processes these files sequentially and individually. So if you have a lot of files- and it takes a long time so I contacted to get out of s a core and developer Rick Olson, and he gave me good advice.

A

He said well set this git LFS, skips much environment variable and run git LFS full afterwards, and it will speed up your cloning. This worked, but unfortunately only um yeah for for mac, OS and linux and windows. It was still slow, so I experimented around and finally I found a solution that is actually 13 times faster and I've wrote a new command for using this enterprise conflict that we have and I call this gate areas clone, and this is actually lightning fast and it works really really well for engineers and yeah.

A

Luckily, Steve Sweeting fomentation, he picked up the idea and put this into good lfs core, which is awesome, and it is apparently released with an exquisite of S version 1.2 in the near future.

A

Okay, um now the last part of the presentation that is I'll get to you such monitoring. So when you want to look when you want to find out what your users are actually doing with get and entropy depender price, then you have two ways. One way is rather obvious: you can just n a block for winning and we forward our long locks to slung. So what can you do there? Well, for instance, you can answer the question. Who is cloning all the time? Why is this important?

A

Well, people that come from version control systems like perforce? They have this idea of so when I want to make a clean, build, I, delete everything and then I get everything from the server well, of course you get. That is quite a bad idea, because you see all know you transfer your entire history and that generates a lot of traffic that you don't want to. Some people are our a little bit more eligible and they use the git clone dash dash DEP command to limit the amount of history there is transferred to the client.

A

This is good, but this actually generates or in order to to to process this and the gate observant needs to create a ticket. Server needs to create a special Peck file for you, which requires right some cpu cycles. So if everyone is doing that, that's not good for your for your good server and what we want to get well.

A

We want to. We want to find people that are actually cloning all the time and so and then we we look at the repositories that they are cloning and we and we and we look at the size of these repositories, and then we generate a daily list, and these are actually the teams that that we did we talk to and we try to find solutions for them that work better with an get and better.

A

We think it is, for instance, you Skid, fetch and update this way, your your local repositories and in many ways this is actually this is possible. You just need to show the team's how it works, and this is good for them, because their processes in I'll get way faster because get fetches faster than clone, and it's good for us, because we, you save bandwidth the second, maybe not so obvious users monitoring way is the good enterprise back up.

A

When you create a backup, then get up, enterprise will actually have will store on my sequel database dump that you can import an your own, my secure database, and then you can write all kinds of queries to mine this data and one of these queries of, for instance, who is actually using pull requests.

A

We we think products are great thing and- and we want to make sure our engineers know about them and and and leverage them, and so we actually chart okay, who is using what teams are using pull requests and how many for workers today actually use how they made sugar today, marriage and so on, and- and this is very interesting to us, because we have a big number of offices all around the world and it's very hard to to reach all of them.

A

But with this kind of data mining we can actually see what offers is using pull requests and what others is not using foraker so and we can find them and we can talk to them and can show them may be ways and that they didn't know it even exists. So this works pretty good for us.

A

So to recap, what I take aways of this top so first of all, script you'll get onboarding. It is already hard enough to you so make it as easy as possible for your engineers. Second cadential helpers are not really secure, so use tokens. That's um that's important!

A

Third, articulate your get users recommendations just to help your engineers to get started and forth, monitor your get usage um to help them to improve, to help your engineers to improve their ways to interact cricket. Okay, thank.

A