GitHub Git Merge 2018, 1 Apr 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Annotating diffs - Git Merge 2018

Description

Presented by Grant Mathews, Software Engineer, Atlassian

About GitMerge
Git Merge is the pre-eminent Git-focused conference: a full-day offering technical content and user case studies, plus a day of workshops for Git users of all levels. Git Merge is dedicated to amplifying new voices in the Git community and to showcasing the most thought-provoking projects from contributors, maintainers and community managers around the world. Find out more at git-merge.com

A

All right, hello, my name is grant I'm a developer with bitbucket cloud, and today I want to talk about annotating diffs, so I spend a lot of my time reviewing code and that typically means looking at changes in the form of diffs. So here's a DIF. This is a minor variant on a unified DIF format. So it has a diff header, which has some header lines that we don't really care about, and it has a header lines that tell you which file is changing.

A

Those we do care about tips normally have one or more diff a diff hunk has all the changes that you would expect to see in a different part and a diff hunk has a hunk header which tells you where those changes are occurring and, of course, has all the added lines and remove lines that you would expect to see, and it often has contexts lines which are just unchanged lines. So hopefully, none of that is too new for anybody, but that's unified.

A

If the changes that I look at rarely are this simple, though they tend to look something more like this, where this is just a small patch series that I pulled out of the Linux kernel, for example, purposes: there's only seven commits there there's about 80 lines changing, and this is relatively clear, clean code if you can see it, but when I'm looking at this amount of changes, I often want more context for a particular subset.

A

For example, I might want to know what the commit message is for a particular set of changes and I don't have to go through every commit to find that oftentimes. It's important to see if changes happened simultaneously, all in the same commit that can be very important, sometimes and oftentimes. There's just a lot of detail. You know a lot of information associated with a particular line, so this one-line change has 137 lines in the commit message.

A

Most projects aren't like the Linux kernel, though they'll put that level of detail, typically in a pull request on a website somewhere. So when I review code, it looks more like this, or this is just part of a dip in a PR on the bucket and even in this view, I often want more context. So there's a drop-down menu that has extra stuff and I think it would be cool if we could add a show, commits item to the drop down menu like that, and that this isn't like a product announcement.

A

Anything I'm, not promising anything, just I think it'd be cool to have the ability to show the commits that are associated with the changes that I'm looking at and, if we're being really ambitious, we could put those commits right in line with those changes like that. Just have a link straight from. Oh, this change is in this commit and that link would, of course, just take you to the commit page where you may or may not have a useful commit message. But what the bitbucket team tends to have installed on their repositories is very useful.

A

Integration called pull, request, commit links and that will go through and for every commit in the repository. It will try to find all the pull requests that that commit showed up in. So that's really useful.

A

For example, if you have a release, pull request, which is essentially one pull request, that is ten or twenty other pull requests altogether, and it's very convenient to be able to go and say: oh, where were these changes reviewed so I use that a lot and so back to the diff view, I think it'd be cool if that integration could have its own item in the menu like that.

A

That might be useful and, if we're again being really ambitious, we could put the link to the PRS right next to those lines changed I think that would be very useful functionality. So the question is: how do we tie the commits to the changes that we're looking at like? What? What do we do? Do you know? How would we represent that so I want to suggest an annotated, diff format. So this is what I think an annotated, if should look like it kind of looks like a unified diff.

A

It has a header that kind of looks like unified, diff header. It has one or more hunks that kind of look like unified, diff funks. You may have already noticed the primary difference.

A

Each line gets prefixed with a commit, and so that's the annotation, of course also useful I think it would be nice to have the original line number from when that change was introduced, and that would be very useful if you're in a PR on a website- and you want to attach a comment to a particular line of change and maybe have that comment show up in different contexts, maybe in subsequent PRS or if somebody's just browsing commits so I think that'd be cool. There's one other piece of information.

A

We need to really make that work though, and that's the file name which can change during a DIF. So we need to track the renames. So we just throw that in the header. That already has that name. Information and a detail that would become important later is I, feel that removed lines should show the commits that removed them. That just seems like the most logical thing to do. You know, for that. Annotation less important is how context lines get annotated for my use cases: I, don't particularly care about context. This could be all zeros.

A

Space is not important here. I just gave it the same hash as everything else, because that was literally the easiest thing to do not critical, but I think that would be a useful way to represent this information. Let's look at a slightly bigger example: the previous annotated, if only had one commit this one has four and you can see that all four show up in the rename list just again feels useful and then you can look at the line numbers and they do jump around.

A

We are tracking the original line, numbers small detail, but again I think that would be really cool to be able to generate and use in displaying diffs for PRS in particular. So then the question is: how do we generate these annotations? What what tools can we use and get hopefully provides an annotate command already?

A

So, let's explore that and see if we can use it for generating annotated, diffs and what get annotate does is it will look at a file and it will go through history where in get history is represented, of course, as a directed acyclic graph, where each commit is a node in that graph. You don't really need to know too many details about that, except that it's often called the dag and going through history is often call walking the dag, so I'll probably say that a lot so get annotate starts with file walks.

A

The dag looks at lines that are introduced and tries to match those lines to the lines that it has in the file and eventually it will spit out an annotated version of that file, and here we see that there's hashes there's line numbers. This is literally half of what we need. This is great. This is every added line and even every context line which we don't really care about, but still cool for a diff. So, let's see if we can really squeeze diff annotation in given what yet already gives us.

A

So if we generate a diff start with, that seems like a good place to start and then we call get annotate once for each file that shows up in the diff, so maybe a little slow, but that will give us half of what we need. Great.

A

The only thing left to do is get the removed lines and get annotate has a reverse option, which will walk the dag backwards, which is forwards through history. If you give it a file and a place to start, it will start at that place. Look at the lines of that file and then track the last time that it saw each line as it goes forwards through history. So it's not quite what we want. It doesn't annotate when the line was removed. It annotates when it was last seen.

A

So it's it's off by one in terms of what we want, but it's close, so we can't quite get what I really want out of the built in give commands and maybe a Perl script wrapped around those and there's. You know two problems, one. We would have to call get annotate twice for each file. That shows up in the diff, so it'd be slow, not too bad, but it's also not the annotations that we want.

A

So all right in theory, I think we can get what we want and be slightly more efficient about it, though so in theory we'll try to replicate most of what get annotate does, but do it you know so that it generates the thing that we want. So we start with a diff reasonable. Oh, we walk the dag.

A

Just like get annotate does, and we know that get annotate does exactly what we want for added lines so that that shouldn't be contentious, oh and at when we implement this ourselves at each step in the walk, we can look at every file that we care about instead of having to do multiple walks per file so slightly more efficient and we can walk backwards through the dag, just like it. Annotate reverse does, but instead of comparing it to a file, we can compare to the diff and take the diff at each point and see.

A

Oh, this removed line matches up with that removed wine and in theory we can annotate a diff. So that seems like a reasonably obvious approach to take and when I was looking at doing this I thought. Okay, I can probably you know, be clever and being faster about it. So I worked on an optimized approach and one of the requirements to be faster is to walk the dag once dag box can be slow, so I wanted to minimize the number of dag blocks and, of course, any real implementation wouldn't actually watch walk the dag twice.

A

It would generate a cache of things that had seen on the first walk and I didn't want to use a cache, because that is differently expensive, for example, I think, two weeks ago there was a bit bucket cloud support ticket about somebody who had an 11 million line, diff that wasn't rendering properly for some reason and I, don't want to store 11 million lines, especially multiple times in a cache somewhere where I'm blocking other things. So one walk, no cache trickery.

A

You know that that was my goal and I was thinking about it and I thought. Ok, we can do this if we just build up the annotated diff incrementally as we go and I'm not sure if I implied it but annotating a single commit is kind of trivial.

A

Everything gets the same annotation, and so you can do that for every commit that you see, and then all we have to do is figure out a way to combine those as walk through to build up a diff and that should without using a cache, allow us to get away with one walk and if we're building up an annotated. If, as we go and we have a diff at each step, we don't even need that initialed.

A

If that becomes extraneous and getting rid of that is kind of in line with another performance constraint, never keeping more data than we need, and so that's no cash, no extra diff one dag walk. The super, clever optimized approach and I came up with that approach without really digging into the problem and Donald Knuth has a quote that comes around from time to time. He says. Premature, optimization is the root of all evil. So when I talk about my optimized approach for the rest of the talk, you should keep that quote in mind.

A

That is important, but moving on, let's look at the simplified single file case, see if we can actually make this work you know like. Is it gonna actually do what I want it to do so here we have three commits C on top of B on top of a and then the state of a file at each commit, and so we want the diff from A to C. You know just dropping foo, adding Baz and we want to know. Can we annotate that incremental e step by step? So let's try it.

A

Let's step to C generate the diff from C to B, and we know that those changes were all introduced only by commit C, so the annotation it's kind of trivial boom there. It is, and it's worth noting if we were just annotating from B to C, you know just that. We would spit that out. We're done.

A

That would be a completely valid, annotated, diff, but we're going all the way all the way to a so, we have to step to B generate the diff from B to a, and we know that those changes were only introduced by commit be so trivial to annotate cool. Now all we have to do is figure out how to combine these two annotated ifs, all right. You can kind of look at it and stare and say all right foo with foo Baz with Baz. It should look something like that.

A

That seems doable like that's a reasonable sequence of steps to take. So this looks promising. This looks like oh I can actually maybe code this up, but before I do that, let's see if I can ignore or anything to make it easier so list of things to not worry about. First off, multiple files am I. Gonna have a weird like situation where I have a ton of files to track, and they overlap and weird stuff happens, and it turns out.

A

No, of course, each file is entirely independent in the diff, so just in generating add, if you have independent files- and it tells you hey, there's your file, all you have to do is track it. Computers are good at that. So that turns out to be easy.

A

What about rename so renames get weird all sorts of things happen. You know it's kind of difficult to detect, renames and again it's the same case just generating the diff yeah it's hard for that code, but it spits out the rename I never have to worry about that, and since each file is entirely independent, it turns out all of the interesting cases happen just in one file. So I don't have to worry about that case at all easy.

A

What about merges now merges are very important. That was my release, pull request case, where that's, basically, just a series of merges and I really want that to be annotated and annotated correctly. So, let's quickly look at that where, if you have a merge- and you know, there's your dag- it's not clear that that can work just by going incrementally.

A

You know one by one step-by-step, you kind of have that left part that looks like okay, yeah I know how to annotate that part, and maybe that right part I, could also annotate, and you know they start and end at the same place. So I'll assume that I can handle that later, just to get started, I will ignore it.

A

Hopefully I'll come back to it and get that working branches on the other hand, or if you have the heads of two branches of development, and you want to compare those and you want to annotate the diff. It's very easy to generate a diff between two heads. You know: do it all the time, but what would in the annotation? For that look like at first, it kind of seems like the merge case. You have that left side where oh yeah I can generate an annotation for that, and then you have the right side.

A

Oh yeah I could build up something there, but they don't share. You know the same starting point, so it's not clear that they could be easily combined and you might be able to get around that say by flipping the direction of one of the walks as I owe you go down then up, but that gets weird where you're going backwards. So are you changing like you're added and removed lines as you go along if you annotate a diff and you get a removed line with a commit?

A

If you go to the commit, will it be an added line? Thinking about this one hurt my head so I just said no forbidden, my code will just crash. Well, it will crash with a nice message. Yeah don't do that, but that really simplifies the problem. That means okay. We only care for now about the linear single file case.

A

Great all that's left to do is you know, line up the lines and what that means is that we have to track changes to line numbers through history, we'll be adding and removing lines well, we'll be coming across added in removing lines as we walk the deck we just need to know. You know how things shift and we need to keep track of how things shift so that we can match up lines. You know just oh, these lines collide and when that happens, we have to decide what to do with lines.

A

That's a little hand wavy. So let's look at a more concrete example. So in theory, if you're walking through the dag- and you have an annotated DIF here and later in that same file, you run into these changes.

A

You know: can we've lined those lines up and decide what to do with them, and you might look at that and looking at the top dip, the top annotated, if you might say, Oh b1 bar that doesn't seem to match any line in the other DIF and it's annotated as b1 bar meaning it goes to line one in commit B, but just by parsing that hunk.

A

We know that it's actually from line 2 as well, and so we track all the line numbers we possibly can for this and from line 2 matches up directly with to line 2 in that next chain. So, oh that actually matches up nicely. We know how to combine those great, let's look at a slightly different case. So what? If there's a bunch of lines coming in shifting things down? What happens if we have this annotated, diff? And then we run across these changes?

A

Later, you might say: Oh, d3, Baz, that there is no line 3 and that other diff what it turns out. This is actually the same case. D3 is to line 3 in Committee d, but it's actually from line one, and we know that with no magic just by reading the hunk header, and so we know that o D three matches or D one from D one matches directly with the two C one changes in that.

A

Next, if great so there's two things to note here: first off we only have to track line numbers well, I mean I have to look at line contents to verify things, but the computer can only you know, get away with tracking line numbers, nice, efficient and second off. You can't just Ram these two together you'll end up with a removed line in the middle of a bunch of added lines. That's bad dif etiquette!

A

You need to like sort lines or track contiguous hunks of changes details, but once you handle those, you can get a nice annotated if out of that cool.

A

So this seems like something that it's easy to come up with a finite number of cases. You know figure it out, maybe write it on the back of a napkin type scenario, and so I did that you know I went through every case. You know, plus meets plus minus meets context. That sort of thing came up with these cases are obvious. These cases are forbidden, coded up run it through some tests find out that, of course, the forbidden cases aren't so forbidden. That's fine! You just go back.

A

We examine your model, it's still relatively simple change. All the code run it through some more tests find out. Oh no, this isn't quite tracking line numbers correctly. So that's fine! You, you figure out those details, and you know I only rewrote this code about three or four times kind of felt like thirty or forty, but at the end of this I got a good. You know set of code that hey this will annotate diffs and it works like almost every time.

A

I say almost because there was a problem and that problem would show up if you're going through the other dag you run across a change and then later in that same file, you run into the opposite change. So if somebody makes a change, then undoes that change, then you would expect that to show up in a diff as either context or to just not show up at all, because nothing changed. Of course, what my code was doing was ramming all the lines together, which is technically what happened.

A

I mean that's technically accurate, but nobody cares that it's, you know technically correct. They just wanted to look like they expect it to so I needed a way to fix this problem, to make it go away so fixing the problem. Nobody cares that it's not technically broken, but I just needed to figure out which lines were duplicates. You know really simple and computers have a great way of doing that. It's different so in generating this diff after running and calling DIF all these times.

A

I then had to read if the results and I had to do that after each step in the walk to avoid these sorts of things snowballing out, because you know, if you have an error and you keep doing stuff, it will build. So what that looks like is, if I have a suspicious set of changes. I, look that I okay, there might be some repeated lines, POW run it through DIF and okay.

A

There we are out pops a better diff, and the important thing here is besides calling diff, not the command, but still the algorithm multiple times. You know it slows things down and we're now, looking at the contents of the line, we can no longer get away with just tracking line numbers so for an approach that was supposed to be optimal or at least optimized fast. This is very bad. It slows things down quite a bit, so that was depressing discovery, but at this point you know the code was almost feature complete. It.

A

It handles linear, merges just or linear paths, just fine what about merges. So briefly, if you're walking through the dag you've got some changes that you're tracking and going through and you run across a merge, what do you do so emerge when you're going backwards through history kind of looks? Like a branch, you get two separate paths that pop up all right at this point, I had given up on being clever and I had given up on being efficient.

A

I just did the easiest possible thing copy the entire annotated diff once for each parent and uh great now, I have multiple identity, the diffs for each one. You just track the parents that it expects to see next so that when you walk the dag run into a commit, you feed it to the correct annotate. The diff annotate, combined as usual great keep walking, get another commit feed it to the correct annotated. If and it's a combine cool keep walking.

A

So the interesting part here comes when you run across that common ancestor, the original branch point and you have to recombine the annotated discs that you've been building up so carefully, but it turns out that's just line matching and you know decision logic. I've already done a lot of that already. This is slightly different, but not terribly different. You need to keep track of things like which commits you've run across that were merge commits so that you can, you know, decide the proper annotation for every case, but it ends up being relatively straightforward.

A

You can combine those diffs, no problem, so cool I ended up with a usable. Well, you know an OK implementation of annotated ifs. Let's see how you know like a scorecard here. How did I do so? First off I got away with one bag. Walk cool I got it without even using a cache nice.

A

One issue, though, is that it generates an unexpected DIF. So there's no uniqueness, guarantee and ifs. You can represent the same changes, many different ways and since I was generating diffs in an entirely different way than get generates. Disks I was probably gonna end up, generating slightly different. You know which line is removed, forces context. So that's not great. That's not what users expect that's.

A

You know bad and then there's the extra diff steps where I had to read, if all the lines and again for an optimized approach. That is very, very bad.

A

So overall, this is clearly not optimal, but you know it was functional and at some point after writing this code- and you know coming to this conclusion- I got roped into going to a mercurial sprint and at some point, during the mercurial sprint, I got roped into talking about annotated disks and so I presented. My oh hey, cool. There's this incremental thing that I'm doing and I got some feedback on my approach and I said: yeah no you're right that this is pretty bad, but you guys are dag experts. You live in breathed eggs.

A

You know tell me the correct way to do this and they said ok, so you start with a dip. I'm like okay, yeah sure sounds reasonable, and so you walk the dag forwards. I'm like okay I'm, with you sure and I said you walked it back backwards, Mike whoa backwards, dag, walk too slow. Can't do that and I said. Of course you wouldn't walk the dag backwards. You would build a cache on the first walk and whoa cache.

A

No and I was really digging in my heels, about this being like the slow way what they were recommending, but I realized afterwards that I hadn't actually gone back, and you know asked myself wait. Why is it so slow like re-examined it in depth? So let's do that now. So this is the slow way, the obvious way like the initial theoretical, straightforward approach to annotating diffs, so starting with the diff. What's that cost us turns out it limits the number of lines you examine.

A

So if you're, if you have the diff all the changes that you care about and you step through, that, you know walking the dag and you find a set of changes that aren't in that diff. You know you don't care about them. You have to track line numbers, but you can throw everything else out, like okay, cool skip it and that works at the file level too.

A

If you have a file that you're an across and that file is not in the diff for the whole thing out, so that's occasionally a nice optimization that you can do. You can also skip context during the walk, and the only reason that we tracked context in the incremental approach is that we need to spit context out at the end.

A

You know we have to keep that and moved along with everything else, so that we generate a reasonable DIF for the user, but generating that initial DIF gives us all the context we'll ever need, and since we don't really care about annotating context or at least I didn't, you can just skip it. When you're looking at everything else, and since some diffs are like 80 85 percent context lines, you know if you have a ton of single line changes, that's not a bad optimization but, most importantly, it's consistent with user expectations.

A

It generates the diff that the user expects to see. So you can do things like generate a diff and then say: oh I want to annotate that and then get the same. Diff annotated way more useful than what I was doing, so that turns out to be a very important stuff not to skip, and then let's look at the real performance killer, the double dag walk. So, of course any real. You know implementation would not double with dag walk.

A

It would use a cache generate on the first walk, but it turns out that cash is actually really efficient. It only needs to store line numbers and since we're only looking at removed lines at that point, if we only need to remove two line numbers- and we don't need it like per line- we can just store ranges of lines. So that's a really efficient cache and we know that will never need line contents because we already have all the line contents. We would ever care about in that initial diff that we generated.

A

So this turns out to be kind of a stupid thing to skip it's a very important step, and this isn't so much the slow way. It's probably the fast way and I say probably because I haven't gotten around to implementing it. Yet don't know how I can screw this one up yet, but you know it seems like the correct thing to do, and so hopefully I've convinced you that annotated ifs are interesting and useful and can be generated efficiently, even if I have yet to do so myself. So thank you.