Last night I went to a coding party at a friend's house. I was on my laptop, and so didn't have any of my regular projects with me, so instead I spent some time figuring out some code for another person at the party.
The code in question was the implementation of Patience Diff, a diffing algorithm written by Bram Cohen of Bittorrent fame, and used in (at least) the Bazaar version control system.
The common diff algorithm is based on the longest common subsequence problem. Given (in this case) two documents, finding all lines that occur in both, in the same order. That is, making a third document such that every line in the document appears in both of the original documents, and in the same order. Once you have the longest common subsequence, all that remains is to describe the differences between each document and the common document, a much easier problem since the common document is a subset of the other documents.
While the diffs generated by this method are efficient, they tend not to be as human readable.
Patience Diff also relies on the longest common subsequence problem, but takes a different approach. First, it only considers lines that are (a) common to both files, and (b) appear only once in each file. This means that most lines containing a single brace or a new line are ignored, but distinctive lines like a function declaration are retained. Computing the longest common subsequence of the unique elements of both documents leads to a skeleton of common points that almost definitely correspond to each other. The algorithm then sweeps up all contiguous blocks of common lines found in this way, and recurses on those parts that were left out, in the hopes that in this smaller context, some of the lines that were ignored earlier for being non-unique are found to be unique. Once this process is finished, we are left with a common subsequence that more closely corresponds to what humans would identify.
|
|
Find all unique, common lines. | |
|
|
Take all common, unique lines from the first document and order according to appearance in the second document | |
|
We then perform the Patience algorithm to find the longest common subsequence. Because for the above example, the LCS is quite obvious (everything but line 19), I'll use a randomly generated sequence as an example. In the example, consider the number on the card to be the line number in the first document, and its order in the stack to correspond to the order of the lines in the second document.
The algorithm builds a set of piles, with each card being placed on the left-most pile whose top card is greater than it. When a card is placed on a pile, a backreference is made to the top-most card on the previous pile, if any. At the end of the algorithm, we work backwards from the top-most card on the right-most stack, thus building a longest increasing subsequence. As can be seen in the example, it may not be the only increasing subsequence of that length, but there are none longer.
Now that we have our longest increasing common subsequence of unique lines, we can recurse on those ranges that were excluded earlier. When we are finished, we will have a list of corresponding lines between the new functions, from which a diff can be computed using familiar methods. The Bazaar implementation creates a SequenceMatcher for use with Python's difflib.