Previous Entry | Next Entry

Patience Diff, a brief summary

World Domination

Last night I went to a coding party at a friend's house. I was on my laptop, and so didn't have any of my regular projects with me, so instead I spent some time figuring out some code for another person at the party.

The code in question was the implementation of Patience Diff, a diffing algorithm written by Bram Cohen of Bittorrent fame, and used in (at least) the Bazaar version control system.

The common diff algorithm is based on the longest common subsequence problem. Given (in this case) two documents, finding all lines that occur in both, in the same order. That is, making a third document such that every line in the document appears in both of the original documents, and in the same order. Once you have the longest common subsequence, all that remains is to describe the differences between each document and the common document, a much easier problem since the common document is a subset of the other documents.

While the diffs generated by this method are efficient, they tend not to be as human readable.

Patience Diff also relies on the longest common subsequence problem, but takes a different approach. First, it only considers lines that are (a) common to both files, and (b) appear only once in each file. This means that most lines containing a single brace or a new line are ignored, but distinctive lines like a function declaration are retained. Computing the longest common subsequence of the unique elements of both documents leads to a skeleton of common points that almost definitely correspond to each other. The algorithm then sweeps up all contiguous blocks of common lines found in this way, and recurses on those parts that were left out, in the hopes that in this smaller context, some of the lines that were ignored earlier for being non-unique are found to be unique. Once this process is finished, we are left with a common subsequence that more closely corresponds to what humans would identify.

00 #include <stdio.h>
01
02 // Frobs foo heartily
03 int frobnitz(int foo)
04 {
05     int i;
06     for(i = 0; i < 10; i++)
07     {
08         printf("Your answer is: ");
09         printf("%d\n", foo);
10     }
11 }
12
13 int fact(int n)
14 {
15     if(n > 1)
16     {
17         return fact(n-1) * n;
18     }
19     return 1;
20 }
21
22 int main(int argc, char **argv)
23 {
24     frobnitz(fact(10));
25 }
00 #include <stdio.h>
01
02 int fib(int n)
03 {
04     if(n > 2)
05     {
06         return fib(n-1) + fib(n-2);
07     }
08     return 1;
09 }
10
11 // Frobs foo heartily
12 int frobnitz(int foo)
13 {
14     int i;
15     for(i = 0; i < 10; i++)
16     {
17         printf("%d\n", foo);
18     }
19 }
20
21 int main(int argc, char **argv)
22 {
23     frobnitz(fib(10));
24 }
Find all unique, common lines.
00 #include <stdio.h>
01
02 // Frobs foo heartily
03 int frobnitz(int foo)
04 {
05     int i;
06     for(i = 0; i < 10; i++)
07     {
08         printf("Your answer is: ");
09         printf("%d\n", foo);
10     }
11 }
12
13 int fact(int n)
14 {
15     if(n > 1)
16     {
17         return fact(n-1) * n;
18     }
19     return 1;
20 }
21
22 int main(int argc, char **argv)
23 {
24     frobnitz(fact(10));
25 }
00 #include <stdio.h>
01
02 int fib(int n)
03 {
04     if(n > 2)
05     {
06         return fib(n-1) + fib(n-2);
07     }
08     return 1;
09 }
10
11 // Frobs foo heartily
12 int frobnitz(int foo)
13 {
14     int i;
15     for(i = 0; i < 10; i++)
16     {
17         printf("%d\n", foo);
18     }
19 }
20
21 int main(int argc, char **argv)
22 {
23     frobnitz(fib(10));
24 }
Take all common, unique lines from the first document and order according to appearance in the second document
00 #include <stdio.h>
19     return 1;
02 // Frobs foo heartily
03 int frobnitz(int foo)
05     int i;
06     for(i = 0; i < 10; i++)
09         printf("%d\n", foo);
22 int main(int argc, char **argv)

We then perform the Patience algorithm to find the longest common subsequence. Because for the above example, the LCS is quite obvious (everything but line 19), I'll use a randomly generated sequence as an example. In the example, consider the number on the card to be the line number in the first document, and its order in the stack to correspond to the order of the lines in the second document.


The algorithm builds a set of piles, with each card being placed on the left-most pile whose top card is greater than it. When a card is placed on a pile, a backreference is made to the top-most card on the previous pile, if any. At the end of the algorithm, we work backwards from the top-most card on the right-most stack, thus building a longest increasing subsequence. As can be seen in the example, it may not be the only increasing subsequence of that length, but there are none longer.

Now that we have our longest increasing common subsequence of unique lines, we can recurse on those ranges that were excluded earlier. When we are finished, we will have a list of corresponding lines between the new functions, from which a diff can be computed using familiar methods. The Bazaar implementation creates a SequenceMatcher for use with Python's difflib.

Tags:

Comments

( 6 comments — Leave a comment )
sbranzei
Feb. 20th, 2008 10:02 pm (UTC)
This is awesome :)
You should write more often :-)
bramcohen
Feb. 25th, 2008 06:38 pm (UTC)
That's a good summary, although you skipped the final step, where it matches lines at the beginning and end, which is meant to match, for example, the first five lines of a diff between five blank lines and six blank lines.
(Anonymous)
Feb. 10th, 2009 08:55 am (UTC)
this really helps with resolving conflicts!

i can imagine what the effect is, but perhaps a nice standard-diff vs patience-diff at the end of this article would have been cool!

i really hated the standard diff when i had to resolve multiple function renaming and moved conflicts of similiar functions!
i wish i had that patience-diff back then ... (this unique-common-line thing is really a great idea!)

(Anonymous)
Aug. 24th, 2010 04:11 pm (UTC)
what if matching the following?
file a

aaaaaa
aaaaaa
bbbbbb
bbbbbb
cccccc
cccccc
abc

file b

abc
aaaaaa
aaaaaa
bbbbbb
bbbbbb
cccccc
cccccc


would the matching result be

-aaaaaa
-aaaaaa
-bbbbbb
-bbbbbb
-cccccc
-cccccc
abc
+aaaaaa
+aaaaaa
+bbbbbb
+bbbbbb
+cccccc
+cccccc
M. Emin Akşehirli
Jun. 22nd, 2011 09:02 am (UTC)
Re: what if matching the following?
Yes it is.

git diff --patience HEAD HEAD^
diff --git a/a.txt b/a.txt
index accc3bd..3299d68 100644
--- a/a.txt
+++ b/a.txt
@@ -1,7 +1,7 @@
+aaaaaa
+aaaaaa
+bbbbbb
+bbbbbb
+cccccc
+cccccc
abc
-aaaaaa
-aaaaaa
-bbbbbb
-bbbbbb
-cccccc
-cccccc
ironcream
Aug. 16th, 2011 05:38 pm (UTC)
Thanks for the explanation!
( 6 comments — Leave a comment )

Latest Month

December 2008
S M T W T F S
 123456
78910111213
14151617181920
21222324252627
28293031   
Powered by LiveJournal.com
Designed by Ideacodes