Write your own diff for fun

December 27, 2017

On the other day, when I was looking at git diff, I thought “How does it work?“. Brute-force idea of comparing all possible pairs of lines doesn’t seem efficient and indeed it has exponential algorithmic complexity. There must be a better way, right?

As it turned out, git diff, like a usual diff tool is modeled as a solution to a problem called Longest Common Subsequence. The idea is really ingenious – when we try to diff 2 files we see it as 2 sequences of lines and try to find a Longest Common Subsequence. Then anything that is not in that subsequence is our diff. Sounds neat, but how can one implement it in an effective way (without that exponential complexity)?

LCS problem is a classic problem that is better solved with dynamic programming – somewhat advanced technique in algorithm design that roughly means an iteration with memoization.

I’ve always struggled with dynamic programming because it’s mostly presented through some (in my opinion) artificial problem that is hard for me to work on. But now, when I see something so useful that can help me write a diff, I just can’t resist.

I used a Wikipedia article on LCS as my guide, so if you want to check the algorithm nitty-gritty, go ahead to the link. I’m going to show you my implementation (that is, of course, available on GitHub) to demonstrate how easily you can solve such seemingly hard problem.

I’ve chosen Python to implement it and immediately felt grateful because you can copy-paste pseudocode and use it with minimal changes. Here is the diff printing function from Wikipedia article in pseudocode:

function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i > 0 and j > 0 and X[i] = Y[j]
        printDiff(C, X, Y, i-1, j-1)
        print "  " + X[i]
    else if j > 0 and (i = 0 or C[i,j-1] ≥ C[i-1,j])
        printDiff(C, X, Y, i, j-1)
        print "+ " + Y[j]
    else if i > 0 and (j = 0 or C[i,j-1] < C[i-1,j])
        printDiff(C, X, Y, i-1, j)
        print "- " + X[i]
    else
        print ""

And in Python:

def print_diff(c, x, y, i, j):
    """Print the diff using LCS length matrix by backtracking it"""

    if i >= 0 and j >= 0 and x[i] == y[j]:
        print_diff(c, x, y, i-1, j-1)
        print("  " + x[i])
    elif j >= 0 and (i == 0 or c[i][j-1] >= c[i-1][j]):
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif i >= 0 and (j == 0 or c[i][j-1] < c[i-1][j]):
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])
    else:
        print("")

This is not the actual function for my diff printing because it doesn’t handle few corner cases – it’s just to illustrate Python awesomeness.

The essence of diffing is building the matrix C which contains lengths for all subsequences. Building it may seem daunting until you start looking at the simple cases:

  • LCS of “A” and “A” is “A”.
  • LCS of “AA” and “AB” is “A”.
  • LCS of “AAA” and “ABA” is “AA”.

Building iteratively we can define the LCS function:

  • LCS of 2 empty sequences is the empty sequence.
  • LCS of “${prefix1}A” and “${prefix2}A” is LCS(${prefix1}, ${prefix2}) + A
  • LCS of “${prefix1}A” and “${prefix2}B” is the longest of LCS(${prefix1}A, ${prefix2}) and LCS(${prefix1}, ${prefix2}B)

That’s basically the core of dynamic programming – building the solution iteratively starting from the simple base cases. Note, though, that it’s working only when the problem has so-called “optimal” structure, meaning that it can be built by reusing previous memoized steps.

Here is the Python function that builds that length matrix for all subsequences:

def lcslen(x, y):
    """Build a matrix of LCS length.

    This matrix will be used later to backtrack the real LCS.
    """

    # This is our matrix comprised of list of lists.
    # We allocate extra row and column with zeroes for the base case of empty
    # sequence. Extra row and column is appended to the end and exploit
    # Python's ability of negative indices: x[-1] is the last elem.
    c = [[0 for _ in range(len(y) + 1)] for _ in range(len(x) + 1)]

    for i, xi in enumerate(x):
        for j, yj in enumerate(y):
            if xi == yj:
                c[i][j] = 1 + c[i-1][j-1]
            else:
                c[i][j] = max(c[i][j-1], c[i-1][j])
    return c

Having the matrix of LCS lengths we can now build the actual LCS by backtracking it.

def backtrack(c, x, y, i, j):
    """Backtrack the LCS length matrix to get the actual LCS"""

    if i == -1 or j == -1:
        return ""
    elif x[i] == y[j]:
        return backtrack(c, x, y, i-1, j-1) + x[i]
    else:
        if c[i][j-1] > c[i-1][j]:
            return backtrack(c, x, y, i, j-1)
        else:
            return backtrack(c, x, y, i-1, j)

But for diff we don’t need the actual LCS, we need the opposite. So diff printing is actually slightly changed backtrack function with 2 additional cases for changes in the head of sequence:

def print_diff(c, x, y, i, j):
    """Print the diff using LCS length matrix by backtracking it"""

    if i < 0 and j < 0:
        return ""
    elif i < 0:
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif j < 0:
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])
    elif x[i] == y[j]:
        print_diff(c, x, y, i-1, j-1)
        print("  " + x[i])
    elif c[i][j-1] >= c[i-1][j]:
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif c[i][j-1] < c[i-1][j]:
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])

To invoke it we read input files into Python lists of strings and pass it to our diff functions. We also add some usual Python stanza:

def diff(x, y):
    c = lcslen(x, y)
    return print_diff(c, x, y, len(x)-1, len(y)-1)

def usage():
    print("Usage: {} <file1> <file2>".format(sys.argv[0]))

def main():
    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    with open(sys.argv[1], 'r') as f1, open(sys.argv[2], 'r') as f2:
        diff(f1.readlines(), f2.readlines())

if __name__ == '__main__':
    main()

And there you go:

$ python3 diff.py f1 f2
+ """Simple diff based on LCS solution"""
+ 
+ import sys
  from lcs import lcslen
  
  def print_diff(c, x, y, i, j):
+     """Print the diff using LCS length matrix by backtracking it"""
+ 
       if i >= 0 and j >= 0 and x[i] == y[j]:
           print_diff(c, x, y, i-1, j-1)
           print("  " + x[i])
       elif j >= 0 and (i == 0 or c[i][j-1] >= c[i-1][j]):
           print_diff(c, x, y, i, j-1)
-          print("+ " +  y[j])
+          print("+ " + y[j])
       elif i >= 0 and (j == 0 or c[i][j-1] < c[i-1][j]):
           print_diff(c, x, y, i-1, j)
           print("- " + x[i])
       else:
-          print("")
- 
+         print("")  # pass?

You can check out the full source code at https://github.com/alexdzyoba/diff.

That’s it. Until next time!