Re: [dev] [sbase] diff from Markus Wichmann on 2016-01-28 (dev mail list archive)

From: Markus Wichmann <nullplan_AT_gmx.net>
Date: Thu, 28 Jan 2016 16:48:22 +0100

On Wed, Jan 27, 2016 at 11:22:55PM +0100, Mattias Andrée wrote:
> Perhaps I should describe how the program works
> (although it is very simple.) The documents are
> compared just like of they were words, but with
> lines rather than characters, and only add and
> remove are valid changes, not substitution. This
> is implemented using dynamic programming. It
> procedure fills in two matrices. One that describes
> how many changes have to be made to get from the
> old document to the new document. In the other it
> is registered changes have to be made. This can be
> extrapolated from the first matrix, but to keep it
> simple it is written to the another matrix.
> For any index-pair (i, j) in the matrices, the
> corresponding element describes the changes to get
> from the i first lines of old document to the j first
> lines of the new document. The next part of the
> program is just print the differences. Any time an
> edit is about to be printed, it is cached, then when
> all directly subsequent edits have been cached, it
> prints the line removals followed by the line
> additions. This is to make the output more readable.
>

Some comments on this:

First, we allow C99 extensions, right? C99 has not only variadic arrays,
but pointers to them as well, so your matrix could be declared and
allocated as

size_t (*matrix)[a + 2][b + 1] = malloc(sizeof(size_t[a + 2][b + 1]));

Then access array elements as (*matrix)[i][j] and you're done. This
saves you the trouble of allocating the pointer block, or allocating a
single-dimensional array with enough elements and doing the arithmetic
by hand.

Second, wikipedia describes this algorithm to initialize the topmost
line and the leftmost column with zero, not with their indices.

Third, this algorithm needs quadratic space, i.e. on a 32-bit machine,
it is incapable of comparing two 200,000 line files (it would need
40,000,000,000 machine words for that, and that would just die). So
instead I'm going to explain what busybox diff does:

The two files to be compared are loaded into memory, whereby each line
is annotated with its line number (actually, that's a lie, busybox only
saves a hash of the line and really, really hopes, there are no
collisions, but hush). Then both files are sorted by line content first
and line number second. Then a mapping is calculated, that maps each
line of the old file to its corresponding line in the new one, if any.
This is done by merging the two files, i. e. (pseudocode)

memset(map, -1, sizeof map)
new file pointer = first new line
for every old line:
    while (*new file pointer < old line)
        new file pointer ++
    if (*new file pointer == old line)
        map[old line number] = new line number

Afterwards the files are reread (to undo any transformation that might
have happened to the lines. E.g. if "ignore whitespace" was requested,
the whitespace was actually filtered out back in step 1) and then
basically the mapping is printed: If the current old line has no
mapping, or maps to a line before the current new line, it was deleted,
if it maps to the current new line it stayed the same, and if it maps to
a new line after the current one, all new lines between the current one
and the mapped one were added.

I can't prove that this algorithm will conclusively always calculate the
longest common subsequence, but as long as the common subsequence
arrived at in this way is at least up there with the longest, the number
of useless changes (where a line was removed and then added verbatim)
should be pretty low, so this would only be a minor problem over all.
But that algorithm only requires the memory to hold both files with line
numbers, and the map.

Comments?

Ciao,
Markus
Received on Thu Jan 28 2016 - 16:48:22 CET

This archive was generated by hypermail 2.3.0 : Thu Jan 28 2016 - 17:00:15 CET