Re: [boost] [algorithm] Review Request: edit_distance

1 Feb 2014

      ----- Original Message -----
...
- It would be great to see a benchmark against GNU diff!
So I cooked up an implementation of unix diff:
https://github.com/erikerlandson/algorithm/blob/edit_distance/sequence/examp...

It produces standard default unix diff output:
http://en.wikipedia.org/wiki/Diff#Usage

That is to say, edit_distance_diff_example will reproduce the output of diff, at least if you just give both programs two files for arguments:

$ edit_distance_diff_example foo.txt bar.txt > ed.out
$ diff foo.txt bar.txt > diff.out
$ diff ed.out diff.out
$

I generated a few benchmarking data-sets, by editing /usr/share/dict/words.  On my machine (F18), the 'words' file has 479,828 lines (one word per line) and is 4,953,680 bytes.   So, a decent size of file in both line length and bytes.   I did a couple different experiments, where I compared against a variation that differed by around 150 lines, and then another variation that differed by around 1200 lines.   I also varied input size, by just duplicating the contents, so the result was twice the length and size.

Across these variations of input size and file difference that I tested, edit_distance_diff_example ran consistently 40% to 60% faster than unix diff.  So, roughly speaking it's performing about twice as fast.

And less than 150 lines of code!

Re: [boost] [algorithm] Review Request: edit_distance

Erik Erlandson