Re: [boost] boost] [algorithm][string] Edit distance algorithm

15 May 2007

      JD wrote:
...
...
I have an implementation of the "diff" (i.e. edit script) algorithm here:
http://svn.anyterm.org/anyterm/trunk/common/diff.hh
http://svn.anyterm.org/anyterm/trunk/common/diff.cc
The "edit distance" is in there somewhere.
I would welcome this sort of thing in Boost.  But it needs to be the 
best-known algorithm.  IIRC, there was one improvement that I found in 
the literature but didn't implement because I didn't fully understand it.
Please correct me if I'm wrong, but I get the impression that your code 
is O(NM) where N and M are the sizes of the two input strings.  The 
algorithm that I implemented is O((N+M)D) where D is the edit distance 
in the worst case, and typically O(N+M+D^2).  The algorithm is 
described in the .cc file linked above.  Also, your code seems to have 
O(NM) space requirement; I'm not sure what the lower bound is, but I 
think it's probably O(N+M).
I improved the algorithm and now the space requirements falls to O(m). 
Computational is still the same, however. Having a very quick look at 
your algo (in diff.cc) I can see it's more complex in a way that it is 
not understandable at first read.
Indeed it's quite complex, but in the comments I refer to a couple of 
papers which I suggest that you read.  The operation is much easier to 
understand if you can visualise it, and my attempts at ASCII-art 
diagrams are not as good as the ones in those papers.

There is probably other material in the literature; does Knuth have a 
chapter on this subject?
...
As I can see in the Boost Guideline 
(http://www.boost.org/more/lib_guide.htm#Guidelines):
"Aim first for clarity and correctness; optimization should be only a 
secondary concern in most Boost libraries."
So I think unless there is an obvious way to reduce computational cost 
to O(m) (the same way it was obvious to reduce space usage to the same), 
I think we should stick to the original algorithm.
I would disagree with that view.  Let's quantify quite how the 
performance of the algorithms differ.  My application was to compare 
the contents of a 25x80-character terminal window before and after the 
user has changed something; typically they have just typed one or two 
characters.  So N = M = 25x80 = 2000, and D is typically 2.  Your O(NM) 
algorithm takes 4 000 000 time, while mine takes typically O(N+M+D^2) = 
4004 i.e. 1000 times faster, worst case O((N+M)D) = 8000 i.e. 500 times faster.

I am not sure what the author of the guideline that you have quoted had 
in mind, but I would have thought that that was not thinking of large 
big-O complexity issues.  My personal approach to optimisation has 
always been to make sure that the big-O complexity is right; it is then 
rarely necessary to consider optimising anything else.

Regards,

Phil.

Re: [boost] boost] [algorithm][string] Edit distance algorithm

Phil Endecott