Re: [boost] RFC: edit_distance / edit_alignment library

23 Jul 2013

      ----- Original Message -----
...
Hi Erik,
Erik Erlandson wrote:
...
Boost Community,
I cooked up a library that provides a pair of functions: edit_distance()
and edit_alignment().  The library currently lives here:
https://github.com/erikerlandson/edit_distance
I have had a quick look.  It seems that you've implemented the
Needleman-Wunsch
algorithm, which takes quadratic time and space.  There are less
complex algorithms,
at least for similar input sequences, e.g. Myers - "An O(ND) Difference
Algorithm
and Its Variations"; I have an implementation of that here:
http://svn.chezphil.org/anyterm/trunk/src/diff.cc & .hh.
Than you for the Myers paper!   I've been collecting some thoughts on this topic.

I would certainly want Myers' algorithm (or something like it) in a Boost sequence alignment library for its performance on large sequence data (previously I had been thinking in terms of Hirschberg, which is less efficient).  I do also wish to provide an algorithm that supports substitution as an edit operation, and configurable cost functions whose output may vary with the sequence elements.  Replacing Needleman-Wunsch with a Dijkstra/SSSP implementation under the hood seems like a good idea.

I am grappling with how best to represent the returned "edit script".   Considering the Myers paper, some kind of run-compression seems useful.   The information required depends on what operation.  To represent a run of 'equal elements', you need begin/length for sequence 1 and sequence 2.   Substitution is similar.  For insertion and deletion, you only need begin/length for one sequence.  One might use boost::any to address the differing information, although it seems to involve overhead.  Not sure if there is a clear boost-style answer that stands out.

In my draft, I also experimented with allowing selection of varying edit-script information via function adaptors.   The useful options there differ between Myers, where edit costs are constant, and my desired variant that allows non-constant edit costs.

The minimal edit script output in any case is a pair (op-code, run-length).  If one wants additional location information, then the op-dependent variation appears, adding begin-index for either one or two sequences.   In the case of Myers, edit cost information is not relevant.   I see no real problem with supporting different options for different algorithms, as long as the interface is as consistent as possible.   I still like this as a feature, although with run compression one might argue that the benefits decrease.