----- Original Message -----
Hi Erik,
Erik Erlandson wrote:
Boost Community,
I cooked up a library that provides a pair of functions: edit_distance() and edit_alignment(). The library currently lives here: https://github.com/erikerlandson/edit_distance
I have had a quick look. It seems that you've implemented the Needleman-Wunsch algorithm, which takes quadratic time and space. There are less complex algorithms, at least for similar input sequences, e.g. Myers - "An O(ND) Difference Algorithm and Its Variations"; I have an implementation of that here: http://svn.chezphil.org/anyterm/trunk/src/diff.cc & .hh.
Than you for the Myers paper! I've been collecting some thoughts on this topic. I would certainly want Myers' algorithm (or something like it) in a Boost sequence alignment library for its performance on large sequence data (previously I had been thinking in terms of Hirschberg, which is less efficient). I do also wish to provide an algorithm that supports substitution as an edit operation, and configurable cost functions whose output may vary with the sequence elements. Replacing Needleman-Wunsch with a Dijkstra/SSSP implementation under the hood seems like a good idea. I am grappling with how best to represent the returned "edit script". Considering the Myers paper, some kind of run-compression seems useful. The information required depends on what operation. To represent a run of 'equal elements', you need begin/length for sequence 1 and sequence 2. Substitution is similar. For insertion and deletion, you only need begin/length for one sequence. One might use boost::any to address the differing information, although it seems to involve overhead. Not sure if there is a clear boost-style answer that stands out. In my draft, I also experimented with allowing selection of varying edit-script information via function adaptors. The useful options there differ between Myers, where edit costs are constant, and my desired variant that allows non-constant edit costs. The minimal edit script output in any case is a pair (op-code, run-length). If one wants additional location information, then the op-dependent variation appears, adding begin-index for either one or two sequences. In the case of Myers, edit cost information is not relevant. I see no real problem with supporting different options for different algorithms, as long as the interface is as consistent as possible. I still like this as a feature, although with run compression one might argue that the benefits decrease.