Re: [boost] [gsoc 2013] Approximate string matching

16 Apr 2013


      Dne 15.4.2013 15:11, Adam D. Walling napsal(a):
...
Jan Strnad <hanny.strnad <at> gmail.com> writes:
...
Dne 12.4.2013 19:50, Michael Marcin napsal(a):
...
On 4/12/13 4:52 AM, Jan Strnad wrote:
...
...
...
...
Now more specifically. I would like to implement approximate string
matching algorithm(s) based on the NFA. I would like to support various
approximate distances, such as Hamming distance [1], Levenstein
distance
[2], Damerau-Levenstein distance [3], Delta distance, Gamma distance
and
finally (delta, gamma) distance.
...
Sounds very useful!
I guess this would fit in the string algorithm library?
That seems to me reasonable as well.
...
What would the interface look like?
Is it as simple as:
namespace boost {
namespace algorithm {
template< typename Range1T, typename Range2T >
std::size_t levenstein_distance( Range1T first, Range2T second );
}
}
Sure, that would be possible, but I was originally aiming at something
little bit different. Basically, I want to design an algorithm that
would find all (or just the first) occurrences of a given pattern with
given distance in some string.
So to your question, simple interface could look like this:
namespace boost
{
    namespace algorithm
    {
    	template <typename T>
    	class approximate_distance;
template <typename TRange>  	
      find_iterator<TRange> approximate_find
  (TRange text,
           approximate_distance<TRange>& model);
    }
}
...
I would recommend starting off with the basic implementations of these
distance algorithms if possible. There are several situations where I
currently use my own Levenshtein distance implementation, and could use
other alternative 'distance' measures, but the approximate_find is useless
here.
Of course it could be useful in other situations, but I would suggest
exposing the underlying distance algorithms first, so they can then be
adapted into further utilities like approximate_find or used directly as
necessary.
-Adam D. Walling
OK, I should definitely include these "word" distance functions as well.

Jan Strnad