Re: [boost] [gsoc-2013] proposal for approximate string matching

27 Apr 2013

      2013/4/27 Erik Erlandson <eerlands@redhat.com>
...
...
Here is a prototype for the interface:
Part A. basic similarity metrics (use edit distance as an example)
(1) int edit_distance(const Range1T&, const Range2T&); // return edit
distance.
(2) int edit_distance(const Range1T&, const Range2T&, int threshold);
// return edit distance if it is no larger than threshold, else
return -1 or threshold + 1.
(3) int edit_distance(const Range1T&, const Range2T&, equal_func);
// use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned
integer.  The ideal type might be the size_type of implied sequence
containers, if it can be acquired, since the maximum possible distance is a
function of the maximum supported sequence size.
Thanks, you're right. For edit distance, return size_type is really a
better choice. For some other
similarity metrics whose result is a real number in [0,1], I will choose
double as the return type.
...
...
(4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func,
delete_cost_func, substitute_cost_func); // use three cost functions to
calculate the cost for one insert/delete/substitute operation.
For this last one, I would want the return type for distance to be
governed by the return type of the cost functionals.  So, if
insert_cost_func and friends return float, then the edit distance would
also return float.
Even more generally, I might want the final return type for all
edit_distance variants to be selectable as a template parameter (and
perhaps influence the internal types used to compute it).  I'm open to
ideas here.
I agree to modify the return type for function (4). So now I think your
suggestion for using a class to supply
these 3 functions together is a better choice. In case (1)(2)(3), I think
return size_type seems to be better since
the algorithm itself can only output an integer as the result?
...
(5) some functions to return the sequence of operations?
I think that since you are focusing on supporting the search and join
algorithms, distance variants that can return the sequence of edit-ops
ought to be low priority.  They could be added in the future, without
impacting the library design.
...
Part B. string similarity search and join algorithms, focus on edit
distance
...
class searcher
{
  void insert(IDType id, const string &word); // insert a word to the
index,
id is used as an identifier of this word.
  void erase(IDType id); // erase a word from the index, id should be the
same one used in insertion process.
  vector<Result> search(const string &query, int threshold);
// performing similarity search, which will return all the strings which
are similar to the query string. Result may be a pair of the id and real
edit distance.
};
This searcher class has the feel of some kind of container class, and
maybe ought to be designed with an eye toward satisfying one of the
standard container interfaces.
Thanks, I will consider it carefully. Currently I'm not sure about whether
I can support one of the standard container
interfaces. Probably I can. With the candidate strings are inserted and
removed, I should update the index for them.
...
I noticed you were considering only supporting one category of sequence
...
distance function (edit_distance) here, but if you make the
distance-function into a template parameter, you might be able to use it
with all your planned distance functions "for free"
Good suggestion. But it seems to be hard. Maybe a high performance
algorithm can only support a specific
metric(I will not implement the brute-force method here). Since the edit
distance is the most useful one, I'd like to write an algorithm for it.

Thanks a lot!

Re: [boost] [gsoc-2013] proposal for approximate string matching

Yu Jiang