2013/4/27 Erik Erlandson
Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance is a function of the maximum supported sequence size.
Thanks, you're right. For edit distance, return size_type is really a better choice. For some other similarity metrics whose result is a real number in [0,1], I will choose double as the return type.
(4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func, delete_cost_func, substitute_cost_func); // use three cost functions to calculate the cost for one insert/delete/substitute operation.
For this last one, I would want the return type for distance to be governed by the return type of the cost functionals. So, if insert_cost_func and friends return float, then the edit distance would also return float.
Even more generally, I might want the final return type for all edit_distance variants to be selectable as a template parameter (and perhaps influence the internal types used to compute it). I'm open to ideas here.
I agree to modify the return type for function (4). So now I think your
suggestion for using a class to supply these 3 functions together is a better choice. In case (1)(2)(3), I think return size_type seems to be better since the algorithm itself can only output an integer as the result?
(5) some functions to return the sequence of operations?
I think that since you are focusing on supporting the search and join algorithms, distance variants that can return the sequence of edit-ops ought to be low priority. They could be added in the future, without impacting the library design.
Part B. string similarity search and join algorithms, focus on edit
distance
class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be the same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. };
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.
Thanks, I will consider it carefully. Currently I'm not sure about whether I can support one of the standard container interfaces. Probably I can. With the candidate strings are inserted and removed, I should update the index for them.
I noticed you were considering only supporting one category of sequence
distance function (edit_distance) here, but if you make the distance-function into a template parameter, you might be able to use it with all your planned distance functions "for free"
Good suggestion. But it seems to be hard. Maybe a high performance algorithm can only support a specific metric(I will not implement the brute-force method here). Since the edit distance is the most useful one, I'd like to write an algorithm for it. Thanks a lot!