[gsoc-2013] proposal for approximate string matching
Hi,
I'm a student from Tsinghua University in china and interested
in the idea about "approximate string matching". Below is my
proposal for this. Could you please view it and give me any
kinds of suggestions on it? Thanks!
(1) Personal Details
* Name: Yu Jiang
* College/University: Tsinghua University, China
* Course/Major: Computer Science and Technology
* Degree Program: M.Sc.
* Email: sunlightt07@gmail.com
* Homepage: do not have one
* Availability:
- How much time do you plan to spend on your GSoC?
Since I have a long summer holiday, I can spend 30 hours a week.
- What are you intended start and end dates?
I will follow the calendar of GSoC. I have plenty of time from June
to September.
- What other factors affect your availability (exams, courses,
moving, work, etc.)?
I have some courses this term which will end before June. 15th.
However, these courses don't have heavy workload, I still have
enough time to work on GSoC.
(2) Background Information
* Please summarize your educational background (degrees earned,
courses taken, etc.).
I majored in Computer Science and got my bachelor degree in Tsinghua
University from 2008 to 2012. I learned some basic courses such as
algorithm,
data structure, algebra and logic. I also learned many advanced courses
such as database, operating system, network, formal language, compiler,
computer architecture. Currently I'm working for my master degree.
My research interest is database.
* Please summarize your programming background (OSS projects,
internships, jobs, etc.).
This is the first time I plan to take part in an OSS projects.
I have worked as an intern in Hulu, an online video website,
for one year, mainly focus on frontend webpages.
I have used C++ to finish many course projects, such as a simple
ftp server and client, completing a tiny os and so on.
Recently I am learning the new features in C++11 and get to know
much more about the C++ language.
* Please tell us a little about your programming interests.
I really love to optimize on the implantation of an algorithm by
making full use of the compiler and the architecture(e.g. cache, SIMD).
Besides, I like to write webpages using different kinds of excellent
frameworks.
* Please tell us why you are interested in contributing to the
Boost C++ libraries.
Boost is the most famous C++ libraries. It is powerful and can be
used in many different kinds of programs. It's my honor to add
some new features into it.
* What is your interest in the project you are proposing?
It's related to my research interest. Nowadays data are produced
rapidly. To clean or integrate them so that we can get a better view,
"Approximate string matching" is rather a powerful tool. I'd like to
see such functions in boost library.
* Have you done any previous work in this area before or on
similar projects?
Yes. My lab has focused on researching on such algorithms for 3-5 years.
The "best" algorithm for general string similarity search and join is
called "passjoin", which is invented by our lab. I have implemented it
and done a lot of optimizations. Early this year we attended a string
similarity search & join competition[1] and achieved champion.
The code is mainly written by me.
[1]
http://www2.informatik.hu-berlin.de/~wandelt/searchjoincompetition2013/Resul...
* What are your plans beyond this Summer of Code time frame for
your proposed work?
I will make efforts to make it accepted by the boost library. If
it can be released someday, I will maintain it.
* Please rate, from 0 to 5(0 being no experience, 5 being expert),
your knowledge of the following languages, technologies, or tools:
-C++: 3
-C++ Standard Library: 3.5
-Boost C++ Libraries: 2
-Subversion: 3
* What software development environments are you most familiar with?
Eclipse. I use it in both windows and linux.
* What software documentation tool are you most familiar with?
Doxygen.
(3) Project Proposal
Approximate string matching is a "hotspot" in recent years. It's widely
used in big data processing such as search engine, DNA matching. It also
can be used in simple tasks like spell checking. However, there isn't a
high performance and well-designed library in such area. I wish to see
such algorithms in boost so that many people can benefit from this!
My proposal contains two parts:
A. To implement basic similarity metrics. Edit distance and Jaccard
may be the most popular ones. Well, I'll implement more similarity metrics,
such as Hamming distance, dice and cosine.
B. To implement the fastest algorithm for string similarity search and join.
Similarity metrics will be edit distance only since it is most commonly
used.
Similarity search is to find out all the strings in a given set S, which are
"similar" to a query string. Similarity join is to find out all the string
pairs
Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance is a function of the maximum supported sequence size.
(4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func, delete_cost_func, substitute_cost_func); // use three cost functions to calculate the cost for one insert/delete/substitute operation.
For this last one, I would want the return type for distance to be governed by the return type of the cost functionals. So, if insert_cost_func and friends return float, then the edit distance would also return float. Even more generally, I might want the final return type for all edit_distance variants to be selectable as a template parameter (and perhaps influence the internal types used to compute it). I'm open to ideas here.
(5) some functions to return the sequence of operations?
I think that since you are focusing on supporting the search and join algorithms, distance variants that can return the sequence of edit-ops ought to be low priority. They could be added in the future, without impacting the library design.
Part B. string similarity search and join algorithms, focus on edit distance class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be the same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. };
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces. I noticed you were considering only supporting one category of sequence distance function (edit_distance) here, but if you make the distance-function into a template parameter, you might be able to use it with all your planned distance functions "for free"
class joiner { vector<PairResult> join(const string_and_id_container &R, const string_and_id_container &S, int threshold); // performing similarity join. Usually we are doing on-line join, which means we cannot pre-build the index. So we only need this one function. It will return all the
pair where r is in R and s is in S and r is similar to s. PairResult may be a tuple of r's id, s's id and real edit distance. }; (4) Proposed Milestones and Schedule Now – June 15 To be more familiar with the Boost library. Discuss the details and interfaces with my mentor and on the develop mail list. June 15 – July 15 Develop a beta version of the approximate matching library. Most functions will be done at this time. July 15 – August 1 Review my code and improve its performance and quality. Discuss with my mentor for mid-term evaluation and future works. August 1 – September Write documentation, test code for my library. Make it "boost" like.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
2013/4/27 Erik Erlandson
Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance is a function of the maximum supported sequence size.
Thanks, you're right. For edit distance, return size_type is really a better choice. For some other similarity metrics whose result is a real number in [0,1], I will choose double as the return type.
(4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func, delete_cost_func, substitute_cost_func); // use three cost functions to calculate the cost for one insert/delete/substitute operation.
For this last one, I would want the return type for distance to be governed by the return type of the cost functionals. So, if insert_cost_func and friends return float, then the edit distance would also return float.
Even more generally, I might want the final return type for all edit_distance variants to be selectable as a template parameter (and perhaps influence the internal types used to compute it). I'm open to ideas here.
I agree to modify the return type for function (4). So now I think your
suggestion for using a class to supply these 3 functions together is a better choice. In case (1)(2)(3), I think return size_type seems to be better since the algorithm itself can only output an integer as the result?
(5) some functions to return the sequence of operations?
I think that since you are focusing on supporting the search and join algorithms, distance variants that can return the sequence of edit-ops ought to be low priority. They could be added in the future, without impacting the library design.
Part B. string similarity search and join algorithms, focus on edit
distance
class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be the same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. };
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.
Thanks, I will consider it carefully. Currently I'm not sure about whether I can support one of the standard container interfaces. Probably I can. With the candidate strings are inserted and removed, I should update the index for them.
I noticed you were considering only supporting one category of sequence
distance function (edit_distance) here, but if you make the distance-function into a template parameter, you might be able to use it with all your planned distance functions "for free"
Good suggestion. But it seems to be hard. Maybe a high performance algorithm can only support a specific metric(I will not implement the brute-force method here). Since the edit distance is the most useful one, I'd like to write an algorithm for it. Thanks a lot!
On Sat, Apr 27, 2013 at 3:00 PM, Yu Jiang
2013/4/27 Erik Erlandson
Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance
is a
function of the maximum supported sequence size.
Thanks, you're right. For edit distance, return size_type is really a better choice. For some other similarity metrics whose result is a real number in [0,1], I will choose double as the return type.
For those who use STL distance is signed integer. If you choose unsigned integer, then I suggest reconsider function names to avoid potential confusion for users. For example, is it "deviation" that you name "distance"?
class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be
Part B. string similarity search and join algorithms, focus on edit distance the
same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. };
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.
Thanks, I will consider it carefully. Currently I'm not sure about whether I can support one of the standard container interfaces. Probably I can. With the candidate strings are inserted and removed, I should update the index for them.
IMO, container should be parameterized to allow users choosing the best performing container. Strings based on augmented data structures offer a wide set of very efficient search and update operations, including logarithmic time splice and split. They can give significant performance improvement against strings based on basic data structures, such as std::string, std::vector<char>, std::list<char> etc. Regards, Vadim Stadnik
2013/4/27 Vadim Stadnik
On Sat, Apr 27, 2013 at 3:00 PM, Yu Jiang
wrote: 2013/4/27 Erik Erlandson
Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance
is a
function of the maximum supported sequence size.
Thanks, you're right. For edit distance, return size_type is really a better choice. For some other similarity metrics whose result is a real number in [0,1], I will choose double as the return type.
For those who use STL distance is signed integer. If you choose unsigned integer, then I suggest reconsider function names to avoid potential confusion for users. For example, is it "deviation" that you name "distance"?
Thanks! I would prefer not to change the function name since in such area "edit distance" is used everywhere so user may be familiar with this word. IMO, STL distance returns signed integer because the order of the two parameters matters. For this edit distance function, usually the order of the two strings may not affect the result unless in (4). So I may prefer to return a unsigned integer.
class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be
Part B. string similarity search and join algorithms, focus on edit distance the
same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. };
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.
Thanks, I will consider it carefully. Currently I'm not sure about whether I can support one of the standard container interfaces. Probably I can. With the candidate strings are inserted and removed, I should update the index for them.
IMO, container should be parameterized to allow users choosing the best performing container. Strings based on augmented data structures offer a wide set of very efficient search and update operations, including logarithmic time splice and split. They can give significant performance improvement against strings based on basic data structures, such as std::string, std::vector<char>, std::list<char> etc.
Sorry for this. It seems I didn't describe my idea clearly. I'd like to do something like this: S is a large container of strings and when the user performing searching, we can return all the strings in S which are similar to the given string(maybe the edit distance between them is less than 2). So here I need to build a index first. In fact I want to supply some functions to manipulate the index, since the user may want to change the strings in S slightly. So maybe I'll make the "searcher" class look like a container, but there isn't a real container in it. Thanks.
For those who use STL distance is signed integer. If you choose unsigned integer, then I suggest reconsider function names to avoid potential confusion for users. For example, is it "deviation" that you name "distance"?
I thought the signed result type was "difference", not "distance." Formally, any proper distance metric is always >= 0. String edit distance is such a metric, and I'd advocate both for continuing to call it edit_distance<>, and that it be an unsigned result.
On Sun, Apr 28, 2013 at 4:20 AM, Erik Erlandson
For those who use STL distance is signed integer. If you choose unsigned integer, then I suggest reconsider function names to avoid potential confusion for users. For example, is it "deviation" that you name "distance"?
I thought the signed result type was "difference", not "distance."
Formally, any proper distance metric is always >= 0. String edit distance is such a metric, and I'd advocate both for continuing to call it edit_distance<>, and that it be an unsigned result.
I agree with both comments, STL distance does not represent a "metric". Regards, Vadim Stadnik
I agree to modify the return type for function (4). So now I think your suggestion for using a class to supply these 3 functions together is a better choice. In case (1)(2)(3), I think return size_type seems to be better since the algorithm itself can only output an integer as the result?
I think in most cases the result would be integer. I did once implement a variation where I used (1.5) for the cost of substitution, and (1.0) for insertion/deletion. I no longer remember the details, but I do recall that using (2.0) didn't work as well, it wanted to be (1.5). At any rate, the resulting distance in such a case would be floating point, not integer. If it's reasonably easy to support configurable compute/return types, including non-integers, I think it might prove useful. I know that you are going to be focusing on actual strings for your applications, but there are other kinds of edit distance application, for example a use case like the unix diff command, where you are doing edit-distance-like computations on two sequences of lines of text, and in fact the edit cost between any two lines may itself be computed as an edit distance. So I think designing the routines to be as general as possible (within reason) will pay dividends for the boost community, and improve their adoption.
2013/4/28 Erik Erlandson
I agree to modify the return type for function (4). So now I think your suggestion for using a class to supply these 3 functions together is a better choice. In case (1)(2)(3), I think return size_type seems to be better since the algorithm itself can only output an integer as the result?
I think in most cases the result would be integer. I did once implement a variation where I used (1.5) for the cost of substitution, and (1.0) for insertion/deletion. I no longer remember the details, but I do recall that using (2.0) didn't work as well, it wanted to be (1.5). At any rate, the resulting distance in such a case would be floating point, not integer. If it's reasonably easy to support configurable compute/return types, including non-integers, I think it might prove useful.
Yes, I totally agree with you. In case (1)(2)(3), there isn't a parameter for user to change the default cost(1/1/1/0). So maybe return an integer value is enough. In case (4), the user can change the return type if he likes.
I know that you are going to be focusing on actual strings for your applications, but there are other kinds of edit distance application, for example a use case like the unix diff command, where you are doing edit-distance-like computations on two sequences of lines of text, and in fact the edit cost between any two lines may itself be computed as an edit distance. So I think designing the routines to be as general as possible (within reason) will pay dividends for the boost community, and improve their adoption.
I may not want to focus on actual strings for Part A. similarity functions, since here actual strings and a container are similar to deal with. I also think it's easy for me to compute the sequence for the actions.
Maybe the new interface can be like this:
Part A, similarity metrics(use edit distance as an example):
template....
(1) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&);
(2) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&,
(size_type or unsigned) threshold);
(3) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&,
equal_func);
(4) (user-defined type) edit_distance(const Range1T&, const Range2T&, cost);
the cost struct may be like this:
template
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Part A, similarity metrics(use edit distance as an example): template.... (1) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&); (2) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&, (size_type or unsigned) threshold); (3) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&, equal_func); (4) (user-defined type) edit_distance(const Range1T&, const Range2T&, cost);
the cost struct may be like this: template
struct cost { V substitution(const T&, const T&) const; V insertion(const T &a) const; V deletion(const T &a) const; }
This seems like a good approach to me.
(5) pair
edit_distance_with_operation(const Range1T&, const Range2T&); (6) pair edit_distance_with_operation(const Range1T&, const Range2T&, threshold); (7) pair edit_distance_with_operation(const Range1T&, const Range2T&, equal_func); (8) pair edit_distance_with_operation(const Range1T&, const Range2T&, cost); the "operation" may be a pair<"operator", "position">, where "operator" may be an enum of insertion, deletion and substitution, "position" is an unsigned value.
In the past, I would have said "returning a container on the stack will be an efficiency problem", however with the new "move" semantics, I *think* it will OK if it is implemented properly. I have only read about the new "move" semantics, and not used them, so if I'm wrong about that somebody please set me straight :)
2013/4/29 Erik Erlandson
Part A, similarity metrics(use edit distance as an example): template.... (1) (size_type or unsigned) edit_distance(const Range1T&, const
Range2T&);
(2) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&, (size_type or unsigned) threshold); (3) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&, equal_func); (4) (user-defined type) edit_distance(const Range1T&, const Range2T&, cost);
the cost struct may be like this: template
struct cost { V substitution(const T&, const T&) const; V insertion(const T &a) const; V deletion(const T &a) const; } This seems like a good approach to me.
(5) pair
edit_distance_with_operation(const Range1T&, const Range2T&); (6) pair edit_distance_with_operation(const Range1T&, const Range2T&, threshold); (7) pair edit_distance_with_operation(const Range1T&, const Range2T&, equal_func); (8) pair edit_distance_with_operation(const Range1T&, const Range2T&, cost); the "operation" may be a pair<"operator", "position">, where "operator" may
be an enum of insertion, deletion and substitution, "position" is an unsigned value.
In the past, I would have said "returning a container on the stack will be an efficiency problem", however with the new "move" semantics, I *think* it will OK if it is implemented properly.
I have only read about the new "move" semantics, and not used them, so if I'm wrong about that somebody please set me straight :)
Thanks a lot for your valuable advice! Now I also have much clear thought for Part B (similarity search). Several days ago you suggested me to make the "searcher" class behave like a container. Currently I think similarity search is very like exact search, where we usually use the "unordered_map". In fact, when we call the functions "find" or "operator[]" of an "unordered_map", exactly matched values are returned. Here, what I want to implement is just to return similar matched values. So here I'd like to make my searcher class like an "unordered_map". I 'll support as much as possible functions in the "unordered_map" as long as it's useful and compatible.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Now I also have much clear thought for Part B (similarity search). Several days ago you suggested me to make the "searcher" class behave like a container. Currently I think similarity search is very like exact search, where we usually use the "unordered_map". In fact, when we call the functions "find" or "operator[]" of an "unordered_map", exactly matched values are returned. Here, what I want to implement is just to return similar matched values. So here I'd like to make my searcher class like an "unordered_map". I 'll support as much as possible functions in the "unordered_map" as long as it's useful and compatible.
Yes, I think what you have here is a new variation of Associative Container concept. Maybe a subclass of Unique Associative Container, augmented with a distance-metric functor, and having new query methods augmented with a radius. "Neighborhood Associative Container"? "Metric Associative Container"? I'm not familiar with how the similarity search container will want to work - whether it is a Simple Associative Container (like set<>), or a Paired Associative Container (like map<>). Seems as though it has set-like behavior, where you're maintaining a set of objects, and then you query it for all the members that are with some radius of a given query-object.
2013/4/30 Erik Erlandson
I'm not familiar with how the similarity search container will want to work - whether it is a Simple Associative Container (like set<>), or a Paired Associative Container (like map<>). Seems as though it has set-like behavior, where you're maintaining a set of objects, and then you query it for all the members that are with some radius of a given query-object.
Thanks for your advice these days and sorry for reply this email late. I
finally decide to implement both a set like container (contains string only) and a map like container (use "string" as the key type and an arbitrary type as the value type). I have just submitted my final proposal. Thanks you a lot!
participants (4)
-
Erik Erlandson
-
Erik Erlandson
-
Vadim Stadnik
-
Yu Jiang