
Hi, I'm a student from Tsinghua University in china and interested in the idea about "approximate string matching". Below is my proposal for this. Could you please view it and give me any kinds of suggestions on it? Thanks! (1) Personal Details * Name: Yu Jiang * College/University: Tsinghua University, China * Course/Major: Computer Science and Technology * Degree Program: M.Sc. * Email: sunlightt07@gmail.com * Homepage: do not have one * Availability: - How much time do you plan to spend on your GSoC? Since I have a long summer holiday, I can spend 30 hours a week. - What are you intended start and end dates? I will follow the calendar of GSoC. I have plenty of time from June to September. - What other factors affect your availability (exams, courses, moving, work, etc.)? I have some courses this term which will end before June. 15th. However, these courses don't have heavy workload, I still have enough time to work on GSoC. (2) Background Information * Please summarize your educational background (degrees earned, courses taken, etc.). I majored in Computer Science and got my bachelor degree in Tsinghua University from 2008 to 2012. I learned some basic courses such as algorithm, data structure, algebra and logic. I also learned many advanced courses such as database, operating system, network, formal language, compiler, computer architecture. Currently I'm working for my master degree. My research interest is database. * Please summarize your programming background (OSS projects, internships, jobs, etc.). This is the first time I plan to take part in an OSS projects. I have worked as an intern in Hulu, an online video website, for one year, mainly focus on frontend webpages. I have used C++ to finish many course projects, such as a simple ftp server and client, completing a tiny os and so on. Recently I am learning the new features in C++11 and get to know much more about the C++ language. * Please tell us a little about your programming interests. I really love to optimize on the implantation of an algorithm by making full use of the compiler and the architecture(e.g. cache, SIMD). Besides, I like to write webpages using different kinds of excellent frameworks. * Please tell us why you are interested in contributing to the Boost C++ libraries. Boost is the most famous C++ libraries. It is powerful and can be used in many different kinds of programs. It's my honor to add some new features into it. * What is your interest in the project you are proposing? It's related to my research interest. Nowadays data are produced rapidly. To clean or integrate them so that we can get a better view, "Approximate string matching" is rather a powerful tool. I'd like to see such functions in boost library. * Have you done any previous work in this area before or on similar projects? Yes. My lab has focused on researching on such algorithms for 3-5 years. The "best" algorithm for general string similarity search and join is called "passjoin", which is invented by our lab. I have implemented it and done a lot of optimizations. Early this year we attended a string similarity search & join competition[1] and achieved champion. The code is mainly written by me. [1] http://www2.informatik.hu-berlin.de/~wandelt/searchjoincompetition2013/Resul... * What are your plans beyond this Summer of Code time frame for your proposed work? I will make efforts to make it accepted by the boost library. If it can be released someday, I will maintain it. * Please rate, from 0 to 5(0 being no experience, 5 being expert), your knowledge of the following languages, technologies, or tools: -C++: 3 -C++ Standard Library: 3.5 -Boost C++ Libraries: 2 -Subversion: 3 * What software development environments are you most familiar with? Eclipse. I use it in both windows and linux. * What software documentation tool are you most familiar with? Doxygen. (3) Project Proposal Approximate string matching is a "hotspot" in recent years. It's widely used in big data processing such as search engine, DNA matching. It also can be used in simple tasks like spell checking. However, there isn't a high performance and well-designed library in such area. I wish to see such algorithms in boost so that many people can benefit from this! My proposal contains two parts: A. To implement basic similarity metrics. Edit distance and Jaccard may be the most popular ones. Well, I'll implement more similarity metrics, such as Hamming distance, dice and cosine. B. To implement the fastest algorithm for string similarity search and join. Similarity metrics will be edit distance only since it is most commonly used. Similarity search is to find out all the strings in a given set S, which are "similar" to a query string. Similarity join is to find out all the string pairs<r,s>, where r is in set R and s is in set S and r is similar to s. They are the most useful operators in large data processing. However, the brute-force method for such problem is too slow. And it's not easy to write a correct and high performance algorithm without reading the details in many research papers and debugging them for a long time. To avoid reinventing the wheel, I strongly propose to implement it in boost library. Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not. (4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func, delete_cost_func, substitute_cost_func); // use three cost functions to calculate the cost for one insert/delete/substitute operation. (5) some functions to return the sequence of operations? Version (1) may be the most common case. Version (2) can use the given threshold to do optimization for the speed (O(m*n) -> O(threshold * min(m, n))). It's useful since sometimes we only care about the exact edit distance if it is small. Version (3) and (4) extend the flexibility of this utility algorithm. I'm not sure whether to support (5) or not. Part B. string similarity search and join algorithms, focus on edit distance class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be the same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. }; class joiner { vector<PairResult> join(const string_and_id_container &R, const string_and_id_container &S, int threshold); // performing similarity join. Usually we are doing on-line join, which means we cannot pre-build the index. So we only need this one function. It will return all the <r,s> pair where r is in R and s is in S and r is similar to s. PairResult may be a tuple of r's id, s's id and real edit distance. }; (4) Proposed Milestones and Schedule Now – June 15 To be more familiar with the Boost library. Discuss the details and interfaces with my mentor and on the develop mail list. June 15 – July 15 Develop a beta version of the approximate matching library. Most functions will be done at this time. July 15 – August 1 Review my code and improve its performance and quality. Discuss with my mentor for mid-term evaluation and future works. August 1 – September Write documentation, test code for my library. Make it "boost" like.