[boost] [gsoc-2013] proposal for approximate string matching

26 Apr 2013

      Hi,

I'm a student from Tsinghua University in china and interested
in the idea about "approximate string matching". Below is my
proposal for this. Could you please view it and give me any
kinds of suggestions on it? Thanks!

(1) Personal Details
  * Name: Yu Jiang
  * College/University: Tsinghua University, China
  * Course/Major: Computer Science and Technology
  * Degree Program: M.Sc.
  * Email: sunlightt07@gmail.com
  * Homepage: do not have one
  * Availability:
    - How much time do you plan to spend on your GSoC?
Since I have a long summer holiday, I can spend 30 hours a week.
    - What are you intended start and end dates?
I will follow the calendar of GSoC. I have plenty of time from June
to September.
    - What other factors affect your availability (exams, courses,
moving, work, etc.)?
I have some courses this term which will end before June. 15th.
However, these courses don't have heavy workload, I still have
enough time to work on GSoC.

(2) Background Information
  * Please summarize your educational background (degrees earned,
courses taken, etc.).
I majored in Computer Science and got my bachelor degree in Tsinghua
University from 2008 to 2012. I learned some basic courses such as
algorithm,
data structure, algebra and logic. I also learned many advanced courses
such as database, operating system, network, formal language, compiler,
computer architecture. Currently I'm working for my master degree.
My research interest is database.

  * Please summarize your programming background (OSS projects,
internships, jobs, etc.).
This is the first time I plan to take part in an OSS projects.
I have worked as an intern in Hulu, an online video website,
for one year, mainly focus on frontend webpages.
I have used C++ to finish many course projects, such as a simple
ftp server and client, completing a tiny os and so on.
Recently I am learning the new features in C++11 and get to know
much more about the C++ language.

  * Please tell us a little about your programming interests.
I really love to optimize on the implantation of an algorithm by
making full use of the compiler and the architecture(e.g. cache, SIMD).
Besides, I like to write webpages using different kinds of excellent
frameworks.

  * Please tell us why you are interested in contributing to the
Boost C++ libraries.
Boost is the most famous C++ libraries. It is powerful and can be
used in many different kinds of programs. It's my honor to add
some new features into it.

  * What is your interest in the project you are proposing?
It's related to my research interest. Nowadays data are produced
rapidly. To clean or integrate them so that we can get a better view,
"Approximate string matching" is rather a powerful tool. I'd like to
see such functions in boost library.

  * Have you done any previous work in this area before or on
similar projects?
Yes. My lab has focused on researching on such algorithms for 3-5 years.
The "best" algorithm for general string similarity search and join is
called "passjoin", which is invented by our lab. I have implemented it
and done a lot of optimizations. Early this year we attended a string
similarity search & join competition[1] and achieved champion.
The code is mainly written by me.
[1]
http://www2.informatik.hu-berlin.de/~wandelt/searchjoincompetition2013/Resul...

  * What are your plans beyond this Summer of Code time frame for
your proposed work?
I will make efforts to make it accepted by the boost library. If
it can be released someday, I will maintain it.

  * Please rate, from 0 to 5(0 being no experience, 5 being expert),
your knowledge of the following languages, technologies, or tools:
  -C++: 3
  -C++ Standard Library: 3.5
  -Boost C++ Libraries: 2
  -Subversion: 3

  * What software development environments are you most familiar with?
Eclipse. I use it in both windows and linux.

  * What software documentation tool are you most familiar with?
Doxygen.

(3) Project Proposal
Approximate string matching is a "hotspot" in recent years. It's widely
used in big data processing such as search engine, DNA matching. It also
can be used in simple tasks like spell checking. However, there isn't a
high performance and well-designed library in such area. I wish to see
such algorithms in boost so that many people can benefit from this!

My proposal contains two parts:
A. To implement basic similarity metrics. Edit distance and Jaccard
may be the most popular ones. Well, I'll implement more similarity metrics,
such as Hamming distance, dice and cosine.
B. To implement the fastest algorithm for string similarity search and join.
Similarity metrics will be edit distance only since it is most commonly
used.
Similarity search is to find out all the strings in a given set S, which are
"similar" to a query string. Similarity join is to find out all the string
pairs<r,s>, where r is in set R and s is in set S and r is similar to s.
They are the most useful operators in large data processing. However,
the brute-force method for such problem is too slow. And it's not easy to
write a correct and high performance algorithm without reading the details
in many research papers and debugging them for a long time. To avoid
reinventing the wheel, I strongly propose to implement it in boost
library.

Here is a prototype for the interface:
Part A. basic similarity metrics (use edit distance as an example)
(1) int edit_distance(const Range1T&, const Range2T&); // return edit
distance.
(2) int edit_distance(const Range1T&, const Range2T&, int threshold);
// return edit distance if it is no larger than threshold, else
return -1 or threshold + 1.
(3) int edit_distance(const Range1T&, const Range2T&, equal_func);
// use uqual_func to determine whether two elements are equal or not.
(4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func,
delete_cost_func, substitute_cost_func); // use three cost functions to
calculate the cost for one insert/delete/substitute operation.
(5) some functions to return the sequence of operations?
Version (1) may be the most common case. Version (2) can use the given
threshold to do optimization for the speed (O(m*n) -> O(threshold * min(m,
n))).
It's useful since sometimes we only care about the exact edit distance if
it is small. Version (3) and (4) extend the flexibility of this utility
algorithm. I'm not sure whether to support (5) or not.

Part B. string similarity search and join algorithms, focus on edit distance
class searcher
{
  void insert(IDType id, const string &word); // insert a word to the index,
id is used as an identifier of this word.
  void erase(IDType id); // erase a word from the index, id should be the
same one used in insertion process.
  vector<Result> search(const string &query, int threshold);
// performing similarity search, which will return all the strings which
are similar to the query string. Result may be a pair of the id and real
edit distance.
};
class joiner
{
  vector<PairResult> join(const string_and_id_container &R,
const string_and_id_container &S, int threshold);
// performing similarity join. Usually we are doing on-line join, which
means
we cannot pre-build the index. So we only need this one function. It will
return all the <r,s> pair where r is in R and s is in S and r is similar to
s.
PairResult may be a tuple of r's id, s's id and real edit distance.
};

(4) Proposed Milestones and Schedule
Now – June 15
To be more familiar with the Boost library. Discuss the details and
interfaces with my mentor and on the develop mail list.
June 15 – July 15
Develop a beta version of the approximate matching library. Most functions
will be done at this time.
July 15 – August 1
Review my code and improve its performance and quality. Discuss with my
mentor for mid-term evaluation and future works.
August 1 – September
Write documentation, test code for my library. Make it "boost" like.