[boost] [gsoc 2013] Approximate string matching proposal

27 Apr 2013

      Hi, I've already posted this but it probably got buried. My proposal
for GSOC 2013 is also available at
http://webdev.fit.cvut.cz/~strnaj11/gsoc/.

Regards, Jan Strnad

GSOC 2013 Proposal - Approximate string pattern matching
========================================================

Abstract
--------

My name is Jan Strnad and this is my proposal for GSOC 2013.
I would like to implement a set of approximate string matching
algorithms. There are two categories of these algorithms. A
representative of the first category takes two strings and computes
so-called distance between those two strings. A member of the second
category takes some text and a pattern and finds all approximate matches
of the pattern in the text according to some given distance. I'm
planning to support various distances, such as Hamming, Levenstein,
Damerau–Levenshtein, Jaro, Delta, Gamma ... I'm proposing to implement a
general solution that supports multiple string containers, not only
std::string. There are many possible applications such as spell
checking.

Personal Details
----------------

Name: Jan Strnad

University: Czech Technical University in Prague, Faculty of Information
Technology, The Czech Republic

Major: Theoretical Informatics / System Programming - studying both in
parallel

Degree program: MSc.

Email/XMPP: hanny.strnad@gmail.com

ICQ: 335127474

homepage: none

Mailing address: Okružní 508, 788 13 Vikýřovice, Czech Republic

Availability:
* Start: at worse by 17th June but probably earlier
* End: 21st September
* I have no other duties during this period but I may take a one week
vacation.
* I'm planning to spend 40-45 hours per week.

Background
----------

Please summarize your educational background

I'm first year master's student at CTU in Prague with major in
Theoretical Informatics and System Programming. At this university I
also gained my BSc. degree (Software Engineering, with honors) in June
2012. Here's an incomplete list of courses I have taken so far: Data
structures and algorithms, Graph theory, Complexity theory, Automata and
grammar theory, Compiler design course, Software engineering courses
(including practical ones), various programming languages (C, C++, Java,
Scala, Common Lisp, Haskell, Erlang, SML), various programming paradigms
(imperative, object-oriented, functional, logical), math & statistics
courses, Data compression, String pattern matching course, Computer
architecture related courses, Parallel algorithms course and many
others. I also spent 4 months as an exchange student in Uppsala
University, Sweden where I took Constraint programming and Software
verification courses.

Please summarize your programming background

I'm able to code in many languages, but I'm most experienced in C++ (5
years) and Scala languages (1.5 years). My C++ experience is gained
mostly from school's projects an my small own projects (usually some
kind of small tweaks of third party programs). I have an experience with
C++ templates, multi-threading using both POSIX threads and openMP, STL
and with some of the Boost libraries (Program Options, Smart Ptr). I
definitely not know everything that would be needed for my GSOC project,
but I'm quick learner.

Please tell us a little about your programming interests. Please tell us
why you are interested in contributing to the Boost C++ Libraries. What is
your interest in the project you are proposing?

As you may have guested, my professional interests are programming
languages (especially their paradigms), concurrent & parallel
programming and finally everything related to theoretical informatics.
That includes things such as data compression or pattern matching. This
is also one of the reasons why I want to contribute to Boost. I want to
share my knowledge of pattern matching algorithms and I believe that the
Boost project is the right place. I'm seeking for an off the school
programming experience. Another reason is that since I'm a long time
user of various open source software, I believe that it is time to start
to contribute to community as well. From this program I mostly expecting
gaining some real world experience. My C++ and especially Boost
knowledge should improve significantly.

Have you done any previous work in this area before or on similar projects?

I think I have everything that is needed to successfully complete this
project. I have a theoretical foundations of approximate string matching
algorithms from my university courses as well as I'm able to implement
them using C++ without any major issues.

What are your plans beyond this Summer of Code time frame for your
proposed work?.

I know that it may take a long time for a project to become a part of
Boost library. So, if selected, my intentions after the summer is over
are:
1. Release my project as an open-source, not necessary as a standard
part of Boost library.
2. Do everything necessary to include my work as a part of standard part
of Boost library eventually.
3. Further development: adding support of more distances, implementing
another string algorithms such as repetitions searching. This heavily
depends on users' interest.

I know that my work won't end by the end of summer. I'm willing to spend
a couple of hours per week (say 2 or 3) to keep my project updated and
most importantly alive.

Languages, technologies, tools

My regular C++ development environment is consisted of Linux, git,
CMake, Sublime text, gdb and valgrind. I have also used Eclipse and svn
before. For documentation purposes I'm most familiar with Doxygen.

Here is a self-evaluation of desired skills:

C++ 3

STL 4

Boost libraries 1

Subversion 3

Proposal
--------

Approximate string matching is used in many areas. To name some of them,
there are spell checking and DNA processing. I can also imagine many
applications in human-to-computer interaction.
I'm proposing an extension of Boost String algorithms library. I would
like to implement algorithms of two categories:
   1. Compute distance of two strings. Imagine a function that takes two
strings a returns a number describing their distance (various distances
are described below).
   2. Find all approximate matches of a pattern in some text using given
distance. Example of this can be found in API interface usage example.

Distance is defined as a number (or a tuple of numbers) describing
similarity of two strings.

Here is a list of all distances I'm planning to implement:
   1. Hamming - a number of replace (single character only) operations
needed so that the strings match exactly.
   2. Levenstein - a number of insert, delete and replace operations.
   3. Generalized Levenstein (Weighted Levenstein) - same as before but
each operation may have different cost.
   4. Damerau-Levenstein - a number of insert, delete, replace and
transpose (of two adjacent characters) operations.
   5. Weigted Damerau-Levenstein - same as before but each operation may
have different cost.
   6. Jaro - please see [1]
   7. Jaro-Winkler - also see [1]

(following assumes we can compute distance of two characters, eg. c - a
= 2)
   8. Delta - the maximum number of distance between two characters, eg.
Delta(aaa,bbb) = Delta(aaa,aab) = 1.
   9. Gamma - a cumulative distance between two characters, Gamma(aaa,bbb)
= 3.
   10. (Delta, Gamma) - a combination of two preceding distances.

Distance 10 is not exactly needed, but I included it because it is quite
common to compute both Delta and Gamma distances at the same time, so I
provided efficient algorithm for this.
As noted, when using distances 8-10 we must be able to compute distance
between two characters. This behavior should be customizable as well.

Simple API definition follows. It's very immature and it will change but
it should give you general idea:

~~~~ {.cpp}
namespace boost{
      template <typename TRange>
      std::size_t hamming_dist(TRange first, TRange second);

      template <typename TRange>
      std::size_t levenstein_dist(TRange first, TRange second);

      template <typename TRange, typename Type>
      Type weighted_levenstein_dist(TRange first, TRange second, Type
replace_cost, Type insert_cost, Type delete_cost);

      template <typename TRange>
      std::size_t damerau_levenstein_dist(TRange first, TRange second);

      template <typename TRange, typename Type>
      Type weighted_damerau_levenstein_dist(TRange first, TRange second,
Type replace_cost, Type insert_cost, Type delete_cost, Type transpose_cost);

      template <typename TRange>
      std::size_t delta_dist(TRange first, TRange second);

      template <typename TRange>
      std::size_t gamma_dist(TRange first, TRange second);

      template <typename TRange>
      float jaro_dist(TRange first, TRange second);

      template <typename TRange>
      float jaro_winkler_dist(TRange first, TRange second, float weight);

      template <typename TRange>
      std::pair<std::size_t, std::size_t> deltagamma_dist(TRange first,
TRange second);

      template <typename Type>
      class approximate_distance {
          // ...
      };

      class hamming_distance : public approximate_distance<std::size_t>{};

      template <typename Type, typename TRange>
      class approx_pattern
      {
      public:
        approx_pattern(approximate_distance<Type>& distance, Type
max_allowed, TRange pattern);
      };

      template <typename Type, typename TRange>
      class sapprox_match_iterator
      {
      public:
        sapprox_match_iterator(TRange text, approx_pattern<Type, TRange>);
        // overloaded ++ and * operators
      };

      template <typename Type, typename TRange>
      class sapprox_match_result
      {
      public:
        Type distance() const;
        std::size_t startIndex() const;
        std::size_t endIndex() const;
        TRange str() const;
      };
}
~~~~

Here is an example of usage:

~~~~ {.cpp}
string a = "aaabc";
string b = "aaacb";
cout << boost::hamming_dist(a, b) << endl;
// prints 2
cout << boost::damerau_levenstein_dist(a, b) << endl;
// prints 1
string pattern = "acbdef";
string text = "aacdefbcdefabccef";
boost::approx_pattern ap(boost::hamming_distance, 1, pattern );
boost::sapprox_match_iterator begin( text, ap );
boost::sapprox_match_iterator end;
std::for_each( begin, end,
[]( const boost::sapprox_match_result& what )
{
       cout << what.str() << " starting at " << what.startIndex() << endl;
} );
// aacdef at 0 , fbcdef at 5, abccef at 11
~~~~

As you can see, I would like my implementation to be general so that
other string containers can be used instead of std::string. A will also
take care of proper design in a way that further distances can be easily
added later.

There are multiple implementation approaches that can be used: NFA
simulation, Shift-or algorithm and dynamic programming approach. I will
probably choose the dynamic programming - it has low time and space
complexity as well as it does not require any kind of preprocessing.
Thus searching all occurrences of pattern in a text has O(m*n)
complexity, m is length of pattern, n is length of text.

Related work

There is currently only one C++ library I know of: Flamingo [2]. From my
point of view, its usage is a kind of complicated. You can also find
implementations of single functions on-line, but they lack of multiple
distances support as well as they are not general (eg. designed for
C-like strings only).

Another related project is stringmetrics [3]. This project is written in
Scala so it is not related to C++ programmers.

Schedule
--------

This is an approximate schedule I would like to keep on. Since I will
probably hit many unexpected issues I have reserved some time for it.

17th June : start

17th June - 27th June : prototyping and designing API interface as well
as auxiliary functions / classes

28th June - 30th June : setting up testing environment, writing basic
unit tests

1st July - 14th July : implementation of general dynamic programming
model that will be used by most distances

15th July - 28th July : implementation of distances 1 - 5

29th July - 4th August : implementation of distances 8 - 10

5th August - 11th August : implementation of distances 6,7

12th August - 1st September : implementation of approximate pattern
search in a text for all distances

2nd September - 15th September : wrapping up, end user documentation

16th September - 21st September : reserved

During all implementation phases I'll be also performing testing and
source code documentation.

References
----------

[1] http://en.wikipedia.org/wiki/Jaro-Winkler_distance

[2] http://flamingo.ics.uci.edu/releases/4.1/

[3] http://rockymadden.com/stringmetric/