
Hi, I've already posted this but it probably got buried. My proposal for GSOC 2013 is also available at http://webdev.fit.cvut.cz/~strnaj11/gsoc/. Regards, Jan Strnad GSOC 2013 Proposal - Approximate string pattern matching ======================================================== Abstract -------- My name is Jan Strnad and this is my proposal for GSOC 2013. I would like to implement a set of approximate string matching algorithms. There are two categories of these algorithms. A representative of the first category takes two strings and computes so-called distance between those two strings. A member of the second category takes some text and a pattern and finds all approximate matches of the pattern in the text according to some given distance. I'm planning to support various distances, such as Hamming, Levenstein, Damerau–Levenshtein, Jaro, Delta, Gamma ... I'm proposing to implement a general solution that supports multiple string containers, not only std::string. There are many possible applications such as spell checking. Personal Details ---------------- Name: Jan Strnad University: Czech Technical University in Prague, Faculty of Information Technology, The Czech Republic Major: Theoretical Informatics / System Programming - studying both in parallel Degree program: MSc. Email/XMPP: hanny.strnad@gmail.com ICQ: 335127474 homepage: none Mailing address: Okružní 508, 788 13 Vikýřovice, Czech Republic Availability: * Start: at worse by 17th June but probably earlier * End: 21st September * I have no other duties during this period but I may take a one week vacation. * I'm planning to spend 40-45 hours per week. Background ---------- Please summarize your educational background I'm first year master's student at CTU in Prague with major in Theoretical Informatics and System Programming. At this university I also gained my BSc. degree (Software Engineering, with honors) in June 2012. Here's an incomplete list of courses I have taken so far: Data structures and algorithms, Graph theory, Complexity theory, Automata and grammar theory, Compiler design course, Software engineering courses (including practical ones), various programming languages (C, C++, Java, Scala, Common Lisp, Haskell, Erlang, SML), various programming paradigms (imperative, object-oriented, functional, logical), math & statistics courses, Data compression, String pattern matching course, Computer architecture related courses, Parallel algorithms course and many others. I also spent 4 months as an exchange student in Uppsala University, Sweden where I took Constraint programming and Software verification courses. Please summarize your programming background I'm able to code in many languages, but I'm most experienced in C++ (5 years) and Scala languages (1.5 years). My C++ experience is gained mostly from school's projects an my small own projects (usually some kind of small tweaks of third party programs). I have an experience with C++ templates, multi-threading using both POSIX threads and openMP, STL and with some of the Boost libraries (Program Options, Smart Ptr). I definitely not know everything that would be needed for my GSOC project, but I'm quick learner. Please tell us a little about your programming interests. Please tell us why you are interested in contributing to the Boost C++ Libraries. What is your interest in the project you are proposing? As you may have guested, my professional interests are programming languages (especially their paradigms), concurrent & parallel programming and finally everything related to theoretical informatics. That includes things such as data compression or pattern matching. This is also one of the reasons why I want to contribute to Boost. I want to share my knowledge of pattern matching algorithms and I believe that the Boost project is the right place. I'm seeking for an off the school programming experience. Another reason is that since I'm a long time user of various open source software, I believe that it is time to start to contribute to community as well. From this program I mostly expecting gaining some real world experience. My C++ and especially Boost knowledge should improve significantly. Have you done any previous work in this area before or on similar projects? I think I have everything that is needed to successfully complete this project. I have a theoretical foundations of approximate string matching algorithms from my university courses as well as I'm able to implement them using C++ without any major issues. What are your plans beyond this Summer of Code time frame for your proposed work?. I know that it may take a long time for a project to become a part of Boost library. So, if selected, my intentions after the summer is over are: 1. Release my project as an open-source, not necessary as a standard part of Boost library. 2. Do everything necessary to include my work as a part of standard part of Boost library eventually. 3. Further development: adding support of more distances, implementing another string algorithms such as repetitions searching. This heavily depends on users' interest. I know that my work won't end by the end of summer. I'm willing to spend a couple of hours per week (say 2 or 3) to keep my project updated and most importantly alive. Languages, technologies, tools My regular C++ development environment is consisted of Linux, git, CMake, Sublime text, gdb and valgrind. I have also used Eclipse and svn before. For documentation purposes I'm most familiar with Doxygen. Here is a self-evaluation of desired skills: C++ 3 STL 4 Boost libraries 1 Subversion 3 Proposal -------- Approximate string matching is used in many areas. To name some of them, there are spell checking and DNA processing. I can also imagine many applications in human-to-computer interaction. I'm proposing an extension of Boost String algorithms library. I would like to implement algorithms of two categories: 1. Compute distance of two strings. Imagine a function that takes two strings a returns a number describing their distance (various distances are described below). 2. Find all approximate matches of a pattern in some text using given distance. Example of this can be found in API interface usage example. Distance is defined as a number (or a tuple of numbers) describing similarity of two strings. Here is a list of all distances I'm planning to implement: 1. Hamming - a number of replace (single character only) operations needed so that the strings match exactly. 2. Levenstein - a number of insert, delete and replace operations. 3. Generalized Levenstein (Weighted Levenstein) - same as before but each operation may have different cost. 4. Damerau-Levenstein - a number of insert, delete, replace and transpose (of two adjacent characters) operations. 5. Weigted Damerau-Levenstein - same as before but each operation may have different cost. 6. Jaro - please see [1] 7. Jaro-Winkler - also see [1] (following assumes we can compute distance of two characters, eg. c - a = 2) 8. Delta - the maximum number of distance between two characters, eg. Delta(aaa,bbb) = Delta(aaa,aab) = 1. 9. Gamma - a cumulative distance between two characters, Gamma(aaa,bbb) = 3. 10. (Delta, Gamma) - a combination of two preceding distances. Distance 10 is not exactly needed, but I included it because it is quite common to compute both Delta and Gamma distances at the same time, so I provided efficient algorithm for this. As noted, when using distances 8-10 we must be able to compute distance between two characters. This behavior should be customizable as well. Simple API definition follows. It's very immature and it will change but it should give you general idea: ~~~~ {.cpp} namespace boost{ template <typename TRange> std::size_t hamming_dist(TRange first, TRange second); template <typename TRange> std::size_t levenstein_dist(TRange first, TRange second); template <typename TRange, typename Type> Type weighted_levenstein_dist(TRange first, TRange second, Type replace_cost, Type insert_cost, Type delete_cost); template <typename TRange> std::size_t damerau_levenstein_dist(TRange first, TRange second); template <typename TRange, typename Type> Type weighted_damerau_levenstein_dist(TRange first, TRange second, Type replace_cost, Type insert_cost, Type delete_cost, Type transpose_cost); template <typename TRange> std::size_t delta_dist(TRange first, TRange second); template <typename TRange> std::size_t gamma_dist(TRange first, TRange second); template <typename TRange> float jaro_dist(TRange first, TRange second); template <typename TRange> float jaro_winkler_dist(TRange first, TRange second, float weight); template <typename TRange> std::pair<std::size_t, std::size_t> deltagamma_dist(TRange first, TRange second); template <typename Type> class approximate_distance { // ... }; class hamming_distance : public approximate_distance<std::size_t>{}; template <typename Type, typename TRange> class approx_pattern { public: approx_pattern(approximate_distance<Type>& distance, Type max_allowed, TRange pattern); }; template <typename Type, typename TRange> class sapprox_match_iterator { public: sapprox_match_iterator(TRange text, approx_pattern<Type, TRange>); // overloaded ++ and * operators }; template <typename Type, typename TRange> class sapprox_match_result { public: Type distance() const; std::size_t startIndex() const; std::size_t endIndex() const; TRange str() const; }; } ~~~~ Here is an example of usage: ~~~~ {.cpp} string a = "aaabc"; string b = "aaacb"; cout << boost::hamming_dist(a, b) << endl; // prints 2 cout << boost::damerau_levenstein_dist(a, b) << endl; // prints 1 string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::approx_pattern ap(boost::hamming_distance, 1, pattern ); boost::sapprox_match_iterator begin( text, ap ); boost::sapprox_match_iterator end; std::for_each( begin, end, []( const boost::sapprox_match_result& what ) { cout << what.str() << " starting at " << what.startIndex() << endl; } ); // aacdef at 0 , fbcdef at 5, abccef at 11 ~~~~ As you can see, I would like my implementation to be general so that other string containers can be used instead of std::string. A will also take care of proper design in a way that further distances can be easily added later. There are multiple implementation approaches that can be used: NFA simulation, Shift-or algorithm and dynamic programming approach. I will probably choose the dynamic programming - it has low time and space complexity as well as it does not require any kind of preprocessing. Thus searching all occurrences of pattern in a text has O(m*n) complexity, m is length of pattern, n is length of text. Related work There is currently only one C++ library I know of: Flamingo [2]. From my point of view, its usage is a kind of complicated. You can also find implementations of single functions on-line, but they lack of multiple distances support as well as they are not general (eg. designed for C-like strings only). Another related project is stringmetrics [3]. This project is written in Scala so it is not related to C++ programmers. Schedule -------- This is an approximate schedule I would like to keep on. Since I will probably hit many unexpected issues I have reserved some time for it. 17th June : start 17th June - 27th June : prototyping and designing API interface as well as auxiliary functions / classes 28th June - 30th June : setting up testing environment, writing basic unit tests 1st July - 14th July : implementation of general dynamic programming model that will be used by most distances 15th July - 28th July : implementation of distances 1 - 5 29th July - 4th August : implementation of distances 8 - 10 5th August - 11th August : implementation of distances 6,7 12th August - 1st September : implementation of approximate pattern search in a text for all distances 2nd September - 15th September : wrapping up, end user documentation 16th September - 21st September : reserved During all implementation phases I'll be also performing testing and source code documentation. References ---------- [1] http://en.wikipedia.org/wiki/Jaro-Winkler_distance [2] http://flamingo.ics.uci.edu/releases/4.1/ [3] http://rockymadden.com/stringmetric/