[gsoc 2013] Approximate string matching proposal
Hi, I've already posted this but it probably got buried. My proposal
for GSOC 2013 is also available at
http://webdev.fit.cvut.cz/~strnaj11/gsoc/.
Regards, Jan Strnad
GSOC 2013 Proposal - Approximate string pattern matching
========================================================
Abstract
--------
My name is Jan Strnad and this is my proposal for GSOC 2013.
I would like to implement a set of approximate string matching
algorithms. There are two categories of these algorithms. A
representative of the first category takes two strings and computes
so-called distance between those two strings. A member of the second
category takes some text and a pattern and finds all approximate matches
of the pattern in the text according to some given distance. I'm
planning to support various distances, such as Hamming, Levenstein,
Damerau–Levenshtein, Jaro, Delta, Gamma ... I'm proposing to implement a
general solution that supports multiple string containers, not only
std::string. There are many possible applications such as spell
checking.
Personal Details
----------------
Name: Jan Strnad
University: Czech Technical University in Prague, Faculty of Information
Technology, The Czech Republic
Major: Theoretical Informatics / System Programming - studying both in
parallel
Degree program: MSc.
Email/XMPP: hanny.strnad@gmail.com
ICQ: 335127474
homepage: none
Mailing address: Okružní 508, 788 13 Vikýřovice, Czech Republic
Availability:
* Start: at worse by 17th June but probably earlier
* End: 21st September
* I have no other duties during this period but I may take a one week
vacation.
* I'm planning to spend 40-45 hours per week.
Background
----------
Please summarize your educational background
I'm first year master's student at CTU in Prague with major in
Theoretical Informatics and System Programming. At this university I
also gained my BSc. degree (Software Engineering, with honors) in June
2012. Here's an incomplete list of courses I have taken so far: Data
structures and algorithms, Graph theory, Complexity theory, Automata and
grammar theory, Compiler design course, Software engineering courses
(including practical ones), various programming languages (C, C++, Java,
Scala, Common Lisp, Haskell, Erlang, SML), various programming paradigms
(imperative, object-oriented, functional, logical), math & statistics
courses, Data compression, String pattern matching course, Computer
architecture related courses, Parallel algorithms course and many
others. I also spent 4 months as an exchange student in Uppsala
University, Sweden where I took Constraint programming and Software
verification courses.
Please summarize your programming background
I'm able to code in many languages, but I'm most experienced in C++ (5
years) and Scala languages (1.5 years). My C++ experience is gained
mostly from school's projects an my small own projects (usually some
kind of small tweaks of third party programs). I have an experience with
C++ templates, multi-threading using both POSIX threads and openMP, STL
and with some of the Boost libraries (Program Options, Smart Ptr). I
definitely not know everything that would be needed for my GSOC project,
but I'm quick learner.
Please tell us a little about your programming interests. Please tell us
why you are interested in contributing to the Boost C++ Libraries. What is
your interest in the project you are proposing?
As you may have guested, my professional interests are programming
languages (especially their paradigms), concurrent & parallel
programming and finally everything related to theoretical informatics.
That includes things such as data compression or pattern matching. This
is also one of the reasons why I want to contribute to Boost. I want to
share my knowledge of pattern matching algorithms and I believe that the
Boost project is the right place. I'm seeking for an off the school
programming experience. Another reason is that since I'm a long time
user of various open source software, I believe that it is time to start
to contribute to community as well. From this program I mostly expecting
gaining some real world experience. My C++ and especially Boost
knowledge should improve significantly.
Have you done any previous work in this area before or on similar projects?
I think I have everything that is needed to successfully complete this
project. I have a theoretical foundations of approximate string matching
algorithms from my university courses as well as I'm able to implement
them using C++ without any major issues.
What are your plans beyond this Summer of Code time frame for your
proposed work?.
I know that it may take a long time for a project to become a part of
Boost library. So, if selected, my intentions after the summer is over
are:
1. Release my project as an open-source, not necessary as a standard
part of Boost library.
2. Do everything necessary to include my work as a part of standard part
of Boost library eventually.
3. Further development: adding support of more distances, implementing
another string algorithms such as repetitions searching. This heavily
depends on users' interest.
I know that my work won't end by the end of summer. I'm willing to spend
a couple of hours per week (say 2 or 3) to keep my project updated and
most importantly alive.
Languages, technologies, tools
My regular C++ development environment is consisted of Linux, git,
CMake, Sublime text, gdb and valgrind. I have also used Eclipse and svn
before. For documentation purposes I'm most familiar with Doxygen.
Here is a self-evaluation of desired skills:
C++ 3
STL 4
Boost libraries 1
Subversion 3
Proposal
--------
Approximate string matching is used in many areas. To name some of them,
there are spell checking and DNA processing. I can also imagine many
applications in human-to-computer interaction.
I'm proposing an extension of Boost String algorithms library. I would
like to implement algorithms of two categories:
1. Compute distance of two strings. Imagine a function that takes two
strings a returns a number describing their distance (various distances
are described below).
2. Find all approximate matches of a pattern in some text using given
distance. Example of this can be found in API interface usage example.
Distance is defined as a number (or a tuple of numbers) describing
similarity of two strings.
Here is a list of all distances I'm planning to implement:
1. Hamming - a number of replace (single character only) operations
needed so that the strings match exactly.
2. Levenstein - a number of insert, delete and replace operations.
3. Generalized Levenstein (Weighted Levenstein) - same as before but
each operation may have different cost.
4. Damerau-Levenstein - a number of insert, delete, replace and
transpose (of two adjacent characters) operations.
5. Weigted Damerau-Levenstein - same as before but each operation may
have different cost.
6. Jaro - please see [1]
7. Jaro-Winkler - also see [1]
(following assumes we can compute distance of two characters, eg. c - a
= 2)
8. Delta - the maximum number of distance between two characters, eg.
Delta(aaa,bbb) = Delta(aaa,aab) = 1.
9. Gamma - a cumulative distance between two characters, Gamma(aaa,bbb)
= 3.
10. (Delta, Gamma) - a combination of two preceding distances.
Distance 10 is not exactly needed, but I included it because it is quite
common to compute both Delta and Gamma distances at the same time, so I
provided efficient algorithm for this.
As noted, when using distances 8-10 we must be able to compute distance
between two characters. This behavior should be customizable as well.
Simple API definition follows. It's very immature and it will change but
it should give you general idea:
~~~~ {.cpp}
namespace boost{
template <typename TRange>
std::size_t hamming_dist(TRange first, TRange second);
template <typename TRange>
std::size_t levenstein_dist(TRange first, TRange second);
template
participants (1)
-
Jan Strnad