Re: [boost] [GSoC] Some new ideas for this year's GSoC

8 Apr 2010

      Phil Endecott wrote:
...
Stefan wrote:
...
* Optimize Boost String Algorithm finders to use an efficient 
substring matching algorithm as opposed to the naive algorithm. Most 
likely because of the fact that boost has implemented the finders for 
strings as generic algorithms, they haven't used efficient algorithms 
such as KMP, Boyer-Moore, Rabin-Karp etc.
  \see boost/algorithm/string/detail/finder.hpp
  The idea is to implement efficient algorithms such as the ones 
mentioned above to replace the current naive implementation. Note: it 
may be difficult to find nth matching substring position using some of 
these algorithms.
  Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The 
suffix array, after being precomputed on a certain string, allows for 
very efficient substring searches in the future. This is useful when 
the string doesn't change too often but there are plenty of substring 
to search against.
Hi Stefan,
Have you seen this thread, from a couple of years ago? :
http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
I don't know what has changed in string_algo since then, but I think 
there is probably plenty of opportunity for easy optimisation.
Do you have a motivating application for this?  A lot of this stuff is 
rather data-dependant, so having a particular application that can 
provide a benchmark is a good idea.  I guess that a suffix array library 
would be reasonable for a GSoC project, if you're good!  You might also 
like to consider variants that index at e.g. word boundaries; I have 
used something like that and it works well for some kinds of search.
Regards,  Phil.
_______________________________________________
Unsubscribe & other changes: 
http://lists.boost.org/mailman/listinfo.cgi/boost
I've just implemented KMP and benchmarked against 
boost::algorithm::find_first using a vector of 500 strings and 500 
substrings, taking all possible pairs (500*500 possibilities). 250 of 
the strings were generated randomly, whereas the other 250 were 
generated by concatenating parts of some of the generated substrings. It 
took KMP around 16 seconds and boost around 54 seconds. This is still 
not very bad for boost's implementation, since the data is random, which 
is quite close to best-case. If however substrings were chosen so that 
there were many of them to partly match in the string, but not fully, 
then worst-case would be triggered.
It is obvious why an algorithm in O(nm) having its worst-case triggered 
for relatively big values of n and m (I used m=1..100 n=1..1000) would 
be a disaster.

P.S. I believe Rabin-Karp or suffix arrays would have performed better 
on random data than my KMP did.

P.P.S. I'll share the benchmarking code tomorrow if you're interested.