Re: [boost] [GSoC] Some new ideas for this year's GSoC

8 Apr 2010


      Stefan wrote:
...
Phil Endecott wrote:
...
Stefan wrote:
...
* Optimize Boost String Algorithm finders to use an efficient 
substring matching algorithm as opposed to the naive algorithm. Most 
likely because of the fact that boost has implemented the finders for 
strings as generic algorithms, they haven't used efficient algorithms 
such as KMP, Boyer-Moore, Rabin-Karp etc.
  \see boost/algorithm/string/detail/finder.hpp
  The idea is to implement efficient algorithms such as the ones 
mentioned above to replace the current naive implementation. Note: it 
may be difficult to find nth matching substring position using some 
of these algorithms.
  Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The 
suffix array, after being precomputed on a certain string, allows for 
very efficient substring searches in the future. This is useful when 
the string doesn't change too often but there are plenty of substring 
to search against.
Hi Stefan,
Have you seen this thread, from a couple of years ago? :
http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
I don't know what has changed in string_algo since then, but I think 
there is probably plenty of opportunity for easy optimisation.
Do you have a motivating application for this?  A lot of this stuff is 
rather data-dependant, so having a particular application that can 
provide a benchmark is a good idea.  I guess that a suffix array 
library would be reasonable for a GSoC project, if you're good!  You 
might also like to consider variants that index at e.g. word 
boundaries; I have used something like that and it works well for some 
kinds of search.
Regards,  Phil.
_______________________________________________
Unsubscribe & other changes: 
http://lists.boost.org/mailman/listinfo.cgi/boost
I've decided to implement Rabin-Karp today and after benchmarking I 
realized that on my pseudo-randomly generated data it performs worse 
than boost. My Rabin-Karp implementation will sure need some profiling 
before it can be used in benchmarking. Also some better data sets than 
what I came up with.
Here it is:
...
I doubt rabin-karp normally performs so badly, therefore my 
implementation definitely needs improvement.
_______________________________________________
Unsubscribe & other changes: 
http://lists.boost.org/mailman/listinfo.cgi/boost
Sorry for double-posting, no idea why it happened. Also, one can neglect 
the comments "//! @TODO: choose smaller primes (so that 2*prime fits in 
uint32)" and "//! Heuristic specialization, omit testing character by 
character." as they are incorrect. The former has been solved whereas 
the latter is not a specialization, although was meant to be at one point