Re: [boost] [GSoC] Some new ideas for this year's GSoC

7 Apr 2010

      Stefan wrote:
...
...
...
* Optimize Boost String Algorithm finders to use an efficient 
substring matching algorithm as opposed to the naive algorithm. Most 
likely because of the fact that boost has implemented the finders for 
strings as generic algorithms, they haven't used efficient algorithms 
such as KMP, Boyer-Moore, Rabin-Karp etc.
  \see boost/algorithm/string/detail/finder.hpp
  The idea is to implement efficient algorithms such as the ones 
mentioned above to replace the current naive implementation. Note: it 
may be difficult to find nth matching substring position using some of 
these algorithms.
...
I don't have an application proving it's inefficient yet, but a O(nm) 
algorithm would almost always perform worse than a O(n+m) one.
So n = size of input and m = size of search pattern.

If m is typically small, e.g. 2, then O(n+m) could easily beat O(nm) if 
the constant factors are in its favour.

Even if m is large, then there will often be a dominant case where the 
potential match can be rejected after looking at just the first element 
(i.e. if I'm looking for xyz in abcdefgxyzlmnop, then a "worse case 
O(nm)" algorithm will actually take O(n)).

I think that having a test data set in mind in advance would make this 
much more concrete to reason about.  Does anyone have any suggestions?
...
I've just looked at the posted thread, but the algorithm is_any_of only 
looks for a character in a string, not for a substring in a string.
I guess my point was that even the simpler problem of finding a single 
character has lots of potential for interesting work.

Regards,  Phil.

Re: [boost] [GSoC] Some new ideas for this year's GSoC

Phil Endecott