Re: [boost] [GSoC] Some new ideas for this year's GSoC

8 Apr 2010

      Thomas Klimpel wrote:
...
Phil Endecott wrote:
...
I think that having a test data set in mind in advance would make this
much more concrete to reason about.  Does anyone have any suggestions?
Some people regard using an algorithm with a really bad complexity as 
some sort of bug, no?
Yes, indeed!
...
So the question would be how this "bug" can be exposed, in case it 
is really a bug. One suggestion in this direction might be a denial 
of service attack. But in my own experience, such "bugs" tend to 
cause damage in more subtle and unexpected ways. Perhaps somebody 
can answer me the opposite question: What is the advantage of 
having subtle bugs in your algorithms that only wait to bite you 
when you expect it the least?
Well the not-100%-serious answer would be "because they might be faster 
the rest of the time".

I think these algorithms would be useful contributions to Boost.  But 
because their complexity benefits may not be seen for common usage 
patterns (e.g. short search patterns), I think that studying their 
actual performance with some realistic test data (and not synthetic 
pseudo-random patterns!) should be a part of the project.

I also wonder about other variations, e.g. avoiding recomputing the 
table in the KMP or Boyer-Moore algorithms for repeated searches with 
the same pattern:

string text = content_of_file("long_text.txt");
kmp_pattern pat("search for this");  // compute the table here.
string::const_iterator i = text.begin();
while (i != text.end()) {
   i = kmp_search(i,text.end(),pat);
   ...
}

If N = length of text, M = number of matches and P = length of pattern, 
I think this reduces the complexity from O(N+MP) to O(N+P).

Do this with a TMP and a compile-time pattern, and you've got something expressive-like.

But in my experience, when I have had to do fast search it has always 
been repeated search for different patterns and so I've used some sort 
of an index.  Stephan mentioned suffix arrays and I think that would be 
an even more valuable contribution.

Regards,  Phil.