david v wrote:
Hello, As James pointed out the right place to start with will be suffix tree algorithms. The only thing i don't know if this is going to be very time consuming ???
######For Brian.... 1) What's the longest string you're searching for? For the moment is limited to 6 to 8 characters.
2) What's the smallest bit-width of any cpu on which you're running your code? I'm on 64 dual opteron machine
3) How familiar with string searching/matching are you? Quite familiar but was looking for a cpp library that could handle it instead of trying to rewrite things.
Best, david
David, In my experience, the bit-parallel algorithm of Navarro/Hyyro is significantly faster to build and smaller in memory than the suffix trees. As a third bonus, it's a faster search technique. Two papers: 1) "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" by Heikki Hyyro of the Dept of Computer and Information Sciences, University of Tampere Finland. 2) "Faster Bit-parallel Approximate String Matching" by Hyyro and Gonzalo Navarro. I think that paper is from the Proc. 13th Combinatorial Pattern Matching Both should show up on a google search if you use the exact title. Since you've a 64-bit machine, you could handle up to 63 characters of a search pattern at a max, with any length text to search. Apologies that I can't share my implementations with you on this, they're not for searching but for string matching. I seem to remember that the papers are oriented towards your type of searching across a large text stream. regards, and congrats in advance, Brian