
Joe Gottman wrote:
I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set.
It would be nice if things like this could be at least as efficient as the equivalent C functionality. I've just done some quick benchmarks of possible is_any_of implementations. My test splits my /usr/share/dict/words at every lower-case vowel; it's about a megabyte and splits into 300,000 pieces. First using std::set: struct stdset_is_any_of { std::set<char> s; stdset_is_any_of(string pat) { for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) { s.insert(*i); } } bool operator()(char c) { return s.find(c) != s.end(); } }; Relative performance is 0.6. For small search sets, just storing the string and looking though it each time should be faster. This is of course linear in the search set size, rather than logarithmic: struct str_is_any_of { string pat; str_is_any_of(string pat_): pat(pat_) {} bool operator()(char c) { return pat.find(c) != string::npos; } }; Relative performance is 3.1. Using strchr on pat.c_str() is a bit faster at 3.4. For larger search sets another possibility, for 8-bit characters, is to use a bit-set: struct bitset_is_any_of { uint32_t bits[8]; bitset_is_any_of(string pat) { fill(bits,bits+8,0); for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) { char c = *i; bits[c>>5] |= (1<<(c&0x1f)); } } bool operator()(char c) { return bits[c>>5] & (1<<(c&0x1f)); } }; In practice this seems to work well even for small search sets like my aeiou; its relative performance is 4.0. Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing. In each case, I've used the predicate with std::find_if. But if I instead use C's strcspn() to do the whole job, I see a relative performance for this example of 20. I find it depressing that the C++ code doesn't come close to that. I think that every time I've used is_any_of or similar features the set of characters has been a compile-time constant. So is it possible to make the characters a template parameter, e.g. is_any_of<"ABC">? Sadly it seems that strings can't be template parameters (though apparently pointers can be; what is that used for?). But you can write this: template <char c0> bool is_any_of(char c) { return (c==c0); } template <char c0, char c1> bool is_any_of(char c) { return (c==c0) || is_any_of<c1>(c); } template <char c0, char c1, char c2> bool is_any_of(char c) { return (c==c0) || is_any_of<c1,c2>(c); } // etc. if (is_any_of<'a','b'>(ch)) .... Presumably C++0x varadic templates make that simpler. Using this sort of technique with std::find_if I get a relative performance for my benchmark of 28 - finally better than C. These numbers are all with gcc 4.1 -O3 on Linux. I hope the above is of interest to someone; having done the work it would a shame not to share the results. Joe, I guess that in your case the overhead of copying the predicate and constructing the result sequence will dominate over the performance of the predicate itself. Cheers, Phil.