
On Thu, Feb 14, 2008 at 6:50 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Joe Gottman wrote:
I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set.
It would be nice if things like this could be at least as efficient as the equivalent C functionality. I've just done some quick benchmarks of possible is_any_of implementations. My test splits my /usr/share/dict/words at every lower-case vowel; it's about a megabyte and splits into 300,000 pieces.
First using std::set: [...]
Relative performance is 0.6.
For small search sets, just storing the string and looking though it each time should be faster. This is of course linear in the search set size, rather than logarithmic: [...] Relative performance is 3.1. Using strchr on pat.c_str() is a bit faster at 3.4.
For larger search sets another possibility, for 8-bit characters, is to use a bit-set: [...] In practice this seems to work well even for small search sets like my aeiou; its relative performance is 4.0.
Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing.
Is it legal to specialize an std class for a non UDT?
In each case, I've used the predicate with std::find_if. But if I instead use C's strcspn() to do the whole job, I see a relative performance for this example of 20. I find it depressing that the C++ code doesn't come close to that.
Well, it is not really a fair comparison, the 'C' function is likely implemented in hand optimized assembler, (http://tinyurl.com/yv8s7g) and for such a small function, humans are still better than compilers. So it isn't really C++ versus C. Also the C version is not generic at all.
I think that every time I've used is_any_of or similar features the set of characters has been a compile-time constant. So is it possible to make the characters a template parameter, e.g. is_any_of<"ABC">? Sadly it seems that strings can't be template parameters (though apparently pointers can be; what is that used for?). But you can write this: [...] Presumably C++0x varadic templates make that simpler.
Using this sort of technique with std::find_if I get a relative performance for my benchmark of 28 - finally better than C.
Nice! -- gpd