[String algorithm] is_any_of has inefficient implementation

I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set. There are two problems with this. First, since the order of the characters does not matter, std::unordered_set is probably faster. Second, the contained set has to be copied each time the object is used in, for instance, a split() statement. Since split can reuse a predicate literally thousands of times when splitting a large string, this is a lot of unnecessary copying. Joe Gottman

Hi, Joe Gottman wrote:
I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set. There are two problems with this. First, since the order of the characters does not matter, std::unordered_set is probably faster.
TR1 was not available at the time this class was implemented. So couldn't have used it. Thanks for suggestion.
Second, the contained set has to be copied each time the object is used in, for instance, a split() statement. Since split can reuse a predicate literally thousands of times when splitting a large string, this is a lot of unnecessary copying.
I'm aware of this. I'm working on am optimization for passing functors around the string_algo algorithms, but I'm not finished with it yet. Once finished, the copying will be mostly eliminated. Regards, Pavol.

Joe Gottman wrote:
I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set.
It would be nice if things like this could be at least as efficient as the equivalent C functionality. I've just done some quick benchmarks of possible is_any_of implementations. My test splits my /usr/share/dict/words at every lower-case vowel; it's about a megabyte and splits into 300,000 pieces. First using std::set: struct stdset_is_any_of { std::set<char> s; stdset_is_any_of(string pat) { for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) { s.insert(*i); } } bool operator()(char c) { return s.find(c) != s.end(); } }; Relative performance is 0.6. For small search sets, just storing the string and looking though it each time should be faster. This is of course linear in the search set size, rather than logarithmic: struct str_is_any_of { string pat; str_is_any_of(string pat_): pat(pat_) {} bool operator()(char c) { return pat.find(c) != string::npos; } }; Relative performance is 3.1. Using strchr on pat.c_str() is a bit faster at 3.4. For larger search sets another possibility, for 8-bit characters, is to use a bit-set: struct bitset_is_any_of { uint32_t bits[8]; bitset_is_any_of(string pat) { fill(bits,bits+8,0); for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) { char c = *i; bits[c>>5] |= (1<<(c&0x1f)); } } bool operator()(char c) { return bits[c>>5] & (1<<(c&0x1f)); } }; In practice this seems to work well even for small search sets like my aeiou; its relative performance is 4.0. Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing. In each case, I've used the predicate with std::find_if. But if I instead use C's strcspn() to do the whole job, I see a relative performance for this example of 20. I find it depressing that the C++ code doesn't come close to that. I think that every time I've used is_any_of or similar features the set of characters has been a compile-time constant. So is it possible to make the characters a template parameter, e.g. is_any_of<"ABC">? Sadly it seems that strings can't be template parameters (though apparently pointers can be; what is that used for?). But you can write this: template <char c0> bool is_any_of(char c) { return (c==c0); } template <char c0, char c1> bool is_any_of(char c) { return (c==c0) || is_any_of<c1>(c); } template <char c0, char c1, char c2> bool is_any_of(char c) { return (c==c0) || is_any_of<c1,c2>(c); } // etc. if (is_any_of<'a','b'>(ch)) .... Presumably C++0x varadic templates make that simpler. Using this sort of technique with std::find_if I get a relative performance for my benchmark of 28 - finally better than C. These numbers are all with gcc 4.1 -O3 on Linux. I hope the above is of interest to someone; having done the work it would a shame not to share the results. Joe, I guess that in your case the overhead of copying the predicate and constructing the result sequence will dominate over the performance of the predicate itself. Cheers, Phil.

Phil Endecott wrote:
Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing.
You can't specialize std::set<char> this way. It loses ordering. Sebastian Redl

----Original Message---- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Sebastian Redl Sent: 14 February 2008 17:59 To: boost@lists.boost.org Subject: Re: [boost] [String algorithm] is_any_of hasinefficient implementation
Phil Endecott wrote:
Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing.
You can't specialize std::set<char> this way. It loses ordering.
Why does it lose ordering? If I add 'b' and 'a' and then iterate over the set, I would expect the specialization to return 'a' and then 'b' (because it would find the bit corresponding to 'a' set before the bit corresponding to 'b'). Of course WE can't specialize std::set<char> this way, but an implementor could. -- Martin Bonner Senior Software Engineer/Team Leader PI SHURLOK LTD Telephone: +44 1223 441434 / 203894 (direct) Fax: +44 1223 203999 Email: martin.bonner@pi-shurlok.com www.pi-shurlok.com disclaimer

Martin Bonner wrote:
Why does it lose ordering? If I add 'b' and 'a' and then iterate over the set, I would expect the specialization to return 'a' and then 'b' (because it would find the bit corresponding to 'a' set before the bit corresponding to 'b').
Ah, never mind. My brain was scrambled.

on Thu Feb 14 2008, "Martin Bonner" <Martin.Bonner-AT-pi-shurlok.com> wrote:
----Original Message---- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Sebastian Redl Sent: 14 February 2008 17:59 To: boost@lists.boost.org Subject: Re: [boost] [String algorithm] is_any_of hasinefficient implementation
Phil Endecott wrote:
Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing.
You can't specialize std::set<char> this way. It loses ordering.
Why does it lose ordering? If I add 'b' and 'a' and then iterate over the set, I would expect the specialization to return 'a' and then 'b' (because it would find the bit corresponding to 'a' set before the bit corresponding to 'b').
Of course WE can't specialize std::set<char> this way, but an implementor could.
Assuming you're willing to play fast-and-loose with the big-O of iterator increment (I am). -- Dave Abrahams Boost Consulting http://boost-consulting.com

David Abrahams wrote:
Phil Endecott wrote:
Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing.
Assuming you're willing to play fast-and-loose with the big-O of iterator increment (I am).
It's still O(1) - its execution time is bounded - so there's nothing to worry about even if you did care. Sketch of iterator-increment code: // the bit-set: uint64_t bits[4]; unsigned char inc_iterator(unsigned char c) { // Find the next set bit in bits after c. // libc defines ffs and fls to find the first and last set bit in a word. // At least ffs should be a single instruction on many machines (e.g. // the bsfl instruction on x86). // gcc has __builtin_ffs that should generate this. It also has // __builtin_ffsll for 64-bit words. Here I'm using flsll; if that // doesn't exist you can reverse the order of the bits - except that // you'll still want the other one for decrementing, hmmm. // Note that they return the bit number plus 1, or 0 if none is set. // The lowest interesting bit: int l = c+1; // Find any next bit in the same word as l: uint64_t w = bits[l>>6]>>(l&63); if (w) return flsll(w)-1+l; // Look at the remaining words: switch (l>>6) { case 0: w = bits[1]; if (w) return flsll(w)+63; // fall through case 1: w = bits[2]; if (w) return flsll(w)+127; // fall through case 2: w = bits[3]; if (w) return flsll(w)+191; // fall through } return end(); } Phil.

Phil Endecott skrev:
Joe Gottman wrote:
I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set.
It would be nice if things like this could be at least as efficient as the equivalent C functionality. I've just done some quick benchmarks of possible is_any_of implementations. My test splits my /usr/share/dict/words at every lower-case vowel; it's about a megabyte and splits into 300,000 pieces.
First using std::set:
struct stdset_is_any_of { std::set<char> s; stdset_is_any_of(string pat) { for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) { s.insert(*i); } } bool operator()(char c) { return s.find(c) != s.end(); } };
Relative performance is 0.6.
Fon something as light as char, I suspect std::set<char> or tr1::unordered_set<char> is not an optimal choice. Could you try this test with boost::interprocess::flat_set<char> ? (It's in <boost/interprocess/containers/flat_set.hpp>) -Thorsten

Thorsten Ottosen wrote:
Phil Endecott skrev:
Joe Gottman wrote:
I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set.
It would be nice if things like this could be at least as efficient as the equivalent C functionality. I've just done some quick benchmarks of possible is_any_of implementations. My test splits my /usr/share/dict/words at every lower-case vowel; it's about a megabyte and splits into 300,000 pieces.
First using std::set:
struct stdset_is_any_of { std::set<char> s; stdset_is_any_of(string pat) { for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) { s.insert(*i); } } bool operator()(char c) { return s.find(c) != s.end(); } };
Relative performance is 0.6.
Fon something as light as char, I suspect std::set<char> or tr1::unordered_set<char> is not an optimal choice.
Could you try this test with boost::interprocess::flat_set<char> ? (It's in <boost/interprocess/containers/flat_set.hpp>)
Hi Thorsten, I can't easily test that code, but I have tried this which I think is essentially the same algorithm: struct strbin_is_any_of { string pat; strbin_is_any_of(string pat_): pat(pat_) { sort(pat.begin(),pat.end()); } bool operator()(char c) { return binary_search(pat.begin(),pat.end(),c); } }; With the same test as before, the performance of this is 1.5; that's between the std::set (0.6) and the bit-set (4.0) implementations, as you would expect. Comparing it with the code that does a linear search through an unsorted pattern string, the linear search remained faster in all of my (small set of) tests. This binary search may do better for perhaps 100 elements in the search set, but since the bit-set needs only 32 bytes of storage and is faster, I would switch over to that implementation well before then. Phil.

On Thu, Feb 14, 2008 at 6:50 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Joe Gottman wrote:
I just discovered that the string algorithm predicate is_any_of has an inefficient implementation. Internally it uses a std::set.
It would be nice if things like this could be at least as efficient as the equivalent C functionality. I've just done some quick benchmarks of possible is_any_of implementations. My test splits my /usr/share/dict/words at every lower-case vowel; it's about a megabyte and splits into 300,000 pieces.
First using std::set: [...]
Relative performance is 0.6.
For small search sets, just storing the string and looking though it each time should be faster. This is of course linear in the search set size, rather than logarithmic: [...] Relative performance is 3.1. Using strchr on pat.c_str() is a bit faster at 3.4.
For larger search sets another possibility, for 8-bit characters, is to use a bit-set: [...] In practice this seems to work well even for small search sets like my aeiou; its relative performance is 4.0.
Further investigation with other workloads suggests that this bitset really works very well, and I note that it would be possible to write a specialisation of std::set<char> using it. This is probably worth pursuing.
Is it legal to specialize an std class for a non UDT?
In each case, I've used the predicate with std::find_if. But if I instead use C's strcspn() to do the whole job, I see a relative performance for this example of 20. I find it depressing that the C++ code doesn't come close to that.
Well, it is not really a fair comparison, the 'C' function is likely implemented in hand optimized assembler, (http://tinyurl.com/yv8s7g) and for such a small function, humans are still better than compilers. So it isn't really C++ versus C. Also the C version is not generic at all.
I think that every time I've used is_any_of or similar features the set of characters has been a compile-time constant. So is it possible to make the characters a template parameter, e.g. is_any_of<"ABC">? Sadly it seems that strings can't be template parameters (though apparently pointers can be; what is that used for?). But you can write this: [...] Presumably C++0x varadic templates make that simpler.
Using this sort of technique with std::find_if I get a relative performance for my benchmark of 28 - finally better than C.
Nice! -- gpd

Giovanni Piero Deretta wrote:
On Thu, Feb 14, 2008 at 6:50 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
In each case, I've used the predicate with std::find_if. But if I instead use C's strcspn() to do the whole job, I see a relative performance for this example of 20. I find it depressing that the C++ code doesn't come close to that.
Well, it is not really a fair comparison, the 'C' function is likely implemented in hand optimized assembler, (http://tinyurl.com/yv8s7g) and for such a small function, humans are still better than compilers. So it isn't really C++ versus C. Also the C version is not generic at all.
Hi Giovanni, I can't agree with that. There's no reason why the C++ implementation couldn't be written in hand-optimised assembler too. Or it could simply call the fast C function: template <> const char* std::find(const char* first, const char* last, char value) <const char*> { return ::memchr(first,value,last-first); } Specialisation allows C++ libraries to provide functions that are both generic and optimised for special cases. Anyway, rather than complaining I should be writing code. This week I have been writing some UTF-8 encoding and decoding and Unicode<->iso8859 conversion algorithms. They seem to be faster than the libc implementations which is satisfying especially as I haven't even started on the serious optimisations yet. This will be part of the strings-tagged-with-character-sets stuff that I have described before. Anyone interested? Phil.

On Fri, Feb 15, 2008 at 6:54 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Giovanni Piero Deretta wrote:
On Thu, Feb 14, 2008 at 6:50 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
In each case, I've used the predicate with std::find_if. But if I instead use C's strcspn() to do the whole job, I see a relative performance for this example of 20. I find it depressing that the C++ code doesn't come close to that.
Well, it is not really a fair comparison, the 'C' function is likely implemented in hand optimized assembler, (http://tinyurl.com/yv8s7g) and for such a small function, humans are still better than compilers. So it isn't really C++ versus C. Also the C version is not generic at all.
Hi Giovanni,
I can't agree with that. There's no reason why the C++ implementation couldn't be written in hand-optimised assembler too. Or it could simply call the fast C function:
Actually we are in agreement :). Of course the C++ version (or specializations of that) can be implemented in hand tuned assembler, I was just pointing out that the difference in speed wasn't due to language 'power', but just implementation details. There is no reason that the C++ interface to be slower to the C one (in fact you have shown that, by using compile time knowledge, a plain C++ version can even beat the assembler code). -- gpd

Phil Endecott wrote:
Anyway, rather than complaining I should be writing code. This week I have been writing some UTF-8 encoding and decoding and Unicode<->iso8859 conversion algorithms. They seem to be faster than the libc implementations which is satisfying especially as I haven't even started on the serious optimisations yet. This will be part of the strings-tagged-with-character-sets stuff that I have described before. Anyone interested?
Always. Sebastian

Hi Phil, On 15/02/2008, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote: ...snip... Anyway, rather than complaining I should be writing code. This week I
have been writing some UTF-8 encoding and decoding and Unicode<->iso8859 conversion algorithms. They seem to be faster than the libc implementations which is satisfying especially as I haven't even started on the serious optimisations yet. This will be part of the strings-tagged-with-character-sets stuff that I have described before. Anyone interested?
Yep, definitely. The Fast/S/CGI code from last year's GSoC needs to be Unicode/multi-charset-friendly. I've no experience here unfortunately, but I've been wondering recently how it can/should be supported. It'd be excellent having access to some ready-made wrappers (+ algorithms) for this. Very interested! Regards, Darren

On Fri, Feb 15, 2008 at 3:54 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
[snip]
Anyway, rather than complaining I should be writing code. This week I have been writing some UTF-8 encoding and decoding and Unicode<->iso8859 conversion algorithms. They seem to be faster than the libc implementations which is satisfying especially as I haven't even started on the serious optimisations yet. This will be part of the strings-tagged-with-character-sets stuff that I have described before. Anyone interested?
Sure. Though I'm most interested in all charset conversions. But the most usual is enough to speed up my application *a lot*.
Phil.
Thanks and best regards, -- Felipe Magno de Almeida
participants (10)
-
Darren Garvey
-
David Abrahams
-
Felipe Magno de Almeida
-
Giovanni Piero Deretta
-
Joe Gottman
-
Martin Bonner
-
Pavol Droba
-
Phil Endecott
-
Sebastian Redl
-
Thorsten Ottosen