[uuids] On boost::uuids::uuid operator == performance.

Hello. I would like to discuss a performance issue with boost::uuids::uuid operator ==. Let us consider the following program: --------------------------------------------------- #include <boost/uuid/uuid.hpp> #include <boost/uuid/uuid_generators.hpp> int main() { boost::uuids::uuid id1((boost::uuids::random_generator()())), id2((boost::uuids::random_generator()())); return id1 == id2; // we will consider this line } --------------------------------------------------- That is compiled with maximum optimizations the following way with Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86: cl -EHsc -Ox -GL -Ob2 -Oi -Fa -I..\3dparty\boost_1_48_0 m051.cpp boost uses the following comparison operator: inline bool operator==(uuid const& lhs, uuid const& rhs) /* throw() */ { return std::equal(lhs.begin(), lhs.end(), rhs.begin()); } which end up with the following code: lea edx, DWORD PTR _id2$[esp+92] lea ecx, DWORD PTR _id1$[esp+108] lea eax, DWORD PTR _id1$[esp+92] call ?_Equal@std@@YA_NPBE00@Z ; std::_Equal The whole story looks the following way (the code above plus std::_Equal call expansion): 0117C6AB lea edx,[esp+24h] 0117C6AF lea ecx,[esp+48h] 0117C6B3 lea eax,[esp+38h] 0117C6B7 call 01171000 01171000 push esi 01171001 mov esi,eax 01171003 sub ecx,esi 01171005 push edi 01171006 cmp ecx,4 01171009 jb 01171024 0117100B jmp 01171010 0117100D lea ecx,[ecx] 01171010 mov eax,dword ptr [esi] 01171012 cmp eax,dword ptr [edx] 01171014 jne 01171028 01171016 sub ecx,4 01171019 add edx,4 0117101C add esi,4 0117101F cmp ecx,4 01171022 jae 01171010 01171024 test ecx,ecx 01171026 je 01171073 01171028 movzx eax,byte ptr [esi] 0117102B movzx edi,byte ptr [edx] 0117102E sub eax,edi 01171030 jne 01171063 01171032 cmp ecx,1 01171035 jbe 01171073 01171037 movzx eax,byte ptr [esi+1] 0117103B movzx edi,byte ptr [edx+1] 0117103F sub eax,edi 01171041 jne 01171063 01171043 cmp ecx,2 01171046 jbe 01171073 01171048 movzx eax,byte ptr [esi+2] 0117104C movzx edi,byte ptr [edx+2] 01171050 sub eax,edi 01171052 jne 01171063 01171054 cmp ecx,3 01171057 jbe 01171073 01171059 movzx eax,byte ptr [esi+3] 0117105D movzx ecx,byte ptr [edx+3] 01171061 sub eax,ecx 01171063 sar eax,1Fh 01171066 or eax,1 01171069 xor edx,edx 0117106B test eax,eax 0117106D pop edi 0117106E sete al 01171071 pop esi 01171072 ret 01171073 xor eax,eax 01171075 xor edx,edx 01171077 test eax,eax 01171079 pop edi 0117107A sete al 0117107D pop esi 0117107E ret I do understand that the boost::uuids::uuid implementation is conceptually correct and inoculate the best programming practices. But the following equivalent is too much faster (you may probably want to ensure a proper alignment for some processor architectures or even get even faster 64-bit version): inline bool comp(const boost::uuids::uuid& l, const boost::uuids::uuid& r) { return *reinterpret_cast<const uint32_t*>(l.data) == *reinterpret_cast<const uint32_t*>(r.data) && *reinterpret_cast<const uint32_t*>(l.data+4) == *reinterpret_cast<const uint32_t*>(r.data+4) && *reinterpret_cast<const uint32_t*>(l.data+8) == *reinterpret_cast<const uint32_t*>(r.data+8) && *reinterpret_cast<const uint32_t*>(l.data+12) == *reinterpret_cast<const uint32_t*>(r.data+12); } the assembler output is: mov ecx, DWORD PTR _id1$[esp+92] cmp ecx, DWORD PTR _id2$[esp+92] jne SHORT $LN37@main mov edx, DWORD PTR _id1$[esp+96] cmp edx, DWORD PTR _id2$[esp+96] jne SHORT $LN37@main mov eax, DWORD PTR _id1$[esp+100] cmp eax, DWORD PTR _id2$[esp+100] jne SHORT $LN37@main mov ecx, DWORD PTR _id1$[esp+104] cmp ecx, DWORD PTR _id2$[esp+104] jne SHORT $LN37@main mov eax, 1 jmp SHORT $LN40@main $LN37@main: xor eax, eax $LN40@main: movzx eax, al that the question comes: is boost supposed to be used in commercial applications or it is just a testing area for some concepts? You have probably never thought of the fact that Debug (unoptimized configurations) are sometimes just unusable with such approach on real (big) data. If you are interested you may want to compile the program above this way and see the resulting code: cl -EHsc -Zi -Fa -I..\3dparty\boost_1_48_0 m051.cpp Thank you in advance for your considerations. -- Michael Kochetkov

Michael Kochetkov wrote:
inline bool comp(const boost::uuids::uuid& l, const boost::uuids::uuid& r) { return *reinterpret_cast<const uint32_t*>(l.data) ==
... I could've sworn that memcmp is smart enough on MSVC to do these unrolled DWORD compares for small sizes, but it isn't. I wonder why I got that impression.

On 16/04/12 01:10, Michael Kochetkov wrote:
boost uses the following comparison operator: inline bool operator==(uuid const& lhs, uuid const& rhs) /* throw() */ { return std::equal(lhs.begin(), lhs.end(), rhs.begin()); }
which end up with the following code: lea edx, DWORD PTR _id2$[esp+92] lea ecx, DWORD PTR _id1$[esp+108] lea eax, DWORD PTR _id1$[esp+92] call ?_Equal@std@@YA_NPBE00@Z ; std::_Equal
A good implementation of std::equal would call memcmp here, which would already be optimized. That doesn't seem to be the case. Can you try replacing this by memcmp(lhs.begin(), rhs.begin(), lhs.size()) ?
inline bool comp(const boost::uuids::uuid& l, const boost::uuids::uuid& r) { return *reinterpret_cast<const uint32_t*>(l.data) == *reinterpret_cast<const uint32_t*>(r.data) && *reinterpret_cast<const uint32_t*>(l.data+4) == *reinterpret_cast<const uint32_t*>(r.data+4) && *reinterpret_cast<const uint32_t*>(l.data+8) == *reinterpret_cast<const uint32_t*>(r.data+8) && *reinterpret_cast<const uint32_t*>(l.data+12) == *reinterpret_cast<const uint32_t*>(r.data+12); }
This code is specific to 32-bit architectures and also breaks the strict-aliasing rule.

Mathias Gaunard wrote:
On 16/04/12 01:10, Michael Kochetkov wrote:
boost uses the following comparison operator: inline bool operator==(uuid const& lhs, uuid const& rhs) /* throw() */ { return std::equal(lhs.begin(), lhs.end(), rhs.begin()); }
which end up with the following code: lea edx, DWORD PTR _id2$[esp+92] lea ecx, DWORD PTR _id1$[esp+108] lea eax, DWORD PTR _id1$[esp+92] call ?_Equal@std@@YA_NPBE00@Z ; std::_Equal
A good implementation of std::equal would call memcmp here, ...
It does.

On 04/16/2012 01:58 PM, Peter Dimov wrote:
Mathias Gaunard wrote:
On 16/04/12 01:10, Michael Kochetkov wrote:
boost uses the following comparison operator: inline bool operator==(uuid const& lhs, uuid const& rhs) /* throw() */ { return std::equal(lhs.begin(), lhs.end(), rhs.begin()); }
which end up with the following code: lea edx, DWORD PTR _id2$[esp+92] lea ecx, DWORD PTR _id1$[esp+108] lea eax, DWORD PTR _id1$[esp+92] call ?_Equal@std@@YA_NPBE00@Z ; std::_Equal
A good implementation of std::equal would call memcmp here, ...
It does.
The code above calls the std::_Equal symbol, not memcmp. The code given for std::_Equal also does not contain a jump to any other function.

Mathias Gaunard wrote:
The code above calls the std::_Equal symbol, not memcmp. The code given for std::_Equal also does not contain a jump to any other function.
memcmp is inlined into _Equal. Replacing std::equal with memcmp yields the same assembly code, this time inlined directly at the point where == is invoked.

on Mon Apr 16 2012, "Peter Dimov" <pdimov-AT-pdimov.com> wrote:
Mathias Gaunard wrote:
The code above calls the std::_Equal symbol, not memcmp. The code given for std::_Equal also does not contain a jump to any other function.
memcmp is inlined into _Equal. Replacing std::equal with memcmp yields the same assembly code, this time inlined directly at the point where == is invoked.
Don't forget alignment. Those DWORD comparisons/copies may only work well for aligned memory but when you call memcmp, you erase alignment information and the compiler may assume it's only char-aligned. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Don't forget alignment. Those DWORD comparisons/copies may only work well for aligned memory but when you call memcmp, you erase alignment information and the compiler may assume it's only char-aligned. I did not forget it. My words about alignment was just skipped. No doubt you can align uuid::data[16] almost any suitable way even for uint64_t. You can even start the comparison from the back if you feel that your uuids start to differ from the back to speed up the approach even more.
I would also like to confirm and further elaborate the statement in the discussion (I will not quote it from another message) that memcmp is intrinsic in MSCV for many cases (I have enforced it with Oi compiler switch), it is alignment, underlying data type and array sizes aware. You may want to have a look at the following example: #include <cstdint> #include <cstring> int main() { uint8_t d1[16], d2[16]; return memcmp(d1,d2,16); } Here is the result. $LL4@main: mov esi, DWORD PTR [edx] cmp esi, DWORD PTR [ecx] jne SHORT $LN5@main sub eax, 4 add ecx, 4 add edx, 4 cmp eax, 4 jae SHORT $LL4@main xor eax, eax Pretty neat. I wish the boost and C++ standard library to develop the same way. -- Michael Kochetkov

Michael Kochetkov wrote: ...
int main() { uint8_t d1[16], d2[16]; return memcmp(d1,d2,16); } Here is the result.
$LL4@main: mov esi, DWORD PTR [edx] cmp esi, DWORD PTR [ecx] jne SHORT $LN5@main sub eax, 4 add ecx, 4 add edx, 4 cmp eax, 4 jae SHORT $LL4@main xor eax, eax
Pretty neat.
That's not what I see for the following: int main() { boost::uuids::uuid id1, id2; return memcmp( id1.data, id2.data, 16 ); } even though uuid::data is uint8_t[16]. I see an assembly listing that is virtually identical to the one you posted for _Equal, which is also just a memcmp call: inline bool __CLRCALL_OR_CDECL _Equal(const unsigned char *_First1, const unsigned char *_Last1, const unsigned char *_First2, random_access_iterator_tag, _Range_checked_iterator_tag) { // compare [_First1, _Last1) to [First2, ...), for unsigned chars #if _HAS_ITERATOR_DEBUGGING _DEBUG_RANGE(_First1, _Last1); if (_First1 != _Last1) _DEBUG_POINTER(_First2); #endif /* _HAS_ITERATOR_DEBUGGING */ return (::memcmp(_First1, _First2, _Last1 - _First1) == 0); } I'm using VC++2005 though. Although in both cases, the loop that is actually executed is identical to the one you give above.

even though uuid::data is uint8_t[16]. I see an assembly listing that is virtually identical to the one you posted for _Equal, which is also just a memcmp call: You may probably want to make sure the Oi switch is present.
inline bool __CLRCALL_OR_CDECL _Equal(const unsigned char *_First1, const unsigned char *_Last1, const unsigned char *_First2, random_access_iterator_tag, _Range_checked_iterator_tag) { // compare [_First1, _Last1) to [First2, ...), for unsigned chars #if _HAS_ITERATOR_DEBUGGING _DEBUG_RANGE(_First1, _Last1); if (_First1 != _Last1) _DEBUG_POINTER(_First2); #endif /* _HAS_ITERATOR_DEBUGGING */
return (::memcmp(_First1, _First2, _Last1 - _First1) == 0); } You have an indirection here and the original code has two indirections -- iterators plus _Equal call. As you may see this is too complex for MS optimizer -- it looks like the optimizer loses the data size and alignment information and gives up.
I'm using VC++2005 though. I do remember that at least memcpy intrinsic worked pretty smart in that compiler. Though the optimizers of the subsequent versions are no doubt better.
Although in both cases, the loop that is actually executed is identical to the one you give above. You are right. And I was unable to make the VC to unroll it. That is why I do believe the library that is intended to be used in commercial applications shall do its best in manual optimization of evident cases. And I believe the uint8_t data[16] is crystal evident case in all respects including alignment issues.
-- Michael Kochetkov

Michael Kochetkov wrote:
I'm using VC++2005 though.
I do remember that at least memcpy intrinsic worked pretty smart in that compiler.
It does. I think I see what's going on. memcmp has to return -1, 0 or 1. This is why it does an initial DWORD loop to find the first mismatch, then does a byte compare to determine the return value.

Interestingly, if I try the following program, the "stupid" my_memcmp produces the fastest result. #include <boost/uuid/uuid.hpp> #include <iostream> #include <windows.h> #include <mmsystem.h> typedef unsigned uint32_t; inline bool Eq16( unsigned char const * p, unsigned char const * q ) { return *reinterpret_cast<const uint32_t*>( p ) == *reinterpret_cast<const uint32_t*>( q ) && *reinterpret_cast<const uint32_t*>( p+4 ) == *reinterpret_cast<const uint32_t*>( q+4 ) && *reinterpret_cast<const uint32_t*>( p+8 ) == *reinterpret_cast<const uint32_t*>( q+8 ) && *reinterpret_cast<const uint32_t*>( p+12) == *reinterpret_cast<const uint32_t*>( q+12); } inline bool my_memcmp( unsigned char const * p, unsigned char const * q, size_t n ) { for( size_t i = 0; i < n; ++i ) { if( p[i] != q[i] ) return false; } return true; } int main() { boost::uuids::uuid id1 = {}, id2 = {}; int const N = 100000000; DWORD t1 = timeGetTime(); int s = 0; for( int i = 0; i < N; ++i ) { //s += ( id1 == id2 ); //s += Eq16( id1.data, id2.data ); //s += memcmp( id1.data, id2.data, 16 ); s += my_memcmp( id1.data, id2.data, 16 ); id2.data[ i % 16 ] += i & 0xFF; } DWORD t2 = timeGetTime(); std::cout << s << ": " << t2 - t1 << std::endl; }

Interestingly, if I try the following program, the "stupid" my_memcmp produces the fastest result. My results for your code are: Eq16 -- 100% my_memcmp -- 119,46 % memcmp -- 153.94 % operator == -- 181.03 %
I cannot see how it may be in your case other then: 1. You are wrong about your results. 2. VS2005 has very small default alignment. For the second case you may add #pragma pack(push,16) #pragma pack(pop) in the proper #if defined(_MSC_VER) sections in uuid.hpp. BTW if you are interested in the final result you do need to initialize your uuids with some random values. VC initializes all bytes with zeros and is smart enough to remember that id1 does not change: xor edi, edi mov DWORD PTR _id2$[esp+40], eax mov DWORD PTR _id2$[esp+44], eax mov DWORD PTR _id2$[esp+48], eax mov DWORD PTR _id2$[esp+52], eax .... $LN3@main: ; Line 39 cmp edi, DWORD PTR _id2$[esp+40] jne SHORT $LN8@main cmp edi, DWORD PTR _id2$[esp+44] jne SHORT $LN8@main cmp edi, DWORD PTR _id2$[esp+48] jne SHORT $LN8@main cmp edi, DWORD PTR _id2$[esp+52] jne SHORT $LN8@main mov ecx, 1 jmp SHORT $LN9@main $LN8@main: xor ecx, ecx $LN9@main: movzx ecx, cl ; Line 42 mov edx, eax add esi, ecx and edx, -2147483633 ; 8000000fH jns SHORT $LN38@main dec edx or edx, -16 ; fffffff0H inc edx $LN38@main: add BYTE PTR _id2$[esp+edx+40], al lea ecx, DWORD PTR _id2$[esp+edx+40] inc eax cmp eax, 100000000 ; 05f5e100H jl SHORT $LN3@main as you may see edi never changes and VC event do not bother itself to allocate memory for id1. All id1 bytes are represented with edi. -- Michael Kochetkov
#include <boost/uuid/uuid.hpp>
#include <iostream>
#include <windows.h> #include <mmsystem.h>
typedef unsigned uint32_t;
inline bool Eq16( unsigned char const * p, unsigned char const * q ) { return *reinterpret_cast<const uint32_t*>( p ) == *reinterpret_cast<const uint32_t*>( q ) && *reinterpret_cast<const uint32_t*>( p+4 ) == *reinterpret_cast<const uint32_t*>( q+4 ) && *reinterpret_cast<const uint32_t*>( p+8 ) == *reinterpret_cast<const uint32_t*>( q+8 ) && *reinterpret_cast<const uint32_t*>( p+12) == *reinterpret_cast<const uint32_t*>( q+12); }
inline bool my_memcmp( unsigned char const * p, unsigned char const * q, size_t n ) { for( size_t i = 0; i < n; ++i ) { if( p[i] != q[i] ) return false; }
return true; }
int main() { boost::uuids::uuid id1 = {}, id2 = {};
int const N = 100000000;
DWORD t1 = timeGetTime();
int s = 0;
for( int i = 0; i < N; ++i ) { //s += ( id1 == id2 ); //s += Eq16( id1.data, id2.data ); //s += memcmp( id1.data, id2.data, 16 ); s += my_memcmp( id1.data, id2.data, 16 ); id2.data[ i % 16 ] += i & 0xFF; }
DWORD t2 = timeGetTime();
std::cout << s << ": " << t2 - t1 << std::endl; }
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

BTW if you are interested in the final result you do need to initialize your uuids with some random values. VC initializes all bytes with zeros and is smart enough to remember that id1 does not change: Ok. I just cannot help it. I have corrected the ids declarations myself the following way int main() { boost::uuids::uuid id1((boost::uuids::random_generator()())), id2((boost::uuids::random_generator()())); ...
and my results are as following: Eq16 -- 100% my_memcmp -- 138.89 % memcmp -- 169.17 % operator == -- 199.72 % Eq16 became even faster. This is most likely due to the fact the first comparison became to fire much more often. Anyway with proper alignment (and I am dead sure every compiler that is sensitive to the alignment has a proper tweakster for it) Eq16 (it shall read Eq32 indeed) shall be the fastest way to compare uint8_t[16] array. Unless you are going to implement Eq64 for x64 platforms... BTW, for unoptimized (Debug) mode: Eq16 (optimized) -- 100 % operator == (optimized) -- 199.72 % Eq16 (unoptimized) -- 247.5 % operator == (unoptimized) -- 1419.16 % No comments. -- Michael Kochetkov

Michael Kochetkov wrote:
boost::uuids::uuid id1((boost::uuids::random_generator()())), id2((boost::uuids::random_generator()()));
I prefer to leave id2 identical to id1 at the start, because otherwise the two are never equal. Core i7-920, VC++ 2005 32 bit: operator== 48830: 1087 Eq16 48830: 311 memcmp -53027366: 560 my_memcmp: 48830: 281 64 bit: operator== 48830: 674 Eq16 48830: 314 memcmp -2978538: 773 my_memcmp 48830: 264 Alignment is not the problem. id1 and id2 are aligned at a DWORD boundary (actually, on a 16 byte boundary under 64 bit).

boost::uuids::uuid id1((boost::uuids::random_generator()())), id2((boost::uuids::random_generator()()));
I prefer to leave id2 identical to id1 at the start, because otherwise the two are never equal. Ok.
Core i7-920, VC++ 2005 32 bit: I have 6 years old Xeon 5130 and for your original code (with zeros):
operator== 48830: 1087 735
Eq16 48830: 311 406
memcmp -53027366: 560 625
my_memcmp: 48830: 281 485
I can see VC 2010 is probably better in optimization. Anyway the real problems are that 1. boost's implementation of uuid is at least weird in terms there are much better approaches that cost nothing in implementation but are ignored. 2. the investigation of opinions of boost people (I have made another inquiry in "[boost] [function] The cost of boost::function" thread) shows that boost still is not ready for production usage. It requires too high qualification of the real life programmers to be able to slalom in dark corners of boost and C++ itself so it puts the success of a project at high risk. And I have some real life examples (you do not think I study the cases with the debugger on purpose, do you? :) ) 3. The words on the top of the boost.org page ""...one of the most highly regarded and expertly designed C++ library projects in the world." - Herb Sutter and Andrei Alexandrescu, C++ Coding Standards" really need some grounds. I thank you all for your opinions and help. -- Michael Kochetkov

Michael Kochetkov wrote:
I have 6 years old Xeon 5130 and for your original code (with zeros):
The numbers I cited are for id1 random, id2 equal to id1. boost::uuids::uuid id1 = boost::uuids::random_generator()(), id2 = id1; The random id1 avoids the smartness to keep a zero in a register, and id2 being equal to it at the start leads to a predictable nonzero number of equalities (48830 out of the 100M total). This is what I get with 2010: operator== 48830: 410 Eq16 48830: 299 memcmp 48830: 400 (I changed it to s += memcmp( id1.data, id2.data, 16 ) == 0; to match the others. The original gave something: 356. 2010 is smart enough here to avoid the trailing byte loop; it sees that the length is divisible by 16 and only does DWORD compares.) my_memcmp 48830: 272 There you go. CPUs are odd beasts. I tried, just for the fun of it, the totally unoptimized by hand template<class It1, class It2> inline bool my_equal( It1 first, It1 last, It2 first2 ) { for( ; first != last; ++first, ++first2 ) { if( *first != *first2 ) return false; } return true; } and then s += my_equal( id1.begin(), id1.end(), id2.begin() ); and what do you think? 48830: 274 Go figure.
2. the investigation of opinions of boost people (I have made another inquiry in "[boost] [function] The cost of boost::function" thread) shows that boost still is not ready for production usage.
Don't use it in production then. I do, and it works well for me. Already has for ten years or so. Appendix A, program: #include <boost/uuid/uuid.hpp> #include <boost/uuid/random_generator.hpp> #include <iostream> #include <windows.h> #include <mmsystem.h> #pragma comment( lib, "winmm.lib" ) typedef unsigned uint32_t; inline bool Eq16( unsigned char const * p, unsigned char const * q ) { return *reinterpret_cast<const uint32_t*>( p ) == *reinterpret_cast<const uint32_t*>( q ) && *reinterpret_cast<const uint32_t*>( p+4 ) == *reinterpret_cast<const uint32_t*>( q+4 ) && *reinterpret_cast<const uint32_t*>( p+8 ) == *reinterpret_cast<const uint32_t*>( q+8 ) && *reinterpret_cast<const uint32_t*>( p+12) == *reinterpret_cast<const uint32_t*>( q+12); } inline bool my_memcmp( unsigned char const * p, unsigned char const * q, size_t n ) { for( size_t i = 0; i < n; ++i ) { if( p[i] != q[i] ) return false; } return true; } template<class It1, class It2> inline bool my_equal( It1 first, It1 last, It2 first2 ) { for( ; first != last; ++first, ++first2 ) { if( *first != *first2 ) return false; } return true; } int main() { boost::uuids::uuid id1 = boost::uuids::random_generator()(), id2 = id1; int const N = 100000000; DWORD t1 = timeGetTime(); int s = 0; for( int i = 0; i < N; ++i ) { //s += ( id1 == id2 ); //s += Eq16( id1.data, id2.data ); //s += memcmp( id1.data, id2.data, 16 ) == 0; //s += my_memcmp( id1.data, id2.data, 16 ); s += my_equal( id1.begin(), id1.end(), id2.begin() ); id2.data[ i % 16 ] += i & 0xFF; } DWORD t2 = timeGetTime(); std::cout << s << ": " << t2 - t1 << std::endl; }

On 17/04/12 12:52, Michael Kochetkov wrote:
Anyway the real problems are that 1. boost's implementation of uuid is at least weird in terms there are much better approaches that cost nothing in implementation but are ignored.
Comparing 16 bytes is hardly the important part in a UUID library. Boost.UUID is using the correct construct for this, it is sad some compilers cannot optimize it properly, but it's still the ideal solution from a C++ point of view. The approach you suggest is not well-formed C++ and is specific to your particular configuration; it might fare worst on others. So it's not really "better" nor does it "cost nothing". An option is to simply fix the problem in your local copy of Boost. That being said, I'm sure the library author will consider fixing it if you file a bug for it.
2. the investigation of opinions of boost people (I have made another inquiry in "[boost] [function] The cost of boost::function" thread)
boost::function is heavyweight and has significant costs. If you don't need the features boost::function provides, simply don't use it and avoid the costs associated with it. The same thing can be said about any feature in C++.
shows that boost still is not ready for production usage.
You realize Boost has been used in many production environments, some very high-profile, for more than a decade now?
It requires too high qualification of the real life programmers to be able to slalom in dark corners of boost and C++ itself so it puts the success of a project at high risk. And I have some real life examples (you do not think I study the cases with the debugger on purpose, do you? :) )
This is understandable. Using any external library is a risk to take.
3. The words on the top of the boost.org page ""...one of the most highly regarded and expertly designed C++ library projects in the world." - Herb Sutter and Andrei Alexandrescu, C++ Coding Standards" really need some grounds.
Those are citations from well-known people in the C++ community. They themselves are grounds.

The approach you suggest is not well-formed C++ and is specific to your particular configuration; it might fare worst on others. Why it is not well-formed C++?
2. the investigation of opinions of boost people (I have made another inquiry in "[boost] [function] The cost of boost::function" thread)
boost::function is heavyweight and has significant costs. If you don't need the features boost::function provides, simply don't use it and avoid the costs associated with it. The people I deal with in the real world do not think about its weight. They just cannot memorize boost::_bi::bind_t<int,int (*)(int),boost::_bi::list1<boost::arg<1> > > and type it correctly. So they use boost::function just for convenience.
3. The words on the top of the boost.org page ""...one of the most highly regarded and expertly designed C++ library projects in the world." - Herb Sutter and Andrei Alexandrescu, C++ Coding Standards" really need some grounds.
Those are citations from well-known people in the C++ community. They themselves are grounds. They probably have never thought of the problem above. Or they do not care. I do.
-- Michael Kochetkov

On 18/04/12 11:09, Michael Kochetkov wrote:
The approach you suggest is not well-formed C++ and is specific to your particular configuration; it might fare worst on others. Why it is not well-formed C++?
I've already pointed out why.
The people I deal with in the real world do not think about its weight. They just cannot memorize boost::_bi::bind_t<int,int (*)(int),boost::_bi::list1<boost::arg<1> > > and type it correctly. So they use boost::function just for convenience.
The good thing to do, as with all similar mechanisms (lambda, phoenix), is to avoid naming the result entirely or using auto/decltype/typeof facilities.

bounces@lists.boost.org] On Behalf Of Mathias Gaunard
Anyway the real problems are that 1. boost's implementation of uuid is at least weird in terms there are much better approaches that cost nothing in implementation but are ignored.
Comparing 16 bytes is hardly the important part in a UUID library. I use boost UUID as global unique identifiers and I do compare them in the total majority of usages. I do admit I can misuse the UUIDs concept so I am just curious what have the boost people intended the boost::uuid to be used for?..
-- Michael Kochetkov

Hi, Am 19. April 2012 18:58 schrieb Michael Kochetkov <michael.kv@gmail.com>:
[...] I use boost UUID as global unique identifiers and I do compare them in the total majority of usages. I do admit I can misuse the UUIDs concept so I am just curious what have the boost people intended the boost::uuid to be used for?..
Planned to use it as a key in a multi index container that uses shared memory allocators.. Or even as kind of property id for a tree like recursive property value db. But I dropped that idea, and now I am looking into something like boost::flyweight' (where ' stands for the ability to store the map and values inside certain shared memory segment) as a fast comparable identifier type for the property value db. regards Andreas

On Tue, Apr 17, 2012, at 09:48 PM, Mathias Gaunard wrote:
On 17/04/12 12:52, Michael Kochetkov wrote:
Anyway the real problems are that 1. boost's implementation of uuid is at least weird in terms there are much better approaches that cost nothing in implementation but are ignored.
Comparing 16 bytes is hardly the important part in a UUID library. Boost.UUID is using the correct construct for this, it is sad some compilers cannot optimize it properly, but it's still the ideal solution from a C++ point of view.
The approach you suggest is not well-formed C++ and is specific to your particular configuration; it might fare worst on others. So it's not really "better" nor does it "cost nothing".
An option is to simply fix the problem in your local copy of Boost.
That being said, I'm sure the library author will consider fixing it if you file a bug for it.
Absolutely, I am always happy to review submitted bugs/patches. < snip > Regards, Andy Tompkins boost.uuid author

I do remember that at least memcpy intrinsic worked pretty smart in that compiler.
It does.
I think I see what's going on. memcmp has to return -1, 0 or 1. This is why it does an initial DWORD loop to find the first mismatch, then does a byte compare to determine the return value. Thank you for pointing it out. Frankly speaking being busy with finding out the attitude of the boost people towards the problem I have really skipped the rest of generated code that have probably made the MS compiler to look a little bit smarter than it really is.
-- Michael Kochetkov

On 16/04/12 19:44, Michael Kochetkov wrote:
I do remember that at least memcpy intrinsic worked pretty smart in that compiler.
It does.
I think I see what's going on. memcmp has to return -1, 0 or 1. This is why it does an initial DWORD loop to find the first mismatch, then does a byte compare to determine the return value. Thank you for pointing it out. Frankly speaking being busy with finding out the attitude of the boost people towards the problem I have really skipped the rest of generated code that have probably made the MS compiler to look a little bit smarter than it really is.
Boost being used on many operating systems, compilers, and architectures, generic solutions are usually preferred over ad-hoc fixes for particular compiler combinations. That being said, a MSVC-specific workaround is possible too, if no other satisfying solution is found.

on Mon Apr 16 2012, Michael Kochetkov <michael.kv-AT-gmail.com> wrote:
Don't forget alignment. Those DWORD comparisons/copies may only work well for aligned memory but when you call memcmp, you erase alignment information and the compiler may assume it's only char-aligned.
I did not forget it. My words about alignment was just skipped.
I didn't see your words about alignment, and I was responding to Peter.
No doubt you can align uuid::data[16] almost any suitable way even for uint64_t. You can even start the comparison from the back if you feel that your uuids start to differ from the back to speed up the approach even more.
I would also like to confirm and further elaborate the statement in the discussion (I will not quote it from another message) that memcmp is intrinsic in MSCV for many cases (I have enforced it with Oi compiler switch), it is alignment, underlying data type and array sizes aware. You may want to have a look at the following example: #include <cstdint> #include <cstring>
int main() { uint8_t d1[16], d2[16]; return memcmp(d1,d2,16); } Here is the result.
$LL4@main: mov esi, DWORD PTR [edx] cmp esi, DWORD PTR [ecx] jne SHORT $LN5@main sub eax, 4 add ecx, 4 add edx, 4 cmp eax, 4 jae SHORT $LL4@main xor eax, eax
Pretty neat. I wish the boost and C++ standard library to develop the same way.
Yeah, that's great. My point is that it may well be that int main() { uint8_t d1[16], d2[16]; return memcmp((char*)d1,(char*)d2,16); } (or something only a little more complex, like a call through another layer of inline function) is enough to foil that optimization. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Michael Kochetkov: int main() { uint8_t d1[16], d2[16]; return memcmp(d1,d2,16); } Here is the result. $LL4@main: mov esi, DWORD PTR [edx] cmp esi, DWORD PTR [ecx] jne SHORT $LN5@main sub eax, 4 add ecx, 4 add edx, 4 cmp eax, 4 jae SHORT $LL4@main xor eax, eax -------- That's not really a fair test since it's all in one compilation scope. I try to just make an obj which has function which takes the stuff as a parameters like: void foo(boost::uuids::uuid&, boost::uuids::uuid&) Then the compiler cann't know where it aligned the thing on the stack or do any other quick optimizations (beyond the alignment of the type). The fastest thing on x86 to do this might be an MMX instruction. I remember gcc doing MOVDQA to memset() a 16 byte structure. Though I remember it because it falsely assumed 16 byte alignment and crashed. Is your beef with the compiler? Looking at the code the performance would probably be better if a larger integer type were used and alignment>1. Chris

That's not really a fair test since it's all in one compilation scope. I try to just make an obj which has function which takes the stuff as a parameters like: void foo(boost::uuids::uuid&, boost::uuids::uuid&)
Then the compiler cann't know where it aligned the thing on the stack or do any other quick optimizations (beyond the alignment of the type). It is fair as far as the data type is aligned regardless where it is placed: in the heap or on the stack. If you do really want to foil the case you need to get rid of data type information (as it was proposed with the conversion to the pointer to char) and information about the data size.
The fastest thing on x86 to do this might be an MMX instruction. I remember gcc doing MOVDQA to memset() a 16 byte structure. Though I remember it because it falsely assumed 16 byte alignment and crashed.
Is your beef with the compiler?
Looking at the code the performance would probably be better if a larger integer type were used and alignment>1.
I have no doubt that one can spoil any the most brilliant idea. And that is what I see in boost: with iterators and intermediate Feng shui call it leaves the stupid MS compiler no chances to deduce the original data type, its size and alignment. Being the person who is called when a program crashes or works very slow I am pretty happy with the current situation. But while hacking another sami-failed project I was asked to join a company and start a completely new project. So I just decide now if the boost is suitable for early prototyping only or it may be used in the production too. And I do not care much about the defects like this one -- they are trivial to find and fix. I am more interested in people attitude on the problems like this one because the modern C++ approach "a compiler will take care of it" does not work for me. And as you may see I can prove my words. -- Michael Kochetkov
participants (7)
-
Andreas Pokorny
-
Andy Tompkins
-
Dave Abrahams
-
Hite, Christopher
-
Mathias Gaunard
-
Michael Kochetkov
-
Peter Dimov