
On Friday 28 October 2011 17:43:43 Andrey Semashev wrote:
On Friday, October 28, 2011 17:12:55 Domagoj Saric wrote:
On 21.10.2011. 13:06, Tim Blechmann wrote:
compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode)
the correct way to do that is to have different libraries for sub-architectures and have the runtime- linker decide... this requires infrastructure not present in boost
it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()
if (has_cmpxchg16b)
use_cmpxchg16b();
else
use_fallback();
less bloat and prbly only a minor performance hit ;)
cmpxchg8b is available since the original Pentium. Preferably dynamic support for such ancient hardware, if supported at all, should not be on by default (by forcing dynamic dispatching on everyone).
considering the cost of cmpxchg8b itself, the cost of a branch -- if done correctly [1] -- is most likely immeasurable
Unfortunately, cmpxchg16b is not as common as cmpxchg8b, so a dynamic check would be desirable. However, I would prefer that there were no if's like the one above. Perhaps, a global table of pointers to the actual function implementations would be better. Initially pointers should point to functions that perform cpuid and initialize this table and then call the real functions for the detected hardware. This way we eliminate almost all overhead in the long run, including call_once.
the processor most likely has more difficulties correctly predicting the code flow through a register-indirect branch than a static one, so I am not really sure this is cheaper, but it is in any case worth trying out also, this would not be a "single" function pointer but a whole bunch of them to cover the different atomic operations (reducing everything to CAS generates more lock/unlock cycles in the fallback path otherwise) [1] I'm thinking of forcing the fallback path out-of-line such that cmpxchg8b is fall-through Best regards Helge