
On Monday 31 October 2011 19:29:35 Andrey Semashev wrote:
considering the cost of cmpxchg8b itself, the cost of a branch -- if done correctly [1] -- is most likely immeasurable
Probably. But I'm a perfectionist. :)
me too, but if it does not have a measurable detriment, I consider it perfect :)
Unfortunately, cmpxchg16b is not as common as cmpxchg8b, so a dynamic check would be desirable. However, I would prefer that there were no if's like the one above. Perhaps, a global table of pointers to the actual function implementations would be better. Initially pointers should point to functions that perform cpuid and initialize this table and then call the real functions for the detected hardware. This way we eliminate almost all overhead in the long run, including call_once.
the processor most likely has more difficulties correctly predicting the code flow through a register-indirect branch than a static one, so I am not really sure this is cheaper, but it is in any case worth trying out
Yes, this needs testing, however I hope that unconditional jump should be quite well predictable.
it's only predictable as long as it is in the BTB, as soon as it gets flushed -- out of luck branch to static address, to out-of-line forward address to hit "predict not taken" default assumption on cold cache on the other hand is still essentially free
also, this would not be a "single" function pointer but a whole bunch of them to cover the different atomic operations (reducing everything to CAS generates more lock/unlock cycles in the fallback path otherwise)
Sure, like I said - a table of pointers.
since boost.atomic is (supposed) to stay a header-only library, there are cases where these will be instantiated multiple times -- the many different pointers may pressure the BTB unduly Best regards Helge