
On Monday 08 August 2011 10:59:47 Grund, Holger wrote:
Agreed, this is not impossible, but I still tend to think we should
strive
for a more efficient implementation if at all possible.
Where do you see room for improvement? It is a fallacy to assume that "most efficient implementation" always means "there is a machine instruction providing a 1:1 translation of my high-level construct". Look at this from the POV of cache synchronisation cost (which is the real cost, not the number of instructions), and you will realize that there is not much you can do (assuming you can squeeze the data copies as well as the sequence counter into the same cacheline).
This approach BTW is already way faster than e.g. using a 64-bit mmx register and paying the cost of mmx->gpr transfers on x86.
That doesn't match my experience. Even in the noncontended case, I would be very surprised to see anything "way faster".
mmx -> gpr has quite significant latency, don't forget that you need to shuffle around a fair bit, plus the cost of the eventual emms Also don't forget that moving mmx -> gpr defeats a large portion of the CPU's out-of-order and speculative execution capability, the CPUs cannot in general track dependencies across different register classes.
However, under any kind of contention I do expect the MMX MOVQ version to be significantly faster.
Assuming that you manage to put everything into a single cache line, I doubt that you will see any difference at all under contention: the real cost is the cache line transfer and transferring a single one has a latency of ~150 cycles (and that's rather not going to decrease with modern CPUs). After that, the number of bytes read out of the cache line are basically not measurable anymore.
And of course, 64 bits is less than 32 + 2 * 64.
It would rather be 32 + 64 +32 (there is no point in reading the "inactive" copy, but the processor can certainly read it speculatively). Helge