
Quoting "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com>:
Well, even critical sections, Windows's fastest mutex primative, are much slower in the noncontended case than a spinlock. A two-stage method is needed to match the performance of the present spinlock: a lightweight atomic operation followed by a heavy-weight mutex if the lock is contended. This is why I was mentioned 8 bytes (one word for the critical section, one word for the atomic operation) would be necessary.
I'm quite surprised by this claim. What you describe is precisely how WIN32 critical sections work. If your measures show them to be slower, there must be some other reason for it. WIN32 also provides InitializeCriticalSectionAndSpinCount which will cause a busy-wait for a few cycles before resorting to waiting on the kernel lock. /Mattias