
Anthony Williams wrote:
it looks like you've opted for a check/sleep/check/sleep loop for threads that are waiting for another thread to finish running the routine. This is a bad idea. Blocking of this nature should be done by waiting on an OS primitive rather than with a wait loop.
Why is it that bad? This is safier since there is no opportunity to get an error on the threading primitive construction, it doesn't use system resources like kernel objects and it solves the fundamental problems of creating and destroying those threading primitives in run time. And it will be run only once after all, so performance is not an issue.
I think that performance *is* an issue, even though this will only be run once per thread.
A check/sleep polling loop is a bad idea, as it consumes CPU time that could be spent actually running the once routine (or another thread that doesn't need to wait). By waiting on an OS primitive, the OS can take the thread out of the schedule until the primitive is ready to be acquired.
Not only that, but a check/sleep loop forces a latency of at least the specified sleep time on the waiting thread. If the initialization being waited for only takes a few microseconds (or less --- if it's just a simple initialization it might take only a few nanoseconds), then waiting a whole millisecond is an unnecessary delay.
POSIX provides pthread_once. We should use it.
Do have a look at the analysis that I did for my ARM atomic shared_ptr code: http://thread.gmane.org/gmane.comp.lib.boost.devel/164564/focus=164893 If the probability of contention is very low, then on average adding even one instruction to the non-contended case, or occupying more icache space with yield() calls, may slow the program down more than yielding on contention would speed it up. The probability of contention depends crucially on the duration of the critical section, and I imagine that this could vary enormously for "once" functions, i.e. anything from a couple of instructions to seconds. So it might be worthwhile having different types of "once" for these different cases - and the same could also be said of mutexes. Take care with the pthreads option. I spent a while trying to understand what the Linux pthreads implementation (in glibc) does (for ARM), and it eventually boils down to much the same as I had written. However it's almost an order of magnitude slower, and I believe that's because it involves a couple of function calls while mine is inline. Since pthreads is a C API, I think that the function call overhead is inevitable. So I have put investigating replacing the pthreads mutexes used by boost.threads with asm on my to-do list (though it may never reach the top). Having said all that, does anyone really worry much about "once" performance? It's not like shared_ptr, where code that uses it may be doing atomic reference count changes fairly continuously. Regards, Phil.