[thread] boost::call_once() optimization idea on Win32

Hello, here's a small optimization idea for boost::call_once() on Win32. The code currently looks like this: ---------------------------------------------- template<typename Function> void call_once(once_flag& flag,Function f) { // Try for a quick win: if the procedure has already been called // just skip through: long const function_complete_flag_value=0xc15730e2; if(::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_value) { void* const mutex_handle(::boost::detail::create_once_mutex(&flag)); BOOST_ASSERT(mutex_handle); detail::win32::handle_manager const closer(mutex_handle); detail::win32_mutex_scoped_lock const lock(mutex_handle); if(flag!=function_complete_flag_value) { f(); BOOST_INTERLOCKED_EXCHANGE(&flag,function_complete_flag_value); } } } ---------------------------------------------- The initial flag value of zero (=BOOST_ONCE_INIT) is set at load time via static initialization. It should be safe to do a non-interlocked read of the flag like this: ---------------------------------------------- if(flag!=function_complete_flag_value && ::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_value) ---------------------------------------------- If our non-interlocked read would see something different than "function_complete_flag_value", we still do the interlocked read, too. For normal operation inside a thread safe singleton, this saves us from always issuing a memory barrier. Any flaws with this approach? Also one more (silly?) question: "flag" is not a volatile variable. Does boost::detail::interlocked_read_acquire() make sure the value doesn't get cached inside a register? IMHO we lock the mutex and we might still hold a cached, old value of "flag" in a register. -> Do we need an interlocked read here, too? Or mark the flag type "volatile"? (http://msdn.microsoft.com/en-us/magazine/cc163405.aspx#S3 look at figure 7 and the text below) Best regards, Thomas Jarosch

On Wed, Nov 17, 2010 at 11:15 AM, Thomas Jarosch
Hello,
here's a small optimization idea for boost::call_once() on Win32.
It should be safe to do a non-interlocked read of the flag like this: ---------------------------------------------- if(flag!=function_complete_flag_value && ::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_value) ----------------------------------------------
If our non-interlocked read would see something different than "function_complete_flag_value", we still do the interlocked read, too. For normal operation inside a thread safe singleton, this saves us from always issuing a memory barrier.
Any flaws with this approach?
Yes. Your thread will see flag correctly, but the "important data" being protected/initted by the once, might not bee seen correctly. Typically the code essentially boils down to (if you imagine the call_once being "inlned", since to the CPU, it is all "inline"): First Thread in ("wins", thus does the init) important_data = new ...; // the init call, protected by call_once() flag = function_complete_flag_value; Second Thread: if (flag == function_complete_flag_value) // inside call_once() use_important_data(important_data); // outside call_once() The CPU (not just the compiler, but the CPU, memory subsystem, etc) can reorder that as: First Thread: flag = function_complete_flag_value; important_data = new ...; Second Thread: register temp = important_data; // start memory read early, so that we don't wait for reads to complete if (flag == function_complete_flag_value) use_important_data(temp); See the problem? important_data may be read before it is ready. You need a memory barrier *in each thread* to ensure neither reordering happens. In particular, the memory barrier in the interlocked_read_acquire(&flag) prevents the reordering in the Second Thread.
Also one more (silly?) question: "flag" is not a volatile variable. Does boost::detail::interlocked_read_acquire() make sure the value doesn't get cached inside a register? IMHO we lock the mutex and we might still hold a cached, old value of "flag" in a register. -> Do we need an interlocked read here, too? Or mark the flag type "volatile"?
volatile is almost useless is threaded programming. It is typically both insufficient for threads (ie no memory barrier) and at the same time superfluous - when inside a mutex - as the mutex handles the barrier for you. volatile only helps with the compiler - it doesn't control what the CPU might then do to your instructions (like reorder them, do speculative execution, etc), so it doesn't help much with threads running on separate CPUs. And MS's version of volatile (which _does_ enforce memory barriers) is non-standard. So don't use it in portable code. ie don't use it at all.
(http://msdn.microsoft.com/en-us/magazine/cc163405.aspx#S3 look at figure 7 and the text below)
Best regards, Thomas Jarosch
Tony P.S. I will hopefully be doing another talk on this stuff at BoostCon in May - you should go!

Hello Tony, On Thursday, 18. November 2010 05:13:03 Gottlob Frege wrote:
The CPU (not just the compiler, but the CPU, memory subsystem, etc) can reorder that as:
First Thread: flag = function_complete_flag_value; important_data = new ...;
Second Thread: register temp = important_data; // start memory read early, so that we don't wait for reads to complete if (flag == function_complete_flag_value) use_important_data(temp);
See the problem? important_data may be read before it is ready.
Thanks for the detailed explanation! As you pointed out, the second thread will mess up without the barrier. Time to fix my code :) One more small thing: IMHO the write reordering in the first thread can't happen because of the memory barrier created by interlockedExchange(). The first thread translates to this: important_data = new ...; // the init call lock // create memory barrier: // - Don't allow reordering // from below the barrier // - Finish all outstanding writes flag = function_complete_flag_value;
Also one more (silly?) question: "flag" is not a volatile variable. Does boost::detail::interlocked_read_acquire() make sure the value doesn't get cached inside a register? IMHO we lock the mutex and we might still hold a cached, old value of "flag" in a register. -> Do we need an interlocked read here, too? Or mark the flag type "volatile"?
volatile is almost useless is threaded programming. It is typically both insufficient for threads (ie no memory barrier) and at the same time superfluous - when inside a mutex - as the mutex handles the barrier for you.
True that.
volatile only helps with the compiler - it doesn't control what the CPU might then do to your instructions (like reorder them, do speculative execution, etc), so it doesn't help much with threads running on separate CPUs. And MS's version of volatile (which _does_ enforce memory barriers) is non-standard. So don't use it in portable code. ie don't use it at all.
Let me illustrate it a bit more: - register temp = interlocked_read(&flag) // fetch from mem location xyz -> Init flag not set, so execute init code: - create mutex (also creates memory barrier) - Another thread 2 already entered the mutex, executes the init code and does an interlocked write of "flag". Then it leaves the mutex so thread 1 can continue. - Thread 1 re-reads the flag without interlocked read or volatile: The compiler recognizes it's the same memory location xyz and uses the cached value from "register temp". So we would redo the initialization. Don't we need an interlocked read here, too? Or does the mutex/memory barrier ensure the compiler isn't allowed to do register caching?
P.S. I will hopefully be doing another talk on this stuff at BoostCon in May - you should go!
Nice! Too bad "www.boostcon.com" currently issues a "500 - Internal Server Error" :o) Best regards, Thomas Jarosch

On Thursday, 18. November 2010 11:50:55 you wrote:
Or does the mutex/memory barrier ensure the compiler isn't allowed to do register caching?
Ok, quickly checked this with gcc 4.4.4 on linux: -test.c------------------ int *foobar = 0x1234; int a; int main(void) { __sync_synchronize(); a = *foobar; __sync_synchronize(); return *foobar; } ------------------------- gcc -O4 -S test.c: main: .LFB0: .cfi_startproc mfence movq foobar(%rip), %rax movl (%rax), %eax movl %eax, a(%rip) mfence movq foobar(%rip), %rax movl (%rax), %eax ret .cfi_endproc -> The memory barrier invalidates the register cache. Without the second memory barrier in the C source, the compiler uses the register cached value of *foobar. Thanks for your help on this again, Tony. Cheers, Thomas

On Thu, Nov 18, 2010 at 5:50 AM, Thomas Jarosch
Hello Tony,
Thanks for the detailed explanation! As you pointed out, the second thread will mess up without the barrier. Time to fix my code :)
One more small thing: IMHO the write reordering in the first thread can't happen because of the memory barrier created by interlockedExchange(). The first thread translates to this:
Yes, of course. Sorry, I removed that interlock as well in my example, since that question also often comes up, so I was basically just giving my standard explanation. I should have been more clear.
Let me illustrate it a bit more:
..
Don't we need an interlocked read here, too?
Or does the mutex/memory barrier ensure the compiler isn't allowed to do register caching?
Technically, since C++ still doesn't know anything about threads, there could be some register caching. But there isn't, because the compiler writers and thread packages work this all out for us, using whatever non-standard calls or extensions that are required. There are some papers out there such as "Threads cannot be done in a library" by Boehm http://www.stanford.edu/class/cs240/readings/p261-boehm.pdf going into more detail about this. In particular, exotic cases such as: if (some_condition) lock(mutex); for (i = 0; i < N; i++) { do_stuff(); if (some_condition) x++; } if (some_condition) unlock(mutex); So the programmer's intent is that x is only updated sometimes, and that happens to be the only time that the mutex is needed. This is a case that, without being careful, compilers can optimize incorrectly. C++0x will take care of these cases. But beyond some exotic cases with aggresive optimizations, normal mutexes with non-volatiles just works. And I don't think volatiles fixes the cases listed in the paper.
Nice! Too bad "www.boostcon.com" currently issues a "500 - Internal Server Error" :o)
!

On Thu, Nov 18, 2010 at 5:50 AM, Thomas Jarosch
Nice! Too bad "www.boostcon.com" currently issues a "500 - Internal Server Error" :o)
Fixed, thanks. Next time please make a louder stink on this list, though (like in a subject line); I could have easily missed this. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Wed, Nov 17, 2010 at 11:13 PM, Gottlob Frege
Any flaws with this approach?
Yes.
Your thread will see flag correctly, but the "important data" being protected/initted by the once, might not bee seen correctly.
<snip nice long explanation> Another way to say it is that you are essentially trying to do the classic "double-checked locking" (DCL) pattern, which is well-known not to work and you can find plenty of explanations for. In fact, the BOOST_THREAD_ONCE* stuff is there to solve exactly that problem. -- Dave Abrahams BoostPro Computing http://www.boostpro.com
participants (3)
-
Dave Abrahams
-
Gottlob Frege
-
Thomas Jarosch