
I don't regularly read the boost mailing list - so please reply directly to me or Mike. Sean Begin forwarded message:
From: "Mike Schuster" <schuster@adobe.com> Date: September 15, 2005 3:42:26 PM PDT Subject: Boost thread library bugs
Here is a summary of several bugs I've discovered over the past few months in the Boost thread library (version 1_32_0). Sean, please forward this email to the Boost thread developers. Thanks.
1) On the PowerPC, the sequence of memory write operations executed by one processor may be seen by another processor or device in a different order. This weak write ordering property implies that when modifying a shared resource, the modifying processor must execute a sync instruction to make these modifications visible to all other processors before releasing the lock. I discovered several situations in the Boost thread library where a sync call is missing.
call_once: Immediately after the client function returns a lock variable is set to one. Other processors may see this lock equal to one before all memory write operations performed by the client function are completed. A call to __sync() should be made immediately prior to setting the lock to one.
synchronization class constructors (mutex, read_write_mutex, condition, etc): Once the class constructor returns, Boost provides an API where other threads are free to call the synchronization member functions. However, the memory write operations performed by the constructor may not have been completed when the member functions are executed by a different processor. So a call to __sync () should be made immediately prior to returning from the constructor.
Note that a similar situation occurs between member function calls. However the MacOS synchronization primitives used by Boost do perform a sync, so correct operation is guaranteed implicitly as long the last operation performed by a member function involves an OS synchronization primitive call. This appears to be the situation in many places, but there may be places in the Boost library where this requirement is not met. So someone needs to review all of the source code for problems of this sort.
2) On the PowerPC, I have seen situations where call_once deadlocks in the MPRemoteCall function. I have not been able to diagnose the problem. Deadlocks occur when call_once is executed by non-main threads. I believe I have a solution to the problem which uses a completely different implementation similar to that of the Win32 version and avoids all calls to MPRemoveCall. Maybe I should submit this solution to the Boost developers for consideration.
3) I discovered a deadlock in read_write_mutex. If either of the alternating scheduling policies are used, the implementation will deadlock the first reader to arrive when no writers are active. The deadlock occurs in the function void read_write_mutex_impl<Mutex>::do_read_lock. If m_state == 0 and m_num_readers_to_wait == 0 (this holds immediately on construction), then an arriving reader will hang indefinitely on m_waiting_readers. There are other related situations where the BOOST_ASSERT on loop_count fails.
Note I am concerned that such a blatant flaw is present in the library. This implies that the library has not been very well tested. This is worrisome especially for a thread library where threading bugs can be extremely frustrating and hard for uses of the library to reproduce and diagnose.
-Mike Schuster