A fast-pathed rw-mutex algorithm for Boost...

Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f... It can use DWCAS on 32-bit systems, and CAS on 64-bit systems so its fairly portable... I understand that Boost has a problem with it's current rw-mutex? This one should blow it away wrt performance. I believe that the older Boost implementation made use of condition variables and mutexs, which are not needed... You can use sem_t or SysV semaphores for the slow paths... Any thoughts?

Chris Thomasson wrote:
Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f...
Seems like it can work. Don't you only need one bit for writes though? I have one at http://pdimov.com/cpp/rw_mutex.cpp but your version is better. I see no need for assembly however.

"Peter Dimov" <pdimov@mmltd.net> writes:
Chris Thomasson wrote:
Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f...
Seems like it can work. Don't you only need one bit for writes though?
It's essentially the same algorithm I've used in my latest offering on the thread_rewrite branch. http://boost.cvs.sourceforge.net/boost/boost/boost/thread/win32/read_write_mutex.hpp?revision=1.1.2.8&view=markup&pathrev=thread_rewrite I use 3 bits for the actual state (1 bit for currently shared, 1 for currently exclusive, and 1 for shared with upgradable), a 14 bit count of waiting writers, and a 15 bit count of waiting/running readers, so it all fits in a 32-bit word, and I can use plain CAS rather than DWCAS. I've also extended the algorithm to cover upgradeable locks. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Chris Thomasson wrote:
Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f...
Seems like it can work. Don't you only need one bit for writes though?
It's essentially the same algorithm I've used in my latest offering on the thread_rewrite branch.
Do you plan to document the algorithm? I'm interested in the following scenario: - K active readers are holding the lock - a writer is waiting for the above - L readers are wating for the writer Since you seem to only store K+L (Chris's algorithm has separate counts), how do you know when to signal the writer?

"Peter Dimov" <pdimov@mmltd.net> writes:
Anthony Williams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Chris Thomasson wrote:
Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f...
Seems like it can work. Don't you only need one bit for writes though?
It's essentially the same algorithm I've used in my latest offering on the thread_rewrite branch.
Do you plan to document the algorithm?
Yes, later. Thanks for reviewing, Peter. The final boost code can only be better because of it. Thanks also to Chris for his pseudo-code.
I'm interested in the following scenario:
- K active readers are holding the lock - a writer is waiting for the above - L readers are wating for the writer
Since you seem to only store K+L (Chris's algorithm has separate counts), how do you know when to signal the writer?
Maybe it's not /quite/ the same algorithm, but it's certainly very similar. The answer to your question is "when all K+L readers have finished". i.e., my algorithm gives readers priority. Maybe this isn't such a good thing; in the "common" scenario of one writer thread and many readers, a waiting writer needs to block waiting readers, otherwise the writer might starve if the readers overlap each other. Chris's algorithm goes to the other extreme --- writers always take priority, and no readers will take the lock if there are any waiting writers. If there is more than one writer thread, then the readers might starve. The way I see it, the ideal is that a waiting writer will block any fresh readers until all the current readers have gone away, but once that happens, and the lock is released, the waiting writer and the waiting readers will compete equally. To do that, you should be able to get away with a single bit for waiting writers, too --- either there are waiting writers, or there aren't. Likewise for waiting readers. When the last reader, or the only writer unlocks, it clears both the waiting readers and waiting writers flags, and unblocks all waiting threads (signals an event). The waiting threads then compete. If a reader gets there first, then subsequent readers are allowed in until a writer thread wakes up. Once a writer either gets the lock, or flags that it's waiting, then the remaining readers are blocked until the next wakeup. Thoughts? Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:d58h7nlv.fsf@yahoo.com...
"Peter Dimov" <pdimov@mmltd.net> writes:
Anthony Williams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Chris Thomasson wrote:
Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f...
Seems like it can work. Don't you only need one bit for writes though?
It's essentially the same algorithm I've used in my latest offering on the thread_rewrite branch.
[...]
The answer to your question is "when all K+L readers have finished". i.e., my algorithm gives readers priority.
Maybe this isn't such a good thing; in the "common" scenario of one writer thread and many readers, a waiting writer needs to block waiting readers, otherwise the writer might starve if the readers overlap each other.
Chris's algorithm goes to the other extreme --- writers always take priority, and no readers will take the lock if there are any waiting writers. If there is more than one writer thread, then the readers might starve.
The reason I give writer priority is because I anticipated that there will be a heck of a lot more readers than writers throughout the entire lifetime of the lock... I wanted to be able to get what few writes there may be in during periods of heavy and persistent reader activity; lots of readers should be implied. If the writes only happen once in a while, then they should be given priority over the stampeding heards of reader threads... I can't really imagine a practical situation in which periods of moderate/heavy persistent writer activity that could starve readers... IMHO, that could only mean that the rw-mutex is being used improperly; they should only be used for "mostly-read" based algorithms...
The way I see it, the ideal is that a waiting writer will block any fresh readers until all the current readers have gone away, but once that happens, and the lock is released, the waiting writer and the waiting readers will compete equally.
[...] Well, IMHO: A rw-mutex usually makes readers block/wait for all "prior writes" and the "last write broadcasts". A write usually blocks/waits for all "prior reads/writes", and the "last read signals/last write broadcasts"... Anyway, I believe that it is sufficient to prevent the starvation of writers because a high number of readers is simply implied wrt rw-mutexs... So, it naturally seems that writes would be the only thing that could get subjected to any kind of starvation... Again, if your rw-mutex is getting hit with enough writes to even begin to start to starve the readers, then the user of the rw-mutex is using it improperly... IMHO of course...

"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:d58h7nlv.fsf@yahoo.com...
The way I see it, the ideal is that a waiting writer will block any fresh readers until all the current readers have gone away, but once that happens, and the lock is released, the waiting writer and the waiting readers will compete equally.
[...]
Well, IMHO:
A rw-mutex usually makes readers block/wait for all "prior writes" and the "last write broadcasts". A write usually blocks/waits for all "prior reads/writes", and the "last read signals/last write broadcasts"...
Anyway, I believe that it is sufficient to prevent the starvation of writers because a high number of readers is simply implied wrt rw-mutexs... So, it naturally seems that writes would be the only thing that could get subjected to any kind of starvation...
Again, if your rw-mutex is getting hit with enough writes to even begin to start to starve the readers, then the user of the rw-mutex is using it improperly... IMHO of course...
I agree that preventing writers starving should be sufficient in most cases. I just happen to think that it's less than ideal. I was imagining a system with several hundred readers, and a couple of writers. If the writers take sufficiently long, you might find that you have one waiting on the other. The readers then get starved if you give writers priority, just as the writers get starved if you give readers priority. If, when the first writer unlocks, the readers compete with the waiting writer, then most of the time some readers will get in before the writer re-registers that it's waiting, and blocks the rest. You could just unblock one thread, and if it's a reader then it unblocks all currently-waiting readers. Rather like a condition variable broadcast.... Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:zmbl66cr.fsf@yahoo.com...
"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:d58h7nlv.fsf@yahoo.com...
The way I see it, the ideal is that a waiting writer will block any fresh readers until all the current readers have gone away, but once that happens, and the lock is released, the waiting writer and the waiting readers will compete equally.
[...]
Again, if your rw-mutex is getting hit with enough writes to even begin to start to starve the readers, then the user of the rw-mutex is using it improperly... IMHO of course...
I agree that preventing writers starving should be sufficient in most cases. I just happen to think that it's less than ideal.
I was imagining a system with several hundred readers, and a couple of writers. If the writers take sufficiently long, you might find that you have one waiting on the other. The readers then get starved if you give writers priority, just as the writers get starved if you give readers priority.
Well, if you are using hundreds of reader threads, then basically any rw-mutex is less than ideal in most cases, IMO: http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce Humm...

"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:zmbl66cr.fsf@yahoo.com...
"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:d58h7nlv.fsf@yahoo.com...
[...]
If, when the first writer unlocks, the readers compete with the waiting writer, then most of the time some readers will get in before the writer re-registers that it's waiting, and blocks the rest. You could just unblock one thread, and if it's a reader then it unblocks all currently-waiting readers. Rather like a condition variable broadcast....
I could adjust the logic of my algorithm so that it behaves exactly like that...

Chris Thomasson wrote:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:zmbl66cr.fsf@yahoo.com...
"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:d58h7nlv.fsf@yahoo.com...
[...]
If, when the first writer unlocks, the readers compete with the waiting writer, then most of the time some readers will get in before the writer re-registers that it's waiting, and blocks the rest. You could just unblock one thread, and if it's a reader then it unblocks all currently-waiting readers. Rather like a condition variable broadcast....
I could adjust the logic of my algorithm so that it behaves exactly like that...
I implemented a variation of your algorithm that has - active reader count - pending/active writer bit (+semaphore) - blocked reader+writer count (+semaphore) The last active reader unblocks the pending writer, if there is one, else the rest; the writer unblocks the rest on unlock ("thundering herd"). However I'm getting worse performance out of it compared to my other prototype on a dual core Pentium D on Windows. I also tried to separate the blocked readers and blocked writers and only unblock one writer at a time. No change in performance. Maybe I've made a mistake somewhere. :-)

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:00d401c6f7b8$4930dce0$6607a8c0@pdimov2...
Chris Thomasson wrote:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:zmbl66cr.fsf@yahoo.com...
"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:d58h7nlv.fsf@yahoo.com...
[...]
I could adjust the logic of my algorithm so that it behaves exactly like that...
I implemented a variation of your algorithm
[...]
However I'm getting worse performance out of it compared to my other prototype on a dual core Pentium D on Windows. I also tried to separate the blocked readers and blocked writers and only unblock one writer at a time. No change in performance.
How does the original algorithm compare to yours? Also, how many reader/writer threads are you using, and how many writes do you perform on average, 'per-writer-thread', lets say every 3 or 4 seconds?
Maybe I've made a mistake somewhere. :-)
http://groups.google.com/group/comp.programming.threads/msg/747ea4666afc6826 http://groups.google.com/group/comp.programming.threads/msg/b32bd6781aa6108b ;^)

Chris Thomasson wrote:
How does the original algorithm compare to yours?
I haven't tried it yet. Will do.
Also, how many reader/writer threads are you using, and how many writes do you perform on average, 'per-writer-thread', lets say every 3 or 4 seconds?
Various. Here's the test: #include <boost/thread/thread.hpp> #include <boost/bind.hpp> #include <vector> #include <iostream> #include <ctime> int const N = 1048576; int const n = 1000; // vector size rw_mutex mtx; std::vector<int> v( n ); boost::mutex cmx; // void reader( int r ) { for( int i = 0; i < N; ++i ) { rw_mutex::scoped_read_lock lock( mtx ); int m = v.front(); for( std::vector<int>::const_iterator i = v.begin(), end = v.end(); i != end; ++i ) { assert( *i == m ); } } boost::mutex::scoped_lock lock( cmx ); std::cout << " R" << r; } void writer( int w ) { for( int i = 0; i < N; ++i ) { rw_mutex::scoped_write_lock lock( mtx ); int m = v.front(); for( std::vector<int>::iterator i = v.begin(), end = v.end(); i != end; ++i ) { ++*i; assert( *i == m + 1 ); } } boost::mutex::scoped_lock lock( cmx ); std::cout << " W" << w; } void test( int nr, int nw ) { try { std::cout << nr << "R + " << nw << "W:"; std::time_t tm = std::time( 0 ); boost::thread_group group; int i = 0; for( ; i < nr && i < nw; ++i ) { group.create_thread( boost::bind( writer, i ) ); group.create_thread( boost::bind( reader, i ) ); } for( ; i < nr; ++i ) { group.create_thread( boost::bind( reader, i ) ); } for( ; i < nw; ++i ) { group.create_thread( boost::bind( writer, i ) ); } group.join_all(); std::cout << ": " << std::time( 0 ) - tm << std::endl; } catch( std::exception const & x ) { std::cout << x.what() << std::endl; } } int main() { test( 0, 1 ); test( 0, 4 ); test( 1, 0 ); test( 1, 1 ); test( 1, 4 ); test( 2, 0 ); test( 2, 1 ); test( 2, 4 ); test( 4, 0 ); test( 4, 1 ); test( 4, 4 ); test( 16, 0 ); test( 16, 1 ); test( 16, 4 ); test( 64, 0 ); test( 64, 1 ); test( 64, 4 ); }

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
How does the original algorithm compare to yours?
I haven't tried it yet. Will do.
Humm... I would also be interested in seeing a comparison between your implementation of the algorithm and the prototype I am currently whipping up in assembly... Humm... Calls to a DWCAS function in a loop, in C/C++, can be fairly expensive when compared to a tight cmpxchg8b loop in assembly, that makes no function calls with volatile/compiler-barrier restraints per-loop... Perhaps 2-4 instructions per-loop... [...]
Various. Here's the test:
need to look at this more closely...

Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
How does the original algorithm compare to yours?
I haven't tried it yet. Will do.
Humm...
Hmmm. Doesn't seem to work. What does your dwcas return on success in &cmp? Leaves it unchanged, perhaps? I always return the current value of &mtx there and it seems that your code depends on &cmp only changing if the dwcas fails.

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:002b01c6f7c2$17116590$6607a8c0@pdimov2...
Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
How does the original algorithm compare to yours?
I haven't tried it yet. Will do.
Humm...
Hmmm. Doesn't seem to work. What does your dwcas return on success in &cmp? Leaves it unchanged, perhaps? I always return the current value of &mtx there and it seems that your code depends on &cmp only changing if the dwcas fails.
I am using an IBM-Style CAS: http://groups.google.com/group/comp.programming.threads/msg/bf68fe51225f1cbb http://groups.google.com/group/comp.programming.threads/msg/b64b2419a6b05bb6 it updates the comprand and returns non-zero on failure. Here is exactly how its implemented; look at the very first function in this file: http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_mingw_asm.html or MASM http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_masm_asm.html Does that clear some things up?

Peter Dimov wrote:
Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
How does the original algorithm compare to yours?
I haven't tried it yet. Will do.
Humm...
Hmmm. Doesn't seem to work.
Works now. The performance is somewhat better than my other semaphore-based attempts in most common scenarios. But it still doesn't beat the "naive" implementation at http://pdimov.com/cpp/rw_mutex.cpp Here's the entire header: #ifndef BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED #define BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED // MS compatible compilers support #pragma once #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif // Copyright (c) 2005 Chris Thomasson // Copyright (c) 2006 Peter Dimov // // Distributed under the Boost Software License, Version 1.0. // See accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) #include <boost/detail/interlocked.hpp> #include <cassert> #include <windows.h> bool dwcas( void * d, void * c, const void * x ) { long ov = *(long*)c; long nv = BOOST_INTERLOCKED_COMPARE_EXCHANGE( (long*)d, *(long const*)x, ov ); if( ov == nv ) { return false; } else { *(long*)c = nv; return true; } } class rw_mutex { private: rw_mutex( rw_mutex const & ); rw_mutex & operator=( rw_mutex const & ); private: struct state { unsigned reads: 10; unsigned writes: 2; unsigned r_waits: 10; unsigned w_waits: 10; }; state mtx_; HANDLE sema_r_; HANDLE sema_w_; public: rw_mutex() { state st = { 0 }; mtx_ = st; sema_r_ = CreateSemaphore( 0, 0, 1023, 0 ); sema_w_ = CreateSemaphore( 0, 0, 1023, 0 ); } ~rw_mutex() { CloseHandle( sema_r_ ); CloseHandle( sema_w_ ); } void rdlock() { for( ;; ) { state cmp = mtx_, xchg; do { xchg = cmp; if( !cmp.writes && !cmp.w_waits ) { ++xchg.reads; } else { ++xchg.r_waits; } } while( dwcas( &mtx_, &cmp, &xchg ) ); if( !cmp.writes && !cmp.w_waits ) { break; } WaitForSingleObject( sema_r_, INFINITE ); } } void lock() { for ( ;; ) { state cmp = mtx_, xchg; do { xchg = cmp; if( !cmp.reads && !cmp.writes ) { ++xchg.writes; } else { ++xchg.w_waits; } } while( dwcas( &mtx_, &cmp, &xchg ) ); if ( !cmp.reads && !cmp.writes ) { break; } WaitForSingleObject( sema_w_, INFINITE ); } } void unlock() { state cmp = mtx_, xchg; do { assert( cmp.writes == 1 ); assert( cmp.reads == 0 ); xchg = cmp; --xchg.writes; if( cmp.w_waits ) { --xchg.w_waits; } else if( cmp.r_waits ) { xchg.r_waits = 0; } } while( dwcas( &mtx_, &cmp, &xchg ) ); if( cmp.w_waits ) { ReleaseSemaphore( sema_w_, 1, 0 ); } else if( cmp.r_waits ) { ReleaseSemaphore( sema_r_, cmp.r_waits, 0 ); } } void rdunlock() { state cmp = mtx_, xchg; do { assert( cmp.writes == 0 ); assert( cmp.reads > 0 ); xchg = cmp; --xchg.reads; if( !xchg.reads && cmp.w_waits ) { --xchg.w_waits; } else if( cmp.r_waits ) { xchg.r_waits = 0; } } while( dwcas( &mtx_, &cmp, &xchg ) ); if ( !xchg.reads && cmp.w_waits ) { ReleaseSemaphore( sema_w_, 1, 0 ); } else if ( cmp.r_waits ) { ReleaseSemaphore( sema_r_, cmp.r_waits, 0 ); } } public: // lock classes class scoped_read_lock { private: rw_mutex & mx_; scoped_read_lock( scoped_read_lock const & ); scoped_read_lock & operator=( scoped_read_lock const & ); public: scoped_read_lock( rw_mutex & mx ): mx_( mx ) { mx_.rdlock(); } ~scoped_read_lock() { mx_.rdunlock(); } }; class scoped_write_lock { private: rw_mutex & mx_; scoped_write_lock( scoped_write_lock const & ); scoped_write_lock & operator=( scoped_write_lock const & ); public: scoped_write_lock( rw_mutex & mx ): mx_( mx ) { mx_.lock(); } ~scoped_write_lock() { mx_.unlock(); } }; }; #endif // #ifndef BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:004001c6f7c5$4749e180$6607a8c0@pdimov2...
Peter Dimov wrote:
Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
[...]
Works now.
The performance is somewhat better than my other semaphore-based attempts in most common scenarios. But it still doesn't beat the "naive" implementation at
Have you tried making use of DWCAS instead of CAS? Now I am really interested if my assembly implementation can beat your 'naive' version... Will post it in next couple of days...

I am going to implement your algorithm and my algorithm on my UltraSPARC T1(aka, Niagara), crank up the contention level, and just let it pound away. Could have different results... I will make sure to post the assembly and source C applications that will be compatible with POSIX Threads and SPARC V9 64-bit Assembly Language... Could be interesting little project? ;^) This shoud be complete within a week...

"Chris Thomasson" <cristom@comcast.net> wrote in message news:ehmbfs$ukj$1@sea.gmane.org...
I am going to implement your algorithm and my algorithm on my UltraSPARC T1(aka, Niagara), crank up the contention level, and just let it pound away. Could have different results...
I will make sure to post the assembly and source C applications that will be compatible with POSIX Threads and SPARC V9 64-bit Assembly Language...
Could be interesting little project?
I don't think I need to do this now... I think I know why your algorithm is getting better numbers on your test...

Now that I look at your algorithm from another point of view, I can notice some possible advantages'... i am still refining my analyses...

"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
How does the original algorithm compare to yours?
I haven't tried it yet. Will do.
Humm...
Hmmm. Doesn't seem to work.
Works now.
The performance is somewhat better than my other semaphore-based attempts in most common scenarios. But it still doesn't beat the "naive" implementation
Hi, I have run a test against Peter's "naive" implementation, Peter's implementation of Chris's algorithm, my algorithm from the thread_rewrite branch (v1.1.2.8), my old condition-based algorithm from the thread_rewrite branch (v1.1.2.4), and a new variation of my latest offering that has a "waiting exclusive" bit, and blocks readers if that bit is set, in line with my last post on the subject. The test is essentially Peter's test, with a mod to count the max time any reader or any writer spends waiting, in order to check for starvation. The tests were run on a Pentium D dual core machine. The output format is num readers + num writers: : time-in-seconds max-reader-wait-in-milliseconds max-writer-wait-in-milliseconds The total count N in Peter's algorithm was 65536, in order to get runtimes I could handle. The first thing to note is that my current V1.1.2.8 deadlocks, so I got the algorithm wrong :-( The next thing to note is that the total runtimes of the remainder are all comparable. I would have to run longer tests to really tell them apart. The next thing to note is that Peter's implementation of Chris's algorithm has some REALLY long wait times for readers --- 12 seconds! --- which follows from the "writer priority" nature of the algorithm. Anthony AAJW thread_rewrite v1.1.2.8 DEADLOCK!!!!! AAJW condition based v1.1.2.4 0R + 1W: : 3 max r wait: 0, max w wait: 0 0R + 4W: : 14 max r wait: 0, max w wait: 297 1R + 0W: : 2 max r wait: 0, max w wait: 0 1R + 1W: : 5 max r wait: 46, max w wait: 32 1R + 4W: : 16 max r wait: 265, max w wait: 375 2R + 0W: : 2 max r wait: 16, max w wait: 0 2R + 1W: : 5 max r wait: 46, max w wait: 406 2R + 4W: : 16 max r wait: 94, max w wait: 296 4R + 0W: : 4 max r wait: 16, max w wait: 0 4R + 1W: : 8 max r wait: 63, max w wait: 360 4R + 4W: : 18 max r wait: 250, max w wait: 578 16R + 0W: : 17 max r wait: 31, max w wait: 0 16R + 1W: : 20 max r wait: 141, max w wait: 609 16R + 4W: : 33 max r wait: 188, max w wait: 1422 64R + 0W: : 67 max r wait: 47, max w wait: 0 64R + 1W: : 72 max r wait: 281, max w wait: 562 64R + 4W: : 86 max r wait: 468, max w wait: 984 AAJW new algo with waiting writer flag 0R + 1W: : 3 max r wait: 0, max w wait: 16 0R + 4W: : 14 max r wait: 0, max w wait: 187 1R + 0W: : 2 max r wait: 16, max w wait: 0 1R + 1W: : 5 max r wait: 47, max w wait: 16 1R + 4W: : 16 max r wait: 140, max w wait: 94 2R + 0W: : 2 max r wait: 15, max w wait: 0 2R + 1W: : 6 max r wait: 32, max w wait: 32 2R + 4W: : 16 max r wait: 125, max w wait: 109 4R + 0W: : 5 max r wait: 15, max w wait: 0 4R + 1W: : 7 max r wait: 47, max w wait: 32 4R + 4W: : 18 max r wait: 63, max w wait: 140 16R + 0W: : 16 max r wait: 16, max w wait: 0 16R + 1W: : 20 max r wait: 484, max w wait: 594 16R + 4W: : 30 max r wait: 703, max w wait: 750 64R + 0W: : 64 max r wait: 16, max w wait: 0 64R + 1W: : 72 max r wait: 406, max w wait: 1172 64R + 4W: : 77 max r wait: 563, max w wait: 875 PD CT algo 0R + 1W: : 3 max r wait: 0, max w wait: 0 0R + 4W: : 13 max r wait: 0, max w wait: 359 1R + 0W: : 2 max r wait: 0, max w wait: 0 1R + 1W: : 6 max r wait: 47, max w wait: 32 1R + 4W: : 15 max r wait: 9375, max w wait: 282 2R + 0W: : 2 max r wait: 0, max w wait: 0 2R + 1W: : 6 max r wait: 62, max w wait: 32 2R + 4W: : 16 max r wait: 11422, max w wait: 343 4R + 0W: : 4 max r wait: 16, max w wait: 0 4R + 1W: : 7 max r wait: 78, max w wait: 110 4R + 4W: : 18 max r wait: 12296, max w wait: 531 16R + 0W: : 16 max r wait: 16, max w wait: 0 16R + 1W: : 20 max r wait: 125, max w wait: 141 16R + 4W: : 29 max r wait: 12625, max w wait: 438 64R + 0W: : 66 max r wait: 16, max w wait: 0 64R + 1W: : 69 max r wait: 250, max w wait: 141 64R + 4W: : 78 max r wait: 2110, max w wait: 844 PD naive 0R + 1W: : 4 max r wait: 0, max w wait: 0 0R + 4W: : 13 max r wait: 0, max w wait: 438 1R + 0W: : 2 max r wait: 0, max w wait: 0 1R + 1W: : 5 max r wait: 47, max w wait: 16 1R + 4W: : 16 max r wait: 453, max w wait: 297 2R + 0W: : 2 max r wait: 15, max w wait: 0 2R + 1W: : 5 max r wait: 47, max w wait: 16 2R + 4W: : 16 max r wait: 328, max w wait: 313 4R + 0W: : 4 max r wait: 16, max w wait: 0 4R + 1W: : 8 max r wait: 63, max w wait: 16 4R + 4W: : 17 max r wait: 453, max w wait: 437 16R + 0W: : 16 max r wait: 16, max w wait: 0 16R + 1W: : 20 max r wait: 218, max w wait: 125 16R + 4W: : 29 max r wait: 953, max w wait: 734 64R + 0W: : 65 max r wait: 16, max w wait: 0 64R + 1W: : 68 max r wait: 343, max w wait: 343 64R + 4W: : 77 max r wait: 1515, max w wait: 1156 Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:wt6ok7ow.fsf@yahoo.com...
"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
[...]
Hi,
I have run a test against Peter's "naive" implementation, Peter's implementation of Chris's algorithm, my algorithm from the thread_rewrite branch (v1.1.2.8), my old condition-based algorithm from the thread_rewrite branch (v1.1.2.4), and a new variation of my latest offering that has a "waiting exclusive" bit, and blocks readers if that bit is set, in line with my last post on the subject. The test is essentially Peter's test, with a mod to count the max time any reader or any writer spends waiting, in order to check for starvation.
The tests were run on a Pentium D dual core machine. The output format is
num readers + num writers: : time-in-seconds max-reader-wait-in-milliseconds max-writer-wait-in-milliseconds
The total count N in Peter's algorithm was 65536, in order to get runtimes I could handle.
The first thing to note is that my current V1.1.2.8 deadlocks, so I got the algorithm wrong :-(
;^(...
The next thing to note is that the total runtimes of the remainder are all comparable. I would have to run longer tests to really tell them apart.
The next thing to note is that Peter's implementation of Chris's algorithm has some REALLY long wait times for readers --- 12 seconds! --- which follows from the "writer priority" nature of the algorithm.
12 Seconds? Something is off kilter wrt the writer frequency per-writer-thread and/or the critical-section the writes make use of... IMHO, this would prompt me to revise my applications synchronization scheme... Again, IMHO of course, anytime you are getting enough writes to tank a "rw-mutex w/ stick writer priority" , then you are probably better off using PDR instead of rw-mutexs. Okay, FWIW... Here some pseudo-code for an alter version of my algorithm that gives strict priority to the readers... This "newly augmented algorithm" should do better on Peters test... Hopefully!!! lol. ;^) // write lock rw_mutex_t cmp, xchg; for ( ;; ) { cmp = mtx; do { xchg = cmp; if ( ! cmp.reads && ! cmp.writes ) { ++xchg.writes; } else { ++xchg.w_waits; } } while ( dwcas( &mtx, &cmp, &xchg ) ); if ( ! cmp.reads && ! cmp.writes ) { break; } sem_wait( &w_wset ); } // read lock rw_mutex_t cmp, xchg; for ( ;; ) { cmp = mtx; do { xchg = cmp; if ( ! cmp.writes ) { ++xchg.reads; } else { ++xchg.r_waits; } } while ( dwcas( &mtx, &cmp, &xchg ) ); if ( ! cmp.writes ) { break; } sem_wait( &r_wset ); } // write unlock rw_mutex_t cmp, xchg; do { xchg = cmp; --xchg.writes; if ( cmp.w_waits && ! cmp.r_waits ) { --xchg.w_waits; } else if ( cmp.r_waits ) { xchg.r_waits = 0; } } while ( dwcas( &mtx, &cmp, &xchg ) ); if ( cmp.w_waits && ! cmp.r_waits ) { sem_post( &w_wset ); } else if ( cmp.r_waits ) { sem_post_multiple( &r_wset, cmp.r_waits ); } // read unlock rw_mutex_t cmp, xchg; do { xchg = cmp; --xchg.reads; if ( ! cmp.r_waits && cmp.w_waits ) { --xchg.w_waits; } else if ( cmp.r_waits ) { xchg.r_waits = 0; } } while ( dwcas( &mtx, &cmp, &xchg ) ); if ( ! cmp.r_waits && cmp.w_waits ) { sem_post( &w_wset ); } else if ( cmp.r_waits ) { sem_post_multiple( &r_wset, cmp.r_waits ); } Try that on for size everybody! However, beware, this algorithm can potentially starve writers!! Enjoy... ;^)

Humm. My algorithm is turning out to be fairly flexible wrt tailoring it for different types of rw-priority schemes... Humm...

"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:wt6ok7ow.fsf@yahoo.com...
"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2...
Chris Thomasson wrote:
[...]
Hi,
I have run a test against Peter's "naive" implementation, Peter's implementation of Chris's algorithm, my algorithm from the thread_rewrite branch (v1.1.2.8), my old condition-based algorithm from the thread_rewrite branch (v1.1.2.4), and a new variation of my latest offering that has a "waiting exclusive" bit, and blocks readers if that bit is set, in line with my last post on the subject. The test is essentially Peter's test, with a mod to count the max time any reader or any writer spends waiting, in order to check for starvation.
The next thing to note is that the total runtimes of the remainder are all comparable. I would have to run longer tests to really tell them apart.
I'm running some longer tests now, to see how it comes out.
The next thing to note is that Peter's implementation of Chris's algorithm has some REALLY long wait times for readers --- 12 seconds! --- which follows from the "writer priority" nature of the algorithm.
12 Seconds? Something is off kilter wrt the writer frequency per-writer-thread and/or the critical-section the writes make use of... IMHO, this would prompt me to revise my applications synchronization scheme...
It's just a consequence of writer priority when there's more than one writer. The particular trouble spots were when there is 4 writer threads, so there's pretty much always a writer waiting. I would revise the rwmutex to be "fairer" as a first port of call.
Okay, FWIW... Here some pseudo-code for an alter version of my algorithm that gives strict priority to the readers...
I thought we all agreed that was a bad plan. That's why I'm trying to revise my algorithm to not do this. Incidentally, I've just been reviewing the read-write lock algorithm in Butenhof's "Programming with POSIX Threads", and he uses a reader-priority algorithm. He does say that it might not be suitable for all uses, however, and shows how to change it to writer-priority.
This "newly augmented algorithm" should do better on Peters test... Hopefully!!!
I'll add it to the list. The test I'm running does cover a wide variety of scenarios, with 0, 1, 2 or 4 writers, each matched against 0, 1, 2, 4, 16 or 64 readers, so it's not biased --- a "reader priority" algorithm should give high "write wait" times when there's lots of readers, just as a "writer priority" algorithm will give high "read wait" times when there's more than one writer. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

On 10/26/06, Anthony Williams <anthony_w.geo@yahoo.com> wrote:
"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:wt6ok7ow.fsf@yahoo.com...
"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Chris Thomasson wrote:
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001801c6f7bd$c32f62a0$6607a8c0@pdimov2... > Chris Thomasson wrote:
[...]
Hi,
I have run a test against Peter's "naive" implementation, Peter's implementation of Chris's algorithm, my algorithm from the thread_rewrite branch (v1.1.2.8), my old condition-based algorithm from the thread_rewrite branch (v1.1.2.4), and a new variation of my latest offering that has a "waiting exclusive" bit, and blocks readers if that bit is set, in line with my last post on the subject. The test is essentially Peter's test, with a mod to count the max time any reader or any writer spends waiting, in order to check for starvation.
The next thing to note is that the total runtimes of the remainder are all comparable. I would have to run longer tests to really tell them apart.
I'm running some longer tests now, to see how it comes out.
The next thing to note is that Peter's implementation of Chris's algorithm has some REALLY long wait times for readers --- 12 seconds! --- which follows from the "writer priority" nature of the algorithm.
12 Seconds? Something is off kilter wrt the writer frequency per-writer-thread and/or the critical-section the writes make use of... IMHO, this would prompt me to revise my applications synchronization scheme...
It's just a consequence of writer priority when there's more than one writer. The particular trouble spots were when there is 4 writer threads, so there's pretty much always a writer waiting. I would revise the rwmutex to be "fairer" as a first port of call.
Okay, FWIW... Here some pseudo-code for an alter version of my algorithm that gives strict priority to the readers...
I thought we all agreed that was a bad plan. That's why I'm trying to revise my algorithm to not do this.
Incidentally, I've just been reviewing the read-write lock algorithm in Butenhof's "Programming with POSIX Threads", and he uses a reader-priority algorithm. He does say that it might not be suitable for all uses, however, and shows how to change it to writer-priority.
Is there a reason for wanting a priority? Why not make the lock fair. I'd rather have locks take out a ticket so everything is processed in a FIFO order, with no starvation.
This "newly augmented algorithm" should do better on Peters test... Hopefully!!!
I'll add it to the list.
The test I'm running does cover a wide variety of scenarios, with 0, 1, 2 or 4 writers, each matched against 0, 1, 2, 4, 16 or 64 readers, so it's not biased --- a "reader priority" algorithm should give high "write wait" times when there's lots of readers, just as a "writer priority" algorithm will give high "read wait" times when there's more than one writer.
Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Cory Nelson http://www.int64.org

"Cory Nelson" <phrosty@gmail.com> writes:
Is there a reason for wanting a priority? Why not make the lock fair. I'd rather have locks take out a ticket so everything is processed in a FIFO order, with no starvation.
I think it would be better with no priority, too. I'm working on an algorithm for that. FIFO means that the OS has no say in the wakeup order; it'd be better if all the waiting threads could block on the same OS sync object, so the OS gets to choose which to wake. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

On Oct 26, 2006, at 6:04 AM, Anthony Williams wrote:
"Cory Nelson" <phrosty@gmail.com> writes:
Is there a reason for wanting a priority? Why not make the lock fair. I'd rather have locks take out a ticket so everything is processed in a FIFO order, with no starvation.
I think it would be better with no priority, too. I'm working on an algorithm for that.
FIFO means that the OS has no say in the wakeup order; it'd be better if all the waiting threads could block on the same OS sync object, so the OS gets to choose which to wake.
Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next. http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html If you don't like the "upgradable" part, it strips out cleanly from the code. Just delete everything that refers to "upgradable" and you've got a read/write mutex using Terekhov's algorithm (and rename the mutex). Algorithm in a nutshell: There are two gates. Everyone (readers and writers) wait at gate 1 for entry. When a reader gets into gate 1, he has read ownership. When a writer gets into gate 1, he waits at gate 2 before getting write ownership. While a writer is inside of gate 1, no readers or writers are allowed to pass gate 1. The above algorithm is shown implemented with just mutex and condition variables. The same algorithm is of course implementable using atomics for the "quick path" and I've done that on PPC (but I don't own that code). -Howard

Howard Hinnant wrote:
Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
There are two interesting decisions in a read/write mutex. The first is what happens when a writer is blocked in wrlock() waiting for readers and another reader tries to rdlock(). POSIX leaves this implementation defined in the general case; however, if priority scheduling is supported, the reader "shall" obtain a lock if and only if it's of a higher priority compared to the waiting writer (or all writers, if there's more than one). If we assume equal priority for simplicity, this means that rdlock "shall" block. This is detectable by a conformance test, by the way, since tryrdlock fails if and only if rdlock blocks. This is the "gate 1" part, and all of the algorithms I've been experimenting with also have this property. If the 'writer pending' bit is set, no readers are allowed to obtain a lock. However if another, higher-priority writer arrives after the pending bit is already set, I still grant write access to the first writer, which is probably a violation of POSIX [TPS]. :-) The other interesting decision is what happens on unlock (two subparts, last reader unlocks, or writer unlocks.) Your implementation seems to use the "thundering herd" approach of waking all. This is fair, but may be wasteful if you wake up several writers, or a writer and several readers and the writer manages to pass through gate 1 (everyone else must block again). (This redundant wakeup may not be visible in a benchmark, but it can steal CPU from other busy non-contending threads in a real application.) (I see a significant amount of CAS failures even when waking up only readers; again, this doesn't affect the timing of the specific benchmark, and I'm not sure that it's actually a problem.) Another approach is to wake up one writer in rdunlock and all readers in wrunlock (alternating between readers and writers). I also have a third option in mind that I haven't tested yet.

"Peter Dimov" <pdimov@mmltd.net> writes:
Howard Hinnant wrote:
Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
There are two interesting decisions in a read/write mutex. The first is what happens when a writer is blocked in wrlock() waiting for readers and another reader tries to rdlock(). POSIX leaves this implementation defined in the general case; however, if priority scheduling is supported, the reader "shall" obtain a lock if and only if it's of a higher priority compared to the waiting writer (or all writers, if there's more than one).
If we assume equal priority for simplicity, this means that rdlock "shall" block. This is detectable by a conformance test, by the way, since tryrdlock fails if and only if rdlock blocks.
Agreed.
This is the "gate 1" part, and all of the algorithms I've been experimenting with also have this property. If the 'writer pending' bit is set, no readers are allowed to obtain a lock. However if another, higher-priority writer arrives after the pending bit is already set, I still grant write access to the first writer, which is probably a violation of POSIX [TPS]. :-)
I'm experimenting with having the pending readers and pending writers blocking on the same sync object; then the OS gets to choose which to wake, and it can order by priority. Readers still block if there's pending writers, regardless of priority, though.
The other interesting decision is what happens on unlock (two subparts, last reader unlocks, or writer unlocks.) Your implementation seems to use the "thundering herd" approach of waking all. This is fair, but may be wasteful if you wake up several writers, or a writer and several readers and the writer manages to pass through gate 1 (everyone else must block again).
Another approach is to wake up one writer in rdunlock and all readers in wrunlock (alternating between readers and writers). I also have a third option in mind that I haven't tested yet.
I'm experimenting with ways of waking only one writer, but potentially all readers. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

On Oct 26, 2006, at 12:39 PM, Peter Dimov wrote:
Howard Hinnant wrote:
Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
There are two interesting decisions in a read/write mutex. The first is what happens when a writer is blocked in wrlock() waiting for readers and another reader tries to rdlock(). POSIX leaves this implementation defined in the general case; however, if priority scheduling is supported, the reader "shall" obtain a lock if and only if it's of a higher priority compared to the waiting writer (or all writers, if there's more than one).
If we assume equal priority for simplicity, this means that rdlock "shall" block. This is detectable by a conformance test, by the way, since tryrdlock fails if and only if rdlock blocks.
This is the "gate 1" part, and all of the algorithms I've been experimenting with also have this property. If the 'writer pending' bit is set, no readers are allowed to obtain a lock. However if another, higher-priority writer arrives after the pending bit is already set, I still grant write access to the first writer, which is probably a violation of POSIX [TPS]. :-)
The other interesting decision is what happens on unlock (two subparts, last reader unlocks, or writer unlocks.) Your implementation seems to use the "thundering herd" approach of waking all. This is fair, but may be wasteful if you wake up several writers, or a writer and several readers and the writer manages to pass through gate 1 (everyone else must block again). (This redundant wakeup may not be visible in a benchmark, but it can steal CPU from other busy non-contending threads in a real application.)
(I see a significant amount of CAS failures even when waking up only readers; again, this doesn't affect the timing of the specific benchmark, and I'm not sure that it's actually a problem.)
Another approach is to wake up one writer in rdunlock and all readers in wrunlock (alternating between readers and writers). I also have a third option in mind that I haven't tested yet.
Very nice summary and astute observations. I must admit to really liking the "thundering herd" approach for the interesting unlock cases you site. This is where you're really turning things over to the OS and saying "you decide". And I just flat haven't experimented with priority and consider myself currently unqualified to comment in that area. Which I guess is another reason for me to prefer to let the OS decide such matters (it presumably knows and picks threads with higher priority to wake). Partial rationale: Last reader unlocks: In this case if there is a writer waiting (at gate 2), then (imho) there's no decision but to give the writer the ago, and this is assuming equal priority as you stated. If there is not a writer waiting at gate 2, and this is the last reader, then my implementation does absolutely nothing. A single reader inside of gate 1 blocks no one else from passing through gate 1. There is no thundering herd for last reader unlock. (see unlock_sharable()). Writer unlocks: By definition no one is waiting at gate 2. There may be waiters, both read and write (and upgradable if you're using that) at gate 1. If you choose at this point, a reader or a writer, you are, by definition, giving one of the two priority, at least for this next round. One might have the write unlock alternate its choice, but I'm not seeing much performance gain here. You either wake up 1 writer, or you wake up many readers (on the reader path it might actually be significantly more expensive to make num_reader calls to notify_one() than just do a notify_all()). Let's look at a typical (anticipated) use case under high contention (we're not even discussing or worried about low contention use cases for this discussion): Few (nw) writers blocked, many (nr) readers blocked; writer unlocks, and issues notify_all(). Chances are good that one of the readers will get past gate 1 before one of the writers gets past gate 1, just because there are so many of them compared to the number of writers (strength in numbers). Chance first thread past gate 1 is a reader: nr/(nr+nw) The above formula rapidly approaches 100% as nw/nr approaches 0. nw/nr Chances of reader being first 0 100% // no writers .1 91% .2 83% .3 77% .4 71% .5 67% ... 1.0 50% // same number of writers as readers Indeed it is likely that several, or even many readers will get past gate 1 before one of the few writers is chosen to try. When one of the few writers is finally chosen, it too will get past gate 1 and block at gate 2. At this point you've got many readers inside of gate 1 getting work done. You have only a few writers (nw-1) that woke up for nothing and now have to go back to sleep. And (on average) you'll have relatively few readers that woke up for nothing and must sleep again (just how many, statistically speaking, drops with the nw/nr ratio -- probability formula left as an exercise to the reader :-)). Of course there will be cases where one of the few writers makes it past gate 1 at or near the beginning of the thundering herd, in which case many threads will be woken up for nothing. However, statistically speaking, and assuming all equal probability for any thread getting woken up, this scenario should happen infrequently, and thus not be a performance problem. If you have an atypical scenario: Many writers, few readers, then I agree that the "thundering herd" approach may exhibit performance problems, as it is then likely that a writer is first, and every other thread is woken up for nothing. And in this scenario, I suggest that a read/write mutex is probably not a good design choice for the application. There's a big gray area inbetween where nr approximately equals nw. It would be interesting to see a better analysis of that scenario, and discussion of the applicability of read/write mutexes there. Another thing I like about this approach is that it is relatively simple to add the upgradable concept to it. And there is no extra cost for supporting upgradable. One gets it practically for free. -Howard

(This redundant wakeup may not be visible in a benchmark, but it can steal CPU from other busy non-contending threads in a real application.)
To test this, I added "free" threads to the benchmark. These compete on a different mutex. With my latest experiment, this reveals odd/intriguing patterns of the form 2R+4W+0F: 29 2R+4W+4F: 18 The additional four free threads improve performance! Test code: // Copyright (c) 2005, 2006 Peter Dimov // // Distributed under the Boost Software License, Version 1.0. // See accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) #undef NDEBUG #include "rw_mutex_.hpp" #include <boost/thread/thread.hpp> #include <boost/thread/mutex.hpp> #include <boost/bind.hpp> #include <vector> #include <iostream> #include <ctime> int const N = 1048576; int const n = 1000; // vector size rw_mutex mtx; std::vector<int> v( n ); boost::mutex mx2; std::vector<int> v2( n ); boost::mutex cmx; // void reader( int r ) { for( int i = 0; i < N; ++i ) { rw_mutex::scoped_read_lock lock( mtx ); int m = v.front(); for( std::vector<int>::const_iterator j = v.begin(), end = v.end(); j != end; ++j ) { assert( *j == m ); } if( i == N / 4 || i == N / 2 || i == N / 2 + N / 4 ) { boost::mutex::scoped_lock lock( cmx ); std::cout << " r" << r; } } boost::mutex::scoped_lock lock( cmx ); std::cout << " R" << r; } void writer( int w ) { for( int i = 0; i < N; ++i ) { rw_mutex::scoped_write_lock lock( mtx ); int m = v.front(); for( std::vector<int>::iterator j = v.begin(), end = v.end(); j != end; ++j ) { ++*j; assert( *j == m + 1 ); } if( i == N / 4 || i == N / 2 || i == N / 2 + N / 4 ) { boost::mutex::scoped_lock lock( cmx ); std::cout << " w" << w; } } boost::mutex::scoped_lock lock( cmx ); std::cout << " W" << w; } void free_( int f ) { for( int i = 0; i < N; ++i ) { boost::mutex::scoped_lock lock( mx2 ); int m = v2.front(); for( std::vector<int>::iterator j = v2.begin(), end = v2.end(); j != end; ++j ) { ++*j; assert( *j == m + 1 ); } if( i == N / 4 || i == N / 2 || i == N / 2 + N / 4 ) { boost::mutex::scoped_lock lock( cmx ); std::cout << " f" << f; } } boost::mutex::scoped_lock lock( cmx ); std::cout << " F" << f; } void test( int nr, int nw, int nf ) { try { std::cout << nr << "R + " << nw << "W + " << nf << "F:"; std::time_t tm = std::time( 0 ); boost::thread_group group; for( int i = 0; i < nr || i < nw || i < nf; ++i ) { if( i < nw ) { group.create_thread( boost::bind( writer, i ) ); } if( i < nr ) { group.create_thread( boost::bind( reader, i ) ); } if( i < nf ) { group.create_thread( boost::bind( free_, i ) ); } } group.join_all(); std::cout << ": " << std::time( 0 ) - tm << std::endl; //for( int i = 0; i < 6; ++i ) //{ // std::cout << mtx.retries( i ) << ' '; //} std::cout << std::endl; } catch( std::exception const & x ) { std::cout << x.what() << std::endl; } } int main() { test( 0, 1, 0 ); test( 0, 1, 1 ); test( 0, 4, 0 ); test( 0, 4, 4 ); test( 1, 0, 0 ); test( 1, 0, 1 ); test( 1, 1, 0 ); test( 1, 1, 1 ); test( 1, 4, 0 ); test( 1, 4, 4 ); test( 2, 0, 0 ); test( 2, 0, 2 ); test( 2, 1, 0 ); test( 2, 1, 1 ); test( 2, 2, 0 ); test( 2, 2, 2 ); test( 2, 4, 0 ); test( 2, 4, 4 ); test( 4, 0, 0 ); test( 4, 0, 4 ); test( 4, 1, 0 ); test( 4, 1, 1 ); test( 4, 1, 4 ); test( 4, 4, 0 ); test( 4, 4, 4 ); test( 16, 0, 0 ); test( 16, 0, 4 ); test( 16, 1, 0 ); test( 16, 1, 1 ); test( 16, 1, 4 ); test( 16, 4, 0 ); test( 16, 4, 4 ); //test( 64, 0 ); //test( 64, 1 ); //test( 64, 4 ); }

Peter Dimov wrote:
(This redundant wakeup may not be visible in a benchmark, but it can steal CPU from other busy non-contending threads in a real application.)
To test this, I added "free" threads to the benchmark. These compete on a different mutex. With my latest experiment, this reveals odd/intriguing patterns of the form
2R+4W+0F: 29 2R+4W+4F: 18
The additional four free threads improve performance!
That was because I had an inefficiency in the algorithm, of course. Still, the free threads do seem to help the scheduler sometimes. I finally managed to match the "naive" implementation with a "fancy" semaphore-based scheme. The basic algorithm is that each unlock unblocks one waiting thread; the twist is that if the thread that wakes up is a reader, it then unblocks all waiters. When a writer wakes up, it doesn't unblock anyone, since it is on its way to exclusive access anyway. I haven't reviewed the code for subtle correctness issues, though; no time for that now. :-/ Here's the code: #ifndef BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED #define BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED // MS compatible compilers support #pragma once #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif // Copyright (c) 2005, 2006 Peter Dimov // // Distributed under the Boost Software License, Version 1.0. // See accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) #include <algorithm> #include <cassert> #include <limits.h> #include <windows.h> inline bool atomic_compare_exchange( long * destination, long * comparand, long exchange ) { long ov = *comparand; long nv = InterlockedCompareExchange( destination, exchange, ov ); if( ov == nv ) { return true; } else { *comparand = nv; return false; } } inline bool atomic_compare_exchange_void( void * destination, void * comparand, void const * exchange ) { return atomic_compare_exchange( (long*)destination, (long*)comparand, *(long const*)exchange ); } class rw_mutex { private: rw_mutex( rw_mutex const & ); rw_mutex & operator=( rw_mutex const & ); private: struct state { unsigned writer: 1; unsigned readers: 31; }; state state_; HANDLE sema_wp_; // writer pending HANDLE sema_rw_; // readers+writers blocked long retries_[ 6 ]; long blocked_; public: rw_mutex(): blocked_( 0 ) { state st = { 0 }; state_ = st; sema_wp_ = CreateSemaphore( 0, 0, LONG_MAX, 0 ); sema_rw_ = CreateSemaphore( 0, 0, LONG_MAX, 0 ); std::fill( retries_, retries_ + 6, 0 ); } ~rw_mutex() { CloseHandle( sema_wp_ ); CloseHandle( sema_rw_ ); } void unblock_one() { if( (volatile const long&)blocked_ > 0 ) { long k = InterlockedExchangeAdd( &blocked_, -1 ); if( k > 0 ) { ReleaseSemaphore( sema_rw_, 1, 0 ); } else { // undo InterlockedIncrement( &blocked_ ); } } } void unblock_all() { if( (volatile const long&)blocked_ > 0 ) { long k = InterlockedExchange( &blocked_, 0 ); if( k > 0 ) { ReleaseSemaphore( sema_rw_, 1, 0 ); } } } void rdlock() { state st = state_; for( ;; ) { state xchg = st; if( st.writer ) { InterlockedIncrement( &blocked_ ); WaitForSingleObject( sema_rw_, INFINITE ); unblock_all(); st = state_; } else { xchg.readers = st.readers + 1; if( atomic_compare_exchange_void( &state_, &st, &xchg ) ) { return; // got read lock } else { InterlockedIncrement( &retries_[ 0 ] ); } } } } void lock() { state st = state_; for( ;; ) { state xchg = st; if( st.writer ) { InterlockedIncrement( &blocked_ ); WaitForSingleObject( sema_rw_, INFINITE ); st = state_; } else if( st.readers ) { xchg.writer = 1; if( atomic_compare_exchange_void( &state_, &st, &xchg ) ) { WaitForSingleObject( sema_wp_, INFINITE ); return; } else { InterlockedIncrement( &retries_[ 2 ] ); } } else // free { xchg.writer = 1; if( atomic_compare_exchange_void( &state_, &st, &xchg ) ) { return; } else { InterlockedIncrement( &retries_[ 3 ] ); } } } } void rdunlock() { state st = state_; for( ;; ) { state xchg = st; assert( st.readers > 0 ); --xchg.readers; if( atomic_compare_exchange_void( &state_, &st, &xchg ) ) { break; } else { InterlockedIncrement( &retries_[ 4 ] ); } } if( st.writer && st.readers == 1 ) // we were the last reader and a writer is waiting { ReleaseSemaphore( sema_wp_, 1, 0 ); } else { unblock_one(); } } void unlock() { state st = state_; for( ;; ) { state xchg = st; assert( st.readers == 0 ); xchg.writer = 0; //xchg.blocked = 0; if( atomic_compare_exchange_void( &state_, &st, &xchg ) ) { break; } else { InterlockedIncrement( &retries_[ 5 ] ); } } unblock_one(); } public: long retries( int ix ) { return InterlockedExchange( &retries_[ ix ], 0 ); } public: // lock classes class scoped_read_lock { private: rw_mutex & mx_; scoped_read_lock( scoped_read_lock const & ); scoped_read_lock & operator=( scoped_read_lock const & ); public: scoped_read_lock( rw_mutex & mx ): mx_( mx ) { mx_.rdlock(); } ~scoped_read_lock() { mx_.rdunlock(); } }; class scoped_write_lock { private: rw_mutex & mx_; scoped_write_lock( scoped_write_lock const & ); scoped_write_lock & operator=( scoped_write_lock const & ); public: scoped_write_lock( rw_mutex & mx ): mx_( mx ) { mx_.lock(); } ~scoped_write_lock() { mx_.unlock(); } }; }; #endif // #ifndef BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED

Peter Dimov wrote:
void unblock_all() { if( (volatile const long&)blocked_ > 0 ) { long k = InterlockedExchange( &blocked_, 0 );
if( k > 0 ) { ReleaseSemaphore( sema_rw_, 1, 0 );
This obviously should be ReleaseSemaphore( sema_rw_, k, 0 ); I'm not sure how this managed to pass the test at all.
} } }

"Peter Dimov" <pdimov@mmltd.net> writes:
I finally managed to match the "naive" implementation with a "fancy" semaphore-based scheme. The basic algorithm is that each unlock unblocks one waiting thread; the twist is that if the thread that wakes up is a reader, it then unblocks all waiters. When a writer wakes up, it doesn't unblock anyone, since it is on its way to exclusive access anyway. I haven't reviewed the code for subtle correctness issues, though; no time for that now. :-/
Sorry Peter, but this deadlocks in my tests. I have a new algorithm, posted below, that wakes all readers and one writer when either the last reader or the only writer releases the lock. I can't seem to get consistency in my tests, even when nothing else is running. However, it seems like my algorithm has a lower max wait time, but higher average wait time compared to yours/Chris' earlier algorithms. It performs better than my earlier attempts on all counts. I'll see if I can cut down that average. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk #ifndef BOOST_THREAD_WIN32_READ_WRITE_MUTEX_HPP #define BOOST_THREAD_WIN32_READ_WRITE_MUTEX_HPP #include <boost/assert.hpp> #include <boost/detail/interlocked.hpp> #include <boost/thread/win32/thread_primitives.hpp> #include <boost/thread/win32/interlocked_read.hpp> #include <boost/static_assert.hpp> #include <limits.h> namespace boost { class read_write_mutex { private: struct state_data { unsigned shared_count:10; unsigned shared_waiting:10; unsigned exclusive:1; unsigned exclusive_waiting:10; unsigned exclusive_waiting_blocked:1; friend bool operator==(state_data const& lhs,state_data const& rhs) { return *reinterpret_cast<unsigned const*>(&lhs)==*reinterpret_cast<unsigned const*>(&rhs); } }; template<typename T> T interlocked_compare_exchange(T* target,T new_value,T comparand) { BOOST_STATIC_ASSERT(sizeof(T)==sizeof(long)); long const res=BOOST_INTERLOCKED_COMPARE_EXCHANGE(reinterpret_cast<long*>(target), *reinterpret_cast<long*>(&new_value), *reinterpret_cast<long*>(&comparand)); return *reinterpret_cast<T const*>(&res); } state_data state; void* unlock_event; void* exclusive_event; void release_waiters(state_data old_state) { if(old_state.exclusive_waiting) { bool const success=::boost::detail::ReleaseSemaphore(exclusive_event,1,NULL)!=0; BOOST_ASSERT(success); } if(old_state.shared_waiting || old_state.exclusive_waiting) { bool const success=::boost::detail::ReleaseSemaphore(unlock_event,old_state.shared_waiting + old_state.exclusive_waiting,NULL)!=0; BOOST_ASSERT(success); } } public: read_write_mutex(): unlock_event(::boost::detail::CreateSemaphoreA(NULL,0,LONG_MAX,NULL)), exclusive_event(::boost::detail::CreateSemaphoreA(NULL,0,LONG_MAX,NULL)) { state_data state_={0}; state=state_; } ~read_write_mutex() { ::boost::detail::CloseHandle(unlock_event); ::boost::detail::CloseHandle(exclusive_event); } void lock_shareable() { while(true) { state_data old_state=state; do { state_data new_state=old_state; if(new_state.exclusive || new_state.exclusive_waiting_blocked) { ++new_state.shared_waiting; } else { ++new_state.shared_count; } state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { break; } old_state=current_state; } while(true); if(!(old_state.exclusive| old_state.exclusive_waiting_blocked)) { return; } unsigned long const res=::boost::detail::WaitForSingleObject(unlock_event,BOOST_INFINITE); BOOST_ASSERT(res==0); } } void unlock_shareable() { state_data old_state=state; do { state_data new_state=old_state; bool const last_reader=!--new_state.shared_count; if(last_reader) { if(new_state.exclusive_waiting) { --new_state.exclusive_waiting; new_state.exclusive_waiting_blocked=false; } new_state.shared_waiting=0; } state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { if(last_reader) { release_waiters(old_state); } break; } old_state=current_state; } while(true); } void lock() { while(true) { state_data old_state=state; do { state_data new_state=old_state; if(new_state.shared_count || new_state.exclusive) { ++new_state.exclusive_waiting; new_state.exclusive_waiting_blocked=true; } else { new_state.exclusive=true; } state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { break; } old_state=current_state; } while(true); if(!old_state.shared_count && !old_state.exclusive) { break; } void* const handles[2]={exclusive_event,unlock_event}; bool const success2=::boost::detail::WaitForMultipleObjects(2,handles,true,BOOST_INFINITE)<2; BOOST_ASSERT(success2); } } void unlock() { state_data old_state=state; do { state_data new_state=old_state; new_state.exclusive=false; if(new_state.exclusive_waiting) { --new_state.exclusive_waiting; new_state.exclusive_waiting_blocked=false; } new_state.shared_waiting=0; state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { break; } old_state=current_state; } while(true); release_waiters(old_state); } class scoped_read_lock { read_write_mutex& m; public: scoped_read_lock(read_write_mutex& m_): m(m_) { m.lock_shareable(); } ~scoped_read_lock() { m.unlock_shareable(); } }; class scoped_write_lock { read_write_mutex& m; bool locked; public: scoped_write_lock(read_write_mutex& m_): m(m_),locked(false) { lock(); } void lock() { m.lock(); locked=true; } void unlock() { m.unlock(); locked=false; } ~scoped_write_lock() { if(locked) { unlock(); } } }; }; } #endif

Anthony Williams <anthony_w.geo@yahoo.com> writes:
I have a new algorithm, posted below, that wakes all readers and one writer when either the last reader or the only writer releases the lock.
I can't seem to get consistency in my tests, even when nothing else is running. However, it seems like my algorithm has a lower max wait time, but higher average wait time compared to yours/Chris' earlier algorithms. It performs better than my earlier attempts on all counts.
I'll see if I can cut down that average.
I was releasing the shared_semaphore too many times: (waiting_reader+waiting_writers) instead of (waiting_reader+1), which meant that there were extra spurious wake-ups waiting for readers if there were any writers waiting. Now the total time is comparable to the best of the rest, and the max wait is lower. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk #ifndef BOOST_THREAD_WIN32_READ_WRITE_MUTEX_HPP #define BOOST_THREAD_WIN32_READ_WRITE_MUTEX_HPP #include <boost/assert.hpp> #include <boost/detail/interlocked.hpp> #include <boost/thread/win32/thread_primitives.hpp> #include <boost/thread/win32/interlocked_read.hpp> #include <boost/static_assert.hpp> #include <limits.h> namespace boost { class read_write_mutex { private: struct state_data { unsigned shared_count:10; unsigned shared_waiting:10; unsigned exclusive:1; unsigned exclusive_waiting:10; unsigned exclusive_waiting_blocked:1; friend bool operator==(state_data const& lhs,state_data const& rhs) { return *reinterpret_cast<unsigned const*>(&lhs)==*reinterpret_cast<unsigned const*>(&rhs); } }; template<typename T> T interlocked_compare_exchange(T* target,T new_value,T comparand) { BOOST_STATIC_ASSERT(sizeof(T)==sizeof(long)); long const res=BOOST_INTERLOCKED_COMPARE_EXCHANGE(reinterpret_cast<long*>(target), *reinterpret_cast<long*>(&new_value), *reinterpret_cast<long*>(&comparand)); return *reinterpret_cast<T const*>(&res); } state_data state; void* semaphores[2]; void* &unlock_sem; void* &exclusive_sem; void release_waiters(state_data old_state) { if(old_state.exclusive_waiting) { bool const success=::boost::detail::ReleaseSemaphore(exclusive_sem,1,NULL)!=0; BOOST_ASSERT(success); } if(old_state.shared_waiting || old_state.exclusive_waiting) { bool const success=::boost::detail::ReleaseSemaphore(unlock_sem,old_state.shared_waiting + (old_state.exclusive_waiting?1:0),NULL)!=0; BOOST_ASSERT(success); } } public: read_write_mutex(): unlock_sem(semaphores[0]), exclusive_sem(semaphores[1]) { unlock_sem=::boost::detail::CreateSemaphoreA(NULL,0,LONG_MAX,NULL); exclusive_sem=::boost::detail::CreateSemaphoreA(NULL,0,LONG_MAX,NULL); state_data state_={0}; state=state_; } ~read_write_mutex() { ::boost::detail::CloseHandle(unlock_sem); ::boost::detail::CloseHandle(exclusive_sem); } void lock_shareable() { while(true) { state_data old_state=state; do { state_data new_state=old_state; if(new_state.exclusive || new_state.exclusive_waiting_blocked) { ++new_state.shared_waiting; } else { ++new_state.shared_count; } state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { break; } old_state=current_state; } while(true); if(!(old_state.exclusive| old_state.exclusive_waiting_blocked)) { return; } unsigned long const res=::boost::detail::WaitForSingleObject(unlock_sem,BOOST_INFINITE); BOOST_ASSERT(res==0); } } void unlock_shareable() { state_data old_state=state; do { state_data new_state=old_state; bool const last_reader=!--new_state.shared_count; if(last_reader) { if(new_state.exclusive_waiting) { --new_state.exclusive_waiting; new_state.exclusive_waiting_blocked=false; } new_state.shared_waiting=0; } state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { if(last_reader) { release_waiters(old_state); } break; } old_state=current_state; } while(true); } void lock() { while(true) { state_data old_state=state; do { state_data new_state=old_state; if(new_state.shared_count || new_state.exclusive) { ++new_state.exclusive_waiting; new_state.exclusive_waiting_blocked=true; } else { new_state.exclusive=true; } state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { break; } old_state=current_state; } while(true); if(!old_state.shared_count && !old_state.exclusive) { break; } bool const success2=::boost::detail::WaitForMultipleObjects(2,semaphores,true,BOOST_INFINITE)<2; BOOST_ASSERT(success2); } } void unlock() { state_data old_state=state; do { state_data new_state=old_state; new_state.exclusive=false; if(new_state.exclusive_waiting) { --new_state.exclusive_waiting; new_state.exclusive_waiting_blocked=false; } new_state.shared_waiting=0; state_data const current_state=interlocked_compare_exchange(&state,new_state,old_state); if(current_state==old_state) { break; } old_state=current_state; } while(true); release_waiters(old_state); } class scoped_read_lock { read_write_mutex& m; public: scoped_read_lock(read_write_mutex& m_): m(m_) { m.lock_shareable(); } ~scoped_read_lock() { m.unlock_shareable(); } }; class scoped_write_lock { read_write_mutex& m; bool locked; public: scoped_write_lock(read_write_mutex& m_): m(m_),locked(false) { lock(); } void lock() { m.lock(); locked=true; } void unlock() { m.unlock(); locked=false; } ~scoped_write_lock() { if(locked) { unlock(); } } }; }; } #endif

Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
If you don't like the "upgradable" part, it strips out cleanly from the code. Just delete everything that refers to "upgradable" and you've got a read/write mutex using Terekhov's algorithm (and rename the mutex).
Just two comments for boosters: -> The Boost.Interprocess upgradable_mutex is this same as Howard's algorithm, because I wrote it based on Howard's thorough explanations. So if you want to use this algorithm, use Howard's code, since it's surely more tested than my implementation. I plan to use Howard's algorithm in my next version. -> Now that there is interest in implementing these mutex types, and the committee is considering adding synchronization primitives to the standard or TR2, I think we should implement synchronization algorithms in Boost.Thread that can be reused by other libraries, like Boost.Interprocess. I think that Howard's algorithm can be templatized so that everyone can plug its mutexes/condition variables there, and they don't have to reinvent the wheel. Even if we add atomic operations to that algorithm it will be still valid for shared memory synchronization primitives. My 2 cents, Ion

Howard Hinnant wrote:
Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
I'm having trouble eliminating the notify_all from the unlock() "fast path". Under Windows, it is ridiculously expensive; all other algorithms run circles around it, even those that had errors in them. :-) Did you have an unconditional notify_all in your "real" unlock?

On Oct 26, 2006, at 9:20 PM, Peter Dimov wrote:
Howard Hinnant wrote:
Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
I'm having trouble eliminating the notify_all from the unlock() "fast path". Under Windows, it is ridiculously expensive; all other algorithms run circles around it, even those that had errors in them. :-) Did you have an unconditional notify_all in your "real" unlock?
No. I had extra 1-bit flags that let me know if there was someone waiting on gates 1 or 2: static const unsigned write_entered_ = 1U << (sizeof(unsigned) *_STD::__char<>::bits-1); static const unsigned upgradable_entered_ = write_entered_ >> 1; static const unsigned entry_sleeping_ = upgradable_entered_ >> 1; static const unsigned fw_sleeping_ = entry_sleeping_ >> 1; static const unsigned max_readers_ = ~(write_entered_ | upgradable_entered_ | entry_sleeping_ | fw_sleeping_); So the fast path would check these bits and avoid the notify_all if they weren't set. Btw, cool suggestion on the twist in the rdlock. I got half way through implementing it, discovered a fatal flaw, got half way through writing up why it's a fatal flaw and convinced myself it wasn't a fatal flaw. :-) Still tinkering... Looks promising! :-) -Howard

Howard Hinnant wrote:
On Oct 26, 2006, at 9:20 PM, Peter Dimov wrote:
Howard Hinnant wrote:
Here is my implementation of Terekhov's *fair* read/write algorithm, with upgradable thrown in. This algorithm is just as you say: The OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
I'm having trouble eliminating the notify_all from the unlock() "fast path". Under Windows, it is ridiculously expensive; all other algorithms run circles around it, even those that had errors in them. :-) Did you have an unconditional notify_all in your "real" unlock?
No. I had extra 1-bit flags that let me know if there was someone waiting on gates 1 or 2:
static const unsigned write_entered_ = 1U << (sizeof(unsigned) *_STD::__char<>::bits-1); static const unsigned upgradable_entered_ = write_entered_ >> 1; static const unsigned entry_sleeping_ = upgradable_entered_ >> 1; static const unsigned fw_sleeping_ = entry_sleeping_ >> 1; static const unsigned max_readers_ = ~(write_entered_ | upgradable_entered_ | entry_sleeping_ | fw_sleeping_);
So the fast path would check these bits and avoid the notify_all if they weren't set.
Thanks, I'll try this tomorrow. Later today, I mean. BTW... why do you count the upgradable among the readers? It seems a bit simpler not to include it in the reader count; it already has a bit all for itself.

On Oct 26, 2006, at 9:52 PM, Peter Dimov wrote:
Thanks, I'll try this tomorrow. Later today, I mean.
I'm an old man. I can't deal with your hours. :-)
BTW... why do you count the upgradable among the readers? It seems a bit simpler not to include it in the reader count; it already has a bit all for itself.
I found it simpler to include the upgradable in the reader count. It has been a year or two since I made that decision so, you know, times change, sometimes we get smarter (sometimes not). At the time: an upgradable lock is a reader. It doesn't become a writer until someone asks it to, at which time it decrements the reader count and waits at gate 2. YMMV. -Howard

Peter Dimov wrote:
Howard Hinnant wrote:
So the fast path would check these bits and avoid the notify_all if they weren't set.
Thanks, I'll try this tomorrow. Later today, I mean.
Works splendidly on uniprocessor, but displays the same odd behavior I've been seeing with the other algorithms on my Pentium D (with XP64): 0R+4W+0F: 46 0R+0W+4F: 14 0R+4W+4F: 10 (!) The Windows scheduler doesn't like us. Or I've made a mistake somewhere. Even if I didn't, it seems to me that the bit doesn't help in the 4W case since there's always a writer waiting, so we hit the notify_all. #ifndef BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED #define BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED // MS compatible compilers support #pragma once #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif // Copyright (c) 2005, 2006 Peter Dimov // // Distributed under the Boost Software License, Version 1.0. // See accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) #define RW_MUTEX_NAME "Terekhov/Hinnant, lock-free" #include <algorithm> #include <cassert> #include <boost/thread/mutex.hpp> #include <boost/thread/condition.hpp> #include <boost/detail/interlocked.hpp> template<class T> inline T atomic_load_none( T const * pt ) { return *static_cast<T const volatile *>( pt ); } template<class T> inline void atomic_store_rel( T * pt, T v ) { // x86-specific implementation *static_cast<T volatile *>( pt ) = v; } inline bool atomic_compare_exchange( long * destination, long * comparand, long exchange ) { long ov = *comparand; long nv = BOOST_INTERLOCKED_COMPARE_EXCHANGE( destination, exchange, ov ); if( ov == nv ) { return true; } else { *comparand = nv; return false; } } class rw_mutex { private: rw_mutex( rw_mutex const & ); rw_mutex & operator=( rw_mutex const & ); private: boost::mutex mx_; boost::condition cn_r_; boost::condition cn_w_; enum { state_wp_mask = 0x80000000ul, // writer pending/active state_cw_mask = 0x40000000ul, // someone waiting on cn_w_ state_rd_mask = 0x3FFFFFFFul, // reader count }; long state_; public: rw_mutex(): state_( 0 ) { } ~rw_mutex() { } void rdlock() { long st = atomic_load_none( &state_ ); for( ;; ) { if( st & state_wp_mask ) { boost::mutex::scoped_lock lock( mx_ ); if( atomic_compare_exchange( &state_, &st, st | state_cw_mask ) ) { cn_w_.wait( lock ); st = state_; } } else { if( atomic_compare_exchange( &state_, &st, st + 1 ) ) return; // got read lock } } } void lock() { long st = atomic_load_none( &state_ ); for( ;; ) { if( st & state_wp_mask ) { boost::mutex::scoped_lock lock( mx_ ); if( atomic_compare_exchange( &state_, &st, st | state_cw_mask ) ) { cn_w_.wait( lock ); st = state_; } } else { if( atomic_compare_exchange( &state_, &st, st | state_wp_mask ) ) { if( st & state_rd_mask ) { boost::mutex::scoped_lock lock( mx_ ); while( state_ & state_rd_mask ) { cn_r_.wait( lock ); } } return; } } } } void rdunlock() { long st = BOOST_INTERLOCKED_EXCHANGE_ADD( &state_, -1 ); if( ( st & ~state_cw_mask ) == ( state_wp_mask | 1 ) ) // we were the last reader and a writer is waiting { { boost::mutex::scoped_lock lock( mx_ ); } cn_r_.notify_one(); } } void unlock() { long st = BOOST_INTERLOCKED_EXCHANGE( &state_, 0 ); if( st & state_cw_mask ) { { boost::mutex::scoped_lock lock( mx_ ); } cn_w_.notify_all(); } } public: //long retries( int ix ) //{ // return InterlockedExchange( &retries_[ ix ], 0 ); //} public: // lock classes class scoped_read_lock { private: rw_mutex & mx_; scoped_read_lock( scoped_read_lock const & ); scoped_read_lock & operator=( scoped_read_lock const & ); public: scoped_read_lock( rw_mutex & mx ): mx_( mx ) { mx_.rdlock(); } ~scoped_read_lock() { mx_.rdunlock(); } }; class scoped_write_lock { private: rw_mutex & mx_; scoped_write_lock( scoped_write_lock const & ); scoped_write_lock & operator=( scoped_write_lock const & ); public: scoped_write_lock( rw_mutex & mx ): mx_( mx ) { mx_.lock(); } ~scoped_write_lock() { mx_.unlock(); } }; }; #endif // #ifndef BOOST_DETAIL_RW_MUTEX_HPP_INCLUDED

"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Howard Hinnant wrote:
So the fast path would check these bits and avoid the notify_all if they weren't set.
Thanks, I'll try this tomorrow. Later today, I mean.
Works splendidly on uniprocessor, but displays the same odd behavior I've been seeing with the other algorithms on my Pentium D (with XP64):
0R+4W+0F: 46 0R+0W+4F: 14 0R+4W+4F: 10 (!)
The Windows scheduler doesn't like us. Or I've made a mistake somewhere. Even if I didn't, it seems to me that the bit doesn't help in the 4W case since there's always a writer waiting, so we hit the notify_all.
Which mutex/condition implementation are you using? RC-1.34/HEAD/1.33.1 or thread_rewrite? You might get different timings with each, as the underlying mechanisms are quite different. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Howard Hinnant wrote:
So the fast path would check these bits and avoid the notify_all if they weren't set.
Thanks, I'll try this tomorrow. Later today, I mean.
Works splendidly on uniprocessor, but displays the same odd behavior I've been seeing with the other algorithms on my Pentium D (with XP64):
0R+4W+0F: 46 0R+0W+4F: 14 0R+4W+4F: 10 (!)
The Windows scheduler doesn't like us. Or I've made a mistake somewhere. Even if I didn't, it seems to me that the bit doesn't help in the 4W case since there's always a writer waiting, so we hit the notify_all.
Which mutex/condition implementation are you using? RC-1.34/HEAD/1.33.1 or thread_rewrite?
HEAD, but the phenomenon above is not synchronization primitive-specific. I'm seeing it with semaphore-based implementations as well.

Peter Dimov wrote:
Anthony Williams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Howard Hinnant wrote:
So the fast path would check these bits and avoid the notify_all if they weren't set.
Thanks, I'll try this tomorrow. Later today, I mean.
Works splendidly on uniprocessor, but displays the same odd behavior I've been seeing with the other algorithms on my Pentium D (with XP64):
0R+4W+0F: 46 0R+0W+4F: 14 0R+4W+4F: 10 (!)
The Windows scheduler doesn't like us. Or I've made a mistake somewhere. Even if I didn't, it seems to me that the bit doesn't help in the 4W case since there's always a writer waiting, so we hit the notify_all.
Which mutex/condition implementation are you using? RC-1.34/HEAD/1.33.1 or thread_rewrite?
HEAD, but the phenomenon above is not synchronization primitive-specific. I'm seeing it with semaphore-based implementations as well.
Your last algorithm also suffers from it, not as much, though: 0R+4W+0F: 26 0R+0W+4F: 13 0R+4W+4F: 11 My "naive" algorithm (slightly enhanced to use an event instead of mutex+cv for waiting) gave 0R+4W+0F: 14 0R+0W+4F: 14 0R+4W+4F: 11 which I explain by it being "honest" to the kernel and using a mutex for the writers instead of building its own. :-) This is probably very much OS/kernel/CPU/workload specific, though. Your algorithm definitely holds its own in all realistic scenarios.

"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Anthony Williams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Howard Hinnant wrote:
So the fast path would check these bits and avoid the notify_all if they weren't set.
Thanks, I'll try this tomorrow. Later today, I mean.
Works splendidly on uniprocessor, but displays the same odd behavior I've been seeing with the other algorithms on my Pentium D (with XP64):
0R+4W+0F: 46 0R+0W+4F: 14 0R+4W+4F: 10 (!)
The Windows scheduler doesn't like us. Or I've made a mistake somewhere. Even if I didn't, it seems to me that the bit doesn't help in the 4W case since there's always a writer waiting, so we hit the notify_all.
Which mutex/condition implementation are you using? RC-1.34/HEAD/1.33.1 or thread_rewrite?
HEAD, but the phenomenon above is not synchronization primitive-specific. I'm seeing it with semaphore-based implementations as well.
Your last algorithm also suffers from it, not as much, though:
0R+4W+0F: 26 0R+0W+4F: 13 0R+4W+4F: 11
My "naive" algorithm (slightly enhanced to use an event instead of mutex+cv for waiting) gave
0R+4W+0F: 14 0R+0W+4F: 14 0R+4W+4F: 11
which I explain by it being "honest" to the kernel and using a mutex for the writers instead of building its own. :-)
This is probably very much OS/kernel/CPU/workload specific, though.
I guess it's caused by active contention. If there's other threads doing stuff, then the threads competing on the lock don't hit it so hard, and don't have to retry the CAS so many times. I expect they also hit the fast path more often, too.
Your algorithm definitely holds its own in all realistic scenarios.
Thanks. I'll commit it to thread_rewrite, as it's definitely better than what's there at the moment (since that version deadlocks). If you have any more thoughts on an improvement/new algorithm, I'd be glad to see it. Your input is much appreciated. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:00d101c6f9d7$8a44b000$6607a8c0@pdimov2...
Peter Dimov wrote:
Anthony Williams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Peter Dimov wrote:
Howard Hinnant wrote:
[...]
Your last algorithm also suffers from it, not as much, though:
[...]
This is probably very much OS/kernel/CPU/workload specific, though. Your algorithm definitely holds its own in all realistic scenarios.
I just remembered that Joe Seigh posted a couple of 'ad-hoc' rw-mutex implementations over on c.p.t; back when I was SenderX... Joe was using them for something to compare his atomic_ptr proxy collector... Here is the link: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b5... It might be worth taking a look, and possibly including it your test application(s)...

On Oct 24, 2006, at 6:42 PM, Peter Dimov wrote:
Various. Here's the test:
So this is a great test. But the one thing it doesn't test is starvation. And I'm not sure how to quantify that in an automatic test either. I'd love to see this test augmented along the lines of: void clearscr() { std::cerr << "\033[2J"; } void gotoxy(int row, int col) { std::ostringstream out; out << "\033[" << row << ';' << col << 'H'; std::cerr << out.str(); } void reader( int r ) { for( int i = 0; i < N; ++i ) { if ((i+1) % (n/100) == 0) { boost::mutex::scoped_lock lock( cmx ); gotoxy(r, 0); std::cerr << "read thread "<< r << " at " << (float)(i +1)/n * 100 << '%'; gotoxy(30,0); } rw_mutex::scoped_read_lock lock( mtx ); int m = v.front(); for( std::vector<int>::const_iterator i = v.begin(), end = v.end(); i != end; ++i ) { assert( *i == m ); } } boost::mutex::scoped_lock lock( cmx ); std::cout << " R" << r; } I don't know how to quantify it, but there is nothing quite like the warm fuzzy of seeing this on your screen: writ thread 1 at 62% writ thread 2 at 63% writ thread 3 at 59% writ thread 4 at 58% read thread 5 at 62% read thread 6 at 55% read thread 7 at 64% read thread 8 at 75% I.e. No thread is getting starved. All are successfully competing for their fair share. I'd love to see this type of measure quantified. If all writer threads run before all reader threads, or vice-versa, I don't care how fast it runs, it's not good. -Howard

Howard Hinnant wrote:
I don't know how to quantify it, but there is nothing quite like the warm fuzzy of seeing this on your screen:
writ thread 1 at 62% writ thread 2 at 63% writ thread 3 at 59% writ thread 4 at 58% read thread 5 at 62% read thread 6 at 55% read thread 7 at 64% read thread 8 at 75%
You are right in principle but the above isn't a good result for a multi-core. The read threads should have an advantage proportional to the number of cores since they can run in parallel. I'll try to embelish my test with something that measures starvation.

On Oct 24, 2006, at 9:29 PM, Peter Dimov wrote:
Howard Hinnant wrote:
I don't know how to quantify it, but there is nothing quite like the warm fuzzy of seeing this on your screen:
writ thread 1 at 62% writ thread 2 at 63% writ thread 3 at 59% writ thread 4 at 58% read thread 5 at 62% read thread 6 at 55% read thread 7 at 64% read thread 8 at 75%
You are right in principle but the above isn't a good result for a multi-core. The read threads should have an advantage proportional to the number of cores since they can run in parallel. I'll try to embelish my test with something that measures starvation.
<nod> Indeed the above snapshot was taken from a single processor machine. My memory tells me I've had similar experience with a dual core machine, maybe there was the advantage you speak of for readers, but I'm unable to verify that at the moment. Much thanks for your efforts in this area. -Howard

"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:zmbl66cr.fsf@yahoo.com...
"Chris Thomasson" <cristom@comcast.net> writes:
"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:d58h7nlv.fsf@yahoo.com...
[...]
I was imagining a system with several hundred readers, and a couple of writers. If the writers take sufficiently long,
IMHO, this would be a flaw in the application logic. Your not suppose to hold locks for very long at all... If your making expensive calls in a the context of a critical-section, then your asking for poor performance anyway...

Anthony Williams wrote:
I'm interested in the following scenario:
- K active readers are holding the lock - a writer is waiting for the above - L readers are wating for the writer
Since you seem to only store K+L (Chris's algorithm has separate counts), how do you know when to signal the writer?
Maybe it's not /quite/ the same algorithm, but it's certainly very similar.
The answer to your question is "when all K+L readers have finished". i.e., my algorithm gives readers priority.
Maybe this isn't such a good thing; in the "common" scenario of one writer thread and many readers, a waiting writer needs to block waiting readers, otherwise the writer might starve if the readers overlap each other.
Not good. :-)
Chris's algorithm goes to the other extreme --- writers always take priority, and no readers will take the lock if there are any waiting writers. If there is more than one writer thread, then the readers might starve.
Not good either. :-)
The way I see it, the ideal is that a waiting writer will block any fresh readers until all the current readers have gone away, but once that happens, and the lock is released, the waiting writer and the waiting readers will compete equally.
I agree that this (or a variation of it) is the best policy. I've tried to implement it in my prototype. Unfortunately, it has a different problem; the L readers (and potentially M writers) that are blocked by the pending/active writer are then serialized by the mutex mx_. Ideally, there would be an M+L (or M+1, or M+kL, where 0<k<1) tie after the writer unlocks, and if a reader wins it, all readers should proceed. There's also an implementation by Ion Gaztañaga: http://boost.cvs.sourceforge.net/boost/boost/boost/interprocess/sync/interprocess_upgradable_mutex.hpp?revision=1.4&view=markup in which he implements yet another Terekhov algorithm (AFAIK). Not lock-free enough, though.

On Oct 24, 2006, at 12:15 PM, Peter Dimov wrote:
There's also an implementation by Ion Gaztañaga:
http://boost.cvs.sourceforge.net/boost/boost/boost/interprocess/ sync/interprocess_upgradable_mutex.hpp?revision=1.4&view=markup
in which he implements yet another Terekhov algorithm (AFAIK).
Imho Terekhov's algorithm is the way to go regarding reader/writer priority. I've beat on this algorithm fairly heavily in my own implementation of it, and it really performs nicely. Take a look at this test case originally inspired by Peter: http://home.twcny.rr.com/hinnant/cpp_extensions/rw_perf.html You can ignore the move-aspects of it, and even the upgradable aspect of it if you want. But try to get the I/O part of it working on your system (gotoxy). Then crank up the number of readers/writers and play with the protected section cost (vec_len). Then observe the progress of the different kinds of threads. In my experiments no thread is ever shut out, no matter how high the contention gets. This is a very desirable property. -Howard

"Anthony Williams" <anthony_w.geo@yahoo.com> wrote in message news:y7r53mxs.fsf@yahoo.com...
"Peter Dimov" <pdimov@mmltd.net> writes:
Chris Thomasson wrote:
Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f...
Seems like it can work. Don't you only need one bit for writes though?
It's essentially the same algorithm I've used in my latest offering on the thread_rewrite branch.
I use 3 bits for the actual state (1 bit for currently shared, 1 for currently exclusive, and 1 for shared with upgradable), a 14 bit count of waiting writers, and a 15 bit count of waiting/running readers, so it all fits in a 32-bit word, and I can use plain CAS rather than DWCAS. I've also extended the algorithm to cover upgradeable locks.
I only make use of DWCAS on 32-bit systems; CAS is 100% compatible with my algorithm on 64-bit systems... BTW, is yours starvation-free?

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:006001c6f767$1f350b60$6607a8c0@pdimov2...
Chris Thomasson wrote:
Here is the initial experimental pseudo-code that I am thinking about prototyping in IA-32 Assembly:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5f...
Seems like it can work.
It sure does.
Don't you only need one bit for writes though?
Yup. That would free up a bunch of extra space... I guess I could use it to hold more readers.
I have one at
http://pdimov.com/cpp/rw_mutex.cpp
but your version is better. I see no need for assembly however.
I kind of like prototyping lock-free algorithms in assembly language because it helps to address stuff like this: http://groups.google.com/group/comp.programming.threads/msg/a8d7067bc1425ae1 http://groups.google.com/group/comp.programming.threads/msg/0afc1109d18c2991 http://groups.google.com/group/comp.programming.threads/msg/fd3b4b5a5cd7841e I get a little paranoid about that sometimes... ;^) I think I will go ahead and fully implement this and post it here shortly... Thanks for your comments.

Chris Thomasson wrote:
... I understand that Boost has a problem with it's current rw-mutex?
While being very glad, that a lot of valuable discussion is taking place to provide a working rw-mutex, I am missing an answer to an urgent question: What shall be done about 1.34? Whichever rw-mutex we will adopt, this likely will not happen before 1.35. Since I did not have the time to study the proposed algorithms, I am currently only in the role of the cleaning crew, to get Boost.Thread ready for 1.34. Since I did not find a summary and final outcome of the discussion, I'll restate my question (as beeing asked by the release manager): Shall we drop rw-mutex (including doc) from 1.34? This also has been done in 1.33 already, to avoid exposing the user from a known broken implementation. As an option we could 1) leave the broken rw mutex in the 1.34, but put a big fat warning 2) provide a new algorithm, also with a warning. 1) doesn't sound very attractive, 2) possibly will delay 1.34 release date even further. Please give me your feedback. Roland (aka. speedsnail)
participants (7)
-
Anthony Williams
-
Chris Thomasson
-
Cory Nelson
-
Howard Hinnant
-
Ion Gaztañaga
-
Peter Dimov
-
Roland Schwarz