Re: [boost] A fast-pathed rw-mutex algorithm for Boost...

26 Oct 2006

      On Oct 26, 2006, at 12:39 PM, Peter Dimov wrote:
...
Howard Hinnant wrote:
...
Here is my implementation of Terekhov's *fair* read/write algorithm,
with upgradable thrown in.  This algorithm is just as you say:  The
OS alone decides who gets to run next.
http://home.twcny.rr.com/hinnant/cpp_extensions/upgrade_mutex.html
There are two interesting decisions in a read/write mutex. The  
first is what
happens when a writer is blocked in wrlock() waiting for readers  
and another
reader tries to rdlock(). POSIX leaves this implementation defined  
in the
general case; however, if priority scheduling is supported, the reader
"shall" obtain a lock if and only if it's of a higher priority  
compared to
the waiting writer (or all writers, if there's more than one).
If we assume equal priority for simplicity, this means that rdlock  
"shall"
block. This is detectable by a conformance test, by the way, since  
tryrdlock
fails if and only if rdlock blocks.
This is the "gate 1" part, and all of the algorithms I've been  
experimenting
with also have this property. If the 'writer pending' bit is set,  
no readers
are allowed to obtain a lock. However if another, higher-priority  
writer
arrives after the pending bit is already set, I still grant write  
access to
the first writer, which is probably a violation of POSIX [TPS]. :-)
The other interesting decision is what happens on unlock (two  
subparts, last
reader unlocks, or writer unlocks.) Your implementation seems to  
use the
"thundering herd" approach of waking all. This is fair, but may be  
wasteful
if you wake up several writers, or a writer and several readers and  
the
writer manages to pass through gate 1 (everyone else must block  
again).
(This redundant wakeup may not be visible in a benchmark, but it  
can steal
CPU from other busy non-contending threads in a real application.)
(I see a significant amount of CAS failures even when waking up only
readers; again, this doesn't affect the timing of the specific  
benchmark,
and I'm not sure that it's actually a problem.)
Another approach is to wake up one writer in rdunlock and all  
readers in
wrunlock (alternating between readers and writers). I also have a  
third
option in mind that I haven't tested yet.
Very nice summary and astute observations.

I must admit to really liking the "thundering herd" approach for the  
interesting unlock cases you site.  This is where you're really  
turning things over to the OS and saying "you decide".  And I just  
flat haven't experimented with priority and consider myself currently  
unqualified to comment in that area.  Which I guess is another reason  
for me to prefer to let the OS decide such matters (it presumably  
knows and picks threads with higher priority to wake).

Partial rationale:

Last reader unlocks:

In this case if there is a writer waiting (at gate 2), then (imho)  
there's no decision but to give the writer the ago, and this is  
assuming equal priority as you stated.

If there is not a writer waiting at gate 2, and this is the last  
reader, then my implementation does absolutely nothing.  A single  
reader inside of gate 1 blocks no one else from passing through gate  
1.  There is no thundering herd for last reader unlock. (see  
unlock_sharable()).

Writer unlocks:

By definition no one is waiting at gate 2.  There may be waiters,  
both read and write (and upgradable if you're using that) at gate 1.   
If you choose at this point, a reader or a writer, you are, by  
definition, giving one of the two priority, at least for this next  
round.  One might have the write unlock alternate its choice, but I'm  
not seeing much performance gain here.  You either wake up 1 writer,  
or you wake up many readers (on the reader path it might actually be  
significantly more expensive to make num_reader calls to notify_one()  
than just do a notify_all()).

Let's look at a typical (anticipated) use case under high contention  
(we're not even discussing or worried about low contention use cases  
for this discussion):  Few (nw) writers blocked, many (nr) readers  
blocked; writer unlocks, and issues notify_all().

Chances are good that one of the readers will get past gate 1 before  
one of the writers gets past gate 1, just because there are so many  
of them compared to the number of writers (strength in numbers).

Chance first thread past gate 1 is a reader:  nr/(nr+nw)

The above formula rapidly approaches 100% as nw/nr approaches 0.

nw/nr   Chances of reader being first
0             100%   // no writers
.1             91%
.2             83%
.3             77%
.4             71%
.5             67%
...
1.0            50%  // same number of writers as readers

Indeed it is likely that several, or even many readers will get past  
gate 1 before one of the few writers is chosen to try.  When one of  
the few writers is finally chosen, it too will get past gate 1 and  
block at gate 2.  At this point you've got many readers inside of  
gate 1 getting work done.  You have only a few writers (nw-1) that  
woke up for nothing and now have to go back to sleep.  And (on  
average) you'll have relatively few readers that woke up for nothing  
and must sleep again (just how many, statistically speaking, drops  
with the nw/nr ratio -- probability formula left as an exercise to  
the reader :-)).

Of course there will be cases where one of the few writers makes it  
past gate 1 at or near the beginning of the thundering herd, in which  
case many threads will be woken up for nothing.  However,  
statistically speaking, and assuming all equal probability for any  
thread getting woken up, this scenario should happen infrequently,  
and thus not be a performance problem.

If you have an atypical scenario:  Many writers, few readers, then I  
agree that the "thundering herd" approach may exhibit performance  
problems, as it is then likely that a writer is first, and every  
other thread is woken up for nothing.  And in this scenario, I  
suggest that a read/write mutex is probably not a good design choice  
for the application.

There's a big gray area inbetween where nr approximately equals nw.   
It would be interesting to see a better analysis of that scenario,  
and discussion of the applicability of read/write mutexes there.

Another thing I like about this approach is that it is relatively  
simple to add the upgradable concept to it.  And there is no extra  
cost for supporting upgradable.  One gets it practically for free.

-Howard