Re: Lightweight Mutex Code

newer
Re: Boost Mathematicians

older
Lightweight Mutex Code

Tyson Whitehead

5 Jul 2004 5 Jul '04

7:43 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Aaron W. LaFramboise wrote:

...

I am not sure what the status of these are. Are other components supposed to be using these?

I came upon them when I was checking out the smart pointers implementation. Grepping the source seems to indicate 'lightweight_mutex.hpp' is only used by the smart pointers (only included by 'quick_allocator.hpp' and 'shared_count.hpp') under multithreaded conditions.

...

In addition to what you mentioned, they're broken in other ways:

If a low priority thread is preempted by a high priority thread immediately before __atomic_add, and the high priority thread is attempting to grab the spinlock, deadlock is acheived, despite the sched_yield().

I missed that one. You're right though, and not only that, but the sched_yield manpage states (I'm assuming this applies to threads -- threads in Linux are actually processes -- right?): "Note: If the current process is the only process in the highest priority list at that time, this process will continue to run after a call to sched_yield." The implementer seemed to be aware there could be priority problems though. The relevant comments in 'lightweight_mutex.hpp' are: // * Used by the smart pointer library // * Performance oriented // * Header-only implementation // * Small memory footprint // * Not a general purpose mutex, use boost::mutex, CRITICAL_SECTION or // pthread_mutex instead. // * Never spin in a tight lock/do-something/unlock loop, since // lightweight_mutex does not guarantee fairness. // * Never keep a lightweight_mutex locked for long periods. // // The current implementation can use a pthread_mutex, a CRITICAL_SECTION, // or a platform-specific spinlock. // // You can force a particular implementation by defining // BOOST_LWM_USE_PTHREADS, // BOOST_LWM_USE_CRITICAL_SECTION, or // BOOST_LWM_USE_SPINLOCK. // // If neither macro has been defined, the default is to use a spinlock on // Win32, and a pthread_mutex otherwise. // // Note that a spinlock is not a general synchronization primitive. In // particular, it is not guaranteed to be a memory barrier, and it is // possible to "livelock" if a lower-priority thread has acquired the // spinlock but a higher-priority thread is spinning trying to acquire the // same lock. // // For these reasons, spinlocks have been disabled by default except on // Windows, where a spinlock can be several orders of magnitude faster than a // CRITICAL_SECTION. The current Win32 doesn't have the counter problems, and claims to be able to yield to lower priority threads, so it should be okay. The relevant lock routine is (m_.l_ is initialized to 0): while( InterlockedExchange(&m_.l_, 1) ){ // Note: changed to Sleep(1) from Sleep(0). // According to MSDN, Sleep(0) will never yield // to a lower-priority thread, whereas Sleep(1) // will. Performance seems not to be affected. Sleep(1); } The unlock routine is: InterlockedExchange(&m_.l_, 0); - -T - -- Tyson Whitehead (-twhitehe@uwo.ca -- WSC-) Computer Engineer Dept. of Applied Mathematics, Graduate Student- Applied Mathematics University of Western Ontario, GnuPG Key ID# 0x8A2AB5D8 London, Ontario, Canada -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFA6a+FRXbLmIoqtdgRAqxRAJ0RxqXlsbc5i5GBCydxFWgqTU32sQCgyJ/Y py3rSxTj5BiVUmXPqoP/zcY= =nVLD -----END PGP SIGNATURE-----

Show replies by date

Aaron W. LaFramboise

5 Jul 5 Jul

10:27 p.m.

New subject: Lightweight Mutex Code

Tyson Whitehead wrote:

...

The current Win32 doesn't have the counter problems, and claims to be able to yield to lower priority threads, so it should be okay. The relevant lock routine is (m_.l_ is initialized to 0):

while( InterlockedExchange(&m_.l_, 1) ){ // Note: changed to Sleep(1) from Sleep(0). // According to MSDN, Sleep(0) will never yield // to a lower-priority thread, whereas Sleep(1) // will. Performance seems not to be affected. Sleep(1); }

For what its worth, while it avoids deadlocking problems, this code is also wrong. Synchronization code should never sleep. That code above will actually sleep for the minimum amount of time Windows allows, which is usually 10ms. I measured the performance of this spinlock a while back when improvements to GCC's Windows mutexes were being discussed on the MinGW lists, and in some cases where there is substanctial contention over short-lived locks, the performance degradation is quite unacceptable. Certain naïve benchmarks might not notice, though, because during the period waiters are sleeping, the lock will be accessable. I think the code above could lead to severe slowdowns in some usage cases, and could lead to difficult to diagnose intermittant 'blips' of bad performance in others. In any case, my point is that spinlocks are always going to be wrong in these sorts of situations unless coupled with some other primative. In fact, spinning at all is usually the wrong thing to do. On single CPU systems, it is always a waste of precious time. On multi CPU systems, it can cause very undesirable memory bus slowdowns. The best general locking strategy I have seen--which is used by Linux futexes and GCC for Windows' new locks--is where a normal scheduler locking primative is wrapped by a quick userspace atomic operation to avoid locking in the noncontended case. The downside is that, on Windows for example, this will require 8 bytes of storage. I'm not sure how shared_ptr uses mutexes, but this might not be acceptable. These spinlocks need to be fixed. Is anyone working on this? Aaron W. LaFramboise

flodin＠cs.umu.se

6 Jul 6 Jul

1:33 a.m.

New subject: Lightweight Mutex Code

Aaron, Before taking your word for it, I would like to hear more about how your benchmarks relate to smart pointers. It sounds reasonable that the kind of lock described will perform badly in high-contention situations. But what evidence do you have that there is any probability of high contention, especially when, as you say, the locks are short-lived? These locks aren't taken every time the smart pointer is accessed; they're only taken when a copy is made. So to get the high contention you speak of, you would have to have two threads repeatedly making copies of the "same" smart pointer. From where I stand, this seems sufficiently rare that the extra performance gain in the common cases is worth it. /Mattias Quoting "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com>:

...

Tyson Whitehead wrote:

...
while( InterlockedExchange(&m_.l_, 1) ){ // Note: changed to Sleep(1) from Sleep(0). // According to MSDN, Sleep(0) will never yield // to a lower-priority thread, whereas Sleep(1) // will. Performance seems not to be affected. Sleep(1); }

For what its worth, while it avoids deadlocking problems, this code is also wrong.

Synchronization code should never sleep. That code above will actually sleep for the minimum amount of time Windows allows, which is usually 10ms. I measured the performance of this spinlock a while back when improvements to GCC's Windows mutexes were being discussed on the MinGW lists, and in some cases where there is substanctial contention over short-lived locks, the performance degradation is quite unacceptable. Certain naïve benchmarks might not notice, though, because during the period waiters are sleeping, the lock will be accessable.

I think the code above could lead to severe slowdowns in some usage cases, and could lead to difficult to diagnose intermittant 'blips' of bad performance in others.

In any case, my point is that spinlocks are always going to be wrong in these sorts of situations unless coupled with some other primative. In fact, spinning at all is usually the wrong thing to do. On single CPU systems, it is always a waste of precious time. On multi CPU systems, it can cause very undesirable memory bus slowdowns.

These spinlocks need to be fixed. Is anyone working on this?

Aaron W. LaFramboise

1:46 a.m.

New subject: Lightweight Mutex Code

flodin@cs.umu.se wrote:

...

It sounds reasonable that the kind of lock described will perform badly in high-contention situations. But what evidence do you have that there is any probability of high contention, especially when, as you say, the locks are short-lived? These locks aren't taken every time the smart pointer is accessed; they're only taken when a copy is made. So to get the high contention you speak of, you would have to have two threads repeatedly making copies of the "same" smart pointer. From where I stand, this seems sufficiently rare that the extra performance gain in the common cases is worth it.

[these comments are specific to the win32 spinlock discussed] I have two specific concerns. 1) Personally, I value predictable performance. I would prefer slight performance degradation (Note, however, that I am not examining the actual numbers.) to the case where I may have an occasional very long wait. (Specific example: If a graphically intensive program is aiming for 30 frames per second, that is only about 30ms per frame. It would be very unfortunate if, even occasionally, a single smart pointer copy accounted for 1/3 of the time it took to render a frame.) 2) In "thundering herd" situations, contention is unavoidable, and performance may be paramount. The possibility of a single thread halting many other threads for 10ms on a machine with many processors is disturbing. (Specific example: In a case where thread is producing work for many consumer threads on a multiprocessor machine, and obtaining the work involves copying a smart pointer, it is possible that some amount of threads will forced to sleep every single time, which might be a substantial percentage of total job wall clock time.) In any case, I feel fairly strongly that sleeping is not the right thing to do. Despite the weakness in appeal to authority, I will note that I have seen Butenhof make the same assertion on Usenet, surely in a manner more eloquent than mine. I previously neglected to suggest alternatives. 1) A generally sound 8-byte mutex could be used. I am not sure how bad this would be in terms of storage efficiency. It may be possible to implement it using less storage, but I have no specific information on this. 2) I've seen a shared_ptr count implementation that used InterlockedIncrement instead of the lwm code. I have not examined this in detail; however it seems that an approach like this is better in all respects than trying to manage locks if the only need is to maintain a count (which is what many mutex primatives are doing anyway). Is there some reason this cannot be used? What do you think? Aaron W. LaFramboise

Mattias Flodin

5:13 a.m.

New subject: Lightweight Mutex Code

Quoting "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com>:

...

1) Personally, I value predictable performance. I would prefer slight performance degradation (Note, however, that I am not examining the actual numbers.) to the case where I may have an occasional very long wait. (Specific example: If a graphically intensive program is aiming for 30 frames per second, that is only about 30ms per frame. It would be very unfortunate if, even occasionally, a single smart pointer copy accounted for 1/3 of the time it took to render a frame.)

2) In "thundering herd" situations, contention is unavoidable, and performance may be paramount. The possibility of a single thread halting many other threads for 10ms on a machine with many processors is disturbing. (Specific example: In a case where thread is producing work for many consumer threads on a multiprocessor machine, and obtaining the work involves copying a smart pointer, it is possible that some amount of threads will forced to sleep every single time, which might be a substantial percentage of total job wall clock time.)

Thanks.

...

In any case, I feel fairly strongly that sleeping is not the right thing to do. Despite the weakness in appeal to authority, I will note that I have seen Butenhof make the same assertion on Usenet, surely in a manner more eloquent than mine.

What I'm thinking here is that smart pointers are a special case, and what may apply in the general case may not apply given that constraint. The more specific information you have about the task, the more you can optimize. However, despite your examples being slightly far-fetched (Sleep(1) would wait for ~1ms, not 10ms, and a consumer/producer system would most likely have a synchronized queue which would avoid any congestion around the smart pointers), they are a good enough hint to convince me that there are real-world applications that would suffer from the problem. As you say, stable performance is more important for a generic implementation.

...

I previously neglected to suggest alternatives.

1) A generally sound 8-byte mutex could be used. I am not sure how bad this would be in terms of storage efficiency. It may be possible to implement it using less storage, but I have no specific information on this.

2) I've seen a shared_ptr count implementation that used InterlockedIncrement instead of the lwm code. I have not examined this in detail; however it seems that an approach like this is better in all respects than trying to manage locks if the only need is to maintain a count (which is what many mutex primatives are doing anyway). Is there some reason this cannot be used?

I was a bit surprised by the use of a lock as well, and given that InterlockedIncrement is an obvious solution for this kind of thing, I assumed there were non-obvious reasons that it couldn't be used. My guess is exception safety, but I would like to hear from the original authors (or anybody else in the know) about this. Perhaps explaining rationale in the documentation would be in order.

...

What do you think?

Alternative 2 would be a superior solution if it's feasible. Alternative 1 is not bad performance-wise if implemented using CRITICAL_SECTION. My only worry is about resource usage, since mutexes are kernel objects. I can imagine that situations where hundreds of thousands of smart pointers are used may end up having an impact on overall performance. In some cases, kernel memory usage is restricted. I'm not sure if mutexes belong to that category. The number of outstanding overlapped file operations is an example that does (in the order of 1000 outstanding operations from my measures, on a machine with 512 MB ram). I believe the majority of threaded applications do not need to share smart pointers between threads to any great extent. Unfortunately the choice to avoid a policy-based design implies that optional thread safety might add something like three extra smart pointer classes to the six already existing ones. /Mattias

Aaron W. LaFramboise

5 a.m.

New subject: Lightweight Mutex Code

Mattias Flodin wrote:

...

Quoting "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com>:

...

However, despite your examples being slightly far-fetched (Sleep(1) would wait for ~1ms, not 10ms, and a consumer/producer system would most likely have a synchronized queue which would avoid any congestion around the smart pointers), they are a good enough hint to convince me that there are real-world applications that would suffer from the problem. As you say, stable performance is more important for a generic implementation.

Sorry. I was mistakenly thinking that it was not possible to sleep for less than the system clock granularity (10ms), but now that I tested it, this assertion appears to not be true. Maybe that was only true for older systems. I will investigate further how bad the degenerate performance might actually be with smart_ptr.

...

...
2) I've seen a shared_ptr count implementation that used InterlockedIncrement instead of the lwm code. I have not examined this in detail; however it seems that an approach like this is better in all respects than trying to manage locks if the only need is to maintain a count (which is what many mutex primatives are doing anyway). Is there some reason this cannot be used?

I was a bit surprised by the use of a lock as well, and given that InterlockedIncrement is an obvious solution for this kind of thing, I assumed there were non-obvious reasons that it couldn't be used. My guess is exception safety, but I would like to hear from the original authors (or anybody else in the know) about this. Perhaps explaining rationale in the documentation would be in order.

Well, here is the implementation that I have seen: http://www.pdimov.com/cpp/shared_count_x86_exp2.hpp This implementation seems like it would be faster than what is presently being used, in all cases. I do not know why it is not used presently. Hopefully Peter Dimov will comment.

...

Alternative 2 would be a superior solution if it's feasible. Alternative 1 is not bad performance-wise if implemented using CRITICAL_SECTION. My only worry is about resource usage, since mutexes are kernel objects. I can imagine that situations where hundreds of thousands of smart pointers are used may end up having an impact on overall performance. In some cases, kernel memory usage is restricted. I'm not sure if mutexes belong to that category. The number of outstanding overlapped file operations is an example that does (in the order of 1000 outstanding operations from my measures, on a machine with 512 MB ram).

Well, even critical sections, Windows's fastest mutex primative, are much slower in the noncontended case than a spinlock. A two-stage method is needed to match the performance of the present spinlock: a lightweight atomic operation followed by a heavy-weight mutex if the lock is contended. This is why I was mentioned 8 bytes (one word for the critical section, one word for the atomic operation) would be necessary.

...

I believe the majority of threaded applications do not need to share smart pointers between threads to any great extent. Unfortunately the choice to avoid a policy-based design implies that optional thread safety might add something like three extra smart pointer classes to the six already existing ones.

Ideally, the smart pointer classes would work for more than the majority; they would work for everyone. I also agree that it is unfortunate that users who do not need threads might have to pay for threads anyway, or a more complicated smart pointer library. Aaron W. LaFramboise

Mattias Flodin

7:23 a.m.

New subject: Lightweight Mutex Code

Quoting "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com>:

...

Well, even critical sections, Windows's fastest mutex primative, are much slower in the noncontended case than a spinlock. A two-stage method is needed to match the performance of the present spinlock: a lightweight atomic operation followed by a heavy-weight mutex if the lock is contended. This is why I was mentioned 8 bytes (one word for the critical section, one word for the atomic operation) would be necessary.

I'm quite surprised by this claim. What you describe is precisely how WIN32 critical sections work. If your measures show them to be slower, there must be some other reason for it. WIN32 also provides InitializeCriticalSectionAndSpinCount which will cause a busy-wait for a few cycles before resorting to waiting on the kernel lock. /Mattias

Aaron W. LaFramboise

7 a.m.

New subject: Lightweight Mutex Code

Mattias Flodin wrote:

...

Quoting "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com>:

...
Well, even critical sections, Windows's fastest mutex primative, are much slower in the noncontended case than a spinlock. A two-stage method is needed to match the performance of the present spinlock: a lightweight atomic operation followed by a heavy-weight mutex if the lock is contended. This is why I was mentioned 8 bytes (one word for the critical section, one word for the atomic operation) would be necessary.

I'm quite surprised by this claim. What you describe is precisely how WIN32 critical sections work. If your measures show them to be slower, there must be some other reason for it. WIN32 also provides InitializeCriticalSectionAndSpinCount which will cause a busy-wait for a few cycles before resorting to waiting on the kernel lock.

In fact, you are quite right. I just tested, and performance for the two methods was identical. I had a peice of apparently stale knowledge in my brain telling me otherwise. I don't remember why I thought that. Well, I see that there is a critical section version of the lwm. Is there some reason the spinlock is being used by default instead? Perhaps the solution here should just be to define BOOST_LWM_USE_CRITICAL_SECTION and be done with it. Aaron W. LaFramboise

Peter Dimov

10:57 a.m.

New subject: Lightweight Mutex Code

Aaron W. LaFramboise wrote:

...

Well, here is the implementation that I have seen: http://www.pdimov.com/cpp/shared_count_x86_exp2.hpp This implementation seems like it would be faster than what is presently being used, in all cases.

I do not know why it is not used presently. Hopefully Peter Dimov will comment.

This proof of concept implementation (of Alexander Terekhov's algorithm) only works on MSVC/Windows. I'll try to portabilify and integrate it in the main trunk after we branch for release. It's too late for 1.32.

7692

Age (days ago)

7693

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Aaron W. LaFramboise
flodin＠cs.umu.se
Mattias Flodin
Peter Dimov
Tyson Whitehead