Performance of boost::mutex and boost::mutex::scoped_lock

I have been doing some profiling using VTune recently to investigate the cause of some underperforming multi-threaded code, and it's possible that I'm not interpreting these results correctly, but if I am it's a little surprising. This isn't the best multi-threaded design in the world, but it's basically a naive implementation of a producer consumer that uses a deque as the production queue, a boost::mutex as the lock, and a boost::condition as a signal to wake up when there's stuff in the queue (queue comment about boost::circular_buffer, which this code really should be using instead). There is LOTS of context switching going on, basically I read a 4KB chunk of data from the filesystem, put it in the queue, the other threads takes the 4KB of data, does something with it, and this repeats for a really long time. Since it doesn't take a whole lot of time to read 4KB of data from the disk, you can tell that there are going to be zillions of context switches and contention going on. So on to the point. Obviously most of the time is spent in boost::detail::basic_timed_mutex::lock, but what's surprising is the breakdown of time inside the lock function: (Keep in mind this is not timed over a terribly long sample, hence the small numbers, but it is over thousands of actual locks, and ultimately this locking alone makes up a noticeable portion of the sample duration) boost::detail::basic_timed_mutex.lock Total Time: 1,032,486 microseconds (1.032 seconds) Total Wait Time: 505,884 microseconds (.505 seconds) It appears that only half the time is actually spent inside the lock, which is pretty crazy. The rest is spent doing computations, which doesn't seem right for a lock. I drilled down further and came up with this: boost::detail.get_system_time_sentinel Total Time: 266,657 microseconds Total Wait Time: 264 microseconds So that accounts for about half of the computational time. The rest is in a Windows system call somewhere that I was not able to drill into. I don't know about the details of date_time library, or much about the details of thread library either for that matter, but why is a time necessary here at all? This is eventually resolving down to a boost::detail::basic_timed_mutex, but there's nothing timed about the operations I've done. I simply did the following: boost::mutex m; boost::mutex::scoped_lock lock(m); //... This should ultimately result in an infinite timeout being specified to the system call, so all of this computation seems completely unnecessary. It seems like a scoped_lock should completely bypass anything having to do with date_time, and perhaps even normal timed locks should come up with a more efficient strategy. Thoughts?

On Wed, Jun 10, 2009 at 9:19 AM, Zachary Turner<divisortheory@gmail.com> wrote:
This isn't the best multi-threaded design in the world, but it's basically a naive implementation of a producer consumer that uses a deque as the production queue, a boost::mutex as the lock, and a boost::condition as a signal to wake up when there's stuff in the queue (queue comment about boost::circular_buffer, which this code really should be using instead).
Are you using the "swap trick" on the queue, or synchronizing each push/pop? I find that keeping a local queue in the thread which is processing the shared queue, and just swapping it with the shared queue when the local copy is empty eliminates most of the synchronization on the queue processing side. The threads pushing work onto the shared queue still have to synchronize, but you can't really get around that... Even w/ shared circular buffers AFAIK. Jon

On Wed, Jun 10, 2009 at 11:34 AM, Jonathan Franklin < franklin.jonathan@gmail.com> wrote:
On Wed, Jun 10, 2009 at 9:19 AM, Zachary Turner<divisortheory@gmail.com> wrote:
This isn't the best multi-threaded design in the world, but it's basically a naive implementation of a producer consumer that uses a deque as the production queue, a boost::mutex as the lock, and a boost::condition as a signal to wake up when there's stuff in the queue (queue comment about boost::circular_buffer, which this code really should be using instead).
Are you using the "swap trick" on the queue, or synchronizing each push/pop?
I find that keeping a local queue in the thread which is processing the shared queue, and just swapping it with the shared queue when the local copy is empty eliminates most of the synchronization on the queue processing side. The threads pushing work onto the shared queue still have to synchronize, but you can't really get around that... Even w/ shared circular buffers AFAIK.
I can try that, but it's not going to be simple due to the fact that the producer / consumer code is one that is part of a 3rd party library that we can't modify the source to, so basically it means I'd have to write my own. which isn't that bad, but it still sucks. Even then, I still don't think it makes sense that a call sequence which attempts to perform a lock with an infinite timeout should have any reason to access the date_time library, especially since it apparently isn't all that fast.

This should ultimately result in an infinite timeout being specified to the system call, so all of this computation seems completely unnecessary. It seems like a scoped_lock should completely bypass anything having to do with date_time, and perhaps even normal timed locks should come up with a more efficient strategy.
Thoughts?
I agree with this. I actually didn't use the timed locks so far, and I would assume that a non-timed-lock is the standard case -- and this case should be optimized for. Optimally, if no timeout is specified, I would expect boost::mutex to be equivalent to calling ::WaitForSingleObject (..., INFINITE) on Windows, or ::EnterCriticalSection , whichever is faster. Cheers, Anteru

I have been doing some profiling using VTune recently to investigate the cause of some underperforming multi-threaded code, and it's possible that I'm not interpreting these results correctly, but if I am it's a little surprising. This isn't the best multi-threaded design in the world, but it's basically a naive implementation of a producer consumer that uses a deque as the production queue, a boost::mutex as the lock, and a boost::condition as a signal to wake up when there's stuff in the queue (queue comment about boost::circular_buffer, which this code really should be using instead). There is LOTS of context switching going on, basically I read a 4KB chunk of data from the filesystem, put it in the queue, the other threads takes the 4KB of data, does something with it, and this repeats for a really long time. Since it doesn't take a whole lot of time to read 4KB of data from the disk, you can tell that there are going to be zillions of context switches and contention going on.
i would be curious, did you have a look at my lock-free data structures [1] ? its main purpose is to provide a boost-style lock-free fifo, supporting multiple producer and consumer threads ... i am not sure, whether it is feasible for your use case, but it should reduce context switches caused by the queue guards ... best, tim [1] http://tim.klingt.org/git?p=boost_lockfree.git;a=summary -- tim@klingt.org http://tim.klingt.org A year ago, six months ago, I thought that I was an artist. I no longer think about it, I am. Henry Miller
participants (4)
-
Anteru
-
Jonathan Franklin
-
Tim Blechmann
-
Zachary Turner