[Boost.Interprocess] Chronic performance on Win7

Hi, we're using the Boost Interprocess layer to do some shared-memory-based IPC and after getting awful performance characteristics out of sync primitives (in particular the condition variables, but similar results with semaphores) decided to do some profiling. Our platform is a (dual-core) Intel Core 2 Duo, running 64-bit Windows 7 and a 64-bit build environment in MSVC 10. One thing we wished to see was the performance of process hand-offs; i.e. getting one process to block as it unblocked a second symbiotic process and vice versa. To this aim we made a tight loop looking like: boost::interprocess::named_mutex m(boost::interprocess::create_or_open, "ProfileCache-Mutex"); boost::interprocess::named_condition c(boost::interprocess::create_or_open, "ProfileCache-Condition"); scoped_lock<named_mutex> l(m); for (int i = 0; i < 1024; ++i) { c.notify_all(); c.wait(l); } and another, running in another process with the wait & notify_all() reversed. It was checked to make sure that hand-offs were happened properly (no spurious wakeups &c.). I don't have any graphs to hand (though they would be quite easy to create), but as a rough figure, assuming a relatively unfettered execution environment - no browsers running/music playing - we were getting about 45 second run times. That's approximately 40ms per round-trip between the processes or 20ms per forced context switch. I suspect it's no mere coincidence that the Windows scheduling timeslice happens to be 20ms in duration; for Windows the sched_yield call is implemented by a Sleep(1); this, on Windows 7 at least, has the effect of putting the thread to sleep for the rest of its timeslice. Not ideal for low-latency IP-hand-offs. The obvious way to gain far higher performance is to Sleep(0). This has had the meaning since Vista of "run any other thread that's ready to run" (it used to mean any other thread of equal priority that's ready to run, hence priority inversion could ensue and people favoured the more aggressively passive Sleep(1)). Changing to a Sleep(0)-based sched_yield gave appropriately speedy results; of the order of 600us for the 1024 roundtrips, about 300ns per forced context switch. Though the loop around the Sleep(0) is a nice, yielding spin loop, if all processes are waiting (perhaps on an external event), then it will needlessly use 100% of each of the used cores. Also far from ideal. My proposed solution, for which I have attached a draft patch, introduces an argument to sched_yield ("_iteration"), in order to allow sched_yield to customise its behaviour depending on how long it has been waiting so far. All loops around sched_yield() (yield-loops) have been changed to track the number of iterations and pass it as an argument to sched_yield(). Ideally we would like to Sleep(0) in the case that the yield-loop's exit condition will be satisfied within the next 20ms and Sleep(1) in all other cases. As a completely unknown prior, we can use an exponential PDF with the 0.5 point occuring at the duration that we have already been waiting for. Optimising for this model, we Sleep(1) once we have already been waiting 20ms, which puts it at around 50000-100000 Sleep(0) calls. The was tested empirically and gave a simple Sleep(0)-yield performance, but while reducing the CPU consumption to near 0 on indefinite wait()s. We tested to make sure performance wasn't hindered should the threads be of unequal priorities; this indeed wasn't the case. Comments? Regards, Gavin -- Save Ferris!

On 08/19/2011 12:52 PM, Gav Wood wrote:
The obvious way to gain far higher performance is to Sleep(0). This has had the meaning since Vista of "run any other thread that's ready to run" (it used to mean any other thread of equal priority that's ready to run, hence priority inversion could ensue and people favoured the more aggressively passive Sleep(1)). Changing to a Sleep(0)-based sched_yield gave appropriately speedy results; of the order of 600us for the 1024 roundtrips, about 300ns per forced context switch.
why not instead if(!SwitchToThread()) Sleep(1);

El 19/08/2011 22:39, Mathias Gaunard escribió:
On 08/19/2011 12:52 PM, Gav Wood wrote:
The obvious way to gain far higher performance is to Sleep(0). This has had the meaning since Vista of "run any other thread that's ready to run" (it used to mean any other thread of equal priority that's ready to run, hence priority inversion could ensue and people favoured the more aggressively passive Sleep(1)). Changing to a Sleep(0)-based sched_yield gave appropriately speedy results; of the order of 600us for the 1024 roundtrips, about 300ns per forced context switch.
why not instead
if(!SwitchToThread()) Sleep(1);
Thanks for the hint. Ion

El 19/08/2011 12:52, Gav Wood escribió:
Comments?
Thanks for the report. I must admit that process-shared synchronization implemenation in windows is awful at best. I have some ideas to implement those using named mutexes or events, but the fact is that I don't know any open source that emulates PTHREAD_PROCESS_SHARED properly. Sleep(0) is a problem because emulated mutexes can't implement Priority inheritance (only kernel sync mechanisms can do that). Sleep(1) gives a chance to lower priority threads. The spin count until 20ms depends on the processor speed, which is not nice. Gav Wood (thanks!) has proposed SwitchToThread but I read here that: http://msdn.microsoft.com/en-us/library/ms686352%28v=vs.85%29.aspx "Windows 2003 warning! If you plan to use SwitchToThread inside a function that is going to be called heavily, and your program is going to run on Windows 2003 server (whatever flavor), you better be careful. SwitchToThread executes painfully slow on w2003. My benchmarks show that a program running under Windows XP can call a function that performs a useful job and has SwitchToTherad on it, 40 million times in less than 2 minutes, while the same process (running on the same hardware) will take almost 40 minutes on Windows 2003." I don't know if this is applicable also to other Windows versions. Could you measure SwitchToThread version? Best, Ion

Hi again, I've done some graphs for your perusal. The reason an adaptive strategy (such as that which I already proffered) must be used is quite simple: there's no way of knowing, in advance, how long a thread must send itself to sleep for. Windows provides only two choices for yielding: Return immediately if nothing else is to be done (Sleep(0)/SwitchToThread()). Return in 20ms, even if another process was scheduled, ran, and potentially freed us in the meantime (Sleep(1)). For forced-process-switches of frequency >> 20ms, the second (Sleep(1)) is an acceptable yield strategy; made ever-so-slightly better by checking first whether another process can be scheduled in the meantime and early exiting if so (ala Check strategy). It works effectively when there are fewer cores than processes which can work. Unfortunately, this becomes far from optimal for forced-process-switches of frequency < 20ms, especially when the number of cores becomes >= the number of processes that can work. This is because a process, having determined that it cannot continue, must decide how to yield. Assuming it's symbiotic counterpart is running on a different core and that no other work needs to be done on the system, Sleep(0) will return immediately and SwitchToThread will return immediately stating that no other processes could be scheduled. Sleep(1) will return only after 20ms. Assuming the symbiotic counterpart completes after, say, 200us, then the slowdown factor will be around 99% for any strategy that involves Sleep(1); therefore such a call needs to be made only once it is probable that the symbiotic counterpart will complete after another 20ms. My proposal attempts to implement such a strategy. As well as the attached graph, I attached the numbers from my benchmarking. The job was an iterative summation of random numbers. The benchmarking code is attached. These numbers are not extremely precise as I wasn't in a pristine environment, but they serve to illustrate. Gav 2011/8/20 Ion Gaztañaga <igaztanaga@gmail.com>:
El 20/08/2011 1:39, Ion Gaztañaga escribió:
Gav Wood (thanks!) has proposed
Obviously, I wanted to write Mathias!
Best,
ion _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sat, 20 Aug 2011 00:39:34 +0100, Ion Gaztañaga <igaztanaga@gmail.com> wrote:
Thanks for the report. I must admit that process-shared synchronization implemenation in windows is awful at best.
Huh? Sure its not just that you're trying to emulate a POSIX API and didn't really abstract away very well in API design? <shrug>

El 24/08/2011 7:21, James Mansion escribió:
On Sat, 20 Aug 2011 00:39:34 +0100, Ion Gaztañaga <igaztanaga@gmail.com> wrote:
Thanks for the report. I must admit that process-shared synchronization implemenation in windows is awful at best.
Huh? Sure its not just that you're trying to emulate a POSIX API and didn't really abstract away very well in API design?
Try to emulate Windows API with POSIX is even worse ;-) Best, Ion
participants (4)
-
Gav Wood
-
Ion Gaztañaga
-
James Mansion
-
Mathias Gaunard