[Boost.Interprocess] Chronic performance on Win7

19 Aug 2011

      Hi,

we're using the Boost Interprocess layer to do some
shared-memory-based IPC and after getting awful performance
characteristics out of sync primitives (in particular the condition
variables, but similar results with semaphores) decided to do some
profiling. Our platform is a (dual-core) Intel Core 2 Duo, running
64-bit Windows 7 and a 64-bit build environment in MSVC 10.

One thing we wished to see was the performance of process hand-offs;
i.e. getting one process to block as it unblocked a second symbiotic
process and vice versa. To this aim we made a tight loop looking like:

boost::interprocess::named_mutex
m(boost::interprocess::create_or_open, "ProfileCache-Mutex");
boost::interprocess::named_condition
c(boost::interprocess::create_or_open, "ProfileCache-Condition");
scoped_lock<named_mutex> l(m);
for (int i = 0; i < 1024; ++i)
{
 c.notify_all();
 c.wait(l);
}

and another, running in another process with the wait & notify_all()
reversed. It was checked to make sure that hand-offs were happened
properly (no spurious wakeups &c.).

I don't have any graphs to hand (though they would be quite easy to
create), but as a rough figure, assuming a relatively unfettered
execution environment - no browsers running/music playing - we were
getting about 45 second run times. That's approximately 40ms per
round-trip between the processes or 20ms per forced context switch.

I suspect it's no mere coincidence that the Windows scheduling
timeslice happens to be 20ms in duration; for Windows the sched_yield
call is implemented by a Sleep(1); this, on Windows 7 at least, has
the effect of putting the thread to sleep for the rest of its
timeslice. Not ideal for low-latency IP-hand-offs.

The obvious way to gain far higher performance is to Sleep(0). This
has had the meaning since Vista of "run any other thread that's ready
to run" (it used to mean any other thread of equal priority that's
ready to run, hence priority inversion could ensue and people favoured
the more aggressively passive Sleep(1)). Changing to a Sleep(0)-based
sched_yield gave appropriately speedy results; of the order of 600us
for the 1024 roundtrips, about 300ns per forced context switch.

Though the loop around the Sleep(0) is a nice, yielding spin loop, if
all processes are waiting (perhaps on an external event), then it will
needlessly use 100% of each of the used cores. Also far from ideal.

My proposed solution, for which I have attached a draft patch,
introduces an argument to sched_yield ("_iteration"), in order to
allow sched_yield to customise its behaviour depending on how long it
has been waiting so far. All loops around sched_yield() (yield-loops)
have been changed to track the number of iterations and pass it as an
argument to sched_yield().

Ideally we would like to Sleep(0) in the case that the yield-loop's
exit condition will be satisfied within the next 20ms and Sleep(1) in
all other cases. As a completely unknown prior, we can use an
exponential PDF with the 0.5 point occuring at the duration that we
have already been waiting for. Optimising for this model, we Sleep(1)
once we have already been waiting 20ms, which puts it at around
50000-100000 Sleep(0) calls.

The was tested empirically and gave a simple Sleep(0)-yield
performance, but while reducing the CPU consumption to near 0 on
indefinite wait()s. We tested to make sure performance wasn't hindered
should the threads be of unequal priorities; this indeed wasn't the
case.

Comments?

Regards,

Gavin
--
Save Ferris!

Gav Wood

Mathias Gaunard

Ion Gaztañaga

Ion Gaztañaga

Ion Gaztañaga

Gav Wood

James Mansion

Ion Gaztañaga

tags

participants (4)