
Thanks for doing some useful benchmarks!
The code size and speed can be improved further by using a simpler method that just calls sched_yield(). However, this method is less suitable for use as a default implementation because it will have worse complexity in applications with contention: futex() knows which thread to wake up next, while sched_yield() doesn't.
My (limited) understanding from recent linux kernel mailing list discussions regarding supposed regressions in the scheduler is that sched_yield does not have a well-defined behaviour for anything other than SCHED_FIFO tasks and as such its implementation and performance may vary greatly depending on a particular scheduler implementation. The recent regression noted for a network benchmark called iperf, used sched_yield and saw major changes to performance. The advice was not to rely on sched_yield other than in simple 'don't care' apps. (and that iperf wasn't Perhaps try your sched_yield tests on most recent kernel and play with the different scheduler sched_yield strategies to see if that influences your non-contended/contended settings. Probably shouldn't end up relying on it anyway so the point may be moot... http://lkml.org/lkml/2007/10/1/279 might be a start