
When you're talking "optimal," you're setting a pretty dang high bar.
I don't have to care about "optimal" if the difference between a suboptimal use of SIMD and not using it at all is an order of magnitude.
Indeed, however, in Gautam Sewani's GSOC project this year he looked quite hard at optimising Boost.Math with SSE2/3 instructions and found it quite hard to find *any* use cases where hand written SEE2 code was better than compiler generated code - one exception was the classic vectorised addition, but even then you're struggling to get a 2x improvement. Of course if the submitter can show that his code *is* faster than the alternatives then all this discusion is entirely moot, and strictly IMO we should stop discussing the bicycle shed colour and get on with it :-) Just my 2c worth, John.