
I have a complete implementation that models std::valarray in every possible way (with compile-time known limitations) including expression-template support for the numeric ops. My testing indicates a decent performance increase over gcc4's valarray for small number of elements <1000.
If anyone has any interest, I'll put together some docs and throw it in the sandbox.
I've no immediate use, but it does sound interesting.
In order to get an idea whether its gcc's valarray implementation at fault, you could evaluate macstl (macs and pcs) at www.pixelglow.com They cite significant speed improvements just by replacing gcc's valarray with their alternative (albeit they are concerned with vectorised operations almost exclusively). Its got a slightly restrictive license, but has valarrays and similar fixed size alternatives. A quick benchmark of this might convince you of what is worth pursuing. FWIW wrapping simple C arrays and using 'restrict' seems to work as well in many optimising compilers I've looked at on the PC. Something like macstl wins when you have longer operations that can make use of expression templates. I'd LOVE to see something like MacSTL developed for Boost. I haven't really been following MTL recently, but some smart people are working on it behind the scenes. I'm hoping that can be used as a framework for an alternative. Paul