
On 11/06/2011 17:30, David A. Greene wrote:
The benchmarks show this. Just in the slides, there is a 4x improvement when using std::accumulate (which is still a pretty trivial algorithm) with pack<float> instead of float, and in both cases automatic vectorization was enabled.
With which compilers?
Mainstream compilers with mainstream options: what people use. The boostcon talk was pretty simple, it was not aimed at HPC experts, but rather tried to show that the library could be useful to everyone.
But how does the programmer know what is best? It changes from implementation to implementation.
The programmer is not supposed to know, the library does. The programmer writes a * b + c, we generate a fma instruction if available, and a multiplication followed by an addition otherwise. Now it is true there are some cases where there are multiple choices, and which one is fastest is not clear and may depend on the micro-architecture and not just the instruction set. We don't really do different things depending on micro-architecture, but I think the cases where it really matters should be rather few. When testing on some other micro-architectures, we notice that there are a few unexpected run times, but nothing that really justifies doing different codegen, at least at our level. We're currently setting up a test farm, and we'll try to graph the run time in cycles of all of our functions on different architectures. Any recommendation and which micro-architectures to include for x86? We can't afford to have too many. We mostly work with Core and Nehalem.
That's overkill. Alignment often isn't required to vectorize.
It doesn't cost the user much to enforce it for new applications, and it allows us not to have to worry about it.
With AMD and Intel's latest offerings alignment is much less of a performance issue.
What about portability?
32 is not always the right answer on SandyBridge/Bulldozer because 256 bit vectorization is not always the right answer.
If 256-bit vectorization is not what you want, then you have to specify the size you want explicitly. Otherwise we always prefer the size that allows the most parallelism.
What's the vector length of a pack<int> on Bulldozer?
256 bits because __m256i exists, and there are some instructions for those types, even if they are few. But I suppose 128 bits could also be an acceptable choice. I need to benchmark this, but I think the conversions from/to AVX/SSE are sufficiently fast to make it a good choice in general. It's a default anyway, you can set the size you want.
That is in fact what is happening via autotuning. Yes, in some cases hand-tuned code outperforms the compiler. But in the vast majority of cases, the compiler wins.
That's not what I remember of my last discussions with people working with the polyhedric optimization model. They told me they came close, but still weren't as fast as state-of-the-art BLAS implementations.
My case comes from years of experience.
You certainly have experience in writing compilers, but do you have experience in writing SIMD versions of algorithms within applications? You don't seem to be familiar with SIMD-style branching or other popular techniques.
It may be useful in some instances, but it is not a general solution to the problem of writing vector code. In the majority of cases, the programmer will want the compiler to do it because it does a very good job.
All tools are complimentary and can thus co-exist. What I didn't like about your original post is that you said compilers were the one and only solution to parallelization. At least we can agree on something now ;).