
We're rolling a beta release of Boost.SIMD along with a beta release of NT2. This version comes very late because we were busy with other things but hopefully we'll now do rolling releases frequently. Documentation is quite lacking but code is relatively stable and works. * Release Files can be found here: http://nt2.metascale.org/downloads/ * Additional material about the library can be found here : - Boost.SIMD PACT paper : http://www.lri.fr/~falcou/pub/pact-2012.pdf - Boost.SIMD MeetingC++ slides : http://meetingcpp.com/tl_files/mcpp/slides/12/simd.pdf * We welcome comments, questions and random requests/rants/bug reports as to polish the library before actual submission to the boost review. Such comments can be sent there, on our nt2-dev Google groups or in the nt2 issue tracker. Regards

On 20-12-2012 07:50, joel falcou wrote:
We're rolling a beta release of Boost.SIMD along with a beta release of NT2.
I looked through the slides. It all looks very impressive! Well done. I would love to use SIMD extensions for some of our mathematical tools. We often face some obstacles though. A. Programs are deployed at our customers' customers, meaning we have to be very conservative about which SSE extension to use. Currently, we disable SSE. Question: do your library allow one to customize which extension to generate code for? B. Floating point precision is crucial in some of our software. Maybe you could provide kahan_sum and high-precision normalization functions? Do you know if there are any problems about the precision in SIMD code? kind regards Thorsten

On 20/12/2012 10:27, Thorsten Ottosen wrote:
On 20-12-2012 07:50, joel falcou wrote:
We're rolling a beta release of Boost.SIMD along with a beta release of NT2.
I looked through the slides. It all looks very impressive! Well done.
I would love to use SIMD extensions for some of our mathematical tools. We often face some obstacles though.
A. Programs are deployed at our customers' customers, meaning we have to be very conservative about which SSE extension to use. Currently, we disable SSE. Question: do your library allow one to customize which extension to generate code for?
You can choose which extension to generate code for at compile time. The approach we recommend is to compile various versions of your function with different settings, and then choose the right one at runtime depending on the host capabilities. We provide functions to easily check whether an extension is supported.
B. Floating point precision is crucial in some of our software. Maybe you could provide kahan_sum and high-precision normalization functions? Do you know if there are any problems about the precision in SIMD code?
We took great care of guaranteeing high precision with all our functions, along with nan/inf support and, when reasonably supported by the hardware, denormal support. Some of these have performance costs, and can be disabled. The main difference when writing SSE code is that float operations are really done in single-precision and double in double-precision, which is not the case when using the x87 FPU for example, which may use up to 80 bits of precision for intermediate computations. We do not have a kahan sum algorithm but it would be fairly easy to add. Maybe the one from boost.accumulators could even work out of the box.

On 20-12-2012 12:52, Mathias Gaunard wrote:
You can choose which extension to generate code for at compile time.
The approach we recommend is to compile various versions of your function with different settings, and then choose the right one at runtime depending on the host capabilities. We provide functions to easily check whether an extension is supported.
That seems cool. Can illegal instructions be in a binary as long as they are not exeuted?
B. Floating point precision is crucial in some of our software. Maybe you could provide kahan_sum and high-precision normalization functions? Do you know if there are any problems about the precision in SIMD code?
We took great care of guaranteeing high precision with all our functions, along with nan/inf support and, when reasonably supported by the hardware, denormal support. Some of these have performance costs, and can be disabled.
The main difference when writing SSE code is that float operations are really done in single-precision and double in double-precision, which is not the case when using the x87 FPU for example, which may use up to 80 bits of precision for intermediate computations.
Sure, the 80 bits are one reason that a naive sum is often as precise as a kahan sum on x87. Let me ask another way: compilers often provide flags that affect how code is generate for fp. For example, we use fp:precise on VS. Do fp:precise affect SIMD code or? -Thorsten

On Thu, Dec 20, 2012 at 5:49 PM, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
On 20-12-2012 12:52, Mathias Gaunard wrote:
You can choose which extension to generate code for at compile time.
The approach we recommend is to compile various versions of your function with different settings, and then choose the right one at runtime depending on the host capabilities. We provide functions to easily check whether an extension is supported.
That seems cool. Can illegal instructions be in a binary as long as they are not exeuted?
Yes, no problem with that. The suggested approach has a nasty potential problem though. You have to be extra careful so that no common inline functions are compiled in different translation units with different compiler settings. Otherwise you may have ODR violation and it is unspecified which version of such functions end up in the compiled binary.

On 20-12-2012 15:09, Andrey Semashev wrote:
On Thu, Dec 20, 2012 at 5:49 PM, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
That seems cool. Can illegal instructions be in a binary as long as they are not exeuted?
Yes, no problem with that.
Ok.
The suggested approach has a nasty potential problem though. You have to be extra careful so that no common inline functions are compiled in different translation units with different compiler settings. Otherwise you may have ODR violation and it is unspecified which version of such functions end up in the compiled binary.
Well, is that different than other ODR violations? We use bjam, so the settings are easy to make consistent across projects. -Thorsten

On Thu, Dec 20, 2012 at 8:17 PM, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
On 20-12-2012 15:09, Andrey Semashev wrote:
The suggested approach has a nasty potential problem though. You have to be extra careful so that no common inline functions are compiled in different translation units with different compiler settings. Otherwise you may have ODR violation and it is unspecified which version of such functions end up in the compiled binary.
Well, is that different than other ODR violations?
Not really, except that it's much more easier to get that problem.
We use bjam, so the settings are easy to make consistent across projects.
Umm, I'm not sure you understood. The point was to compile different TUs with different compiler options (some with SSE enabled and others without).

On 20-12-2012 17:35, Andrey Semashev wrote:
On Thu, Dec 20, 2012 at 8:17 PM, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
On 20-12-2012 15:09, Andrey Semashev wrote:
The suggested approach has a nasty potential problem though. You have to be extra careful so that no common inline functions are compiled in different translation units with different compiler settings. Otherwise you may have ODR violation and it is unspecified which version of such functions end up in the compiled binary.
Well, is that different than other ODR violations?
Not really, except that it's much more easier to get that problem.
We use bjam, so the settings are easy to make consistent across projects.
Umm, I'm not sure you understood. The point was to compile different TUs with different compiler options (some with SSE enabled and others without).
Hm. I think I get it now. Sounds nasty indeed. Forget about using any templated container in such TUs. -Thorsten

Nice work. Just a quick question: In case my arch do not support a instruction, do you have a default implementation? Like a fallback? On Thu, Dec 20, 2012 at 7:39 PM, Andrey Semashev <andrey.semashev@gmail.com>wrote:
On Thu, Dec 20, 2012 at 5:49 PM, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
On 20-12-2012 12:52, Mathias Gaunard wrote:
You can choose which extension to generate code for at compile time.
The approach we recommend is to compile various versions of your function with different settings, and then choose the right one at runtime depending on the host capabilities. We provide functions to easily check whether an extension is supported.
That seems cool. Can illegal instructions be in a binary as long as they are not exeuted?
Yes, no problem with that.
The suggested approach has a nasty potential problem though. You have to be extra careful so that no common inline functions are compiled in different translation units with different compiler settings. Otherwise you may have ODR violation and it is unspecified which version of such functions end up in the compiled binary.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 20/12/12 17:22, Shakti Misra wrote:
Nice work. Just a quick question: In case my arch do not support a instruction, do you have a default implementation? Like a fallback?
Yes. There is always a variant of each function written in normal C++. Some functions are also "natively" supported only with SSE4.2, but there can be a SSE2 or SSSE3 version.

On 20/12/12 14:49, Thorsten Ottosen wrote:
Sure, the 80 bits are one reason that a naive sum is often as precise as a kahan sum on x87.
Let me ask another way: compilers often provide flags that affect how code is generate for fp. For example, we use fp:precise on VS. Do fp:precise affect SIMD code or?
fp:precise is the normal operating mode for floating-point, i.e. the compiler is not allowed to do optimizations that would break your code. This is the only mode in which we guarantee the numerical accuracy of our results. (we usually guarantee 1 ulp) fp:fast considers that floating-point operations are associative, among others, which enables more transformations on your code and possible optimizations or automatic parallelizations (parallelization may require changing the order of operations). Since floating-point operations are not really associative, you could end up with different results (e.g. a kahan sum will be considered equivalent to a normal sum). Boost.SIMD can also be used in this mode, but whether the compiler will be able to use it with vector intrinsics and not just scalars depends on how good it is.

joel falcou wrote:
We're rolling a beta release of Boost.SIMD along with a beta release of NT2.
[...] Documentation is quite lacking but code is relatively stable and works.
* Additional material about the library can be found here : - Boost.SIMD MeetingC++ slides : http://meetingcpp.com/tl_files/mcpp/slides/12/simd.pdf
* We welcome comments, questions and random requests/rants/bug reports as to polish the library before actual submission to the boost review.
I viewed the slides out of general interest and I found them very convincing. I would like to review your library when the documentation is ready. -Julian

hi,
* We welcome comments, questions and random requests/rants/bug reports as to polish the library before actual submission to the boost review. Such comments can be sent there, on our nt2-dev Google groups or in the nt2 issue tracker.
in a way it is hard to comment on the library, as the documentation does not really exist, still some questions/thoughts, maybe i've missed something in the slides/paper ... didn't have a look at the code, yet: * does boost.simd support horizontal operations (fold/reduce/accumulate)? * is it possible to define simd kernels for both packs and scalars e.g. to make use of branches in the scalar implementation used in preroll/postroll of unrolled loops? * does boost.simd have any facility to do loop unrolling? in my library, i'm unrolling loops by the size of a cache line and try to avoid read-after-write hazards by pre-loading data. * how does boost.simd deal with stack alignment? on win32 i ended up compiling with -mstackrealign, though it is said not to be the best solution (didn't have a closer look) cheers, tim

On 20/12/2012 10:59, Tim Blechmann wrote:
hi,
* We welcome comments, questions and random requests/rants/bug reports as to polish the library before actual submission to the boost review. Such comments can be sent there, on our nt2-dev Google groups or in the nt2 issue tracker. in a way it is hard to comment on the library, as the documentation does not really exist, still some questions/thoughts, maybe i've missed something in the slides/paper ... didn't have a look at the code, yet:
We're working hard ot get a doc rolling ASAP
* does boost.simd support horizontal operations (fold/reduce/accumulate)?
The reduction toolbox provides vector horizontal operation. You can then run std::accumulate with a proper functor and have the simd_iterator kicks in and vectorize your large data reduction.
* is it possible to define simd kernels for both packs and scalars e.g. to make use of branches in the scalar implementation used in preroll/postroll of unrolled loops?
All operations on pack are also defined on scalar so if you write a polymorphic template functor, it will work on both. There is a simple example of this in the paper.
* does boost.simd have any facility to do loop unrolling? in my library, i'm unrolling loops by the size of a cache line and try to avoid read-after-write hazards by pre-loading data.
We have a primitive one but it is not yet fully promoted as part of the API.
* how does boost.simd deal with stack alignment? on win32 i ended up compiling with -mstackrealign, though it is said not to be the best solution (didn't have a closer look)
Good question, IIRC we tried ot satisfy the ABI requirement of passing SIMD type in structures over to functions properly by either usign references at the upper level and then work with register type as soon as possible. This is still suboptimal in some cases. Maybe Mathias can give you more insight on this subject.

* does boost.simd support horizontal operations (fold/reduce/accumulate)? The reduction toolbox provides vector horizontal operation. You can then run std::accumulate with a proper functor and have the simd_iterator kicks in and vectorize your large data reduction.
ok, but you might be able to use native simd instructions to do horizontal operations. for sse, i'm using this for providing the horizontal minimum for example: -- __m128 data; /* [0, 1, 2, 3] */ __m128 low = _mm_movehl_ps(data, data); /* [2, 3, 2, 3] */ __m128 low_accum = _mm_min_pslow, data); /* [0|2, 1|3, 2|2, 3|3] */ __m128 elem1 = _mm_shuffle_ps(low_accum, low_accum, _MM_SHUFFLE(1,1,1,1)); /* [1|3, 1|3, 1|3, 1|3] */ __m128 accum = _mm_min_ss(low_accum, elem1); return _mm_cvtss_f32(accum); -- also, sse3 can be used to compute a horizontal sum: -- __m128 accum1 = _mm_hadd_ps(data_, data_); /* [0+1, 2+3, 0+1, 2+3] */ __m128 elem1 = _mm_shuffle_ps(accum1, accum1, _MM_SHUFFLE(1, 1, 1, 1)); /* [2+3, 2+3, 2+3, 2+3] */ __m128 result = _mm_add_ps(accum1, elem1); return _mm_cvtss_f32(result); --
* how does boost.simd deal with stack alignment? on win32 i ended up compiling with -mstackrealign, though it is said not to be the best solution (didn't have a closer look) Good question, IIRC we tried ot satisfy the ABI requirement of passing SIMD type in structures over to functions properly by either usign references at the upper level and then work with register type as soon as possible. This is still suboptimal in some cases. Maybe Mathias can give you more insight on this subject.
once the compiler decides to allocate the register type to the stack, one needs to care about stack alignment (compare [1]). i was hoping that the compiler will try to ensure the alignment automatically, but unfortunately it doesn't :/ cheers, tim [1] http://stackoverflow.com/questions/2386408/qt-gcc-sse-and-stack-alignment

On 20/12/2012 11:33, Tim Blechmann wrote:
* does boost.simd support horizontal operations (fold/reduce/accumulate)? The reduction toolbox provides vector horizontal operation. You can then run std::accumulate with a proper functor and have the simd_iterator kicks in and vectorize your large data reduction.
ok, but you might be able to use native simd instructions to do horizontal operations. for sse, i'm using this for providing the horizontal minimum for example:
We do have such horizontal operations. sum, prod, minimum, maximum, all, any, none, etc.
once the compiler decides to allocate the register type to the stack, one needs to care about stack alignment (compare [1]). i was hoping that the compiler will try to ensure the alignment automatically, but unfortunately it doesn't :/
Putting things on the stack with arbitrary alignment seems to work (at least according to our tests). Passing the thing in question to a function by value doesn't, however.

On 20/12/2012 11:05, Joel Falcou wrote:
* is it possible to define simd kernels for both packs and scalars e.g. to make use of branches in the scalar implementation used in preroll/postroll of unrolled loops?
All operations on pack are also defined on scalar so if you write a polymorphic template functor, it will work on both. There is a simple example of this in the paper.
While this works, if you do this you won't be able to use lazy branch evaluation in the scalar version. If you write c = if_else(a != 0, b/a, b); this is equivalent in scalar to c = (a != 0) ? b/a : b; except that in the case above you will always compute the division.

* is it possible to define simd kernels for both packs and scalars e.g. to make use of branches in the scalar implementation used in preroll/postroll of unrolled loops?
All operations on pack are also defined on scalar so if you write a polymorphic template functor, it will work on both. There is a simple example of this in the paper.
While this works, if you do this you won't be able to use lazy branch evaluation in the scalar version.
If you write
c = if_else(a != 0, b/a, b);
this is equivalent in scalar to
c = (a != 0) ? b/a : b;
except that in the case above you will always compute the division.
would it be possible to simply provide an overloaded operator() for scalar types? cheers, tim

On 20/12/12 17:35, Tim Blechmann wrote:
* is it possible to define simd kernels for both packs and scalars e.g. to make use of branches in the scalar implementation used in preroll/postroll of unrolled loops?
All operations on pack are also defined on scalar so if you write a polymorphic template functor, it will work on both. There is a simple example of this in the paper.
While this works, if you do this you won't be able to use lazy branch evaluation in the scalar version.
If you write
c = if_else(a != 0, b/a, b);
this is equivalent in scalar to
c = (a != 0) ? b/a : b;
except that in the case above you will always compute the division.
would it be possible to simply provide an overloaded operator() for scalar types?
Sure, you can have separate scalar and simd versions, with the caveat that quite often you'll write nearly the same code twice.

"Mathias Gaunard" wrote in message news:50D2F8D1.2030900@ens-lyon.org...
While this works, if you do this you won't be able to use lazy branch evaluation in the scalar version.
If you write
c = if_else(a != 0, b/a, b);
this is equivalent in scalar to
c = (a != 0) ? b/a : b;
except that in the case above you will always compute the division.
As mentinoned here https://github.com/MetaScale/nt2/issues/379, it should be (re)considered whether this scalar-if special handling actually brings any benefit (or it might actually even be a pessimization in the majority of cases)... -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

Great work, SIMD is all time favorite for me. Thanks for extending SIMD support for Boost library. ~ BR On Thu, Dec 20, 2012 at 3:29 PM, Tim Blechmann <tim@klingt.org> wrote:
hi,
* We welcome comments, questions and random requests/rants/bug reports as to polish the library before actual submission to the boost review. Such comments can be sent there, on our nt2-dev Google groups or in the nt2 issue tracker.
in a way it is hard to comment on the library, as the documentation does not really exist, still some questions/thoughts, maybe i've missed something in the slides/paper ... didn't have a look at the code, yet:
* does boost.simd support horizontal operations (fold/reduce/accumulate)?
* is it possible to define simd kernels for both packs and scalars e.g. to make use of branches in the scalar implementation used in preroll/postroll of unrolled loops?
* does boost.simd have any facility to do loop unrolling? in my library, i'm unrolling loops by the size of a cache line and try to avoid read-after-write hazards by pre-loading data.
* how does boost.simd deal with stack alignment? on win32 i ended up compiling with -mstackrealign, though it is said not to be the best solution (didn't have a closer look)
cheers, tim
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

We're rolling a beta release of Boost.SIMD along with a beta release of NT2.
... What is the recommended Boost.SIMD way to write a function like void add_n( float const * s, float const * s2, float * d, size_t n ); // d[i] = s[i] + s2[i] where none of s, s2, d are guaranteed to be aligned?

Le 20/12/2012 18:34, Peter Dimov a écrit :
What is the recommended Boost.SIMD way to write a function like
void add_n( float const * s, float const * s2, float * d, size_t n ); // d[i] = s[i] + s2[i]
where none of s, s2, d are guaranteed to be aligned?
You should align them ;) More seriously, you can run a for using with pack and unaligned_load/store: void add_n( float const * s, float const * s2, float * d, size_t n ) { size_t c = pack<float>::static_size; size_t vn = v / c * c; size_t sn = v % c; for(std::size_t i=0, i<vn; i+= c, d+=c,s+=c,s2+=c) store(unaligned_load<pack<T>>(s) + unaligned_load<pack<T>>(s2), d ); for(std::size_t i=0, i<sn; i++,d++,s++,s2++) *d = *s + *s2; } This also takes the fact n can be not a multiple of pack cardinal into account. If pointer is aligned you can actually just use simd::input_iterator and feed this to std::transform. Maybe we should have a simd::unaligned_input_iterator and/or make simd::transform accept non aligned data. Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.

on Thu Dec 20 2012, Joel Falcou <joel.falcou-AT-gmail.com> wrote:
Maybe we should have a simd::unaligned_input_iterator and/or make simd::transform accept non aligned data. Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.
It's probably a bit late to bring this up, but did you consider a design that uses something like Matt Austern's segmented iterator concept (http://lafstern.org/matt/segmented.pdf) and imposes a hierarchical view over regular memory that segments on aligned memory regions? -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

Le 20/12/2012 19:12, Dave Abrahams a écrit :
on Thu Dec 20 2012, Joel Falcou <joel.falcou-AT-gmail.com> wrote:
Maybe we should have a simd::unaligned_input_iterator and/or make simd::transform accept non aligned data. Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.
It's probably a bit late to bring this up, but did you consider a design that uses something like Matt Austern's segmented iterator concept (http://lafstern.org/matt/segmented.pdf) and imposes a hierarchical view over regular memory that segments on aligned memory regions?
We did but we never had time to implement somethign like that. The best we had was a simd::transform that handled the epi/prolog But it's maybe the best stuff to have yes.

On 20/12/12 19:12, Dave Abrahams wrote:
on Thu Dec 20 2012, Joel Falcou <joel.falcou-AT-gmail.com> wrote:
Maybe we should have a simd::unaligned_input_iterator and/or make simd::transform accept non aligned data. Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.
It's probably a bit late to bring this up, but did you consider a design that uses something like Matt Austern's segmented iterator concept (http://lafstern.org/matt/segmented.pdf) and imposes a hierarchical view over regular memory that segments on aligned memory regions?
The code needs to be the same for all segments in that model, unless I've missed something.

on Thu Dec 20 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 20/12/12 19:12, Dave Abrahams wrote:
on Thu Dec 20 2012, Joel Falcou <joel.falcou-AT-gmail.com> wrote:
Maybe we should have a simd::unaligned_input_iterator and/or make simd::transform accept non aligned data. Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.
It's probably a bit late to bring this up, but did you consider a design that uses something like Matt Austern's segmented iterator concept (http://lafstern.org/matt/segmented.pdf) and imposes a hierarchical view over regular memory that segments on aligned memory regions?
The code needs to be the same for all segments in that model, unless I've missed something.
I don't understand exactly what you mean, nor what problem you see with that model. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

On 20/12/12 19:42, Dave Abrahams wrote:
on Thu Dec 20 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 20/12/12 19:12, Dave Abrahams wrote:
on Thu Dec 20 2012, Joel Falcou <joel.falcou-AT-gmail.com> wrote:
Maybe we should have a simd::unaligned_input_iterator and/or make simd::transform accept non aligned data. Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.
It's probably a bit late to bring this up, but did you consider a design that uses something like Matt Austern's segmented iterator concept (http://lafstern.org/matt/segmented.pdf) and imposes a hierarchical view over regular memory that segments on aligned memory regions?
The code needs to be the same for all segments in that model, unless I've missed something.
I don't understand exactly what you mean, nor what problem you see with that model.
This model is good for B-trees and other similar things. I don't see how this can apply here at all. We need to iterate, load and store our data differently depending on the segment. We even have different types to manipulate on each of them. In the segmented iterator model, all sibling iterators must be the same type, and therefore have the same code. The only reasonable thing to do would be to return a fusion sequence of ranges, and have algorithms deal with that. Even this still has a bit of overhead over a hand-written loop structure.

on Thu Dec 20 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 20/12/12 19:42, Dave Abrahams wrote:
on Thu Dec 20 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 20/12/12 19:12, Dave Abrahams wrote:
on Thu Dec 20 2012, Joel Falcou <joel.falcou-AT-gmail.com> wrote:
Maybe we should have a simd::unaligned_input_iterator and/or make simd::transform accept non aligned data. Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.
It's probably a bit late to bring this up, but did you consider a design that uses something like Matt Austern's segmented iterator concept (http://lafstern.org/matt/segmented.pdf) and imposes a hierarchical view over regular memory that segments on aligned memory regions?
The code needs to be the same for all segments in that model, unless I've missed something.
I don't understand exactly what you mean, nor what problem you see with that model.
This model is good for B-trees and other similar things.
...like deques, circular buffers, blocked arrays, etc.
I don't see how this can apply here at all.
We need to iterate, load and store our data differently depending on the segment.
Yes, incomplete end segments are dealt with differently. Did you have something else in mind?
We even have different types to manipulate on each of them. In the segmented iterator model, all sibling iterators must be the same type, and therefore have the same code.
Yes; there's a runtime test.
The only reasonable thing to do would be to return a fusion sequence of ranges, and have algorithms deal with that. Even this still has a bit of overhead over a hand-written loop structure.
I'm afraid you have not been specific enough about the problem you claim to see for me to be able to evaluate whether it exists or not. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

On 20/12/2012 20:34, Dave Abrahams wrote:
Yes; there's a runtime test.
Let's put a runtime test at every step of our critical loop, sounds like a good idea. I actually somewhat looked into this, and in this case segmented iterators don't allow any optimization compared to normal iterators. Also, since this clearly cannot work with heterogeneous scalar/simd sequences, the iterator would need to fill the remaining data with dummy values to fill vectors on the boundaries, dummy value which must change according to what the user needs. Polymorphic algorithms ended up being a significantly more interesting approach.
I'm afraid you have not been specific enough about the problem you claim to see for me to be able to evaluate whether it exists or not.
I'm afraid you're the one trying to make everything fit into this model you've somehow taken a liking to. If you'd like to claim that it can apply, then feel free to show how.

on Fri Dec 21 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 20/12/2012 20:34, Dave Abrahams wrote:
Yes; there's a runtime test.
Let's put a runtime test at every step of our critical loop, sounds like a good idea.
Uh, no. You do a runtime test at the beginning of the algorithm. But if you read Matt's paper, I presume you know that already. So either you're just taking the opportunity to be snarky, or you don't understand what I'm suggesting, or you're holding onto some important insight that I'd need in order to understand something you already do.
I actually somewhat looked into this, and in this case segmented iterators don't allow any optimization compared to normal iterators.
Don't worry, I hear you loud and clear saying that it doesn't work. You haven't, though, done anything to help me understand why you're saying that.
Also, since this clearly cannot work with heterogeneous scalar/simd sequences, the iterator would need to fill the remaining data with dummy values to fill vectors on the boundaries, dummy value which must change according to what the user needs.
I don't know what you mean.
Polymorphic algorithms ended up being a significantly more interesting approach.
Nor here.
I'm afraid you have not been specific enough about the problem you claim to see for me to be able to evaluate whether it exists or not.
I'm afraid you're the one trying to make everything fit into this model you've somehow taken a liking to.
Wow, strong words. I'm not that invested in this debate. It just seems like an obvious fit to me, but I'm hardly a SIMD expert.
If you'd like to claim that it can apply, then feel free to show how.
I don't have time to write the code out ATM. It seems to me that once you impose a hierarchical "SIMD" view over the sequence, you can specialize the algorithm for the aligned part, to use SIMD instructions. Maybe I'm overlooking something, but so far you haven't been very helpful in the understanding-what-I'm-missing department. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

On 21/12/12 19:48, Dave Abrahams wrote:
Don't worry, I hear you loud and clear saying that it doesn't work. You haven't, though, done anything to help me understand why you're saying that.
I have done so repeatedly (this is not the first time we're discussing this), yet you don't seem to understand what we're actually doing and what is needed. There are both semantical and technical problems with trying to use segmented iterators, both of which I already explained to you. Let me repeat everything I've already told you in one single message. This will be my last attempt to explain it to you. It is already quite ironic that I spent more time explaining to you why segmented iterators do not bring anything to help with this problem than the time it took me to figure it out. SIMD are special instructions that allow to perform the same operation on a vector of N values. For simplicity's sake, we'll ignore alignment concerns in what follows, since it's only something to need to take care of with if you want to do additional optimizations. If you want to perform a function like plus, on n elements, and that n is not necessarily a multiple of N, you can do a certain number of operations with vectors of N elements and then you need to take care of the trailing data that doesn't fully fill a vector differently. If you want to add two sequences of 10 floats each, storing the result in another third sequence, and you have 4xfloat vectors, you do 4 4xfloat loads, 2 4xfloat additions, 2 4xfloat stores and then 4 normal 1xfloat loads, 2 1xfloat additions and 2 1xfloat stores (which are all different functions than their 4xfloat counterparts). Question: would iterating the sequence with a segmented iterator allow us to do this? Semantical problem: an iterator, be it segmented or otherwise, must provide the same element type for all its elements (i.e. sequences are homogeneous). That means we cannot have a 4xfloat body and a trailing 1xfloat epilogue. We need to have 4xfloat everywhere, which, while possible in some cases, would be limiting in others, needing a case-by-case workaround depending on the operation being done (like conditional stores or filling with neutral data). Technical problem: for each segment, the type of the iterator needs to be the same. The code for incrementing or dereferencing the iterator is therefore the same. This gives us two solutions: - Do a branch in the dereferencing function of the iterator to load normal data and trailing data differently (at every iteration, not ok -- this approach also works with normal iterators) - Pre-emptively build the last vector of the data and copy it into the iterator (which implies unnecessary loads, stores and cache misses and makes the iterator expensive to copy), and consider that as an additional segment of contiguous memory. Answer: Not in a very satisfying manner. The added generality of those concepts does not justify the runtime overhead and the fact that each user would still need to deal with how to reduce trailing data to a vector.
I don't know what you mean.
Since you don't know, why do you insist on saying that it is "obvious"?
Polymorphic algorithms ended up being a significantly more interesting approach.
Nor here.
In this instance, polymorphic algorithms refer to algorithms that call a polymorphic function object with different argument types depending on which part of the sequence is being processed.
I don't have time to write the code out ATM. It seems to me that once you impose a hierarchical "SIMD" view over the sequence, you can specialize the algorithm for the aligned part, to use SIMD instructions. Maybe I'm overlooking something, but so far you haven't been very helpful in the understanding-what-I'm-missing department.
SIMD algorithms and segmented iterator algorithms were done for different purposes. Lifting them to the same general algorithm is just a bad idea.

On 23/12/12 13:41, Mathias Gaunard wrote:
Technical problem: for each segment, the type of the iterator needs to be the same. The code for incrementing or dereferencing the iterator is therefore the same. This gives us two solutions: - Do a branch in the dereferencing function of the iterator to load normal data and trailing data differently (at every iteration, not ok -- this approach also works with normal iterators) - Pre-emptively build the last vector of the data and copy it into the iterator (which implies unnecessary loads, stores and cache misses and makes the iterator expensive to copy), and consider that as an additional segment of contiguous memory.
The above only stands for input iterators. I forgot mentioning output ones. The first solution still works for output iterators, but the second one clearly does not. The only way we would have to make efficient segmented output iterators for this case would be to allow iterator of different segments to be of different types, which would require making the number of segments per level a compile-time constant. This would make the segmented iterator concepts significantly less useful for the use cases they were designed for.

on Sun Dec 23 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 23/12/12 13:41, Mathias Gaunard wrote:
Technical problem: for each segment, the type of the iterator needs to be the same. The code for incrementing or dereferencing the iterator is therefore the same. This gives us two solutions: - Do a branch in the dereferencing function of the iterator to load normal data and trailing data differently (at every iteration, not ok -- this approach also works with normal iterators) - Pre-emptively build the last vector of the data and copy it into the iterator (which implies unnecessary loads, stores and cache misses and makes the iterator expensive to copy), and consider that as an additional segment of contiguous memory.
The above only stands for input iterators. I forgot mentioning output ones. The first solution still works for output iterators, but the second one clearly does not.
The only way we would have to make efficient segmented output iterators for this case would be to allow iterator of different segments to be of different types, which would require making the number of segments per level a compile-time constant.
Certainly not. You can do one runtime dispatch per top-level algorithm call, depending on segment alignment, just as the Austern paper does... and just as your approach would have to do AFAICT. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

on Mon Dec 24 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 24/12/12 03:55, Dave Abrahams wrote:
depending on segment alignment
I give up. If you want to discuss this with me, please at least take the time to read what I said.
I did read it. Quite carefully, I thought. I probably used the wrong words. What I meant by "segment alignment" in this case was "whether or not the start and end fall into the same segment," which according to what you wrote, could amount to "whether or not there is more than a SIMD vector's worth of data."
It is probably simpler to drop the subject.
Probably, but neither of us will learn anything from this discussion if we do. I'm still eager to understand if, and how, my theory is full of hooey. Nobody else is stepping forward to explain it, so I guess you're the only guy who can.
Merry Christmas.
hohoho-ly y'rs, -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

on Sun Dec 23 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 21/12/12 19:48, Dave Abrahams wrote:
Don't worry, I hear you loud and clear saying that it doesn't work. You haven't, though, done anything to help me understand why you're saying that.
I have done so repeatedly (this is not the first time we're discussing this), yet you don't seem to understand what we're actually doing and what is needed.
Perhaps not.
There are both semantical and technical problems with trying to use segmented iterators, both of which I already explained to you.
Let me repeat everything I've already told you in one single message. This will be my last attempt to explain it to you. It is already quite ironic that I spent more time explaining to you why segmented iterators do not bring anything to help with this problem than the time it took me to figure it out.
You seem to think that's because I'm being purposefully obtuse. I'm really just trying to learn something here and explore this design space. No matter how frustrating it may be to explain these things to me, I don't think I deserve this level of hostility.
SIMD are special instructions that allow to perform the same operation on a vector of N values.
Yes.
For simplicity's sake, we'll ignore alignment concerns in what follows, since it's only something to need to take care of with if you want to do additional optimizations.
Oh, I was under the impression that in general you could only use SIMD instructions on aligned sequences. That actually seems to make segmented iterators more practical, because you don't (necessarily) need to solve the "misaligned segments" problem.
If you want to perform a function like plus, on n elements, and that n is not necessarily a multiple of N, you can do a certain number of operations with vectors of N elements and then you need to take care of the trailing data that doesn't fully fill a vector differently. If you want to add two sequences of 10 floats each, storing the result in another third sequence, and you have 4xfloat vectors, you do 4 4xfloat loads, 2 4xfloat additions, 2 4xfloat stores and then 4 normal 1xfloat loads, 2 1xfloat additions and 2 1xfloat stores (which are all different functions than their 4xfloat counterparts).
That is already consistent with what I already understand about SIMD.
Question: would iterating the sequence with a segmented iterator allow us to do this?
It would seem so.
Semantical problem: an iterator, be it segmented or otherwise, must provide the same element type for all its elements (i.e. sequences are homogeneous).
Well, that is one view. Segmented iterators, conceptually, provide a "flat" view of the elements that works with standard algorithms, and a "segmented" view that is useful for optimization. I think I now see the problem to which you're referring. The hierarchical algorithms in exactly the form presented by Austern would certainly not work, because ultimately they present no type information to nested algorithm calls that would allow compile-time dispatching to a different implementation (one that used SIMD instructions). However, I think something *very* similar would work perfectly well. If you look at the Austern paper's "fill" example (and imagine there's a SIMD "fill" instruction), the key lines are marked below with "****". #+begin_src c++ template <class SegIter, class T> void fill(SegIter first, SegIter last, const T& x, true_type) { typedef segmented_iterator_traits<SegIter> traits; typename traits::segment_iterator sfirst = traits::segment(first); typename traits::segment_iterator slast = traits::segment(last); if (sfirst == slast) fill(traits::local(first), traits::local(last), x); else { fill(traits::local(first), traits::end(sfirst), x); for (++sfirst ; sfirst != slast ; ++sfirst) // **** fill(traits::begin(sfirst), traits::end(sfirst), x); // **** fill(traits::begin(sfirst), traits::local(last), x); } } template <class ForwardIter, class T> void fill(ForwardIter first, ForwardIter last, const T& x, false_type) { for ( ; first != last; ++first) *first = x; } template <class Iter, class T> inline void fill(Iter first, Iter last, const T& x) { typedef segmented_iterator_traits<Iter> Traits; fill(first, last, x, typename Traits::is_segmented_iterator()); } #+end_src A simple change to that loop could allow a SIMD-specialized version of fill to kick in. For example, s/begin/seg_begin/ and s/end/seg_end/. OK, the names are not great, but the idea is that seg_begin would return a begin iterator that carries in its type the fact that it is on a segment boundary, and also, when the segments are contiguous and of SIMD vector size, that additional information. Then one could add an overload of fill that takes advantage of this information to use SIMD fill instructions rather than regular assignments.
That means we cannot have a 4xfloat body and a trailing 1xfloat epilogue. We need to have 4xfloat everywhere, which, while possible in some cases, would be limiting in others, needing a case-by-case workaround depending on the operation being done (like conditional stores or filling with neutral data).
I'm not sure what kinds of workarounds you have in mind. An example might help.
Technical problem: for each segment, the type of the iterator needs to be the same.
Only if you take the Austern paper *very* literally.
The code for incrementing or dereferencing the iterator is therefore the same. This gives us two solutions: - Do a branch in the dereferencing function of the iterator to load normal data and trailing data differently (at every iteration, not ok -- this approach also works with normal iterators) - Pre-emptively build the last vector of the data and copy it into the iterator (which implies unnecessary loads, stores and cache misses and makes the iterator expensive to copy), and consider that as an additional segment of contiguous memory.
Eeew. I would never suggest that.
Answer: Not in a very satisfying manner. The added generality of those concepts does not justify the runtime overhead and the fact that each user would still need to deal with how to reduce trailing data to a vector.
I don't know what you mean.
Since you don't know, why do you insist on saying that it is "obvious"?
I feel my words are being twisted here. The only time I used that word in this thread was in "It just seems like an obvious fit to me, but I'm hardly a SIMD expert." This was nothing more than an explanation of why I was pursuing this line of discussion. I certainly was not saying that your meaning was "obvious," as you seem to be implying.
Polymorphic algorithms ended up being a significantly more interesting approach.
Nor here.
In this instance, polymorphic algorithms refer to algorithms that call a polymorphic function object with different argument types depending on which part of the sequence is being processed.
I think that is exactly what I am proposing.
I don't have time to write the code out ATM. It seems to me that once you impose a hierarchical "SIMD" view over the sequence, you can specialize the algorithm for the aligned part, to use SIMD instructions. Maybe I'm overlooking something, but so far you haven't been very helpful in the understanding-what-I'm-missing department.
SIMD algorithms and segmented iterator algorithms were done for different purposes. Lifting them to the same general algorithm is just a bad idea.
Maybe. I'm still not convinced. It seems like a way to write generic algorithms that allows significant incremental optimization and code reuse. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

On 24/12/12 03:54, Dave Abrahams wrote:
Well, that is one view.
It's not a matter of view. The definition of the concepts is not really subject to interpretation.
A simple change to that loop could allow a SIMD-specialized version of fill to kick in.
If we're going to specialize for SIMD anyway, then there is no point to the exercise at all. The solution you're proposing is more about extending our SIMD variants to the algorithms to work with segmented sequences, not re-using segmented iterator algorithms as defined in the paper to do SIMD. Those are entirely two different things. I can tell you the second is not really conceptually feasible, and in the cases it is it is not a good idea. The first one is trivial and just a matter of adding the code for it, but there isn't really any incentive to do so since there are no segmented iterators in existence anywhere.
That means we cannot have a 4xfloat body and a trailing 1xfloat epilogue. We need to have 4xfloat everywhere, which, while possible in some cases, would be limiting in others, needing a case-by-case workaround depending on the operation being done (like conditional stores or filling with neutral data).
I'm not sure what kinds of workarounds you have in mind. An example might help.
The examples of workarounds are in parentheses.

on Tue Dec 25 2012, Mathias Gaunard <mathias.gaunard-AT-ens-lyon.org> wrote:
On 24/12/12 03:54, Dave Abrahams wrote:
Well, that is one view.
It's not a matter of view. The definition of the concepts is not really subject to interpretation.
I am thinking about the abstract concepts that underly the generic programming "concepts" in that paper.
A simple change to that loop could allow a SIMD-specialized version of fill to kick in.
If we're going to specialize for SIMD anyway, then there is no point to the exercise at all.
The problem of using SIMD instructions for these operations *clearly* requires specialization no matter which approach you choose. I don't think that fact makes it a waste of time to explore different approaches to using such a specialization.
The solution you're proposing is more about extending our SIMD variants to the algorithms to work with segmented sequences, not re-using segmented iterator algorithms as defined in the paper to do SIMD.
That's a reasonable point of view.
Those are entirely two different things.
OK
I can tell you the second is not really conceptually feasible, and in the cases it is it is not a good idea. The first one is trivial and just a matter of adding the code for it, but there isn't really any incentive to do so since there are no segmented iterators in existence anywhere.
IMO given the importance of cache effects and hierarchical data structures, something like segmented iterators *should* be in existence. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

On Sun, Dec 23, 2012 at 11:41 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
SIMD algorithms and segmented iterator algorithms were done for different purposes. Lifting them to the same general algorithm is just a bad idea.
Despite this conclusion I am still interested in segmented data structures and I think that the question how to use SIMD algorithms with or in segmented data structures will arise again and again. The reason is that there is a class of advanced data structures that offers much more efficient algorithms compared to basic data structures. In particular, an augmented array based B+ tree supports summation and accumulation algorithms with logarithmic running time. The mean value can be also calculated in logarithmic time. Adding SIMD algorithms to such data structures in theory should deliver even better performance. You've explained that segmented iterators do not work with SIMD algorithms. However, there are should be some other options, since at implementation level a segmented data structure uses arrays. This is why I am wondering if it is possible: 1. provide variants of SIMD algorithms that can be applied by the implementation of a segmented data structure directly to arrays? 2. improve generalization level of these algorithms through local iterators, such as a pointer based iterator for an array? Regards, Vadim Stadnik

On 24/12/12 05:07, Vadim Stadnik wrote:
You've explained that segmented iterators do not work with SIMD algorithms. However, there are should be some other options, since at implementation level a segmented data structure uses arrays. This is why I am wondering if it is possible:
1. provide variants of SIMD algorithms that can be applied by the implementation of a segmented data structure directly to arrays? 2. improve generalization level of these algorithms through local iterators, such as a pointer based iterator for an array?
You could also simply iterate the segments and call the SIMD algorithm for each of them.

Joel Falcou wrote:
Le 20/12/2012 18:34, Peter Dimov a écrit :
What is the recommended Boost.SIMD way to write a function like
void add_n( float const * s, float const * s2, float * d, size_t n ); // d[i] = s[i] + s2[i]
where none of s, s2, d are guaranteed to be aligned?
You should align them ;)
More seriously, you can run a for using with pack and unaligned_load/store:
void add_n( float const * s, float const * s2, float * d, size_t n ) { size_t c = pack<float>::static_size; size_t vn = v / c * c; size_t sn = v % c;
for(std::size_t i=0, i<vn; i+= c, d+=c,s+=c,s2+=c) store(unaligned_load<pack<T>>(s) + unaligned_load<pack<T>>(s2), d );
for(std::size_t i=0, i<sn; i++,d++,s++,s2++) *d = *s + *s2; } ... Note that on any pre-Nehalem CPU, the unaligned load will be horrendsously slow.
Yes, and the right thing to do is to first check whether s and s2 are equally unaligned, and if so, have a prefix scalar loop that aligns them; if not, check whether s2 and d are equally unaligned, and align them; and finally, if neither of these are true, align s. (Although I'm not quite certain whether unaligned stores weren't costlier, in which case the order changes a bit.) Then proceed with the rest of your code above. This is tedious boilerplate so I wondered whether you had already provided a solution. simd::transform seems the logical place to put it.

Mathias Gaunard wrote:
On 20/12/12 19:49, Peter Dimov wrote:
This is tedious boilerplate so I wondered whether you had already provided a solution. simd::transform seems the logical place to put it.
simd::transform only has one input and one output. How would it work with multiple inputs?
I couldn't find the documentation of simd::transform in boost-simd-3.0b1.tgz, but assuming it's the same as std::transform, the same way as std::transform works with two inputs?

On 20/12/12 20:09, Peter Dimov wrote:
Mathias Gaunard wrote:
On 20/12/12 19:49, Peter Dimov wrote:
This is tedious boilerplate so I wondered whether you had already provided a solution. simd::transform seems the logical place to put it.
simd::transform only has one input and one output. How would it work with multiple inputs?
I couldn't find the documentation of simd::transform in boost-simd-3.0b1.tgz, but assuming it's the same as std::transform, the same way as std::transform works with two inputs?
Yes that's supposed to be the same as std::transform. I had forgotten there was a variant of std::transform with two inputs too. But then there is not one for three inputs or more. The only way you have to use an arbitrary amount of inputs is to use a zip_iterator

Mathias Gaunard wrote:
I had forgotten there was a variant of std::transform with two inputs too.
But then there is not one for three inputs or more. The only way you have to use an arbitrary amount of inputs is to use a zip_iterator
You can build three inputs over two (but not over one): d = s + s2 + s3; becomes d = s + s2; d = d + s3; It's not exactly the same, performance-wise, but the point is that two inputs are a necessary primitive, and three inputs are an optimization.

on Thu Dec 20 2012, "Peter Dimov" <lists-AT-pdimov.com> wrote:
You can build three inputs over two (but not over one):
d = s + s2 + s3;
becomes
d = s + s2; d = d + s3;
It's not exactly the same, performance-wise, but the point is that two inputs are a necessary primitive, and three inputs are an optimization.
Unless you can zip, in which case only one input is needed. Most languages only have the one-sequence version of "map" (a.k.a. transform). -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

"Peter Dimov" wrote in message news:EC9066EE5B5448A2B4BAF64570BE6BAD@pdimov5...
Yes, and the right thing to do is...
Actually there is a "significant"[1] portion of cases where alignment problems can be fixed more elegantly. If we realize that protected memory systems have page-size-granularity which is much larger than any conceivable SIMD vector size it immediately follows that "overread" (reading outside of the specified range) of a maximum size of "SIMD-cardinal - 1" is always safe. Further more in a single threaded case or when you allocate your arrays with both their location and size aligned, overwrite is also always safe when you wrap the overwrite in a guard class that automatically aligns the array pointers/iterators and then simply restores any possibly overwritten elements in its destructor (NT2/Boost.SIMD devs might recall the discussion of EdgeRestoredOutputRange on the mailing list and/or IRC). This only requires that all the arrays be compatibly aligned (i.e. if one is off by 1 and the other by 2 this approach cannot be used) but it covers almost all of my/our own cases (usually it only does not cover cases that arise when crossing 3rd party library/legacy interfaces)... @NT2 devs, I might have missed it but I didn't see you mention your shifted-iterator functionality in this context... [1] I know, a weasel word...to make it less so I'm referring to my experience with digital signal processing... -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

On December 24, 2012 8:16:13 PM "Domagoj Saric" <dsaritz@gmail.com> wrote:
If we realize that protected memory systems have page-size-granularity which is much larger than any conceivable SIMD vector size it immediately follows that "overread" (reading outside of the specified range) of a maximum size of "SIMD-cardinal - 1" is always safe. Further more in a single threaded case or when you allocate your arrays with both their location and size aligned, overwrite is also always safe when you wrap the overwrite in a guard class that automatically aligns the array pointers/iterators and then simply restores any possibly overwritten elements in its destructor.
[snip] Call me old-fashioned but this would be an abomination. Every memory checking tool will righteously raise a red flag on such code. In some cases the overread garbage can also affect the processing result, so it cannot be used as a general solution. Even if you provide such behavior, please, don't make it a default and make it very explicit. I'm prepared to have a somewhat slower tail processing, it won't be a deal-breaking slowdown anyway. And if it will, I will probably overallocate memory and fill the tail with neutral data and avoid the undefined behavior.

On 24/12/12 17:16, Domagoj Saric wrote:
"Peter Dimov" wrote in message news:EC9066EE5B5448A2B4BAF64570BE6BAD@pdimov5...
Yes, and the right thing to do is...
Actually there is a "significant"[1] portion of cases where alignment problems can be fixed more elegantly. If we realize that protected memory systems have page-size-granularity which is much larger than any conceivable SIMD vector size it immediately follows that "overread" (reading outside of the specified range) of a maximum size of "SIMD-cardinal - 1" is always safe.
It is safe indeed. But the value you will read might not make sense for what you are doing. Consider the simple example of summing all values between two pointers. Values beyond the last pointer should be zero, not whatever lies there in memory. For writing, this is IMO way too hacky, since wrong ordering of the code could easily end up with the wrong value.
@NT2 devs, I might have missed it but I didn't see you mention your shifted-iterator functionality in this context...
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory. To use shifted load to work with arbitrary alignment, you need to generate all variants and select the good one at runtime. In case I haven't been clear about that, for unary transform code could look like this (untested) template<class T> T* transform(T const* begin, T const* end, T* out, F f) { typedef native<T, BOOST_SIMD_DEFAULT_EXTENSION> vT; static const size_t N = meta::cardinal_of<vT>::value; T const* out_bnd = align_on(out, N); for(; begin != end && out != out_bnd; ++begin, ++out) *out = f(*begin); T const* end_bnd = begin+(end-begin)/N*N; size_t misalign = align_on(begin, N)-begin; if(begin != end_bnd) begin += misalign; switch(misalign) { #define M0(z,n,t) \ case n: \ for(; begin != end_bnd; begin += N, out += N) \ store(f(load<vT, -n>(begin), out); \ break; \ /**/ BOOST_PP_REPEAT(BOOST_SIMD_BYTES, M0, ~) #undef M0 default: BOOST_ASSERT_MSG(0, "unsupported alignment in transform"); } for(; begin != end; ++begin, ++out) *out = f(*begin); return out; } Binary transform would surely be quite more complicated.

Mathias Gaunard wrote:
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory.
Does this have any performance advantage over just using an unaligned load? I'd expect the microcode to do whatever the shifted load does, but I haven't measured it.

Le 25/12/2012 15:43, Peter Dimov a écrit :
Mathias Gaunard wrote:
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory.
Does this have any performance advantage over just using an unaligned load? I'd expect the microcode to do whatever the shifted load does, but I haven't measured it.
Shifted load is a couple of aligned load + bit shuffling. This is a technique steming from way back on Altivec. Experiments done on 1D filtering using both show some benefits over unaligned load on pre-Nehalem CPUs. It's usually better in this kind of kernel as we can reuse register to save load in the inenr loops, thus in fact recuding the global number of loads for a given filter run. See Lacassagne[1] et al for actual description of the algorithm. Incidentally, such register saving techniques are already buitl in the shifted_iterator implementation.

On Tue, Dec 25, 2012 at 7:10 PM, Joel Falcou <joel.falcou@gmail.com> wrote:
Le 25/12/2012 15:43, Peter Dimov a écrit :
Mathias Gaunard wrote:
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory.
Does this have any performance advantage over just using an unaligned load? I'd expect the microcode to do whatever the shifted load does, but I haven't measured it.
Shifted load is a couple of aligned load + bit shuffling. This is a technique steming from way back on Altivec. Experiments done on 1D filtering using both show some benefits over unaligned load on pre-Nehalem CPUs.
AFAIK, even on post-Nehalem CPUs unaligned loads (and stores) are slower if the operation spans across the cache line boundary. I don't have the numbers though. Will shifted_iterator use palignr from SSSE3?

On 25/12/12 15:43, Peter Dimov wrote:
Mathias Gaunard wrote:
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory.
Does this have any performance advantage over just using an unaligned load? I'd expect the microcode to do whatever the shifted load does, but I haven't measured it.
The shifted load statically knows the alignment, unaligned loads do not. Note that the switch is outside of the loop, not for each load.

Mathias Gaunard wrote:
On 25/12/12 15:43, Peter Dimov wrote:
Mathias Gaunard wrote:
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory.
Does this have any performance advantage over just using an unaligned load? I'd expect the microcode to do whatever the shifted load does, but I haven't measured it.
The shifted load statically knows the alignment, unaligned loads do not. Note that the switch is outside of the loop, not for each load.
Ah, you're probably talking AltiVec, which probably doesn't have an unaligned load instruction. I was thinking SSE, which does.

On 25/12/12 16:28, Peter Dimov wrote:
Mathias Gaunard wrote:
On 25/12/12 15:43, Peter Dimov wrote:
Mathias Gaunard wrote:
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory.
Does this have any performance advantage over just using an unaligned load? I'd expect the microcode to do whatever the shifted load does, but I haven't measured it.
The shifted load statically knows the alignment, unaligned loads do not. Note that the switch is outside of the loop, not for each load.
Ah, you're probably talking AltiVec, which probably doesn't have an unaligned load instruction. I was thinking SSE, which does.
I'm not talking of any ISA in particular. Not all instructions have the same latency and bandwidth.

On 20/12/12 18:34, Peter Dimov wrote:
We're rolling a beta release of Boost.SIMD along with a beta release of NT2.
...
What is the recommended Boost.SIMD way to write a function like
void add_n( float const * s, float const * s2, float * d, size_t n ); // d[i] = s[i] + s2[i]
where none of s, s2, d are guaranteed to be aligned?
I don't think there is any easy way. Either you use unaligned loads/stores for all but one variable, or you generate all variants depending on how disaligned each pointer is compared to each other and select the optimal variant at runtime. I'd recommend to go the unaligned_load + store route for simplicity.

"joel falcou" wrote in message news:50D2B523.8080502@gmail.com...
* We welcome comments, questions and random requests/rants/bug reports as to polish the library before actual submission to the boost review
OK, I'll delay the big "Oh yes!!!1!!1! with a few buts" until then ;-D -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman
participants (13)
-
Andrey Semashev
-
Dave Abrahams
-
Domagoj Saric
-
joel falcou
-
Joel Falcou
-
Julian Gonggrijp
-
Mathias Gaunard
-
Mukkaysh Srivastav
-
Peter Dimov
-
Shakti Misra
-
Thorsten Ottosen
-
Tim Blechmann
-
Vadim Stadnik