Accelerating algorithms with SIMD - Segmented iterators and alternatives

Hi, I have, along with Joel Falcou, been looking a bit at how we can integrate SIMD into iterator-based algorithms to accelerate them. Unfortunately, there are a couple of issues. The basic idea is to provide a range adaptor that adapts of range of PODs, where N consecutive PODs are stored contiguously, into a range of vectors (or "packs") of those PODs of size N. Therefore, instead of: std::vector<float> tab(20); float sum = boost::accumulate(tab, 0); we can write something like: std::vector< float, simd::allocator<float> > tab(20); float sum = simd::sum( boost::accumulate( simd::packed_range<4>(tab), simd::zero_ ) ); Which does the same but using SSE instructions, adding four floats at a time, and typically providing a 4x speed-up. This can be applied to have the same result for any binary operation which is commutative and associative (which unfortunately isn't the case of many operations, but can still be useful). Here, however, it works because 1) the memory is well aligned thanks to the use of our custom allocator and 2) the size, 20, is a multiple of the pack size, 4. We'd be interested in providing a mechanism where we could also support ranges of data aligned at an arbitrary address and of an arbitrary size. The SSE would only be used in the middle, the borders being handled element per element. A first solution consists in padding the beginning and the end with zeroes. Say you have 1 2 3 4 5 6 7 with only 3 being aligned well enough for SSE usage. so we'd turn that into | 0 0 1 2 | 3 4 5 6 | 7 0 0 0 |. However, it is very important that the code to deal with borders does not affect the efficiency of the code in the middle. This is not possible with the normal iterator/range design. Our data is clearly cut into three segments: the beginning that is not aligned, the data that is rightly aligned and sized, and the data at the end that is not big enough to fill a whole pack. Therefore, segmented iterators could provide us with a solution to do this efficiently; but so could other solutions as well. Wouldn't it be better to be able to use the mechanism so that you could deal with the borders in an element-per-element way while you could deal with the middle bit in a vector-per-vector way? With C++ iterators, you pull data, but if we could use an iteration paradigm where you push data instead, such as calling a function object for each element, we could call an overloaded function that would be able to accept both a single element and a pack. Such a solution would also provide the speed benefits of segmented iterators, and all high-level standard algorithms can be expressed with this iteration paradigm just as well as with iterators. I have already suggested such an alternative to segmented iterators (where I had a few misconceptions about segmented iterators, I thought they broke compatibility, but they don't, they still maintain the legacy non-segmented interface) there: <http://thread.gmane.org/gmane.comp.lib.boost.user/61636/focus=61647>, with an example of how you could use that to iterate over a binary tree there: <http://www.boostpro.com/vault/index.php?action=downloadfile&filename=generator_iterator.zip&directory=> Converting from those things to a pair of iterators is possible through the use of a coroutine, as demonstrated in the example. I would like to have feedback on the general idea of providing tools to use SIMD with iterators and related algorithms, and ideas about whether there is sufficient interest in the alternative to iterators for me to develop them. The intent would be to submit both the alternative to iterators and the SIMD helpers to Boost.

At Mon, 11 Oct 2010 16:19:13 +0100, Mathias Gaunard wrote:
+1!
Now that your misconceptions about segmented iterators are cleared up, could you give a brief rundown of your alternative and its advantages? If all I have to do is read your earlier post and ignore claims of broken compatibility, just say so, but I don't want to waste time trying to decipher it if it's substantially wrong. Thanks! -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 11/10/2010 16:37, David Abrahams wrote:
Here, I will only be talking of the full generator I described in that other thread, not the step one. It's push instead of pull, basically. - It's a very simple, easy-to-grasp concept, it's basically just allowing ranges to define their own for_each (I use an output iterator in my example, but it's not really different from a function object) - It is very easy to write an iteration primitive with it. See the difference between binary_tree_iterator and binary_tree_generator. I don't think a segmented iterator would be that easy to define for that data structure. - It is iteration governed by the data, rather than by the program that wants to use the data. Therefore the iteration can have full knowledge of the data layout and doesn't need to do useless operations at every increment, since that information can be statically known from the place the iteration is at, allowing it to be as efficient as possible. This is also useful for when you have to wait for data, such as when you want to iterate data coming from a network, which is typically done with a reactive paradigm. - It can be converted into a forward iterator through a "simple" coroutine application. I still need to benchmark this to see how much of a cost it is, but I'm no good at benchmarking. Of course, there is the drawback that it makes the code using the iteration primitive somewhat more complicated: but so do segmented iterators. There are also several other drawbacks, in particular tied to the fact that it cannot be conceptually much more than a forward iterator. There is also an extra advantage I hadn't considered that I thought of when considering SIMD: - It is possible to push different types of data and have them statically dispatched through overloading. Heterogeneous structures with a pull paradigm require a dynamic dispatch. I was in particular thinking of using this latter advantage so that the iteration could dispatch its contents to the algorithm element-by-element at the borders and pack-by-pack for the rest. Of course I didn't introduce this idea as SIMD my main concern, I just saw a potential useful application there. Now, I'm not saying we should embrace push instead of pull; I think both approaches should be allowed to co-exist, and that we should probably try to standardize concepts and allow interoperability with other range and iterator algorithms and adapters. But this raises several questions: How do we choose between the two? How do we integrate the approaches?

At Mon, 11 Oct 2010 22:50:29 +0100, Mathias Gaunard wrote:
Nice for very linear things. Not so great for sort, binary_search, reverse, rotate, ...
It may not matter that it isn't easy, if you really want generic capability.
Reading through http://github.com/boost-lib/parameter/blob/master/test/efficiency.cpp might help.
...nor even implement several of the algorithms that work with forward iterators.
I'm sorry, I can't understand what you're driving at here.
Although what you're describing is easier to make efficient for many (perhaps even the majority of cases), it doesn't sound like it actually has the same level of generality as segmented iterators do. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

I find it interesting, but also a bit sad, that the thread on metaprogramming “plumbing” is very active, while this fundamental question of abstraction and generic programming has gone unaddressed. At Tue, 12 Oct 2010 04:06:02 -0400, David Abrahams wrote:
-- Dave Abrahams BoostPro Computing http://www.boostpro.com

... Says the guy who, quite literally, wrote the book on metaprogramming :) Sorry. Is it possible to distill a succinct phrasing of question being posed? I think I'm having trouble reading it through the reply markers. I think I get the basic gist, but I'm not sure my understanding is complete enough to generate an opinion. Maybe it would make a nice article for C++Next. Andrew

At Tue, 12 Oct 2010 15:40:38 -0500, Andrew Sutton wrote:
The irony is not lost on me. But I fear I've created a monster :(
The question has to do with what kind of abstraction should be used to expose element access in segmented structures. This short and very readable 1998 paper explains the problem space and offers one approach: http://lafstern.org/matt/segmented.pdf -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/12/2010 7:10 PM, David Abrahams wrote:
Jumping in the middle here so this may not be relevant.... Here is the beginning of a discussion on spirit-devel about segmented Fusion algorithms that sought to apply Matt Austern's formulation to Fusion's heterogeneous sequences: http://article.gmane.org/gmane.comp.parsers.spirit.devel/2765 This discussion eventually led to Fusion's current (incomplete, unmaintained) support for segmented sequences. (Although in Fusion, the goal is not improved runtime performance, but easier-to-implement segmented sequences and drastically lower compile-time overhead.) We found no fundamental problems in Austern's formulation of generic segmentation, or in our adaptation of it to meet Fusion's needs. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Wed, Oct 13, 2010 at 3:10 AM, David Abrahams <dave@boostpro.com> wrote:
For the mere mortals among us, who are struggling to keep up with the metaprogramming astronauts, am I right in thinking that you are lamenting the emphasis on 'bells and whistles' to support SIMD iterators, while the more fundamental aspects of iterator segmentation are being ignored? And in your last remark are you saying you fear that metaprogramming and the Boost MPL are monsters!? Thx - Rob.

That's how I've read the discussion. The previous thread seems to have trailed off into a space that more to do with particulars and little to do with the fundamental abstraction. On the topic of abstraction, I would say that Matt's solution seems to be the right approach -- from an iterator perspective. He claims that segmentation concepts is an orthogonal to the standard iterator concepts. I don't especially like the implementation, requiring an explicit specialization of segmented_iterator_traits. I see nothing preventing the segmented iterator class from defining those associated types and operations and allowing the traits class to use them by default. The solution seems to fit the GP pattern quite well. Specialize generic algorithms for orthogonal or refined concepts to get faster code for "free". I don't really know enough about Fusion to comment on the design, but on the surface it seems like an equivalent solution.
And in your last remark are you saying you fear that metaprogramming and the Boost MPL are monsters!?
My feeling on metaprogramming (and hence the MPL) is that it's a necessary evil for the style of generic programming going on here: necessary because it's the duct tape that holds generic libraries together, evil because it allows us to focus on really low-level details while doing some really clever programming. In case this may have read differently, I think that this style of generic programming is actually a good thing. Andrew

On 13.10.2010 16:15, Andrew Sutton wrote:
Easy enough. Your class just has to contain a type called is_segmented, definition irrelevant. But there is no way around having such a marker. You don't want to trigger off the classical iterator tag, since that would mean twice the number of tags. BOOST_MPL_DEFINE_HAS_MEMBER(is_segmented) template <typename T, bool Enable = has_is_segmented<T>::value> struct segmented_iterator_traits { typedef false_type is_segmented_iterator; }; template <typename T> struct segmented_iterator_traits<T, true> { typedef true_type is_segmented_iterator; typedef T iterator; typedef typename T::segment_iterator segment_iterator; // ... }; Sebastian

At Wed, 13 Oct 2010 09:15:45 -0500, Andrew Sutton wrote:
I just worry that the focus on TMP is so intense that, potentially, design choices in the GP space that could vastly reduce the amount of TMP needed are missed. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/13/2010 11:24 PM, David Abrahams wrote:
+1 I share this sentiment. My focus on TMP has been so intense too that I am stepping back and reflecting on what I(/we) have done with Spirit, Fusion and Phoenix hoping to find an alternate path. My feeling is that I want to ease out a bit on TMP and focus more on GP. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

On 10/13/2010 4:53 PM, Joel de Guzman wrote:
All libraries that, IMO, have well-defined and easy-to-grasp concepts (in the GP sense). I wonder why you feel this way. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 10/14/2010 8:37 AM, Eric Niebler wrote:
I'm not sure, Eric. It's just a gut feeling at this point. The overall complexity makes me feel lost in a TMP forest that I want to step out a bit to see the bigger picture. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

At Wed, 13 Oct 2010 09:01:46 +0100, Robert Jones wrote:
You're in the ballpark. I'm lamenting the focus on tricky techniques (a catch-all bucket for metaprogramming and JIT and...) when there's a more fundamental question of abstraction to deal with. IMO, first you should get the abstractions right, and only then should you go back and apply tricky techniques to optimize if/where necessary.
And in your last remark are you saying you fear that metaprogramming and the Boost MPL are monsters!?
No, but I do fear that the Boost community has become so adept at fancy TMP that we have lost track of our roots in generic programming, per this thread: http://news.gmane.org/find-root.php?message_id=%3cm2wrq5zpxq.wl%25dave%40boo... -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tue, Oct 12, 2010 at 4:06 AM, David Abrahams <dave@boostpro.com> wrote:
Please please please, if you're going to attempt a better alternative for iterators, read "The essence of the Iterator pattern" [1] (or [2] for an updated version) and the important work on idioms[3] that it references. I wish I had more time to spend on this effort, but needless to say the tough semantic work has already been done! Thanks David A. for pointing out there was a semantics discussion going on here. David [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.7480 [2] http://www.comlab.ox.ac.uk/jeremy.gibbons/publications/iterator.pdf [3] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.22 -- David Sankel Sankel Software www.sankelsoftware.com

At Tue, 12 Oct 2010 19:15:15 -0400, David Sankel wrote:
I don't know about that. This work seems (to me) fundamentally different in spirit from C++ iterators a là Stepanov and Musser. For one thing, Gibbons fails to even address random access. Staying close to the machine model and the ability to make the most generic algorithms as efficient as hand-coded C (sometimes even assembly!) are the lifeblood of generic programming, and I don't see how one can ignore those things in the realm of HPC.
Thanks David A. for pointing out there was a semantics discussion going on here.
Welcome. Sorry to be a contrarian :-) -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 13/10/10 00:15, David Sankel wrote:
This is interesting in essence, but I fear it abstracts too much into lambda land. (to the point where I don't understand it, but that's another problem ;)) From a glimpse at your paper, the advantages of your approach over a monadic map (which is what I take to be the most similar thing to a fold in C++, since C++ has side effects) is that it allows composition, and doesn't need two symmetric definitions to traverse the data both ways. Is that correct? Unfortunately I do no understand how one would use this in C++. The approach seems a bit similar to continuations, unless I'm completely out of it? I have already suggested using coroutines to make the iteration more flexible, but it comes with a runtime cost (context switching).

On 12/10/10 09:06, David Abrahams wrote:
Nice for very linear things. Not so great for sort, binary_search, reverse, rotate, ...
Yes, since those require either bidirectional or random access iterators. Ability to stop the iteration could be a useful addition to make the concept have more applications. In functional languages, people seem happy to use exceptions for that; but I assume that certainly wouldn't be the case in C++. In any case, linear operations or filters represent in my opinion a large enough spectrum of what people do with iterable ranges of data that it could be worthwhile to have concepts that allow them to be efficient, even if it cannot apply to other cases.
It may not matter that it isn't easy, if you really want generic capability.
But do we want generic capability at the cost of complicated code which is hard to optimize?
Do you have some examples?
struct my_iteration { void operator()(float f) { do_something_with_scalar(f); } void operator()(packed_view<float, 4> v) { do_something_with_vector(v); } }; for_each(simd::packed_range(mystuff), my_iteration()); vs for(packed_view<float, 4> v : simd::packed_range(mystuff)) do_something_with_vector(v); Using a pull solution for iteration, the former /could/ take care of the non-aligned cases at the beginning and at the end of the mystuff range through overloading.
Segmented iterators still allow bidirectional support, indeed (I think?). But I'm note sure it can do random access as well, if you consider recursively segmented local iterators. I believe writing a zip_iterator (or similar) would be no easier for segmented iterators than for generators due to the arbitrary recursion.

On 10/13/2010 02:52 AM, Mathias Gaunard wrote:
I could see this. [...]
find_if? adjacent_find? adjacent_difference? parallel_for_each? I suppose (some of) these could be implemented using a for_each-style of iteration, but would require caching (of the previously visited element, in the case of adjacent_*) and exceptions (in the case of find_if/adjacent_find)...which would seem to negate the performance motivation. Or maybe I'm being too dense... [...]
One might describe this as expressing iteration in terms of the visitor pattern...? I honestly didn't understand what you meant between "push" and "pull" before this snippet. [...]
Seems to me segmented iterators and a visitation-based/push-based iteration address fundamentally different problems. At least, I don't see how segmented iterators help here. The problem that visitation-based iteration seems to try to solve is improving iteration efficiency over join(t)_range-like ranges...
I believe writing a zip_iterator (or similar) would be no easier for segmented iterators than for generators due to the arbitrary recursion.
What do you mean by generators in this context? - Jeff

On 13/10/10 11:44, Jeffrey Lee Hellrung, Jr. wrote:
Those return an iterator, so that makes it complicated to see what they would do. Assuming they would return a pointer to the value, you could do (ignoring const issues): template<typename Range, typename P> struct find_if_f { find_if_f(P p_) : p(p_) {} bool operator()(typename range_value<Range>::type& e) { if(p(e)) { ptr = &e; return false; } return true; } P p; typename range_value<Range>::type* ptr; }; template<typename Range, typename P> typename range_value<Range>::type* find_if(Range& range, P p) { find_if_f<Range, P> f(p); if(for_each_until(range, f))// for_each but stops when f returns false return f.ptr; return 0; }
adjacent_difference?
You need to remember the previous element, but you don't have to copy to do that, you could just use a reference or pointer to it. I would think a sufficiently good optimizer should take care of inlining that.
parallel_for_each?
Good point. This requires the ability to subdivise the range, which would be feasible but inefficient, in particular due to the lack of random access. I
I'm not certain those really hamper performance (assuming the use of another mechanism instead of exceptions to stop the loop).
This referenced the initial conversation I had with Dave in another thread. It's not very practical terminology.
Since those could be different segments, you wouldn't need to check which segment you are in at each iteration. Actually, all local iterators for a a segmented iterator must be the same for all segments, but since a local iterator can be a segmented iterator itself, it can also have a local iterator which can be different. By recursion, you can therefore use different iterator types for each segment, but that's somewhat convoluted.
That's what I called my alternative to segmented iterators in the other thread, because I made them use an output iterator instead of a function object. Any function that reads a range and passes each element to a template user-supplied object would fit the idea. I call it push because the data is being pushed onto the user-supplied object.

On 13/10/10 12:18, Mathias Gaunard wrote:
That's wrong actually. Since all the local iterators are the same, then their local iterators must be the same too. So segmented iterators wouldn't help to only check for alignment once it has been reached I think. I would have to properly try to implement one, but it's not particularly practical to do. Anyway, my belief is that my alternative to segmented iterators is more interesting.

At Wed, 13 Oct 2010 15:52:47 +0100, Mathias Gaunard wrote:
I think the exercise would be worth your time
Anyway, my belief is that my alternative to segmented iterators is more interesting.
Not to me, until you can begin to address something like std::merge(). -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 13/10/10 16:07, David Abrahams wrote:
Well, let me rephrase the problem. Given the contiguous sequence of numbers 1 2 3 4 5 6 7 8 9 10 11, with the 3rd element (3) lying on a satisfying alignment boundary to deploy SIMD instructions of cardinal 4, we want to adapt the range into this: | 0 0 1 2 | 3 4 5 6 | 7 8 9 10 | 11 0 0 0 | The concern is that we want the iteration of the middle part | 3 4 5 6 | 7 8 9 10 | (which in practice would contain much more elements than this), to be as fast as possible, i.e. - increment is a ptr += 4 - dereference is load ptr into a SIMD register This is not possible using normal iterators, since I will need to have a different behaviour (and thus a branch), for at least one of those two functions, depending on what state the iteration is at. As a result a few bytes at the beginning and at the end of a long sequence slow down the whole iteration. I thought segmented iterators would solve this, since after all I've got three segments, and you've been repeatedly saying on this list that it is the solution to many iterator problems, but they don't. Segmented iterators are only useful if the iteration of all my segments is done with the exact same logic but at different locations, since they all must have the same type. They don't allow to circumvent the problem I exposed with normal iterators at all. My alternative not only does, but it could also allows not to have to pad the ranges with zeroes, which is potentially more interesting to write SIMD-aware algorithms.
Yet you are convinced about segmented iterator without even knowing if it can address something like std::merge (well sure it can if you use the fallback flat iterator, but that's not what I mean).

At Wed, 13 Oct 2010 17:38:56 +0100, Mathias Gaunard wrote:
I think you mean 1 2 | 3 4 5 6 | 7 8 9 10 | 11 right? Are you really intending to manufacture zeros for padding?
Yes, this is exactly the same problem as std::deque has, which segmented iterators were designed for.
Of course they do.
I think they do. However, you might find that you need some dynamic dispatching to get it to work. In other words, if (operating on complete segment) SIMD algorithm else elementwise algorithm
You don't need to pad with zeroes for segmented iterators.
I do know that it can address std::merge. It's not necessarily pretty, though. ;-) -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/13/2010 10:50 AM, David Abrahams wrote:
That doesn't seem any better than what I would imagine the range_detail::join_iterator does. I.e., I don't see what segmented iteration buys us here over what I'd imagine Mathias is currently doing (which, I'd imagine, is some variant on a joint_iterator that returns aligned packets). - Jeff

At Wed, 13 Oct 2010 12:18:13 -0700, Jeffrey Lee Hellrung, Jr. wrote:
where's that list of algorithms again? sort, transform, merge, binary_search ... -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/13/2010 01:19 PM, David Abrahams wrote:
Ummm...not sure what your point is. Are you saying a join_iterator (the iterator used in the result of boost::join) has issues with these algorithms? - Jeff

At Wed, 13 Oct 2010 13:39:29 -0700, Jeffrey Lee Hellrung, Jr. wrote:
As I understand it he's not doing anything like a classic STL iterator, at all. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 13/10/10 21:19, David Abrahams wrote:
where's that list of algorithms again? sort, transform, merge, binary_search ...
There is no problem implementing transform, binary_search, or even sort if you assumed it worked with forward iterators. And I still very much doubt you can have a correct behaviour of std::merge for segmented iterators in the face or recursively segmented iterators.

On 13/10/10 18:50, David Abrahams wrote:
I think you mean
1 2 | 3 4 5 6 | 7 8 9 10 | 11
What is that nonsense supposed to be? All elements are of type pack<int, 4>, and therefore must have four values.
right? Are you really intending to manufacture zeros for padding?
Right, that was the very idea of the feature.
Yes, this is exactly the same problem as std::deque has
No it isn't. std::deque is not a different *behaviour* problem, it's a different *location* problem.
which segmented iterators were designed for.
They do not help *at all* in this case, since the iterator of each segment must have the same type, and therefore *the same behaviour*.
You just demonstrated a bit lower that they didn't.
Therefore they do not allow to avoid the inefficiency, and they are no different from standard iterators in that respect.
Yes you do, since all elements must have the same type, a limitation that /could/ be lifted by using a for_each-like syntax. The value type is a pack of 4 integers loaded in a single SIMD register. You must give a value to all four of them.
I do know that it can address std::merge. It's not necessarily pretty, though. ;-)
I would quite like to see any reference on this.

Mathias Gaunard wrote:
Would it be easier to understand if written as * * 1 2 | 3 4 5 6 | 7 8 9 10 | 11 * * * or X X 1 2 | 3 4 5 6 | 7 8 9 10 | 11 X X X ?
All elements are of type pack<int, 4>, and therefore must have four values.
Yes, four values, and some value at the ends will be sort of "arbitrary". But "arbitrary" and "0" are not necessarily the same thing. I'm not expert enough in SIMD to know whether "0" has an advantage over "arbitrary" in that context. Regards, Thomas

At Thu, 14 Oct 2010 20:00:15 +0100, Mathias Gaunard wrote:
I fear this conversation is becoming uncivil.
That's what "if" statements are for.
Yes, they are different, because you don't need to do that test at the lowest level, for each element. That test would only be done at the segment level.
Unless you process the uneven end bits with non-SIMD code.
Don't have one handy, sorry. :-( -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 15/10/2010 04:48, David Abrahams wrote:
Sorry, I just couldn't make sense of what this is supposed to be, and used bad phrasing. What would the value type of that range be?
That's what "if" statements are for.
We want to avoid these.
How so? The way I see if, I would need that if at every increment.
Unless you process the uneven end bits with non-SIMD code.
I would quite like a provide a way to do this. A range, however, must be homogeneous, and have all its elements be the same type, so it's either a scalar or a vector, can't be both.

At Sat, 16 Oct 2010 19:01:12 +0100, Mathias Gaunard wrote:
int
That's what "if" statements are for.
We want to avoid these.
Actually, you don't need them. This is really simple; it's just like processing a deque.
That's where your problem is; you can't look at the "vectors" as elements. They're just views used as an optimization technique for operating on subsequences of a particular number of scalars. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 16/10/10 20:20, David Abrahams wrote:
So how does that help using SIMD instructions at all? What do the | represent for you?
As I said, it isn't. Iterating all segments of a deque is done with the same code but at different locations. I need different code for iterating each segment.

At Mon, 18 Oct 2010 11:24:10 +0100, Mathias Gaunard wrote:
When a complete segment is processed, you can use SIMD. Elsewhere, not.
What do the | represent for you?
Segment boundaries
I don't mean processing a deque as done by algorithms today, I mean processing a deque with a hierarchical algorithm as described in the paper.
I need different code for iterating each segment.
No, you need different code for iterating the end bits. Look carefully at the hierarchical_fill implementation near the end of the paper. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 18.10.2010, at 16:34, David Abrahams wrote:
Segmented iterators as described in the paper have two limitations, as I see it: 1) The value type of the flat iterator and the local iterator have to be the same. 2) Segments must be uniform. (2) means that you can't have one segment have one type with one kind of iterator, and the next another type with another kind of iterator. Case in point, joint_iterator over a fusion sequence of containers. You can't represent those with a segmented iterator, because the segments are different container types and thus you cannot have a segment iterator, since it wouldn't have a single value type. It occurs to me that there probably would be a way to bring compile-time fixed-size heterogenous iterators (well, fusion sequences) into a reusable concept world here too, but that's a lot of additional work on top of segmented iterators. (2) also means that the local iterators for all segments are the same, and therefore that the final value type of every segment has to be the same. (1) is not immediately obvious. The segmented_iterator_traits do not actually make this requirement. However, it is a consequence of the intended use of segmented iterators: transparent efficiency increase for the standard algorithms. fill() is a case in point. The flat version looks like this: template <typename ForwardIterator, typename T> void fill(ForwardIterator first, ForwardIterator last, const T &v) { for (; first != last; ++first) { *first = v; } } Requirements: ForwardIterator is a forward iterator, and T is assignable to ForwardIterator::reference. If ForwardIterator was additionally a segmented iterator, the complicated hierarchical fill should be transparently invoked. This comes down to this (implementation switching omitted): template <typename SegIterator, typename T> void fill(SegIterator first, SegIterator last, const T &v) { typedef segmented_iterator_traits<SegIterator> Seg; if (Seg::segment(first) == Seg::segment(last)) { fill(Seg::local(first), Seg::local(last), v); } else { // omitted } } But that nested fill has a different ForwardIterator, and thus introduces a new requirement that bubbles outward: T must be assignable to LocalIterator::reference. So the only way that segmented iterators stay a transparent optimization (that is, can be used in the standard algorithms without introducing additional requirements on the caller) is if the LocalIterator's reference guarantees that anything assignable to ForwardIterator's reference is also assignable to itself, which is pretty much the same as saying that the two have to be the same. Not quite, but almost. Here's the problem when trying to use the segmented iterator idea to present a sequence of scalars as a sequence of SIMD packs. You can try to divide the sequence into three segments: leading unaligned values, aligned packs, and trailing unaligned. However, this runs afoul of (2): the segments are heterogenous. The first and third iterate over single values, the center iterates over packs. The abstraction doesn't support it. Alternatively, you could divide the sequence into N segments. The first are the leading unaligned values, segments two through N-1 are fixed-size segments of as many values as fit in a SIMD register, and segment N are the trailing unaligned values. But then you are not actually dealing in SIMD packs. Instead, you're dealing in very short loops that a compiler might find easier to auto-vectorize or at least unroll, but which will completely destroy your performance if it doesn't. Finally, you can transform the entire sequence into a sequence of SIMD packs, but that's not segmented anymore. (Also, the implementation is incredibly ugly.) If you try to misuse the segmented iterator concept for this, (1) will bite you. Sebastian

At Mon, 18 Oct 2010 22:43:25 +0200, Sebastian Redl wrote:
Which is totally fine in this case
Also totally fine in this case. If you want to operate on a heterogeneous sequence, you'd need fusion-y segmented iterators of course, so you'd make the obvious corresponding extensions of fusion iterator concepts. But you don't need it for SIMD, AFAICT
They're already in that world, IIUC.
Also not a problem.
I don't understand what you're talking about in that section above, but I think you're making things too complicated because you're fixated on the idea of a heterogeneous sequence. Here's a simple extension to the hierarchical algorithm formulation that I think could accomodate SIMD. The key change is the segment_fill function, which gives us an additional customization point with which to capture SIMD segments template<class SegIter,class T> void fill(SegIter first, SegIter last, const T& x, true_type) { typedef segmented_iterator_traits<SegIter> traits; typename traits::segment_iterator sfirst = traits::segment(first); typename traits::segment_iterator slast = traits::segment(last); if(sfirst==slast) fill(traits::local(first),traits::local(last), x); else { fill(traits::local(first), traits::end(sfirst), x); for(++sfirst; sfirst!=slast; ++sfirst) segment_fill(sfirst, x, traits()); fill(traits::begin(sfirst), traits::local(last), x); } } template <class SegmentForwardIter, class T, class Traits> void segment_fill( SegmentForwardIterator s, T const& x, Traits traits) { fill(traits.begin(s), traits.end(s), x); } template <class ForwardIter,class T> void fill(ForwardIter first,ForwardIter last,const T&x, false_type) { for(; first!=last; ++first) *first = x; } template <class Iter,class T> inline void fill(Iter first,Iter last,const T&x) { typedef segmented_iterator_traits<Iter> Traits; fill(first,last,x, typename Traits::is_segmented_iterator()); } template <class U, class T, class Traits> void segment_fill( my_SIMD_iterator<U> s, T const& x, Traits traits) { // Do SIMD stuff on an entire segment } Naturally, there are some things you'd want to rework about this. For example, you'd probably want to rethink the way the segmented traits are formulated and how dispatching to segment_fill worked. But Austern's paper gives the bones of a solution, I think. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

At Wed, 13 Oct 2010 03:44:37 -0700, Jeffrey Lee Hellrung, Jr. wrote:
That's the same problem (in essence) that segmented iterators are solving. Did you read the Austern paper? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/13/2010 08:04 AM, David Abrahams wrote:
Yes. But I'm not convinced yet that iterators exposing the segmentation of the data structure are the right solution to presenting a flattened view of that data structure. 1) It doesn't play nice with standard C++ looping constructs, specifically for loops. 2) I have yet to see a benchmark comparison between segmented iterators and alternatives (other than a side-note claim in the paper). I'm open to looking at some specific comparisons if one has already done this, and this is something I'm interested enough in that I might try to do myself. Seems to me a more general solution is a "flat_multirange_iterator" templated on the outer range and the "segmentation depth". I.e., a flat_multirange_iterator< vector< vector<T> >, 2 > would iterate over the T's. To address your claim that "that's the same problem ... that segmented iterators are solving", what I meant by a join_range or joint_range is, essentially, a flattened out Boost.Fusion sequence of ranges of *differing* types (but with compatible value_type's and references). E.g., the thing that boost::join [1] returns. Segmented iterators can't deal with such ranges (at least with a strict reading of Austern's paper), and I think such ranges are closer to what Mathias has in mind (to be verified......). But perhaps I'm just not seeing your generalization beyond the immediate scope of the Austern paper? - Jeff [1] http://www.boost.org/doc/libs/1_44_0/libs/range/doc/html/range/reference/uti...

At Wed, 13 Oct 2010 12:08:43 -0700, Jeffrey Lee Hellrung, Jr. wrote:
Good; that's not what they're for ;-)
1) It doesn't play nice with standard C++ looping constructs, specifically for loops.
The whole point of hierarchical algorithms is to avoid flat loops for segmented data structures.
That's gonna be slow. Lots of testing and branching. "Note that the joined range incurs a performance cost due to the need to check if the end of a range has been reached internally during traversal."
I don't see why you say that.
Looks like this doc is missing something about the value/reference type requirements, BTW. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/14/2010 1:23 AM, David Abrahams wrote:
I think I missed something, then. Are we not applying sequential algorithms (e.g., fill, for_each, ...) to hierarchical data structures?
It certainly avoids flat loops, to the point of outright precluding their use. I can add here a 1a) How would you go about iterating through segmented data structures in parallel?
First, that quote applies directly to a join_iterator, though I'll give you that it *may* apply to my hypothetical a flat_multirange_iterator. What I'm having trouble with is it *looks* to me like you have the same tests and branches with a segmented iterator as with a flat_multirange_iterator. Comparing the nested for-loop on page 5 of the Austern paper (which is what a segmented iterator would give you) with the flattened while-loop on page 6 (let's replace the while-condition with true put an inserted break statement when !(node != nodes.end())), they both have the same number of iterator increments, comparisons, and assignments. What am I missing? Are compilers just able to produce better code for nested for-loops than flattened loops?
Show me what the equivalent nested for-loop looks like if a join_range can be "segmentedly" iterated over. If I tried, I'd be inclined to write a for-loop over a Boost.Fusion sequence. If you want to include a join_iterator within the family of segmented iterators, you'd have to enrich the interface of segmented iterators to include Boost.Fusion sequences. Is that desirable? Maybe? [...]
Yes :( - Jeff

On 10/14/2010 8:07 AM, Jeffrey Lee Hellrung, Jr. wrote:
Actually, I just realized that making those loop modifications is "cheating" :/ Unless you did some kind of Java-style iteration. Whoops, - Jeff

On 14/10/10 16:07, Jeffrey Lee Hellrung, Jr. wrote:
It seems useful, but it wouldn't be usable for the main reason segmented iterators were created, deque and trees, and that concept wouldn't really be interoperable with segmented iterators.

At Thu, 14 Oct 2010 08:07:47 -0700, Jeffrey Lee Hellrung, Jr. wrote:
Sorry, no, I was the one who missed something. Please allow me to correct what I said. Presenting a flattened view of a hierarchical data structure kills performance. Segmented iterators expose an additional non-flattened view so that appropriately constructed algorithms can avoid that performance loss...
Are we not applying sequential algorithms (e.g., fill, for_each, ...) to hierarchical data structures?
...where by "appropriately constructed," I mean algorithms that have been re-cast hierarchically. "Sequentialness" is not necessarily implied.
You mean two at once? I think you have to produce a new segmentation structure that bounds segments at the union of all segment boundaries in the two sequences. I would probably do that with a segmented zip view.
It's exactly the same problem that the iterators of std::deque has.
I acknowledge that you retracted that analysis in a later post.
Well, things get complicated when the element types or structure of a single sequence are heterogeneous.
If I tried, I'd be inclined to write a for-loop over a Boost.Fusion sequence.
Yeah, you need to do something like that.
IIUC, there already is support for segmented Boost.Fusion iterators somewhere. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/14/2010 8:41 PM, David Abrahams wrote:
If you can provide more direction here, I would appreciate it. There is boost::fusion::joint_view, but that only concatenates 2 Boost.Fusion sequences, not 2 ranges... By the way, I did run a comparison between a segmented iteration, with nested for-loops, and a flattened iteration. I was very much surprised in the runtime difference. It obviously depends on the operation you do per iteration, but for just incrementing a counter, the difference was 2 or 3 orders of magnitude. Needless to say, I'm more convinced of the value of exposing the segmentation of data structures now. Here's what I'm getting from this discussion: I see 2 viable approaches to address iteration over non-linear data structures, where efficiency is a major concern. One is an external, segmented iterator interface, possibly enriched to enable iteration over Boost.Fusion sequences at any depth. The other is an internal "push" or visitation iteration suggested originally by Mathias. Neither models seem to yield *convenient* implementations of *all* iteration algorithms in <algorithm>, but I can see enough utility that one model and/or the other would be worth investigating. I'm not sure yet if either iteration model is more powerful than the other, though it seems there are some algorithms expressable via segmented iteration that would be challenging to express via push iteration. In any case, David, I appreciate your patience and participation. - Jeff

At Thu, 14 Oct 2010 23:27:21 -0700, Jeffrey Lee Hellrung, Jr. wrote:
Eric knows more about this than I do. Eric?
:-)
Some things definitely become significantly more complicated to implement.
Anything that requires bidirectional or random access, for example.
In any case, David, I appreciate your patience and participation.
np. Glad this topic is getting some much-needed exposure. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 10/15/2010 11:06 AM, Robert Ramey wrote:
No, it's not right; upon looking at the assembly output (which I should've done prior to making hasty conclusions), I think the original nested for-loop test I tried was too easy for the compiler to optimize. I changed the test to actually iterate over a segmented data structure (I was just doing the looping constructs originally, without any actual data structure), and the results are much less dramatic: the nested for-loop iteration is about 20% faster. I think this is in line with the Austern paper. Here is the code: #include <iostream> #include <vector> #include <boost/timer.hpp> int main(int argc, char* argv[]) { typedef std::vector< int > v_type; typedef v_type::iterator v_it_type; typedef std::vector< v_type > vv_type; typedef vv_type::iterator vv_it_type; vv_type vv(1000); for(int i = 0; i != 1000; ++i) vv[i].resize(1000); { boost::timer timer; for(int i = 0; i != 1000; ++i) for(vv_it_type vv_it = vv.begin(), vv_end = vv.end(); vv_it != vv_end; ++vv_it) { v_type& v = *vv_it; for(v_it_type v_it = vv_it->begin(), v_end = vv_it->end(); v_it != v_end; ++v_it) *v_it = 42; } const double elapsed = timer.elapsed(); std::cout << elapsed << std::endl; } { boost::timer timer; for(int i = 0; i != 1000; ++i) { vv_it_type vv_it = vv.begin(), vv_end = vv.end(); v_it_type v_it = vv_it->begin(), v_end = vv_it->end(); while(vv_it != vv_end) { *v_it = 42; ++v_it; if(v_it == v_end) { ++vv_it; if(vv_it != vv_end) v_it = vv_it->begin(), v_end = vv_it->end(); } } } const double elapsed = timer.elapsed(); std::cout << elapsed << std::endl; } return 0; } Still faster, and if you replace the std::vector's with, e.g., boost::iterator_range< boost::counting_iterator< ... > >'s, you might get a better measure of the actual loop overhead, as you'd avoid the memory accesses (speculation). - Jeff

(Sorry for the delay.) On 10/15/2010 9:15 AM, David Abrahams wrote:
For a long time, Fusion has shipped with half-baked, undocumented, partial support for segmented Fusion sequences. Everything is under ext_ directories scattered throughout boost/fusion. There are two segmented algorithms: for_each_s and find_if_s. There is a segmented data structure: tree. There are primitives: is_segmented, and segments. There are iterators that make a segmented sequence look flat: segmented_iterator and iterator_range specialized on segmented_iterators that is itself segmented. The segmented_iterator and segmented iterator_range support are by far the most complicated---and most in dire need of a complete rewrite. But the implementation was so hair-pullingly horrific that I dare not tread there again. Aparantly, they've been failing their regression tests for a while. There's a trac ticket, but I'm scared of it. Yes, really. It's a perpetual To-Do item. The benefit of segmented Fusion is that it makes the implementation of inherently segmented data structures (like trees and joins) embarrassingly trivial, and it makes iteration over them efficient at compile time. The trouble with segmented Fusion is that to reap those benefits, you need a segmented version of each algorithm, many of which are non-trivial to write. It adds a lot of complexity to Fusion, about which Joel has expressed concern (not unreasonably, I might add). So that's where we are. -- Eric Niebler BoostPro Computing http://www.boostpro.com

At Wed, 13 Oct 2010 10:52:36 +0100, Mathias Gaunard wrote:
Actually, all of those can be done with bidirectional iterators (although the standard over-constrains sort). binary_search can be done with forward iterators.
Nooo. Waaay too far from the machine model. If we're talking about HPC, let's figure out how to generate optimal (or near-optimal) code.
I believe that segmented iterators do allow them to be efficient.
I don't believe you have to choose.
binary_search (equal_range, lower_bound, upper_bound), rotate, equal, transform (2-input version), includes, merge... These were just the first ones I encountered in the algorithms section of http://www.sgi.com/tech/stl/stl_index_cat.html
Oh, I think I see what you're getting at. Solving that problem for segmented iterators would be interesting.
Just as with flat containers, some segmented iterators for hierarchical containers can do random-access and some can't. std::deque is an example that can. I don't think recursive segmentation is an obstacle as long as all the sizes are constant at the leaves.
I'm not sure what you're getting at here, again. Is it relevant? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 13/10/10 16:00, David Abrahams wrote:
I didn't know binary_search (and similar functions) worked with forward iterators. I have trouble seeing how they could work effectively at all.
Segmented iterators are neither practical to write nor to use. ou've
rotate,
Interesting one, because it modifies itself. I'm not convinced it's not possible to write it.
equal, transform (2-input version), includes, merge...
Indeed, there is a combination problem.
Well, the limitation is how you can combine those "generators" to iterate several at the same time, be it for merge or otherwise. I don't see a good way to do this with segmented iterators either.

On 13.10.2010, at 18:09, Mathias Gaunard wrote:
It's O(n) iterator movements, but only O(log n) comparisons and dereferences, which is valuable if comparison is expensive (e.g. long, similar strings). Sebastian

At Wed, 13 Oct 2010 17:09:38 +0100, Mathias Gaunard wrote:
O(N) steps, O(log N) comparisons. Better than linear search if comparison is not cheap.
Why do you say that? I don't think anyone has made a serious effort.
I'd like to see that one.
Ah.
I don't see a good way to do this with segmented iterators either.
:-) zip_iterator -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Wed, Oct 13, 2010 at 2:45 PM, David Abrahams <dave@boostpro.com> wrote:
At Wed, 13 Oct 2010 17:09:38 +0100,
[snip]
I've been watching this thread and I've been thinking of saying my experience trying to use segmented iterators. I am working on a project that has a hierarchical structure where an object can own other objects of the same type. It is a tree-structure. This hierarchical structure might be acessible hierachically or not. Conceptually it always is, but it might be accessible only through input_iterators which would flatinize the hierarchy in a efficient way. But, some other implementation might be able to access it hierarchically, which would allow some algorithms, e.g. find, to work in better-than-a-linear time. At first I thought that segmented iterators were what I needed. But then trying to work with it. I found that segmented iterators would differentiate segment iterators from local iterators in a way that didn't work in a tree-like structure. The nodes of the trees had to be moved to be the first in the local sequence because only local iterators are dereferenced in segmented-aware algorithms. This meant that segmented iterators weren't completely overhead-free. This doesn't mean segmented iterators aren't good, just that the use-case that I tried didn't seem to fit it. [snip]
Regards, -- Felipe Magno de Almeida

At Wed, 13 Oct 2010 17:34:57 -0300, Felipe Magno de Almeida wrote:
Yes, they are designed to deal with a segmentation depth that's known at compile-time.
Right. Trees have any storage at internal nodes, which is another way they differ from the kinds of structures segmented iterators work on.
This doesn't mean segmented iterators aren't good, just that the use-case that I tried didn't seem to fit it.
Makes sense. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 13/10/10 18:45, David Abrahams wrote:
At Wed, 13 Oct 2010 17:09:38 +0100, Mathias Gaunard wrote:
zip_iterator doesn't work with segmented iterators. Well it does work in the sense that segmented iterators also provide a fallback flat iterator, but it doesn't work with the actual segmented one.

At Thu, 14 Oct 2010 19:26:00 +0100, Mathias Gaunard wrote:
It could be made to work. It just takes... work :-) -- Dave Abrahams BoostPro Computing http://www.boostpro.com

hi there if you focus on simd-aware - and more specifically x86 simd - implementation _DON'T_ read further consider OpenCL as a general way to speed up computations (which uses simd as one of backends as well as gpu shader units etc.) i think it will be much more general and useful by the way openCL claims to be portable -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

On 11/10/10 18:54, DE wrote:
Except not as openCL require JIT ... Portable SSEx/Altivec code is rather trivial to make properly. Not like I presented that to Boost'con already, so before saying anythign else, yes it works.

lets talk about x86 a little since there is a wide variety of processors from celeron to i7 and ones from amd which may or may not support mmx or sse1/2/3/4 or avx or some other unintelligible weird thing which in turn is known (i guess) only at run time so my question is: do you expect code writers which want their binaries work across all modern x86 processors (starting with pentium) to write plain version along with accelerated one? or will your implementation fall back to plain old say x87 when sse2 is not available so the writer could write only one version of source? i guess everything above is applied to power and arm platforms as well -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

DE wrote:
But how is this different from specifying appropriate compile options for the platforms you intent to support? Even the code generated by a compiler will only work on the processor architectures that the compiler targeted. And if you tell the compiler to target the lowest common denominator, the resulting code will be quite slow. Regards, Thomas

On 12/10/10 19:31, Thomas Klimpel wrote:
When no SSEx stuff are enabled, pack<T,N> falls back to boost::array emulation. If you wan to ship code for various SSex flavor, just make various compialtion fo your kernel code then dynamically laod the correct one. There is no way you can check at runtime before calling EVERY function, this is far too costly.

on 12.10.2010 at 21:38 joel falcou wrote :
an idea immediately arises in my mind: what if it could be possible to force a particular backend e.g. sse2 and make it propagate down the call stack? this way you write generic code once and are able to (pre)compile for (expected) platforms/technologies then you just switch once depending on some circumstances and call the appropriate instance (say sse2 forced) at least this way you write the source once and possibly don't rewrite it for every circumstance -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

on 12.10.2010 at 22:35 joel falcou wrote :
cool forgive me for repeating myself, do i understand correctly? will it be something like the following? template</*...*/> void task(...); //somewhere if (supportsSSE2) task<SSE2>(/*...*/); else task<General>(/*...*/); -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

on 12.10.2010 at 21:31 Thomas Klimpel wrote :
for example intel compiler for win32 allows to generate two versions binaries from one source: plain x86 and say sse2 which are switched at runtime depending on cpuid so the question still stands -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

DE wrote:
Question to Joel: Will this compiler automatically do the correct thing with NT2 without any further user intervention? If I understood correctly, NT2 will know at compile time which SIMD instruction set is available, and generate the appropriate code. Now if the intel compiler compiles the code multiple times for different target architectures, I imagine that NT2 would generate the appropriate code during the different compile runs for the different architectures, so that the resulting executable should work optimal right out of the box. Is this correct? Regards, Thomas

On 12/10/10 21:22, Thomas Klimpel wrote:
Question to Joel: Will this compiler automatically do the correct thing with NT2 without any further user intervention?
NT2 vectorization is triggeed by the activation of compiler option for SIMD exteniosn (-msse2 etc) so yes.
If I understood correctly, NT2 will know at compile time which SIMD instruction set is available, and generate the appropriate code.
Yes
Now if the intel compiler compiles the code multiple times for different target architectures, I imagine that NT2 would generate the appropriate code during the different compile runs for the different architectures,
Yes
so that the resulting executable should work optimal right out of the box. Is this correct?
You lost me here, how do you link all your version ? Dynamically or what ?

On 11/10/2010 17:54, DE wrote:
OpenCL is a possible implementation, albeit we choose to call the various SIMD instructions ourselves to have more control on the toolchain and the end result. OpenCL will however be our main backend for GPU targets which we will be supporting in the future. Or maybe we will just target it through Clang and LLVM. NT2 (also know as the crazy frenchman library), upon which this effort is based, only supports x86 (SSE, ..., SSE4, AVX) and PowerPC (AltiVec). ARM (NEON, VFP) is being added. An effort has been made in its design so that instructions could register themselves for a particular (type, cardinal) pair, all of which ranked according to a category to select the best candidate. It heavily uses meta-programming, including rewritten bits of MPL to augment its compile-time performance. It also uses expression templates with Proto to detect certain patterns, such as fused multiply-add, for which x86 is introducing new instructions soon. Therefore it is very tunable and extensible. The bit I wanted to discuss here, however, is not the implementation, but rather the interface that the library provides to the programmer. We aim to provide an interface in modern C++ that integrates well with the standard library and Boost in order to allow developers to make use of SIMD in an easy and fairly high-level fashion, potentially using meta-programming to write an algorithm with parametric register and cache sizes. OpenCL is not an interface that satisfies those criteria.

Mathias Gaunard wrote:
Have you seen the ct thing coming from Intel? I just learned about it last week. http://software.intel.com/en-us/articles/intel-array-building-blocks/ It looks like a container library for vector processing in C++ with a JIT compilation runtime environment as part of the library. As a guy who appreciates compilers, langauges and libraries there is a lot to like there. This area is evolving very fast. Regards, Luke

Assuredly I missed some sort of presentation or a discussion elsewhere, but why is a segmented-iterator/SIMD-oriented-iterator (or ct) superior or worse than a well-tuned valarray? Obviously, valarray has a uniquely sparse interface providing (less than) the barebones needed for data-parallel programming; on top of which it uses a member-function strategy. However, it seems that "fixing" valarray might be a more robust start, rather than immediately striking out on a different strategy.

On 11/10/2010 23:34, Simonson, Lucanus J wrote:
Have you seen the ct thing coming from Intel? I just learned about it last week.
http://software.intel.com/en-us/articles/intel-array-building-blocks/
I haven't had a chance to look at it myself, but I've heard this is based on Rapidmind. I would say NT2 is quite comparable to that product, albeit, since NT2 is a public research project that doesn't have the full backing of Intel, it is probably inferior in some aspects, while still having good original ideas too. We are however only looking at submitting the SIMD part to Boost, rather than the whole of NT2 that does also threads, MPI, matrices, vectors, numerical stuff, etc. I believe it is important to be able to isolate and decouple components as much as possible; and I can indeed think of quite of few applications where I would want to use SIMD but not the whole artillery of different parallel technologies and making my application depend on a big runtime that does a lot behind the scenes. Being able to use it non-intrusively with my own types, or extend it as needed, is a plus too. So for now, for Boost, we're just looking at providing nice ways to help the programmer organize its code around packs of data (also known as vectors, but for various reasons we prefer to use a different name), each pack being the size of a SIMD register, and ensuring that code can then generate efficient SIMD code.
It looks like a container library for vector processing in C++ with a JIT compilation runtime environment as part of the library. As a guy who appreciates compilers, langauges and libraries there is a lot to like there. This area is evolving very fast.
An interesting area to look into indeed. We have a few compilation-related projects around NT2 (some using Proto, some using an actual compiler) where we optimize array programming, but none using a JIT. NT2 does all its fine tuning (including for size of cache etc.) at compile-time, so it cannot adapt as well for different CPUs without a recompilation.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Note: s/expressive/expression/ on my last post. This is becoming a recurring problem for me :x -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAky1Pt0ACgkQO/fqqIuE2t4sGQCg4h8eddk9piv3p+EY5UoRwWHK 1r0AnAtnZWB1lTb20BtXwBo8VazFwpzK =QhW/ -----END PGP SIGNATURE-----

Recent versions of compilers are starting to support automatic loop vectorization, which can be utilized with minimal, if any, code changes. I imagine that as a library author you are more inclined to look for a library solution to this optimization problem, whereas compiler authors are more likely to try to implement it as part of the optimizer, but it seems to me that implementing it in the compiler is generally preferable.

On 12/10/2010 00:46, Jeremy Maitin-Shepard wrote:
Implementing automatic vectorization in compilers is certainly worthwhile. Other people are working on this, as this is not really our area of expertise. We consider our approach complimentary; both approaches benefit different applications. To produce good vectorized code, you need to design your code in a way that is vectorizable. By making the vectorization explicit, it becomes clear what you need to do to make it work, and what the problems that prevent vectorization are. It can require concessions or reorganizing your code, something an optimizer will not do. Also, auto-vectorization technology is not at a level where it can compete with explicit vectorization. Even with an ideal setup where everything should be vectorizable, there is a very large gap between using explicit vectorization or just automatic vectorization, as can be demonstrated for example by this benchmark: <http://eigen.tuxfamily.org/index.php?title=Benchmark>

Jeremy Maitin-Shepard wrote:
Even a "good" vectorizing C++ compiler is pretty bad, practically worthless, and while research continues apace, the practical implications of the state of the art hasn't really changed for decades. The simple fact is that most of the time the compiler just can't make enough useful assumptions about general C code to vectorize it safely. People write SSE code by hand in assembly (calling intrisics is writing assembly by hand, but more verbose) or accept the scalar code the compiler generates. Fortran is a different story, it is a language designed to make life easy for the compiler. The compiler can hardly fail to vectorize fortran code, but other than calling fortran libraries from C interfaces that doesn't help us much. If you just defined an object model that faithfully duplicates semantics of fortran and let people program that in C++ you could auto-vectorize it in the C++ compiler if it became part of the standard, but that would take years and cost millions of lives, its much easier to do it with a JIT runtime environment in a library. Since we already have the compiler technology for languages that allow it to work well it seems best to implement such a language as a DSEL that allows us to use that compiler technology from C++ rather than try to implement a vector compiler in a template metaprogram or keep beating our heads against the vectorizing general C code brick wall. JIT runtime is an end run around standardization and is kind of like going through Belgium to invade France. You get to Paris without all the fuss. Regards, Luke

Agreed
Agreed
I have rouble seeing how JIT and High Performance or JIT and EMbedded platform can be usable together, hence our static solution. I urge people to get a look at our boost'con slides. We are not speaking in the air, we already have working proof of concept and even more than that.

On 12/10/2010 04:23, Simonson, Lucanus J wrote:
Since we already have the compiler technology for languages that allow it to work well it seems best to implement such a language as a DSEL that allows us to use that compiler technology from C++ rather than try to implement a vector compiler in a template metaprogram
I don't understand. Implementing a C++ DSEL means implementing a compiler as a C++ meta-program, since it's a domain-specific language embedded in the C++ language. Did you mean it without the E, i.e. a domain-specific language with a normal compiler infrastructure? We're doing that too, re-using the same technology.

Mathias Gaunard wrote:
We want something that can be called from C++ and extends C++. If we wanted a domain-specific langauge there is cuda, openCL and fortran itself (if you care to think of Fortran as a high performance computing domain specific langauge). Implementing a C++ DSEL does mean implementing the front end of the compiler as a meta-program if you want your errors at compile time instead of runtime, but the backend of the compiler can be deferred to runtime by compiling the expressions to expression data objects (some IR) that aren't compiled into executable code until the first time they are called. Having done expression templates and meta programming (including expression templates for vector primitives) I can honestly say that I wouldn't want to implement an optimizing compiler as a template meta-program. It would be a very tedious and painful task with the end result being something that is thoroughly impractical due to long compile times and the impossibility of maintaining it. It could not compete with the quality and speed of compilers programmed in C. Regards, Luke

On 12/10/10 18:40, Simonson, Lucanus J wrote:
Having done expression templates and meta programming (including expression templates for vector primitives) I can honestly say that I wouldn't want to implement an optimizing compiler as a template meta-program. It would be a very tedious and painful task with the end result being something that is thoroughly impractical due to long compile times and the impossibility of maintaining it.
Really ? try perusing proto to its fullest, it's not that hard to maintain as soon as oyu have a neat generic package around and use it to geenrate a proper intermediate representation of your code fragment.
It could not compete with the quality and speed of compilers programmed in C.
Compile time is an expendable ressources, runtime is not.

joel falcou wrote:
Both are expendable. Compile time goes to developer productivity. If the developer is more productive he will be able to either write more functionality or better optimize the runtime performance of the same functionality with better algorithms/finer tuning. Before dismissing JIT perhaps you should do some benchmark comparisons? There is a pretty big difference between parsing syntax to an expression template and doing minor operations on the expression and implementing constant propigation and dead code elimination. I know proto can help with the former, but I'm not to keen on using it for the latter. I'm pretty sure we can optimize across expressions at runtime, while we can only optimize within expressions at compile time with template meta-programming. You're cutting off quite a bit of optimization opportunity by limiting the scope of optimizations to the expression level. Regards, Luke

While a little awkward to program in, the use of "auto" allows multistatement optimizations. However, unless you want *very* awkward coding, the control flow will always be transparent to a C++ DSEL optimizing library. The hard part of vectorization (and what defeats most compilers) are the complexities inherent in handling control flow; I don't mean simple things like loop unrolling, I'm thinking of loop invariants, loop lifting (commutativity of operations across loops), loop pipelining (distributivity of operations between loops), etc. Even "well behaved" codes for shaders in 3d graphics contexts are very branch-y now-a-days and require sophisticated compilers to allow kernels to fully utilize the hardware.

On 12/10/10 19:03, Simonson, Lucanus J wrote:
Both are expendable. Compile time goes to developer productivity. If the developer is more productive he will be able to either write more functionality or better optimize the runtime performance of the same functionality with better algorithms/finer tuning. Before dismissing JIT perhaps you should do some benchmark comparisons?
We did, and we won the benchmarks.
There is a pretty big difference between parsing syntax to an expression template and doing minor operations on the expression and implementing constant propigation and dead code elimination. I know proto can help with the former, but I'm not to keen on using it for the latter. I'm pretty sure we can optimize across expressions at runtime, while we can only optimize within expressions at compile time with template meta-programming. You're cutting off quite a bit of optimization opportunity by limiting the scope of optimizations to the expression level.
Stop, you're trying to shoe orn too much where it should not. We use ET and proto to generate a fine layer of small code fragment o be compiled int he correct sequenec, using proper intrinsic and what not. Then , all lowlevel compilation is handled by the compiler. Give back to Casear ...

Could you share the results?
More on that, I'm eager to know how dynamically geenrating code (in a string ? in a file ? as soem bytecode array ?) THEN running a compiler then executing the resulting binary can beat statically compiled code dealing with vector register. Explain this to me and I convert my self.

I looked at Intel's array building blocks and tried to understand how they were doing JIT. More here : http://nonchalantlytyped.net/blog/2010/09/26/deconstructing-intels-array-bui... Basically, they use a mix of operator overloading and some cleverly named macros to make C++ statements "generate" abstract syntax trees at run time, then JIT it to SSE supported by thread building blocks. Manjunath

On 12/10/10 20:32, Manjunath Kudlur wrote:
I can't see how different it is from ET and what are the benefit of doing static code geenration at runtime... I must be really dense but I can't see this winning any race ...

On Tue, 12 Oct 2010, joel falcou wrote:
Presuming a sufficiently advanced jit architecture, - static code generation - may not have access to custom opcodes or hardware on the user's computer - may not be able to produce all optimal variants - cannot change after delivery - can only guess the actual runtime code paths - a jit on the user's machine can support - local benchmarking and profiling - user-side optimization (including profile hints) - use an upgraded compiler - save compiled results - ... Basically, think of it as moving the compiler from the developer's box to the user's executable. Both are "the same compiler", but the user's copy presumably knows more about the actual input data and hardware available. Knowing when and how to invoke the JIT is an art (Java uses it too much, C/C++ too little), but there are certain patterns which are a clear win. For example, a jit can inline several functions in a user-selected computation pipeline (examples: DSP blocks or high/medium/low special effects options); this eliminates branches, temporaries, etc. (hence its use in current GPU shader languages). The LLVM project supports turning the full compiler into a JIT (hint: there is no essential need for on-disk object files). With the Clang frontend, it should even be possible to compile Boost code into a running application... Later, Daniel

Daniel Herring wrote:
I want to underscore the similarity between shaders in graphics and vector kernels in HPC applications. As we offload to GPU the distincition gets even more fuzzy. When your games load it tells you its compiling the shaders. That's a pretty crummy user experience in and of itself, but we don't quibble because we are happy with the user experience that comes immediately after. In theory you could pre-compile shaders for every variant of graphics hardware out there and ship them with the game and then patch it as new hardware comes out. In practice we are happier to wait for the shaders to compile when we launch the game or get to the next level. It isn't realistic to expect every user to compile the application for their platform, nor it is reasonable to expect every app vendor to provide their source code to the customer to recompile.
I know a lot of people who are excited about LLVM + Clang and JIT comes up a lot since LLVM is light weight and fast. It would be nice to have an open source (and free) C++ interpreter. We could use LLVM for things like auto-vectorization JIT in a snap. People get it. Regards, Luke

On 12/10/2010 22:34, Daniel Herring wrote:
Unlike Joel, I do see the advantages of using a JIT to do high-performance computing, and the potential benefit of including such technology into NT2, albeit those benefits would be minimal in the current market NT2 is targeting. I think, however, that this is getting way off-topic. We are only looking at submitting the SIMD abstraction layer to Boost, nothing more, and certainly not the whole of NT2. There is much more to a numerical computation library or runtime than just SIMD. People have expressed interest in using a SIMD abstraction layer to speed up uBLAS, but it has various implications beyond that, and can also be used to give a significant speedup to text processing, compression algorithms, or whatever else you might think of. I want to repeat what the SIMD library is in case it hasn't been clear: it only provides an abstraction that allows one to model SIMD registers, and tools to help in packetizing data into register-sized packets while taking care of alignment considerations; all of it while ensuring operations on the packets will directly translate in the expected SIMD instructions, for many different architectures on all major C++ compilers. As a Boost library, it intends to do only one job, in a generic and minimal fashion, and do it well. NT2.SIMD or Boost.SIMD would take all its code generation decisions at compile-time according to what the compiler claims to support, like a regular C++ library. If the user of that library wants to take those decisions at runtime, it will be up to him to JIT the code (which shouldn't be too difficult, since we aim at providing strong integration with clang), use a strategy similar to that of the Intel compiler (and as already said in this thread, it already integrates with that strategy very well), or use an environment or runtime built upon the library that already does it.

On 13/10/10 09:39, Mathias Gaunard wrote:
I was unclear adn a bit spun yesterday for out-of-ML reason. I have nothing against JIT, we even JIT openCL atm for GPGPU support. i was a bit against JIT for the problem at hand and th eproposal we were making.
As a Boost library, it intends to do only one job, in a generic and minimal fashion, and do it well.
Agreed

sorry, i haven't read the whole thread another question sse series on x86 provide several ways to catch errors specifically control word/flags and exceptions/signals (noty aware of other platforms) are you going to model such subtle platform specific details? and if yes -- how? as a C++ programmer i would prefer C++ exception mechanism -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

On 13/10/2010 16:58, DE wrote:
There is no support for catching such errors, at least there is none yet. Translating them to C++ exceptions sounds like translating SIGABRT or other errors to C++ exceptions to me, in a similar fashion to what Visual C++ does with SEH; i.e. not a good idea. On a technical note, it might be a bit complicated to do, since that state is global so it could be hard in the presence of threads. We'll nevertheless look into the details of this.

Daniel Herring wrote:
But in practice it doesn't help much. As a non-JIT developer you can figure what is the minimum requirements for your program to run decently. Optimize for that! If your customer has better hardware - fine, it will run even faster. If your calculations need SIMD extensions, no JITing in the world can fix that on my 486. Bo Persson

joel falcou wrote:
Could you share the results?
Tell me which stuff you JIT with ;)
I've been talking about the ct/rapidmind stuff, though my discussion has been about the general idea and not the specifics of what they may have implemented. I don't know exactly what is in there and haven't tried it yet myself.
You can write IR to memory, or load it as part of the process image, run the backend of the compiler on that and dynamically link in the result by replacing some function pointer in a table with the address of the executable code that is generated. Implement it as a try catch around the function call and initialize the function pointer in the table to a null pointer and you get near zero runtime overhead for subsequent function calls. Allow the JIT compiler to inline these functions into eachother and you can reduce the function overhead even further. Alternately, implement it as something that runs at program startup and eliminate the indirection entirely. As to how it could be better: If the dynamically generated code makes use of new hardware features not available at the time the original code was compiled.... If the dynamically generated code is better than the statically generated code because it has better algorithms for optimizing vector instructions... If the compile time is negligable compared the the runtime of the resulting code... If the dynamically generated code offloads work to the GPU through shared L3 cache... If the dynamically generated code mutlithreads in addition to vectorizing the code and dynamically schedules it to all available cores... Need I go on? I shall. It doesn't make sense for vector intrinsics compiled by C++ compiler to outperform fortran compiled to vector instructions by the fortran compiler. I know perfectly well that it is the same back end compiler, but the difference is that with C++ the compiler can't know which optimizations are safe that the fortran compiler can easily make. As long as the embedded langauge allows the compiler to make the assumptions needed to enable optimizations that the C++ compiler can't then the code it generates will outperform the C++ compiler even if it's the exact same backend compiler being used. There is practically no way you can tell the C++ compiler which assumptions are safe, there just aren't sufficient pragmas and flags for that and even if there were they wouldn't be portable. In any case, I don't really know much about this stuff, I encourage you to look at what's in there yourself. It could be it's much less good than I imagine it to be based upon my brief reading of the description. You could be a lot smarter than the guys from MIT who did it, who knows? Regards, Luke

I've been talking about the ct/rapidmind stuff, though my discussion has been about the general idea and not the specifics of what they may have implemented. I don't know exactly what is in there and haven't tried it yet myself.
I am sure both JIT approach and static metaprogramming approach have their places. My issue with ct/rapidmind/ArBB is the "unnatural-ness" it introduces to the programmer. Seems like the programmer has to be aware of "retained" mode vs. immediate mode when coding his algorithms. For some of the problems, I believe static metaprogramming (a la NT2) can be more natural to a C++ programmer.
If the dynamically generated code makes use of new hardware features not available at the time the original code was compiled....
Can't beat JIT in the above scenario. But the above scenario is arguably fictitious. HPC people will definitely recompile their code on newer platforms.
If the compile time is negligable compared the the runtime of the resulting code...
Vectorizing is no easy task. State of the art algorithms are O(N^2) on the size of a basic block. So it has non-negligible overhead. Unless the programmers using JIT are aware of this (use "retained" mode to pre-compile functions once and call it multiple times), they can easily blow away the benefits. Manjunath

Manjunath Kudlur wrote:
Manycore with SSE is going mainstream. Desktop applications of the near future will look more and more like HPC applications of today. Desktop users will not recompile their code. Regards, Luke

On 12/10/10 20:41, Simonson, Lucanus J wrote:
If the dynamically generated code makes use of new hardware features not available at the time the original code was compiled....
Well, recompilign xost you so much ?
If the dynamically generated code is better than the statically generated code because it has better algorithms for optimizing vector instructions...
Care to elaborate ?
If the compile time is negligable compared the the runtime of the resulting code...
I ask some example, this seems more than fuzzy
If the dynamically generated code offloads work to the GPU through shared L3 cache...
We can do that
If the dynamically generated code mutlithreads in addition to vectorizing the code and dynamically schedules it to all available cores...
We do this already and no need for JITing this, runtiem scheduler is enough
It doesn't make sense for vector intrinsics compiled by C++ compiler to outperform fortran compiled to vector instructions by the fortran compiler. I know perfectly well that it is the same back end compiler, but the difference is that with C++ the compiler can't know which optimizations are safe that the fortran compiler can easily make. As long as the embedded langauge allows the compiler to make the assumptions needed to enable optimizations that the C++ compiler can't then the code it generates will outperform the C++ compiler even if it's the exact same backend compiler being used. There is practically no way you can tell the C++ compiler which assumptions are safe, there just aren't sufficient pragmas and flags for that and even if there were they wouldn't be portable.
Then you miss th epoint of what we are trying to do. We don't and never claimed to vectorize ANY random C++ code but instrumented code of ours. And in this case, we know we can because of the fact WE control everything. I guess blaming the poster was so much more fun than actually requesting details. And all in all, I guess the fact than tryign to get innovation on this in a OS project is useless too ...

joel falcou wrote:
You can't have your cake and eat it too. If you leave the low level optimization to the C++ compiler you have to accept that it will punt when it can't be sure whether two pointers might be aliases, for example.
I wasn't trying to blame you for anything other than dismissing the JIT approach out of hand and my only intention was to get you to spend some time looking into this CT thing which looks quite promising. I'm not sure why you would get frustrated with participating in OS, but it's clear I've upset you, and the wise crack about being smarter than the guys who came up with CT was uncalled for--I apologize. Regards, Luke

On 12/10/10 21:15, Simonson, Lucanus J wrote:
You can't have your cake and eat it too. If you leave the low level optimization to the C++ compiler you have to accept that it will punt when it can't be sure whether two pointers might be aliases, for example.
This can be checked quite lightweightly at runtime after some ET based exploration. We did it in NT2 and it works out of the box. the definition of vaporware.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Notes on clang/LLVM + C++ + JIT stuff: //////////////////////////////////////////////////////////////////// // Copyright 2008 Eric Niebler. Distributed under the Boost // Software License, Version 1.0. (See accompanying file // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) #include <iostream> #include <boost/proto/core.hpp> #include <boost/proto/context.hpp> #include <boost/typeof/std/ostream.hpp> namespace proto = boost::proto; proto::terminal< std::ostream & >::type cout_ = {std::cout}; template< typename Expr > void evaluate( Expr const & expr ) { proto::default_context ctx; proto::eval(expr, ctx); } int main() { evaluate( cout_ << "hello" << ',' << " world" ); return 0; } //] wash@Pegasus:~/sandbox$ time clang++ -cc1 -triple x86_64-unknown-linux-gnu - -emit-llvm-bc -x c++ hello.cpp real 0m1.958s user 0m1.676s sys 0m0.276s wash@Pegasus:~/sandbox$ lli hello.bc hello, world wash@Pegasus:~/sandbox$ There's nothing magic about JITing C++, even JITing non-trivial C++. It's basically pointless, because the LLVM bitcode/LLVM assembly generated by a C++ program is absolutely, positively not portable. Starting in the earliest stages of compilation (preprocessing), platform dependencies enter the code (not to mention all the preprocessor workarounds). Add in alignment jazz. Plus sizes. Plus implementation details (GNU expands sizeof and assert in the preprocessor, for example). CXXABI. C++ is not Java, and clang/LLVM are not magic. We can do cool things with clang/LLVM, C++ and JIT, but at the end of the day, C++ is not Java. Other stuff: I think I saw some comments implying that C compilers were faster or better than C++ compilers. I know that I saw people claim that C compilers were faster, better or could optimize better than C++ compilers implemented using expression templates and metaprogramming techniques. Such claims are unfounded. http://clang.llvm.org/performance.html - clang uses limited template metaprogramming techniques, however it is quickly shaping into a solid, production-quality open-source C++ compiler (I have compiled a semi-functional Linux kernel with clang), and it is by far superior to it's C predeccessor GCC when it comes to optimization. Boost.Wave, a standards compliant C preprocessor is implemented using Boost.Spirit. www.ll.mit.edu/HPEC/agendas/proc02/abstracts/mullin.pdf - really smart MIT guy's paper on really cool optimizations, done in C++ with C++ TMP techniques, specifically expressive templates. - -- Bryce Lelbach aka wash http://groups.google.com/group/ariel_devel -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAky1PVkACgkQO/fqqIuE2t6gBwCcDjnYwO8kYzzBu7HOyl6yyAhB UJsAoMfUKrujDLkBHHEOe7Eyep9AJOJi =hYSg -----END PGP SIGNATURE-----

[probably my last post on this -- agreed it is OT, and the proposed SIMD framework would be interesting] On Wed, 13 Oct 2010, Bryce Lelbach wrote:
Agreed on the portability, but I think you missed the point. People are advocating something like JIT, not to VM bytecode but directly to machine code, exactly to get non-portable optimizations.
The normal example is Fortran being faster than C/C++ because it doesn't have to worry about issues such as pointer aliasing. Later, Daniel

At Wed, 13 Oct 2010 01:02:17 -0400, Bryce Lelbach wrote:
But all those things are not intrinsic to C++. The C++ abstract machine doesn't specify any of those platform dependencies. It's just that you need to do some of those optimizations in the front-end for full compilation to object code, and nobody is interested in C++ code that's link-incompatible with all the libraries, etc., that have already been compiled. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

[Repost: first one doesn't appear to have made it through] Just trying to refocus this thread to what I meant it to be. We have a SIMD abstraction layer that we would like to eventually submit to Boost. For appropriate values of N, simd::pack<T, N> represents a SIMD register, which allows you to do the same operation on N elements in parallel with a single CPU instruction. pack<T, N> provides all the same operators as T, and can also detect a sequence of operations that exists on the CPU as a single instruction. It falls back to a loop if the architecture has no SIMD register of size sizeof(T)*N bytes. simd::pack<T, N> is also a fusion sequence and a range. The library also provides a series of useful functions, like summing a pack or reordering its elements. What I would like to know, is how people think we could integrate this system into iterators and ranges so that existing algorithms could be adapted to treat N elements at a time instead of 1, and therefore get a potential speed gain. As I said, we currently have an iterator adapter that adapts a range of properly aligned Ts, that are also contiguous at least in chunks of N elements, into a range of pack<T, N>s. Are there other utilities we could provide to help with the usage of SIMD? I was thinking of supporting adapting non-aligned ranges as well and padding them with some values, but that is not possible to do efficiently with the standard iterator model, which led to some discussion about an alternative iterator model that seems to be going nowhere. Any suggestion of features or tools the SIMD could provide to make the life of the developer easier would be appreciated. With regards to alignment, we also provide a memory allocator that aligns correctly, and functions to get the next or previous aligned address from a particular address.

On 10/13/2010 01:43 PM, Mathias Gaunard wrote:
"seems" being the operative word ;) I find the argument to investigate a framework for internal iteration / push iteration / visitation iteration / whatever convincing. I haven't been convinced yet (despite Dave's best efforts) that the standard, external iteration model, segmented or otherwise, will absolutely address your efficiency concerns. Have you benchmarked the performance difference between the standard iteration model and push iteration? - Jeff

I've only had a chance to read the Austern paper "Segmented Iterators and Hierarchical Algorithms" once. I agree that the segmented iterators are rather appealing as an interface to packs-of-pods-datatype. However, I'm not sure the conceptual interface to segmented iterators are rich enough. Luke Simonson has more experience with the construction of an iterator pattern around SIMDs used in a high-performance system than I do. However I seem I to remember that the pattern for SIMD iteration always looked like this: Loop A? Loop B* Loop C? Where Loops A & C handle the unaligned start/end (if they exist) and Loop B* handles (0 or more of) the aligned packs of the sequence. Handling the edge-cases for A & C were definitely not the same --- there are memory exceptions that can occur at the end-of-range for Loop C that can't (or usually don't) occur for Loop A when doing unaligned loads/stores. I seem to remember that it was insufficient to just mark "not Loop B" but it was necessary to explicitly mark foreach operations as "Loop A" vs. "Loop C". It seems that this division of Loop A/B/C will need to be exposed through the iterator. My other concern is that while it should be fairly straightforward to provide efficient SIMD instruction sequences for addition, subtraction, multiplication, (refined) division, and related fused-accumulate ops, many of the more "interesting" SIMD algorithms are highly sensitive to swizzling & permutation, prefetching, built-in conversion operations, etc. A SIMD library exposing just (for instance) the Ring or Field algebraic operations does have its uses; however, I feel that in most "real world" scenarios the unexposable details needed for high-performance coding will mean that the library is only used for toy applications. That is, unless you think you can also define platform-independent forwarding to swizzle/permute/conditional-lane-usage/etc. Another (particular ugly) use of SIMDs is to allow per-object aligned storage & load of structures into/outof binary array blobs. This would basically look like an allocator for an object that uses the SIMD extensions for loading/storing data-structures as binary blobs into and out of arrays. Have you considered an interface for this sort of usage?

Smith, Jacob N wrote:
It seems to me a segmented iterator could provide a generic way to, for example, vectorize an algorithm on a deque by letting me unpack the pointer and distance to end of the current contiguous address range. I could implement A?B*C? for each contiguous range without needing access to the internals of the dequeue data structure, however, the dequeue doesn't provide a segmented iterator interface, nor does anything else that I'm aware of, and I don't see that changing, so I don't see the point in implemented vector accelerated algorithms with segmented iterator based interfaces to work with all the segmented iteratior data types that don't exist and probably never will. My work with the vectorized A?B*C? using pointer based interface and generic call back functor argument showed that it could match the performance of hand crafted assembly with all of my loop unrolling, prefetching and similar handiwork. There was a lot of special casing for unaligned begin and end of source and destinatin address ranges. I made great use of it vectorizing some pretty complex general C-style code and got nice speedups for data that was unaligned and address ranges that were often quite small. It was a very productive abstraction and easy to apply. The benefit of abstracting away the complexity of memory access was a big win. Regular iterators didn't perform as well, but I did implement them and use them because they were better than nothing. I didn't look at segmented iterators. I think we might want a different abstraction than segmented iterators to vectorize algorithms that are abstracted away from data structures. Perhaps we should think in terms of arrays instead of scalar elements as the smallest unit. Abstractions and semantics are important exactly because they dictate how you think about a problem. I want simple abstractions and semantics that make the problem at hand easy to think about. So far I haven't heard anyone explain how a segmented iterator makes it easier to vectorize an algorithm and not harder. Regards, Luke

We did our homework I can reassure you. We provide more than just operators on these pack and we use proto to detect fused operations seuqence and replace them before evaluation. As for swizzling and permute, we have various potential interface for this and trying to settle on some. Conversion are used through a simd::cast<T> operator. As for prefetching, the current solutino is to provide a generic prefetch function one can use. In NT2, those prefetch are estimated and inserted in the array evaluation but here, we deal with a lower level.
I fail to understand your use case ? We have some SIMD compatible allocator already but what d you make reference to ?

----- Original Message ----- From: <Joel.Falcou@lri.fr> To: <boost@lists.boost.org> Sent: Thursday, October 14, 2010 6:58 AM Subject: Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives
Hi, I would like to add pointer to your lib in the library under construction page http://svn.boost.org/trac/boost/wiki/LibrariesUnderConstruction. Please could you gice some references where we can download the SIMD library and a link to the documentation? Best, Vicente

On 14/10/10 07:49, vicente.botet wrote:
It's currently part of NT2 and need to be extracted and boostified. NT2 on-going development (and I emphasize this *on-going*) is on github.com/jfalcou/nt2. I am currently slowly refactoring/pushing our old code base into the new system so stuff are in flux atm. The main area is include/nt2/sdk/simd

At Wed, 13 Oct 2010 16:39:18 -0700, Smith, Jacob N wrote:
I think you should read the paper again and look at the implementation of one of the algorithms there. They all have that structure. -- Dave Abrahams BoostPro Computing http://www.boostpro.com
participants (24)
-
Andrew Sutton
-
Bo Persson
-
Bryce Lelbach
-
Daniel Herring
-
David Abrahams
-
David Sankel
-
DE
-
Eric Niebler
-
Felipe Magno de Almeida
-
Jeffrey Lee Hellrung, Jr.
-
Jeremy Maitin-Shepard
-
Joel de Guzman
-
Joel Falcou
-
joel falcou
-
Joel.Falcou@lri.fr
-
Manjunath Kudlur
-
Mathias Gaunard
-
Robert Jones
-
Robert Ramey
-
Sebastian Redl
-
Simonson, Lucanus J
-
Smith, Jacob N
-
Thomas Klimpel
-
vicente.botet