Anyone interested in a generic loop (and loop unrolling) library ?

Loops, in particular loop-unrolling, can be made generic, easy to write, and accessible to everybody. With the help of Boost.Lambda (and a few other boost libraries), we can easily construct arbitrary loops that are unrolled at compile-time. I haven't implemented the full library yet, in fact I've implemented only a few very preliminary functionalities (hardly flawless), it serves only to illustrate the ideas. From what they can do, a possible Boost.Loop library looks very promising. However I feel that I need much help on the design of such a library. The implementation could be straightforward (with the help of so many good libraries in Boost), once a good design is found. Hopefully there are interested people here so I won't have to work on my own. (If there already exists such a library, please let me know.) In the example, you can see how easy it is to construct arbitrary (limited only by Boost.Lamda) loops that are generically unrolled at compile time. With an optimizing compiler, it can be as efficient as hand-written codes. Not only that, but also can we make (almost) any linear-algebra libraries work together without difficulty. In the example, we don't need any extra coding to make a c_array, std::vector, or even a column_view of a ublas::matrix to work together. In the following I give the full example code to illustrate what it's capable of and how to use it. The header file, loop.hpp, is attached at the end of this post. The example has been tested with apple-gcc4.0.1 . Thanks, Hui /**********************************************************************************************************/ #include <vector> #include <iostream> #include <iterator> #include <functional> #include <boost/array.hpp> #include <boost/numeric/ublas/matrix.hpp> #include <boost/numeric/ublas/matrix_proxy.hpp> #include <boost/numeric/ublas/io.hpp> #include "loop.hpp" using namespace std; using namespace boost; using namespace boost::lambda; using namespace boost::numeric; using namespace boost::numeric::ublas; using namespace loop; int idt(int i) { return i; } int main(int argc, char* argv[]) { const size_t N = 8, M = 3; boost::array<double,N> v; std::vector<double> w(N); matrix<double> m(N,M); // All loops are unrolled at compile time // indtrans acts like an array whose individual element equals its index index_transform<double(int)> indtrans(&idt); // equivalent to for(i=0;i<N;i+=2){ v[i]=w[i]=1; } loop_over< linear_range<0,N,2> >( _1 = _2 = 1, v, w ); // equivalent to for(i=1;i<N;i+=2){ v[i]=i; } loop_over< linear_range<1,N,2> >( _1 = _2, v, indtrans ); // equivalent to for(i=1;i<N;i+=2){ w[i]=i; } loop_over< linear_range<1,N,2> >( _1 = _2, w, indtrans ); // loop_over< linear_range<1,N,2> >( _1 = _2 = _3, v, w, indtrans ); // I wonder why this line doesn't compile with gcc4?? m.clear(); // set m to a zero matrix // equivalent to for(i=1;i<N;++i){ column(m,2)[i]=1+i*2; } loop_over< linear_range<0,N,1> >( _1 = 1+_2*2, loop::ref(column(m,2)), indtrans ); // equivalent to for(i=1;i<2;++i){ v[i]*=-2; } followed by for(i=N-3;i<N;++i){ v[i]*=-2; } loop_over< link_range<linear_range<0,2,1>,linear_range<N-3,N,1> > >( _1 *= -2, v); // vertically print out v and w, side by side loop_over< linear_range<0,N,1> >( cout << _1 << constant('\t') << _2 << constant('\n'), v, w ); cout<<endl; // print out m cout << m << endl << endl; double s = 0; // calculate the inner-product of w and column(m,2) (when the column is viewed as a vector/array) loop_over< linear_range<0,N,1> >( _1 += _2*_3, loop::ref(loop::scalar(s)), w, loop::ref(column(m,2)) ); cout << s << endl << endl; return 0; } /**********************************************************************************************************/ Here is the loop.hpp /********************************************************************************************************** * loop.hpp * Created by Hui Li on 6/26/08. ******************************************************************************************************** */ #ifndef LOOP_H #define LOOP_H #include <boost/static_assert.hpp> #include <boost/ref.hpp> #include <boost/lambda/lambda.hpp> #include <boost/function.hpp> #include <boost/function_types/function_arity.hpp> #include <boost/mpl/if.hpp> #include <boost/mpl/bool.hpp> #include <boost/mpl/assert.hpp> namespace loop { // the empty range struct empty_range {}; // a range starting from Start, ending at End, with stride Stride template < int Start, int End, int Stride = 1 > struct linear_range { BOOST_MPL_ASSERT(( boost::mpl::bool_<(Start*Stride<End*Stride)> )); static const int first = Start; typedef typename boost::mpl::if_ < boost::mpl::bool_< ((Start+Stride)*Stride < End*Stride) >, linear_range < Start+Stride, End, Stride >, empty_range >::type next_range; }; // link two arbitrary ranges template < typename LeftRange, typename RightRange > struct link_range { static const int first = LeftRange::first; typedef typename boost::mpl::if_ < typename boost::is_same < typename LeftRange::next_range, empty_range >::type, RightRange, link_range < typename LeftRange::next_range, RightRange > >::type next_range; }; // implementation of the unrolled loop over a range template < typename LambdaExpr, typename Range > struct loop_Impl { template < typename V1 > void operator()(LambdaExpr& expr, V1& v1) { expr(v1[Range::first]); loop_Impl < LambdaExpr, typename Range::next_range >()(expr,v1); } template < typename V1, typename V2 > void operator()(LambdaExpr& expr, V1& v1, V2& v2) { expr(v1[Range::first],v2[Range::first]); loop_Impl < LambdaExpr, typename Range::next_range >()(expr,v1,v2); } template < typename V1, typename V2, typename V3 > void operator()(LambdaExpr& expr, V1& v1, V2& v2, V3& v3) { expr(v1[Range::first],v2[Range::first],v3[Range::first]); loop_Impl < LambdaExpr, typename Range::next_range
()(expr,v1,v2,v3); } };
// looping over an empty range does nothing template < typename LambdaExpr > struct loop_Impl < LambdaExpr, empty_range > { template < typename V1 > void operator()(LambdaExpr&, V1){} template < typename V1, typename V2 > void operator()(LambdaExpr&, V1&, V2&){} template < typename V1, typename V2, typename V3 > void operator()(LambdaExpr&, V1&, V2&, V3&){} }; // helper functions template < typename Range, typename LambdaExpr, typename V1 > void loop_over(LambdaExpr& expr, V1& v1) { loop_Impl<LambdaExpr,Range>()(expr,v1); } template < typename Range, typename LambdaExpr, typename V1, typename V2 > void loop_over(LambdaExpr& expr, V1& v1, V2& v2) { loop_Impl<LambdaExpr,Range>()(expr,v1,v2); } template < typename Range, typename LambdaExpr, typename V1, typename V2, typename V3 > void loop_over(LambdaExpr& expr, V1& v1, V2& v2, V3& v3) { loop_Impl<LambdaExpr,Range>()(expr,v1,v2,v3); } // map a call to operator[] to a unary function all template < typename Function > struct index_transform { BOOST_STATIC_ASSERT(( boost::function_types::function_arity<Function>::value == 1 )); typedef boost::function<Function> func_type; typename func_type::result_type operator[](typename func_type::argument_type x) { return func(x); } template < typename FuncTp > index_transform(FuncTp f):func(f){} private: func_type func; }; // force removal of const-ness of a reference // not so elegant but should be okay template < typename T > T& ref(const T& t) { return const_cast<T&>(t); } // independent of whatever the index might be, it always returns the reference to the same variable template < typename T > struct scalar_Impl { T& operator[](int) { return ref; } scalar_Impl(T& t):ref(t){} private: T& ref; }; // scalar helper template < typename T > scalar_Impl<T> scalar(T& t) { return scalar_Impl<T>(t); } } // namespace loop #endif /**********************************************************************************************************/

Hui Li wrote:
Loops, in particular loop-unrolling, can be made generic, easy to write, and accessible to everybody. With the help of Boost.Lambda (and a few other boost libraries), we can easily construct arbitrary loops that are unrolled at compile-time.
Can you clarify exactly what you're trying to achieve? I don't think that loops, in general, and hard-to-write and not accessible to everybody. And speaking about loop unrolling -- the whole point of that is performance -- did you measure performance of the code using your approach and traditional code with suitable optimizations? - Volodya

Loop unrolling, per se, is not all that hard. But in combination with software pipelining and/or manual vectorization it can be a challenge to get it right. A template library could conceivable help. I generally use preprocessor macros in such situations but it quickly gets very ugly and very hard to maintain. However, the performance that can be gained by using such a library is largely a temporary thing. Compilers keep getting better and, in most cases, simply saying what you mean will allow a good compiler to make the necessary optimizations. If your compiler isn't that smart today chances are it will be tomorrow. So I wouldn't invest a whole lot of work in writing such a library nor in converting code to use it. Of course, even temporary wins have some value. If work does go ahead on this I will watch with interest and contribute when I can. On real code with real data I've achieved 60-70% speedups by manually unrolling, pipelineing, and vectorizing critical loops. The biggest problems with this approach, the factors that limit how much it gets done, are: 1) It takes a rocket scientist to make the code transformations safely and it takes a lot of low-level insight to know where the transformations are likely to help. 2) The performance gains are very sensitive to the microarchitecture of the execution platform so it is hard to justify for commercial software (that runs on a variety of platforms without recompilation) 3) The transformed code becomes essentially maintenance-proof. 4) To evaluate the effectiveness of the proposed transformation you really want to test a fairly large number of alternative transformations. Each one is very tedious to do: degree of unrolling .vs. alternative interleavings of pipeline stages .vs. data representation alternatives .vs. ... You can't test them all. A template library that successfully abstracted some of the mechanics of these transformations would, at a minimum, make it feasible to evaluate more alternatives on more microarchitectures and reduce the maintenance penalty of implementing the transformations. If it also made it feasible to deploy multiple runtime-selected implementations of a function from one source text that would be very interesting. -swn -----Original Message----- From: Vladimir Prus [mailto:vladimir@codesourcery.com] Sent: Sunday, June 29, 2008 1:15 AM To: boost@lists.boost.org Subject: Re: [boost] Anyone interested in a generic loop (and loopunrolling) library ? Hui Li wrote:
Loops, in particular loop-unrolling, can be made generic, easy to write, and accessible to everybody. With the help of Boost.Lambda (and a few other boost libraries), we can easily construct arbitrary loops that are unrolled at compile-time.
Can you clarify exactly what you're trying to achieve? I don't think that loops, in general, and hard-to-write and not accessible to everybody. And speaking about loop unrolling -- the whole point of that is performance -- did you measure performance of the code using your approach and traditional code with suitable optimizations? - Volodya

On Mon, Jun 30, 2008 at 10:27 AM, Stephen Nuchia <snuchia@statsoft.com> wrote:
Loop unrolling, per se, is not all that hard. But in combination with software pipelining and/or manual vectorization it can be a challenge to get it right. A template library could conceivable help. I generally use preprocessor macros in such situations but it quickly gets very ugly and very hard to maintain.
However, the performance that can be gained by using such a library is largely a temporary thing. Compilers keep getting better and, in most cases, simply saying what you mean will allow a good compiler to make the necessary optimizations. If your compiler isn't that smart today chances are it will be tomorrow. So I wouldn't invest a whole lot of work in writing such a library nor in converting code to use it.
Of course, even temporary wins have some value. If work does go ahead on this I will watch with interest and contribute when I can. On real code with real data I've achieved 60-70% speedups by manually unrolling, pipelineing, and vectorizing critical loops. The biggest problems with this approach, the factors that limit how much it gets done, are:
1) It takes a rocket scientist to make the code transformations safely and it takes a lot of low-level insight to know where the transformations are likely to help. 2) The performance gains are very sensitive to the microarchitecture of the execution platform so it is hard to justify for commercial software (that runs on a variety of platforms without recompilation) 3) The transformed code becomes essentially maintenance-proof. 4) To evaluate the effectiveness of the proposed transformation you really want to test a fairly large number of alternative transformations. Each one is very tedious to do: degree of unrolling .vs. alternative interleavings of pipeline stages .vs. data representation alternatives .vs. ... You can't test them all.
A template library that successfully abstracted some of the mechanics of these transformations would, at a minimum, make it feasible to evaluate more alternatives on more microarchitectures and reduce the maintenance penalty of implementing the transformations. If it also made it feasible to deploy multiple runtime-selected implementations of a function from one source text that would be very interesting.
-swn
-----Original Message----- From: Vladimir Prus [mailto:vladimir@codesourcery.com] Sent: Sunday, June 29, 2008 1:15 AM To: boost@lists.boost.org Subject: Re: [boost] Anyone interested in a generic loop (and loopunrolling) library ?
Hui Li wrote:
Loops, in particular loop-unrolling, can be made generic, easy to write, and accessible to everybody. With the help of Boost.Lambda (and a few other boost libraries), we can easily construct arbitrary loops that are unrolled at compile-time.
Can you clarify exactly what you're trying to achieve? I don't think that loops, in general, and hard-to-write and not accessible to everybody. And speaking about loop unrolling -- the whole point of that is performance -- did you measure performance of the code using your approach and traditional code with suitable optimizations?
- Volodya
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I would certainly be interested (in helping development as well). For what its worth, I c++0x-ified your code (more for demonstration, but still works well ... except range linking). template < typename ...Ranges> struct link_range; template < typename LeftRange, typename... Ranges> struct link_range<LeftRange, Ranges...> { static const int first = LeftRange::first; typedef typename boost::mpl::if_ < typename boost::is_same < typename LeftRange::next_range, empty_range >::type, typename link_range<Ranges...>::next_range, link_range < typename LeftRange::next_range, Ranges... > >::type next_range; }; template < typename LeftRange, typename RightRange > struct link_range<LeftRange, RightRange> { static const int first = LeftRange::first; typedef typename boost::mpl::if_ < typename boost::is_same < typename LeftRange::next_range, empty_range >::type, RightRange, link_range < typename LeftRange::next_range, RightRange > >::type next_range; }; // implementation of the unrolled loop over a range template < typename LambdaExpr, typename Range > struct loop_Impl { template < typename ...V1 > void operator()(LambdaExpr& expr, V1&... v1) { expr(v1[Range::first]...); loop_Impl < LambdaExpr, typename Range::next_range >()(expr,v1...); } }; // looping over an empty range does nothing template < typename LambdaExpr > struct loop_Impl < LambdaExpr, empty_range > { template < typename ...V1 > void operator()(LambdaExpr&, V1...){} }; // helper functions template < typename Range, typename LambdaExpr, typename ...V1 > void loop_over(LambdaExpr& expr, V1&... v1) { loop_Impl<LambdaExpr,Range>()(expr,v1...); }
participants (4)
-
Chris Fairles
-
Hui Li
-
Stephen Nuchia
-
Vladimir Prus