[multi-array] Performance? Difference from GIL or uBLAS?

Has anybody measured performance of multi-array? I am considering whether or not to use it, but when I googled multi_array I got a very old, but unresolved, discussion about poor performance measurements in experiments with optimized code that uses multi_array. (http://lists.boost.org/boost-users/2003/07/4423.php) Also, after a rough glance through multi_array and uBLAS it looks like GIL, uBLAS, and multi_array seem to repeat very similar concepts (and seem to use different syntax). Isn't that kind of bad? I have been playing with GIL in the past but now I am scratching my head about these other libraries. -- John

Also, after a rough glance through multi_array and uBLAS it looks
like
GIL, uBLAS, and multi_array seem to repeat very similar concepts (and seem to use different syntax). Isn't that kind of bad? I have been playing with GIL in the past but now I am scratching my head about these other libraries.
Beware that running ''some tests'' on arbitrary large array of any kind must take into account the fact that you may induces cache misses especially in the above example where I don't think a quarter of the data fit into the L1 cache. Considering cache misses may lead to a ten fold computing time, this may render performance tests unusable. I have struggled a few with optimized array and tried many way to do so. Seems the easiest way to have : - chained operator[ ] access syntax - correct performance in L1 cache - easy way to perform tiled computation - comaptible with other high-performances settings (padding & alignement for SIMD for example) is not easy. Best way so far is to use a compact Numeric Recipes like structure aka for a D1*D2*..*Dn array, you have to allocate a D1*D2*..*Dn memory block and n-1 array of pointer of different level to each dimension. Then returning the top level pointer (a T**.** with n *) grants you a way to acces element i1,i2,..,in through n[] call while having no stride computation to do manually and so, making tiling easy to write as a multiple array loop nest. And most of the time, a container that can't support tiling easily is useless in most real life application as this simple techniques can significantly increase runtime performances. I actually wrote a thesis on the subject and still struggle to get a decent library out of all my experiments.There is some prototype of it but it's nowhere completed (NT2 on sourceforge for reference) in which I use Epression tempalte to evaluates at compile time a somewhat optimal tiling size. If needed I can reinject my experiments results here so we can all brainstorm about this but I think the problem of "easy-to-use and good performances" numeric data container is not easy and not currently addressed anywhere.

joel falcou wrote:
Beware that running ''some tests'' on arbitrary large array of any kind must take into account the fact that you may induces cache misses
<snip>
I actually wrote a thesis on the subject and still struggle to get a decent library out of all my experiments.There is some prototype of it but it's nowhere completed (NT2 on sourceforge for reference) in which I use Epression tempalte to evaluates at compile time a somewhat optimal tiling size. If needed I can reinject my experiments results here so we can all brainstorm about this but I think the problem of "easy-to-use and good performances" numeric data container is not easy and not currently addressed anywhere
I think BLITZ++ used to have some very good performance comparisons with Fortran IIRC. (What ever happend to BLITZ++? Is it dead?) I understood the thread to be about doing the same operation on the same data, using multi_array versus doing the pointer arithmetic yourself. I am actually interested about the iterators performance in multi-array too, since their iterators seem to just wrap indexes. That _seems_ like it would be bad, but Idunno. --John

I think BLITZ++ used to have some very good performance comparisons with Fortran IIRC. (What ever happend to BLITZ++? Is it dead?)
yes it does, I used it and the invaluable material from Todd to build my own ET based container library. I wish I knew what happened to Blitz indeed as I think its code base - while verbose and untidy - was very interesting
I am actually interested about the iterators performance in multi-array too, since their iterators seem to just wrap indexes. That _seems_ like it would be bad, but Idunno.
iterator or pointer almost leads to the same performances (ina 5 to 10% margin maybe). The real error people keep doing is performing test on huge dataset without any cache friendly algorithm

joel falcou wrote:
I am actually interested about the iterators performance in multi-array too, since their iterators seem to just wrap indexes. That _seems_ like it would be bad, but Idunno.
iterator or pointer almost leads to the same performances (ina 5 to 10% margin maybe). The real error people keep doing is performing test on huge dataset without any cache friendly algorithm
Did you mean to say 'index or pointer'? It seems like the trend in image
processing is to avoid using indexes in the inner loop if at all possible.
As far as the cache-friendly issue, the example used the same memory and
the same access pattern.
I repeated the test on my system with mingw g++ 4.3.1 and the -O2 flag,
and I got similarly bad results ...
BUT I figured out why the performance was so bad in that case!
First, the poster did not use BOOST_DISABLE_ASSERTS, so the multi_array
was doing the extra checks.
Second, the results of the operation were never read, so the compiler
removed the whole 'Pointers' test.
With that macro defined, both loops are completely removed by the compiler.
If I make the result volatile, then the running times are:
Pointer....: 532
Multi_array: 593
HOWEVER, when I switch to using iterators & pointers for the inner loop
the penalty is a factor of 10 (not 10%, but 1000%)
Pointer....: 516
Multi_array: 5141
Again, same access pattern, same memory.
--John
//---modified code----
#define BOOST_DISABLE_ASSERTS
#include "windows.h"
#include <iostream>
#include "boost\multi_array.hpp"
using namespace std;
using namespace boost;
const size_t DimX = 2000;
const size_t DimY = 1500;
const size_t DimZ = 1;
volatile int iVal; //gbl volatile in an attempt to avoid optimizing away....
template<class It>
void test_helper(It begin, It end){
while (begin != end)
iVal = *begin++;
}
void testPointer(multi_array

Did you mean to say 'index or
John C. Femiani a écrit : pointer'? Yes of course
It seems like the trend in image processing is to avoid using indexes in the inner loop if at all possible. Which is a non sensical way of doing stuff cause code using pointer arithmetic and "leet" stuff like *(ptr++ + x*width) makes code hard to decipher and to parallelize afterward. I wish you a good deal of fun to block a pointer based IP code into SMP or SIMD :/ It's preferable to stick to indexes as it works fine on both side (parallelization and performances).
Second, the results of the operation were never read, so the compiler
removed the whole 'Pointers' test. I always fall into this trap myself even after years ;) I got similar result on my box too. But now, turn to fortran indexing and it gets worse even with indexes. Side notes : a better metric than absolute time or cycles is cycle per element as it allows us to detect cache miss and to have performances which can be comapred through various architecture at the metric is independant of most of classic pitfall from other performances measures.

It seems like the trend in image processing is to avoid using indexes in the inner loop if at all
possible.
Which is a non sensical way of doing stuff cause code using pointer arithmetic and "leet" stuff like *(ptr++ + x*width) makes code hard to decipher and > to parallelize afterward. I wish you a good deal of fun to block a pointer based IP code into SMP or SIMD :/ It's preferable to stick to indexes as it works fine on both side (parallelization and performances).
But with a index you need two variables, the base pointer + index. And unless your compiler + instruction set allows you to addess a memory location using a base pointer + index then you are going to loose much in performance. IMHO a pointer is always faster, it let you write the algorithm in a different way avoiding probably the example you gave above.

But with a index you need two variables, the base pointer + index. And unless your compiler + instruction set allows you to addess a memory location using a base pointer + index then you are going to loose much in performance. IMHO a pointer is always faster, it let you write the algorithm in a different way avoiding probably the example you gave above.
Did you do the actual performances check ? On decent compiler for non-exotic architectures the difference is really thin for L1 cache compliant number of elements. I have nothing against pointer anyway but good data structure like Numerical Recipes multi-dimensionnal allocation makes things easier with a very small overhead.

As a complement, I took time to prepare a sample programms of how I handle multi-dimensionnal array in my everyday code. It compares it to a classic pointer access. I'll be glad to discuss results of such methods on different platform and discuss pros and cons of this method. On most platform, whener the data fits in the cache, both methods ends up in a few tenth of percent of margin. For larger array, the difference grows but it can be diminished by using loop tiling or loop blocking. Moreover the natural [][] syntax eases the way you can write complex element acces patterns.

Working with indexes is defintively easier and more confortable. And probably in many situation the overhead is not worth to be optimized at all. I only was surprised when you said there is a performance gain using indexes. But I hardly would write something like this (from your array.cpp example) for (int i = 0 ; i < h ; ++i) for (int j = 0 ; j < w ; ++j) { array[i][j] = uni() ; } You are loosing performance unless the compiler is so intelligent to avoid the multiplication at each step. When performance is really an issue then I first re-think the algorithm, after I check the produced assembler listing to see what the compiler did. In the example above at each step you have to calculate this address: array + i*rowsize + j If you are lucky the compiler can produce code that does this in one instruction set, if not you will get an overhead. Again if h and w are low values you will not notice it. This example is of course much faster, and yes, it is not elegant nor clear. int *p=array,*pend=&array[h][w]; while (p < pend) *p++ = uni() ; _____ Da: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] Per conto di joel falcou Inviato: giovedì 7 agosto 2008 20.47 A: boost-users@lists.boost.org Oggetto: Re: [Boost-users] R: [multi-array] Performance? Difference from GIL or uBLAS? As a complement, I took time to prepare a sample programms of how I handle multi-dimensionnal array in my everyday code. It compares it to a classic pointer access. I'll be glad to discuss results of such methods on different platform and discuss pros and cons of this method. On most platform, whener the data fits in the cache, both methods ends up in a few tenth of percent of margin. For larger array, the difference grows but it can be diminished by using loop tiling or loop blocking. Moreover the natural [][] syntax eases the way you can write complex element acces patterns.

Le Jeu 7 août 2008 21:09, Andrea Denzler a écrit :
Working with indexes is defintively easier and more confortable. And probably in many situation the overhead is not worth to be optimized at all.
I only was surprised when you said there is a
indexes. Again my hands were faster
That's what my experiments shown for months performance gain using than my thoughts. Apologizes for the misunderstanding
You are loosing performance unless the compiler is so intelligent to
avoid the multiplication at each step. I yet have to find a compiler not that intelligent. From VC6 to ICC and g++ since v 3.x, i never had to complain about the compiler output. Then again, most computation is prefetched during the array allocation ( see nrc_alloc_array function in my source).
When performance is really an issue then I first re-think the algorithm algorithmic optimisations are always best anyway
after I check the produced assembler listing to see what the compiler did. I was used to do this but I really think it's not necessary by now with modern compiler.
If you are lucky the compiler can produce code that does this in one instruction set, if not you will get an overhead. Again if h and w are low values you will not notice it. And that's exactly what happens when you do loop tiling, the inner h and w are rather small (tile often occupies less than half the L1 cache in size)
This example is of course much faster, and yes, it is not elegant nor clear. int *p=array,*pend=&array[h][w]; while (p < pend) *p++ = uni() ;
Well, I copy/pasted this in my array.cpp in place of the loop nest. My 2D array take 9.938s to iterate 10000 over a 512*512 image of float, while your "while loop" took 9.891. I only lose ~0.5% by using NRC allocation + indexing. It's indeed faster but not by that much and, indeed, far less elegant. If needed, I can try to implement a simili-multi_array using my method and compare it to the original one in term of performance.

If you are lucky the compiler can produce code that does this in one instruction set, if not you will get an overhead. Again if h and w are low values you will not notice it. And that's exactly what happens when you do loop tiling, the inner h and w are rather small (tile often occupies less than half the L1 cache in size)
This example is of course much faster, and yes, it is not elegant nor clear. int *p=array,*pend=&array[h][w]; while (p < pend) *p++ = uni() ; Well, I copy/pasted this in my array.cpp in place of the loop nest. My 2D array take 9.938s to iterate 10000 over a 512*512 image of float, while your "while loop" took 9.891. I only lose ~0.5% by using NRC allocation + >indexing. It's indeed faster but not by that much and, indeed, far less elegant.
When you use an index you loose the computation cost of calculating the address and the maintenance of the two indexes. Only that. I don't think it has much to do with L1 cache, especially if the few variables are located in a register. Now how much is that cost? A few CPU cycles per address, maybe 4 not much more with a good compiler/cpu. If that above uni() function need something like 50 cpu cycles then the gain for switching to pointer is very low as in your example. But if instead of the uni function you have a simple add that cost only 1 cycle then you are having a 5 times performance loss. But also this can be not important if the total number of items is relatively low, and this is the common case. Try to do the pointer access with the simple "res += array[ti][tj] ;" instruction. In a program where I use much images and arrays the real gain was switching few functions to handwritten assembler code. 6 times faster than an already good c++ algorithm. All the rest of the application is standard elegant and "slow" code. But it doesn't impact the overall performance.

Try to do the pointer access with the simple "res += array[ti][tj] ;" instruction. The results from my earlier post was from this. The res = uni() is not the core of the computation, just the array initialisation :) The greater the computation,
In a
the smaller the overhead will be IMHO. For a computation like res + = array[i][j]*cos(array[i][j]) versus its while( begin++) equivalent, the overhead is never greater than 1% program where I use much images and arrays the real gain was
switching few functions to handwritten assembler code. 6 times faster than
an already good c++ algorithm. All the rest of the application is standard elegant and "slow" code. But it doesn't impact the overall performance.
I perfectly agree but I think that in most case, going down that low level is not needed, vene in image processing. i would rather take a few minutes to SIMDify a code if possible than rewriting it in inline assembly. Anyway, I'm onto writing this small multi_array with indexing to see how it fares.

I perfectly agree but I think that in most case, going down that low level is not needed, vene in image processing. i would rather take a few minutes to SIMDify a code if possible than rewriting it in inline assembly.
When intel came out with it's first SIMD instruction set I was happy to try them on my application. It was a failure. Because even if one instruction executes on 3 data locations it's cost was 3 CPU cycles. Three integer instructions cost 1 cycle each. Also with SIMD you had to load the registers first. In my case I lost performance. So I realized that there is a gain only in some specific case. Not in my situation, even if it's an image processing algorithm. Adding more registers to the CPU would give more benefit to compilers and handwritten code, but adding registers was never a guideline of Intel. I hope that today the same SIMD instruction will execute in only 1 cycle, I should check this one day or another. And, Yes, optimizing at this level is a rare situation. P.S.: sorry Asif

When intel came out with it's first SIMD instruction set I was happy to try them on my application. It was a failure. Because even if one instruction executes on 3 data locations it's cost was 3 CPU cycles. Three integer instructions cost 1 cycle each. Also with SIMD you had to load the registers first. <snip>
And, Yes, optimizing at this level is a rare situation. I will be curious to know which kind of IP algorithm need those but this is maybe a topic
MMX and SSE were rather a disaster on this point. Acually, I learn to play with SIMD using Altivec on PowerPC and that's a complete different deal. SSSE3 and upcoming SSE4 are ratehr good too. that should go private instead of adding noise to the list.

Hi Andrea,
On 8/8/08, Andrea Denzler
If you are lucky the compiler can produce code that does this in one instruction set, if not you will get an overhead. Again if h and w are low ............
May I respectfully request that you please keep the subject-line the same in your replies? Currently, each of your replies seems to have prefixed the subject line with "R:" (without double quotes). This is particularly of lesser utility to me as I am not able to read all of your replies in the proper order because I am using GMail that sequences all the messages in a thread "having the same subject line". -- Thanks in advance, Asif

Andrea Denzler wrote:
Working with indexes is defintively easier and more confortable. And probably in many situation the overhead is not worth to be optimized at all.
I only was surprised when you said there is a performance gain using indexes.
But I hardly would write something like this (from your array.cpp example)
for (int i = 0 ; i < h ; ++i)
for (int j = 0 ; j < w ; ++j) {
array[i][j] = uni() ;
}
You are loosing performance unless the compiler is so intelligent to avoid the multiplication at each step. When performance is really an issue then I first re-think the algorithm, after I check the produced assembler listing to see what the compiler did. In the example above at each step you have to calculate this address:
array + i*rowsize + j
If you are lucky the compiler can produce code that does this in one instruction set, if not you will get an overhead. Again if h and w are low values you will not notice it.
This example is of course much faster, and yes, it is not elegant nor clear.
int *p=array*,**pend=&array[h][w];
while (p < pend)
*p++ = uni() ;
The standard algorithms use iterators; An array implementation that translates iterators to indexes and expects the indexes to be converted back into pointers is questionable to me. Especially when my trivial test code showed the multi-array implementation was 10-20 times slower. This slow down is way more than I expected. The multi_array implementation is extremely scary to me because it does not seem to be low overhead at all. -- John

Yes, iterators if available make our code clean and elegant and without
overhead. We use iterators even on STL::vector. They are great.
The minimum overhead you have with indexes is to calculate the address
(char *)array + i*sizeof(row) + j*sizeof(item)
And to double the increment and loop check since you have two variables
++i, i

Andrea Denzler wrote:
Yes, iterators if available make our code clean and elegant and without overhead. We use iterators even on STL::vector. They are great.
The minimum overhead you have with indexes is to calculate the address (char *)array + i*sizeof(row) + j*sizeof(item) And to double the increment and loop check since you have two variables ++i, i
But if the underlying structure is not a plain array but a class you may have also boundary checks on the indexes. All this together can cost several CPU cycles.
I used BOOST_DISABLE_ASSERTS, that was supposed to remove boundary checks.
Also with a iterator/pointer you may use only a register (not a memory location) improving even more the speed.
But does that happen when the iterator is implemented in using a base* and and index, along with 3 or four other pointers to stride arrays and extents?
It depends on the cost of the inner code, if it is just a simple arithmetic calculation (1 cpu cycle) then I expect something like 10 times slower.
Admittedly, the inner op is trivial, but so are MANY operations you would perform on an array. I.e. adding, multiplying, forming linear combinations, etc.
All the GIL examples I saw use iterators and what I liked is that complicated image calculations can be stacked. If I have to flip the an image and convert it to B/W then traditionally I would do that in two steps, with Gil you do that in one step only, with nearly double performance.
Are there performance measurements for GIL? GIL uses Locators, which I love. GIL seems to be stuck on arrays that are an 'image' of 'pixels', which is not always the case for me. Ideally I would be able to construct an array of any type. I was looking into multi_array hoping there would be some analogy to GIL / Locators, but multi_array seems to be mimicking aspects of the syntax used in Fortran, Matlab, or Python. That's great from a high level point of view, but I am a bit afraid because I don't understand the performance implications. I think that in the spirit of 'zero overhead', there should be some boost requirement that certain libraries include performance measurements / graphs as part of their docs or build process. For things like multi_array or uBLAS or GIL, I think efficiency is important (not just asymptotic). For example, I would feel more comfortable if I could see something like: http://www.oonumerics.org/blitz/benchmarks/ (The only reason I don't wanna use BLITZ is that the last release was 2005) --John

All the GIL examples I saw use iterators and what I liked is that complicated image calculations can be stacked. If I have to flip the an image and convert it to B/W then traditionally I would do that in two steps, with Gil you do that in one step only, with nearly double performance.
I was going to reply earlier but since you touched on the issues I wanted to mention, let me just suggest that cache awareness be considered as a null inner loops has no obvious performance relationship to something that just writes a constant into incrementing addresses. Any iterator that preserves locality and anything that reduces the number of "passes" can easily pay for some overhead, especially of order zero ( some setup calculations that determine things like block sizes etc but don't get repeated during iteration). I don't have links handy but I think that FFTW has a lot of related literature on performanc issues and, again, Intel has a lot of good optimization references on their site for their processors.
Are there performance measurements for GIL?
GIL uses Locators, which I love. GIL seems to be stuck on arrays that are an 'image' of 'pixels', which is not always the case for me. Ideally I would be able to construct an array of any type.
I was looking into multi_array hoping there would be some analogy to GIL / Locators, but multi_array seems to be mimicking aspects of the syntax used in Fortran, Matlab, or Python. That's great from a high level point of view, but I am a bit afraid because I don't understand the performance implications.
I think that in the spirit of 'zero overhead', there should be some boost requirement that certain libraries include performance measurements / graphs as part of their docs or build process. For things like multi_array or uBLAS or GIL, I think efficiency is important (not just asymptotic).
For example, I would feel more comfortable if I could see something like: http://www.oonumerics.org/blitz/benchmarks/
(The only reason I don't wanna use BLITZ is that the last release was 2005)
--John
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_________________________________________________________________ Get more from your digital life. Find out how. http://www.windowslive.com/default.html?ocid=TXT_TAGLM_WL_Home2_082008

I don't have links handy but I think that FFTW has a lot of related literature on performanc issues and, again, Intel has a lot of good optimization references on their site for their
processors. On a more global scale, is stuff like support for architecture specific optimizations helpers ( memory alignement on arbitrary, architecture related boundaries; simd extension) be considered as in the scope of boost ?

joel falcou wrote:
But with a index you need two variables, the base pointer + index. And unless your compiler + instruction set allows you to addess a memory location using a base pointer + index then you are going to loose much in performance. IMHO a pointer is always faster, it let you write the algorithm in a different way avoiding probably the example you gave above.
Did you do the actual performances check ? On decent compiler for non-exotic architectures the difference is really thin for L1 cache compliant number of elements.
I have nothing against pointer anyway but good data structure like Numerical Recipes multi-dimensionnal allocation makes things easier with a very small overhead. ------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
We are *comparing* implementations on the same machine, with the same
array, with the same simple access patterns and averaging the running
times over 1000 runs. I even toggled which test runs first or second.
IF the L1 cache misses are an issue, that would slow *both* down, so
multi_array must be EVEN WORSE than the test times showed right?
I reduce the size of the array, to reduce the odds of a cache miss (I hope)
That makes multi_array iterators 20 times slower instead of just 10
times slower!
Also both methods are proceeding through the data in a VERY easy to
predict order. If there are cache misses in this test then there will
probably be cache misses in real code, so it is fair to include their
effect in the comparison.
I think the conclusion is that the pointer based approach has a
significant (up to20 times on my system) performance improvement over
base+index approach (at least as implemented by multi_array).
The command line is: g++ test.cpp -O2 -o test.exe
Results in total ticks for 10000 calls each:
size of array: 14400 bytes
Pointer....: 31
Multi_array: 625
Pointer....: 32
Multi_array: 640
Pointer....: 32
Multi_array: 625
Pointer....: 31
Multi_array: 625
Pointer....: 31
Multi_array: 641
Pointer....: 31
Multi_array: 625
Pointer....: 31
Multi_array: 641
Pointer....: 31
<snip>
#define BOOST_DISABLE_ASSERTS
#include "windows.h"
#include <iostream>
#include "boost\multi_array.hpp"
using namespace std;
using namespace boost;
const size_t DimX = 60;
const size_t DimY = 60;
const size_t DimZ = 1;
volatile int iVal;
template<class It>
void test_helper(It begin, It end){
while (begin != end)
iVal = *begin++;
}
void testPointer(multi_array

We are *comparing* implementations on the same machine, with the same array, with the same simple access patterns and averaging
over 1000 runs. I even toggled which test runs first or second. IF the L1 cache misses are an issue, that would slow *both* down, so multi_array must be EVEN WORSE than the test times showed right? I wans't speaking of L1 cache miss on this
the running times particuliar test sorry for the inconvenience
I think the conclusion is that the pointer based approach has a significant (up to20 times on my system) performance improvement over
base+index approach (at least as implemented by multi_array). Yes I came to the same conclusion. I was just pointing the fatc that the base+pointer is not the only way to do something (see my attachment in my previous post) Again sorry for our misandurstanding
participants (5)
-
Andrea Denzler
-
Asif Lodhi
-
joel falcou
-
John C. Femiani
-
Mike Marchywka