GIL Image Processing Library - specific comments

Hi all, I have delved somewhat into the source code of GIL and would like to make some comments: 1. I like the concept of views to adapt images to algorithms' needs. Users can understand this quite easily. But there is also a danger of overuse (this is pointed out in the tuorial): doing all transformations on the fly may be very inefficient. For example, conversion between RGB and CMYK in an adaptor is OK, but RGB to LAB is not. For the latter cases, the GIL authors propose an idiom copy_pixels(some_expensive_view(source), some_other_view(destination)); A traditional STL style solution would probably be easier: transform(source.begin(), source.end(), destination.begin(), SomeFunctor()); (since implementing a view is more involved than implementing a functor, at least in terms of reading documentation). On the other hand, a PixelDereferenceAdapter is semantically also a view, so the use of the view idea could be extended. The interpretation of pixel adaptors as views is more apparent when data accessors are used (they work essentially like the image views). The difference to the PixelDereferenceAdapter is mainly syntactic (using operator* rather than get/set functions), and it was already pointed out that this is not without problems (reference proxies are required for operator*, which are more difficult to implement than data accessors and may sometimes exhibit unexpected behavior). 2. I like the idea of handling unknown input formats by means of a variant pixel. However, it remains to be shown whether this solution is indeed practical for algorithm implementation. For example, when a binary operation is called for a variant with 4 possible types, the compiler is forced to generate 16 algorithm instantiations (for all pairs) for every result type. There are a number of design alternatives (for example coercion to a common type) which are probably better. 3. I don't like that the input/output image format (jpeg, png etc.) must be hardcoded in the program (by using jpeg_read_image() etc.). 4. GIL has no (public) test suite whatsoever. 5. A GIL image can only be used with POD pixel types, because the allocated memory is never initialized (by uninitialized_fill() or equivalent). Neither is the destructor for individual pixels called before deallocation. 6. GIL may have alignment problems. For example, the variant template allocates internal memory as "char[MAX_SIZE]" but then reinterpret_casts it to an image type. Casts from "signed short *" to "pixel<int16s, rgb_t> *" and similar (see the tutorial) are also problematic (is there a guarantee that "sizeof(pixel<int16s, rgb_t>) == 3*sizeof(int16s)", and that they have the same alignment ?) 7. Is transform_pixel_position() able to handle image borders properly and efficiently (page 7 of the tutorial)? Otherwise, it is wrongly used (may cause segfaults). 8. The cached_loaction_t is a nice, elegant optimization. 9. Generic algorithms that work efficiently for both planar and interleaved multi-channel images are not as easy as the authors imply. On page 5 of the tutorial, the authors recommend transform_channels() in order to unroll the inner loop over the channels. But this optimization is useless for planar images - in this case, one wants to work on one channel at a time, i.e. the channel loop must be the outer rather than the inner loop. So much for now. Best regards Ulli -- ________________________________________________________________ | | | Ullrich Koethe Universitaet Hamburg / University of Hamburg | | FB Informatik / Dept. of Informatics | | AB Kognitive Systeme / Cognitive Systems Group | | | | Phone: +49 (0)40 42883-2573 Vogt-Koelln-Str. 30 | | Fax: +49 (0)40 42883-2572 D - 22527 Hamburg | | Email: u.koethe@computer.org Germany | | koethe@informatik.uni-hamburg.de | | WWW: http://kogs-www.informatik.uni-hamburg.de/~koethe/ | |________________________________________________________________|

Ullrich Koethe wrote:
4. GIL has no (public) test suite whatsoever.
That's what I thought too. But, after an email to Lubomir, he pointed to me where the tests are. You can get them from the ASL repository through perforce. GIL should make that more accessible. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Hi Ulli, Thanks for spending time to review GIL. [...]
1. I like the concept of views to adapt images to algorithms' needs. Users can understand this quite easily. But there is also a danger of overuse (this is pointed out in the tuorial): doing all transformations on the fly may be very inefficient. For example, conversion between RGB and CMYK in an adaptor is OK, but RGB to LAB is not. For the latter cases, the GIL authors propose an idiom
copy_pixels(some_expensive_view(source), some_other_view(destination));
A traditional STL style solution would probably be easier:
transform(source.begin(), source.end(), destination.begin(), SomeFunctor());
(since implementing a view is more involved than implementing a functor, at least in terms of reading documentation).
They are not quite equivalent. Views can transform both the domain and the range of images. For example, a view may change the way its horizontal and/or vertical iterators advance, and similarly may perform a transformation upon dereferencing (which your SomeFunctor does). We find it appealing to represent this as a single object (view) rather than splitting it into multiple parts, such as ranges and accessors. Yes, I know these are conceptually different transformations, and they are implemented in different classes; they are just combined in the View. It is appealing because conceptually it just represents an abstract image. And you can pipe views nicely together. Another difference is in performance: copy_pixels(src,dst) is usually faster than the STL equivalent std::copy(src.begin(), src.end(), dst.begin()) This is because the one-dimensional iterators are slower, as they have to deal with potential padding at the end of rows, and also because we provide performance specializations.
On the other hand, a PixelDereferenceAdapter is semantically also a view, so the use of the view idea could be extended.
PixelDereferenceAdapter is not a view. It is a function object to be invoked upon pixel access, similar to your Pixel Accessor. You can easily attach it to any iterator, locator or a view.
The interpretation of pixel adaptors as views is more apparent when data accessors are used (they work essentially like the image views). The difference to the PixelDereferenceAdapter is mainly syntactic (using operator* rather than get/set functions), and it was already pointed out that this is not without problems (reference proxies are required for operator*, which are more difficult to implement than data accessors and may sometimes exhibit unexpected behavior).
I understand your motivation for using pixel adaptors, but in my opinion the benefits of using PixelDereferenceAdapter outweigh its disadvantages. I will elaborate, but I am first waiting to hear from others on the thread we started on this subject.
2. I like the idea of handling unknown input formats by means of a variant pixel. However, it remains to be shown whether this solution is indeed practical for algorithm implementation. For example, when a binary operation is called for a variant with 4 possible types, the compiler is forced to generate 16 algorithm instantiations (for all pairs) for every result type. There are a number of design alternatives (for example coercion to a common type) which are probably better.
GIL incorporates a mechanism for active reduction of code bloat due to variant instantiations. I will present this approach at the LCSD workshop in the upcoming OOPSLA conference, which is in less than a week: http://lcsd.cs.tamu.edu/2006/
3. I don't like that the input/output image format (jpeg, png etc.) must be hardcoded in the program (by using jpeg_read_image() etc.).
We initially had a combined I/O methods like: template <typename Image> void read_image(const char* file, Image& img) { switch (image_type_from_file_name(file)) { case kJPEG: jpeg_read_image(file,img); break; case kTIFF: tiff_read_image(file,img); break; ... } } We decided to remove them for three reasons - first, the set of file formats is open and listing them in a switch statement is not very flexible. Second, using the above would require people to link against all libraries while they may only need/have one. And third, there are plans to put in ASL a more advanced GIL I/O that allows for dynamic registration of file format modules. Of course, the above methods are trivial to implement and people might want to just do them for convenience. On the client side it is ok to be less generic...
4. GIL has no (public) test suite whatsoever.
We have a regression suite, though not on the web site. You will get it if you install GIL through ASL.
5. A GIL image can only be used with POD pixel types, because the allocated memory is never initialized (by uninitialized_fill() or equivalent). Neither is the destructor for individual pixels called before deallocation.
This was an item on our TODO list but don't know how it fell between the cracks. We will put it back on the list. Thanks! Good catch!
6. GIL may have alignment problems. For example, the variant template allocates internal memory as "char[MAX_SIZE]" but then reinterpret_casts it to an image type.
MAX_SIZE is determined by taking the maximum of the size_of of each element in the MPL vector of image types with which the variant is instantiated. So for every variant it is different, and the appropriate size is evaluated at compile time.
Casts from "signed short *" to "pixel<int16s, rgb_t> *" and similar (see the tutorial) are also problematic
This cast is necessary to go from the outside world (pointer to the beginning of the image) into GIL. It is not obvious to me how to avoid it. How does VIGRA deal with the case where the user constructs an 8-bit RGB image by providing VIGRA with a raw pointer to the first pixel?
7. Is transform_pixel_position() able to handle image borders properly and efficiently (page 7 of the tutorial)? Otherwise, it is wrongly used (may cause segfaults).
No it doesn't handle corner conditions. Actually you found a typo in the tutorial. The function that contains transform_pixel_position should be called x_gradient_unguarded to indicate this. As the tutorial shows, x_gradient_unguarded is called from x_gradient using subimage_view that excludes the first and last column.
8. The cached_loaction_t is a nice, elegant optimization.
9. Generic algorithms that work efficiently for both planar and interleaved multi-channel images are not as easy as the authors imply.
Why not? (see below)
On page 5 of the tutorial, the authors recommend transform_channels() in order to unroll the inner loop over the channels. But this optimization is useless for planar images - in this case, one wants to work on one channel at a time, i.e. the channel loop must be the outer rather than the inner loop.
I wouldn't call it useless for planar images. Just the opposite - this code illustrates one of the nice abstractions in GIL; that you can write the algorithm once and have it work for both planar and interleaved images. There is typically no loss in efficiency even though the three planes are separated, because modern architectures can support many simultaneous cache lines. Of course, this is not always the case. The optimal performance depends on the exact algorithm, the hardware, and the image size. There may be contexts for which processing the planes one at a time is noticeably faster. In this case a planar-only specialization of the algorithm can be provided. The point is that by using GIL you are not forced to provide a planar version for the algorithm. We have many versions of the same imaging algorithms at Adobe - ones that work for planar images, and ones for interleaved. GIL is a hope that these could be consolidated. ____ We have been discussing the accessors/DereferenceAdaptors and promotion traits issues in separate threads. Aside from these differences with VIGRA, do you see any issues with the GIL interfaces that would prevent porting the Vigra algorithms to GIL, or that would affect their performance? If yes, what interfaces are missing and what algorithms need them? Thank you again for your review! Thanks, Lubomir

Lubomir Bourdev wrote:
On the other hand, a PixelDereferenceAdapter is semantically also a view, so the use of the view idea could be extended.
PixelDereferenceAdapter is not a view. It is a function object to be invoked upon pixel access, similar to your Pixel Accessor. You can easily attach it to any iterator, locator or a view.
Consider the following view-like syntax: pixel_view_A(dest_iter) = pixel_view_B(src_iter); Now, when you absorb the pixel_view objects into the iterators by means of proxies, you get the GIL style of adaptation: *pixel_view_iter_A = *pixel_view_iter_B; If you replace the assignment with a normal member function, you get VIGRA data accessors (which, by the way, were invented by Dietmar Kuehl, IIRC): pixel_view_A.set(pixel_view_B(src_iter), dest_iter); All these should be semantically equivalent -- the only difference is syntax and possibly performance. Hence I'm saying that pixel access modifiers are essentially views, and treating them in that way may simplify the documentation. The following advantages and disdvantages of either approach have already been pointed out: Proxies: + nice syntax in algorithms + compatible to STL + can be chained - harder to implement, harder to understand implementation - may not work under some exotic circumstances DataAccessors: + easy to implement, easy to understand + can be chained + do always work (even on non-conforming compilers) - slightly less readable syntax in algorithms To add some hard data to the comparison, I performed a benchmark were I timed two proxy implementations vs the accessor approach for the task of converting a planar RGB image into an interleaved one and vice versa. It turned out that on my platform (Pentium M 1.7 GHz, cygwin gcc 3.4.4), both proxy implementations were 30 to 60 percent _slower_ than the equivalent data accessor version. The code is attached (it's a stand-alone program).
I understand your motivation for using pixel adaptors, but in my opinion the benefits of using PixelDereferenceAdapter outweigh its disadvantages.
I'm not convinced.
GIL incorporates a mechanism for active reduction of code bloat due to variant instantiations. I will present this approach at the LCSD workshop in the upcoming OOPSLA conference, which is in less than a week: http://lcsd.cs.tamu.edu/2006/
I'd like to read this paper.
3. I don't like that the input/output image format (jpeg, png etc.) must be hardcoded in the program (by using jpeg_read_image() etc.).
We initially had a combined I/O methods like:
template <typename Image> void read_image(const char* file, Image& img) { switch (image_type_from_file_name(file)) { case kJPEG: jpeg_read_image(file,img); break; case kTIFF: tiff_read_image(file,img); break; ... } }
We decided to remove them for three reasons [snip] there are plans to put in ASL a more advanced GIL I/O that allows for dynamic registration of file format modules.
VIGRA's import/export library almost supports this (the only missing piece is the possibility to register new converters at runtime - at the moment, they must be known at compile time, but otherwise there is no need for a design modification).
6. GIL may have alignment problems. For example, the variant template allocates internal memory as "char[MAX_SIZE]" but then reinterpret_casts it to an image type.
MAX_SIZE is determined by taking the maximum of the size_of of each element in the MPL vector of image types with which the variant is instantiated. So for every variant it is different, and the appropriate size is evaluated at compile time.
I didn't say that the number of bytes was too small. I said that an image object might be required to start at a multiple-of-8 address (say), whereas a char array may start at an arbitrary address. So the reinterpret_cast may return a forbidden address.
Casts from "signed short *" to "pixel<int16s, rgb_t> *" and similar (see the tutorial) are also problematic
This cast is necessary to go from the outside world (pointer to the beginning of the image) into GIL. It is not obvious to me how to avoid it. How does VIGRA deal with the case where the user constructs an 8-bit RGB image by providing VIGRA with a raw pointer to the first pixel?
The DataAccessor takes care of this -- the iterator uses the original pointer (see the PlanarAccessor in the attached program).
We have been discussing the accessors/DereferenceAdaptors and promotion traits issues in separate threads. Aside from these differences with VIGRA, do you see any issues with the GIL interfaces that would prevent porting the Vigra algorithms to GIL, or that would affect their performance? If yes, what interfaces are missing and what algorithms need them?
Three things come to mind: - many more type conversions; e.g. int8 => uint16; RGB <=> LAB etc. - adaptors for the subsampled image view - support for multi-dimensional images (e.g. volume data, image sequences) Another important point are RGBA operations. In most cases, the alpha channel is to be treated differently from the other channels, so you can't just use component-wise operators as for the other color models. VIGRA doesn't have an explicit RGBA type (TinyVector<T, 4> can be used instead), because so far no-one came up with a convincing proposal for these operations. But without them, RGBA is pretty useless. I don't think any of this is a show stopper, but it means considerable work. And, of course, the documentation has to be improved a lot... (I didn't look at the test suite.) Regards Ulli -- ________________________________________________________________ | | | Ullrich Koethe Universitaet Hamburg / University of Hamburg | | FB Informatik / Dept. of Informatics | | AB Kognitive Systeme / Cognitive Systems Group | | | | Phone: +49 (0)40 42883-2573 Vogt-Koelln-Str. 30 | | Fax: +49 (0)40 42883-2572 D - 22527 Hamburg | | Email: u.koethe@computer.org Germany | | koethe@informatik.uni-hamburg.de | | WWW: http://kogs-www.informatik.uni-hamburg.de/~koethe/ | |________________________________________________________________| #include <vector> #include <iostream> #include <time.h> #define BY_REFERENCE template <class T> struct RGBValue { RGBValue() {} RGBValue(T ir, T ig, T ib) : r(ir), g(ig), b(ib) {} T r,g,b; }; typedef std::vector<RGBValue<unsigned char> > Interleaved; template <class T> struct PlanarArray { typedef std::vector<T> Data; typedef typename Data::iterator DI; typedef typename Data::const_iterator DCI; #ifdef BY_REFERENCE struct const_proxy { const_proxy(T const & ir, T const & ig, T const & ib) : r(const_cast<T&>(ir)), g(const_cast<T&>(ig)), b(const_cast<T&>(ib)) {} operator RGBValue<T> () const { return RGBValue<T>(r,g,b); } T & r; T & g; T & b; }; struct proxy : public const_proxy { proxy(T const & ir, T const & ig, T const & ib) : const_proxy(ir, ig, ib) {} proxy & operator=(RGBValue<T> const & rhs) { this->r = rhs.r; this->g = rhs.g; this->b = rhs.b; return *this; } }; struct iterator { iterator(DI p, int o) : ptr(p), offset(o) {} iterator & operator++() { ++ptr; return *this; } proxy operator*() const { return proxy(*ptr, ptr[offset], ptr[2*offset]); } DI ptr; int offset; }; struct const_iterator { const_iterator(DCI p, int o) : ptr(p), offset(o) {} const_iterator & operator++() { ++ptr; return *this; } const_proxy operator*() const { return const_proxy(*ptr, ptr[offset], ptr[2*offset]); } DCI ptr; int offset; }; #else struct const_proxy { const_proxy(DCI i, int o) : iter(i), offset(o) {} operator RGBValue<T> () const { return RGBValue<T>(*iter, iter[offset], iter[2*offset]); } DCI iter; int offset; }; struct proxy { proxy(DI i, int o) : iter(i), offset(o) {} operator RGBValue<T> () const { return RGBValue<T>(*iter, iter[offset], iter[2*offset]); } proxy & operator=(RGBValue<T> const & rhs) { *iter = rhs.r; iter[offset] = rhs.g; iter[2*offset] = rhs.b; return *this; } DI iter; int offset; }; struct iterator { iterator(DI p, int o) : ptr(p), offset(o) {} iterator & operator++() { ++ptr; return *this; } proxy operator*() const { return proxy(ptr, offset); } DI ptr; int offset; }; struct const_iterator { const_iterator(DCI p, int o) : ptr(p), offset(o) {} const_iterator & operator++() { ++ptr; return *this; } const_proxy operator*() const { return const_proxy(ptr, offset); } DCI ptr; int offset; }; #endif PlanarArray(int s) : size(s), data(3*s) {} iterator begin() { return iterator(data.begin(), size); } const_iterator begin() const { return const_iterator(data.begin(), size); } int size; std::vector<T> data; }; typedef PlanarArray<unsigned char> Planar; template <class T> struct DefaultAccessor { template <class Iterator> RGBValue<T> const & operator()(Iterator i) const { return *i; } template <class Iterator> void set(RGBValue<T> const & rgb, Iterator i) const { *i = rgb; } int offset; }; template <class T> struct PlanarAccessor { PlanarAccessor(int o) : offset(o) {} template <class Iterator> RGBValue<T> operator()(Iterator i) const { return RGBValue<T>(*i, i[offset], i[2*offset]); } template <class Iterator> void set(RGBValue<T> const & rgb, Iterator i) const { *i = rgb.r; i[offset] = rgb.g; i[2*offset] = rgb.b; } int offset; }; int main() { const int repetitions = 10; const int size = 1000000; Interleaved ia(size); Planar pa(size); DefaultAccessor<unsigned char> d; PlanarAccessor<unsigned char> p(size); clock_t t1 = clock(); for(int j = 0; j < repetitions; ++j) { Interleaved::iterator iia = ia.begin(); Planar::const_iterator icpa = const_cast<Planar const &>(pa).begin(); for(int k=0; k < size; ++k, ++iia, ++icpa) { d.set(*icpa, iia); } } std::cerr << "read proxy: " << (float(clock() - t1) / CLOCKS_PER_SEC / repetitions) << std::endl; t1 = clock(); for(int j = 0; j < repetitions; ++j) { Interleaved::iterator iia = ia.begin(); Planar::DCI idcpa = const_cast<Planar const &>(pa).data.begin(); for(int k=0; k < size; ++k, ++iia, ++idcpa) { d.set(p(idcpa), iia); } } std::cerr << "read accessor: " << (float(clock() - t1) / CLOCKS_PER_SEC / repetitions) << std::endl; t1 = clock(); for(int j = 0; j < repetitions; ++j) { Interleaved::const_iterator icia = const_cast<Interleaved const &>(ia).begin(); Planar::iterator ipa = pa.begin(); for(int k=0; k < size; ++k, ++icia, ++ipa) { *ipa = d(icia); } } std::cerr << "write proxy: " << (float(clock() - t1) / CLOCKS_PER_SEC / repetitions) << std::endl; t1 = clock(); for(int j = 0; j < repetitions; ++j) { Interleaved::const_iterator icia = const_cast<Interleaved const &>(ia).begin(); Planar::DI idpa = pa.data.begin(); for(int k=0; k < size; ++k, ++icia, ++idpa) { p.set(d(icia), idpa); } } std::cerr << "write accessor: " << (float(clock() - t1) / CLOCKS_PER_SEC / repetitions) << std::endl; }

Ullrich Koethe wrote:
If you replace the assignment with a normal member function, you get VIGRA data accessors (which, by the way, were invented by Dietmar Kuehl, IIRC):
pixel_view_A.set(pixel_view_B(src_iter), dest_iter);
Here's a paper on property maps: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1873.html that may be relevant here.

Hi Ulli -- we just tried your test file with GCC 4.0.2, GCC 4.1.1 and Visual Studio 2005 (VC8). I used Cygwin for both GCC tests. My machine is Xeon 3.2 dual CPU. Both proxy and accessor implementation run at the SAME speed (release builds: -O2 for GCC and /O2 for Visual Studio). So could it be that GCC 3 is not doing a good in optimizing the code? Hailin Ullrich Koethe wrote:
Lubomir Bourdev wrote:
On the other hand, a PixelDereferenceAdapter is semantically also a view, so the use of the view idea could be extended.
PixelDereferenceAdapter is not a view. It is a function object to be invoked upon pixel access, similar to your Pixel Accessor. You can easily attach it to any iterator, locator or a view.
Consider the following view-like syntax:
pixel_view_A(dest_iter) = pixel_view_B(src_iter);
Now, when you absorb the pixel_view objects into the iterators by means of proxies, you get the GIL style of adaptation:
*pixel_view_iter_A = *pixel_view_iter_B;
If you replace the assignment with a normal member function, you get VIGRA data accessors (which, by the way, were invented by Dietmar Kuehl, IIRC):
pixel_view_A.set(pixel_view_B(src_iter), dest_iter);
All these should be semantically equivalent -- the only difference is syntax and possibly performance. Hence I'm saying that pixel access modifiers are essentially views, and treating them in that way may simplify the documentation.
The following advantages and disdvantages of either approach have already been pointed out:
Proxies: + nice syntax in algorithms + compatible to STL + can be chained - harder to implement, harder to understand implementation - may not work under some exotic circumstances DataAccessors: + easy to implement, easy to understand + can be chained + do always work (even on non-conforming compilers) - slightly less readable syntax in algorithms
To add some hard data to the comparison, I performed a benchmark were I timed two proxy implementations vs the accessor approach for the task of converting a planar RGB image into an interleaved one and vice versa. It turned out that on my platform (Pentium M 1.7 GHz, cygwin gcc 3.4.4), both proxy implementations were 30 to 60 percent _slower_ than the equivalent data accessor version. The code is attached (it's a stand-alone program).
I understand your motivation for using pixel adaptors, but in my opinion the benefits of using PixelDereferenceAdapter outweigh its disadvantages.
I'm not convinced.
GIL incorporates a mechanism for active reduction of code bloat due to variant instantiations. I will present this approach at the LCSD workshop in the upcoming OOPSLA conference, which is in less than a week: http://lcsd.cs.tamu.edu/2006/
I'd like to read this paper.
3. I don't like that the input/output image format (jpeg, png etc.) must be hardcoded in the program (by using jpeg_read_image() etc.).
We initially had a combined I/O methods like:
template <typename Image> void read_image(const char* file, Image& img) { switch (image_type_from_file_name(file)) { case kJPEG: jpeg_read_image(file,img); break; case kTIFF: tiff_read_image(file,img); break; ... } }
We decided to remove them for three reasons [snip] there are plans to put in ASL a more advanced GIL I/O that allows for dynamic registration of file format modules.
VIGRA's import/export library almost supports this (the only missing piece is the possibility to register new converters at runtime - at the moment, they must be known at compile time, but otherwise there is no need for a design modification).
6. GIL may have alignment problems. For example, the variant template allocates internal memory as "char[MAX_SIZE]" but then reinterpret_casts it to an image type.
MAX_SIZE is determined by taking the maximum of the size_of of each element in the MPL vector of image types with which the variant is instantiated. So for every variant it is different, and the appropriate size is evaluated at compile time.
I didn't say that the number of bytes was too small. I said that an image object might be required to start at a multiple-of-8 address (say), whereas a char array may start at an arbitrary address. So the reinterpret_cast may return a forbidden address.
Casts from "signed short *" to "pixel<int16s, rgb_t> *" and similar (see the tutorial) are also problematic
This cast is necessary to go from the outside world (pointer to the beginning of the image) into GIL. It is not obvious to me how to avoid it. How does VIGRA deal with the case where the user constructs an 8-bit RGB image by providing VIGRA with a raw pointer to the first pixel?
The DataAccessor takes care of this -- the iterator uses the original pointer (see the PlanarAccessor in the attached program).
We have been discussing the accessors/DereferenceAdaptors and promotion traits issues in separate threads. Aside from these differences with VIGRA, do you see any issues with the GIL interfaces that would prevent porting the Vigra algorithms to GIL, or that would affect their performance? If yes, what interfaces are missing and what algorithms need them?
Three things come to mind: - many more type conversions; e.g. int8 => uint16; RGB <=> LAB etc. - adaptors for the subsampled image view - support for multi-dimensional images (e.g. volume data, image sequences)
Another important point are RGBA operations. In most cases, the alpha channel is to be treated differently from the other channels, so you can't just use component-wise operators as for the other color models. VIGRA doesn't have an explicit RGBA type (TinyVector<T, 4> can be used instead), because so far no-one came up with a convincing proposal for these operations. But without them, RGBA is pretty useless.
I don't think any of this is a show stopper, but it means considerable work.
And, of course, the documentation has to be improved a lot... (I didn't look at the test suite.)
Regards Ulli
------------------------------------------------------------------------
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

HI Hailin,
we just tried your test file with GCC 4.0.2, GCC 4.1.1 and Visual Studio 2005 (VC8). I used Cygwin for both GCC tests. My machine is Xeon 3.2 dual CPU. Both proxy and accessor implementation run at the SAME speed (release builds: -O2 for GCC and /O2 for Visual Studio). So could it be that GCC 3 is not doing a good in optimizing the code?
Apparently. So this is a non-issue then. Interestingly, the transformation interleaved => planar is faster than planar => interleaved. Is this also true on your platforms? Ulli -- ________________________________________________________________ | | | Ullrich Koethe Universitaet Hamburg / University of Hamburg | | FB Informatik / Dept. of Informatics | | AB Kognitive Systeme / Cognitive Systems Group | | | | Phone: +49 (0)40 42883-2573 Vogt-Koelln-Str. 30 | | Fax: +49 (0)40 42883-2572 D - 22527 Hamburg | | Email: u.koethe@computer.org Germany | | koethe@informatik.uni-hamburg.de | | WWW: http://kogs-www.informatik.uni-hamburg.de/~koethe/ | |________________________________________________________________|

Hi Ulli -- good question. Under GCC, all the four versions (read/write, proxy/accessor) run at the same speed (fastest). However, under Visual Studio, interleaved=>planar is faster than planar=>interleaved with your original code (interleaved=>planar runs at the same speed as the ones under GCC). But after I added a method for RGBValue as follows: template <typename T2> RGBValue& operator=(const T2& pa) { r=pa.r; g=pa.g; b=pa.b; return *this; } the proxy version for interleaved=>planar runs as fast as all others. The only one left slower is the accessor version for planar=>interleaved. My guess is that in the accessor version, the compiler may create an unnecessary temporary object after dereferencing the planar proxy. This could be something interesting for you to look into. Hailin Ullrich Koethe wrote:
HI Hailin,
we just tried your test file with GCC 4.0.2, GCC 4.1.1 and
Visual Studio 2005 (VC8). I used Cygwin for both GCC tests. My machine is Xeon 3.2 dual CPU. Both proxy and accessor implementation run at the SAME speed (release builds: -O2 for GCC and /O2 for Visual Studio). So could it be that GCC 3 is not doing a good in optimizing the code?
Apparently. So this is a non-issue then. Interestingly, the transformation interleaved => planar is faster than planar => interleaved. Is this also true on your platforms?
Ulli

Hailin Jin wrote:
Hi Ulli -- good question.
Under GCC, all the four versions (read/write, proxy/accessor) run at the same speed (fastest). However, under Visual Studio, interleaved=>planar is faster than planar=>interleaved with your original code (interleaved=>planar runs at the same speed as the ones under GCC). But after I added a method for RGBValue as follows: template <typename T2> RGBValue& operator=(const T2& pa) { r=pa.r; g=pa.g; b=pa.b; return *this; } the proxy version for interleaved=>planar runs as fast as all others. The only one left slower is the accessor version for planar=>interleaved. My guess is that in the accessor version, the compiler may create an unnecessary temporary object after dereferencing the planar proxy. This could be something interesting for you to look into.
Ok, If I haven't mentioned in my review, I'll also want to see benchmarks. That'll be really nice. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Ullrich Koethe wrote:
HI Hailin,
we just tried your test file with GCC 4.0.2, GCC 4.1.1 and Visual Studio 2005 (VC8). I used Cygwin for both GCC tests. My machine is Xeon 3.2 dual CPU. Both proxy and accessor implementation run at the SAME speed (release builds: -O2 for GCC and /O2 for Visual Studio). So could it be that GCC 3 is not doing a good in optimizing the code?
Apparently. So this is a non-issue then. Interestingly, the transformation interleaved => planar is faster than planar => interleaved. Is this also true on your platforms?
G++ is a poor compiler to test for optimization and speed. Use VC8 instead. It produces one of the tightest code so far. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Joel de Guzman wrote:
G++ is a poor compiler to test for optimization and speed. Use VC8 instead. It produces one of the tightest code so far.
Ahem, can you please be a bit more specific ? There are quite substantial differences between different versions. g++ 4.1 is actually very good (though, unfortunately some of boost.mpl doesn't compile with g++ 4.1.1 due to some obscure bug). Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stefan Seefeld wrote:
Joel de Guzman wrote:
G++ is a poor compiler to test for optimization and speed. Use VC8 instead. It produces one of the tightest code so far.
Ahem, can you please be a bit more specific ? There are quite substantial differences between different versions. g++ 4.1 is actually very good (though, unfortunately some of boost.mpl doesn't compile with g++ 4.1.1 due to some obscure bug).
All of them. We've been testing fusion with different compilers in different platforms and so far g++ trails behind VC++ in all tests. If you not convinced, please see Boost CVS. In libs/fusion/example/ performance, you can find some tests. Try to beat VC8's results. g++ produces bigger and slower code in general. I chuckle when people show benchmarks with g++ alone and proclaim that library A is better than library B. I am not an MS fan but the fact is that VC++ beats g++ with extreme TMP (inlining, templates, small objects optimization, etc). Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Joel wrote:
Try to beat VC8's results. g++ produces bigger and slower code in general. I chuckle when people show benchmarks with g++ alone and proclaim that library A is better than library B. I am not an MS fan but the fact is that VC++ beats g++ with extreme TMP (inlining, templates, small objects optimization, etc).
Slightly off-topic, but how did Intel's 9.1 compiler fair (win and/or Linux)? We've found it better for vectorisation, but may be about to embark on using fusion et al so wondered if you'd done any small object/inlining/templating tests with that too. I've often thought having some performance tests on the regression suite would be good, but I can see it being problematic to factor out different user's and different CPUs et al.

Ullrich Koethe wrote:
Lubomir Bourdev wrote:
Three things come to mind: - many more type conversions; e.g. int8 => uint16; RGB <=> LAB etc. - adaptors for the subsampled image view - support for multi-dimensional images (e.g. volume data, image sequences)
Another important point are RGBA operations. In most cases, the alpha channel is to be treated differently from the other channels, so you can't just use component-wise operators as for the other color models.
Yes! This was the point I was trying to make with Lubomir off list. I was stressing that with HSL/HSB too, you don't want to do component-wise operations. Yes, I'd like to see more color operations.
VIGRA doesn't have an explicit RGBA type (TinyVector<T, 4> can be used instead), because so far no-one came up with a convincing proposal for these operations. But without them, RGBA is pretty useless.
I don't think any of this is a show stopper, but it means considerable work.
Agreed. Hmmm... TinyVector<T, 4>... I think VIGRA should use Fusion for that instead ;-) Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Joel de Guzman wrote:
VIGRA doesn't have an explicit RGBA type (TinyVector<T, 4> can be used instead), because so far no-one came up with a convincing proposal for these operations. But without them, RGBA is pretty useless.
Hmmm... TinyVector<T, 4>... I think VIGRA should use Fusion for that instead ;-)
I had a look at Fusion, but I'm not sure whether it would be helpful in this context. TinyVector is based on three design goals: it should support the std::vector interface (except for resize etc.), it should be fast (you have millions of these beasts in a single image), and it should behave like a built-in arithmetic type (except for division which is problematic because the zero vector is not the only one that may cause a division-by-zero error). Fusion may be more helpful in heterogeneous pixel types, but they often have very specific requirements that are perhaps better handled explicitly. RegardUlli -- ________________________________________________________________ | | | Ullrich Koethe Universitaet Hamburg / University of Hamburg | | FB Informatik / Dept. of Informatics | | AB Kognitive Systeme / Cognitive Systems Group | | | | Phone: +49 (0)40 42883-2573 Vogt-Koelln-Str. 30 | | Fax: +49 (0)40 42883-2572 D - 22527 Hamburg | | Email: u.koethe@computer.org Germany | | koethe@informatik.uni-hamburg.de | | WWW: http://kogs-www.informatik.uni-hamburg.de/~koethe/ | |________________________________________________________________|

Ullrich Koethe wrote:
Joel de Guzman wrote:
VIGRA doesn't have an explicit RGBA type (TinyVector<T, 4> can be used instead), because so far no-one came up with a convincing proposal for these operations. But without them, RGBA is pretty useless.
Hmmm... TinyVector<T, 4>... I think VIGRA should use Fusion for that instead ;-)
I had a look at Fusion, but I'm not sure whether it would be helpful in this context. TinyVector is based on three design goals: it should support the std::vector interface (except for resize etc.),
Like boost::array? it should be fast (you
have millions of these beasts in a single image),
Definitely. and it should behave
like a built-in arithmetic type (except for division which is problematic because the zero vector is not the only one that may cause a division-by-zero error).
No problem. But have you seen Andy's work on matrices using fusion? Fusion may be more helpful in heterogeneous pixel
types, but they often have very specific requirements that are perhaps better handled explicitly.
Like what, for example? Cheers, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Joel de Guzman wrote:
I had a look at Fusion, but I'm not sure whether it would be helpful in this context. TinyVector is based on three design goals: it should support the std::vector interface (except for resize etc.),
Like boost::array?
Similar, but more sophisticated. E.g. it has separate view templates that don't own the data, and provides a number of mathematical functions with loop unrolling.
like a built-in arithmetic type (except for division which is problematic because the zero vector is not the only one that may cause a division-by-zero error).
No problem. But have you seen Andy's work on matrices using fusion?
No, can you provide a pointer?
Fusion may be more helpful in heterogeneous pixel types, but they often have very specific requirements that are perhaps better handled explicitly.
Like what, for example?
How about RGBA? How would a Fusion version of alpha blending two pixels look like? The formula is RGB' = alpha2*RGB2 + (1-alpha2)*alpha1*RGB1 alpha' = alpha2 + (1-alpha2)*alpha1 where pixel 2 is in front of pixel 1. Alternatively, with premultiplied alpha we have RGBA' = RGBA2 + (1-alpha2)*RGBA1 When the individual channels are represented by uint8, we must add functionality to stay in the range 0...255. Regards Ulli -- ________________________________________________________________ | | | Ullrich Koethe Universitaet Hamburg / University of Hamburg | | FB Informatik / Dept. of Informatics | | AB Kognitive Systeme / Cognitive Systems Group | | | | Phone: +49 (0)40 42883-2573 Vogt-Koelln-Str. 30 | | Fax: +49 (0)40 42883-2572 D - 22527 Hamburg | | Email: u.koethe@computer.org Germany | | koethe@informatik.uni-hamburg.de | | WWW: http://kogs-www.informatik.uni-hamburg.de/~koethe/ | |________________________________________________________________|

Ullrich Koethe wrote:
Joel de Guzman wrote:
I had a look at Fusion, but I'm not sure whether it would be helpful in this context. TinyVector is based on three design goals: it should support the std::vector interface (except for resize etc.),
Like boost::array?
Similar, but more sophisticated. E.g. it has separate view templates that don't own the data, and provides a number of mathematical functions with loop unrolling.
There are some code in libs/fusion/example/performance that shows huge performance improvements with applying fusion algorithms on boost::array over std algorithms. This is due to the fact that fusion does the loop unrolling for you, automatically since it does compile time recursion instead of loops.
Fusion may be more helpful in heterogeneous pixel types, but they often have very specific requirements that are perhaps better handled explicitly.
Like what, for example?
How about RGBA? How would a Fusion version of alpha blending two pixels look like? The formula is
RGB' = alpha2*RGB2 + (1-alpha2)*alpha1*RGB1 alpha' = alpha2 + (1-alpha2)*alpha1
where pixel 2 is in front of pixel 1. Alternatively, with premultiplied alpha we have
RGBA' = RGBA2 + (1-alpha2)*RGBA1
When the individual channels are represented by uint8, we must add functionality to stay in the range 0...255.
Ok, that will be the challenge. Fusion can definitely work on homogeneous as well as heterogeneous structures. And this case will be easier to optimize because of the use of ints (uint8) and simpler _flat_ structures unlike Andy's more complex use of matrices and doubles (c++ can't actually have compile time floats). I bet Fusion's compile time metaprogramming will work in our favor this time. I'll get back to you on this but it might take some time, ok?. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

"Joel de Guzman" <joel@boost-consulting.com> wrote in message news:eh85r6$ba$1@sea.gmane.org...
Ullrich Koethe wrote:
Joel de Guzman wrote:
VIGRA doesn't have an explicit RGBA type (TinyVector<T, 4> can be used instead), because so far no-one came up with a convincing proposal for these operations. But without them, RGBA is pretty useless.
Hmmm... TinyVector<T, 4>... I think VIGRA should use Fusion for that instead ;-)
I had a look at Fusion, but I'm not sure whether it would be helpful in this context. TinyVector is based on three design goals: it should support the std::vector interface (except for resize etc.),
Like boost::array?
it should be fast (you
have millions of these beasts in a single image),
Definitely.
and it should behave
like a built-in arithmetic type (except for division which is problematic because the zero vector is not the only one that may cause a division-by-zero error).
No problem. But have you seen Andy's work on matrices using fusion?
As far as the work on "tuple" matrices is concerned, though originally conceived to enable use in my Quan types in transform matrices: http://quan.sourceforge.net/quan_matters/doc/html/index.html The IMO more important use is to replace run time doubles with compile time "static" doubles usually for values of 1 or 0. The effect of this is to reduce a typical 4 x 4 matrix multiply from 64 multiplies and 48 adds down to for example of 9 multiplies and 9 adds in the case of a translation x rotation x translation transform That is quite a profound reduction. Similar reductions are of course possible when applying the transform to vertices. However there is a problem in VC7.1, which is that the compiler simply runs out of resources in relatively simple transfoms, using Fusion, and there is no way round that with Fusion AFAICS. OTOH There is no such problem in VC8 or gcc4.1.1 the other 2 compilers I tested. However rather than lose VC7.1, I opted to try a hand rolled version, IOW I stripped Fusion out completely and removed the iterators and provided custom vectors of 3,9, 4 and 16 elements and custom row and columns. This is not quite as neat as Fusion where one algorithm can be applied to theoretically any combination of matrices, however in looking at the assembler output from the hand made version I saw that by simplifying the programming and removing the extra layers of references that the compiler did now produce what looks to me perfect. (The example code here is simply of a 3x3 rotation matrix multiplied by itself.) N.B as an improvement on perfect, It should also be well noted that because this is a simple test with local constants, that the compiler has in fact Not instantiated this assembler code at all in the main function, but has actually simply outputs constants. (This can be seen in the main assembler at the end). This is an improvement on the Fusion version, where I guess the references do provide a barrier to some optimisations and functions were called in main. Be wary of short tests however ;-) Note also the custom at_c functors, which I found useful. These enable the actual type of result... reference, const reference, value, to be sorted on a element by element basis. In fact the quanta::as_ref etc are functors so arbitrary functors could be substituted for e.g multiply by a constant etc. IOW in light of this I am not sure now that using Fusion is optimal for what I want, but it did provide a good starting point and one could see this as optimising... Source, with some extraneous stuff is at the end. The assembler represents the mux(matrix,matrix) part before its optimised out in this example. Finally the main assembler, showing output of a constant. regards Andy Little 00001 dd 02 fld QWORD PTR [edx] 00003 dc 09 fmul QWORD PTR [ecx] 00005 dd 41 18 fld QWORD PTR [ecx+24] 00008 dc 4a 08 fmul QWORD PTR [edx+8] 0000b de c1 faddp ST(1), ST(0) 0000d dd 18 fstp QWORD PTR [eax] 0000f dd 42 08 fld QWORD PTR [edx+8] 00012 dc 49 20 fmul QWORD PTR [ecx+32] 00015 dd 02 fld QWORD PTR [edx] 00017 dc 49 08 fmul QWORD PTR [ecx+8] 0001a de c1 faddp ST(1), ST(0) 0001c dd 58 08 fstp QWORD PTR [eax+8] 0001f dd 42 20 fld QWORD PTR [edx+32] 00022 dc 49 18 fmul QWORD PTR [ecx+24] 00025 dd 42 18 fld QWORD PTR [edx+24] 00028 dc 09 fmul QWORD PTR [ecx] 0002a de c1 faddp ST(1), ST(0) 0002c dd 58 18 fstp QWORD PTR [eax+24] 0002f dd 41 08 fld QWORD PTR [ecx+8] 00032 dc 4a 18 fmul QWORD PTR [edx+24] 00035 dd 42 20 fld QWORD PTR [edx+32] 00038 dc 49 20 fmul QWORD PTR [ecx+32] 0003b de c1 faddp ST(1), ST(0) 0003d dd 58 20 fstp QWORD PTR [eax+32] int main() { matrix_type matrix( 1.,2.,zero(), 4.,5.,zero(), zero(),zero(),one() ); typedef quanta::matrix_row<2,matrix_type,quanta::as_const_ref> row0_type; row0_type row0(matrix); std::cout << quanta::of_vector::at_c<2,quanta::as_const_ref>()(row0) <<'\n'; typedef quanta::matrix_col<2,matrix_type,quanta::as_const_ref> col2_type; col2_type col2(matrix); std::cout << quanta::of_vector::at_c<2,quanta::as_const_ref>()(col2) <<'\n'; quanta::dot_product<0,0,matrix_type::cols> dot; std::cout << dot(matrix,matrix) <<'\n'; typedef quanta::matrix_mux<3,3,3,3> mux_type; mux_type mux; mux_type::result<matrix_type,matrix_type>::type result = mux(matrix,matrix); std::cout << result.at<0,0>() <<'\n'; } main function assembler for std::cout << result.at<0,0>() <<'\n'; ; Line 84 000c8 dd 05 00 00 00 00 fld QWORD PTR __real@4022000000000000 000ce 51 push ecx 000cf dd 1c 24 fstp QWORD PTR [esp] 000d2 e8 00 00 00 00 call ??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@N@Z ; std::basic_ostream<char,std::char_traits<char> >::operator<<

Andy Little wrote:
"Joel de Guzman" <joel@boost-consulting.com> wrote in message news:eh85r6$ba$1@sea.gmane.org...
Ullrich Koethe wrote:
Joel de Guzman wrote:
VIGRA doesn't have an explicit RGBA type (TinyVector<T, 4> can be used instead), because so far no-one came up with a convincing proposal for these operations. But without them, RGBA is pretty useless.
Hmmm... TinyVector<T, 4>... I think VIGRA should use Fusion for that instead ;-)
I had a look at Fusion, but I'm not sure whether it would be helpful in this context. TinyVector is based on three design goals: it should support the std::vector interface (except for resize etc.), Like boost::array?
it should be fast (you
have millions of these beasts in a single image), Definitely.
and it should behave
like a built-in arithmetic type (except for division which is problematic because the zero vector is not the only one that may cause a division-by-zero error). No problem. But have you seen Andy's work on matrices using fusion?
As far as the work on "tuple" matrices is concerned, though originally conceived to enable use in my Quan types in transform matrices:
http://quan.sourceforge.net/quan_matters/doc/html/index.html
The IMO more important use is to replace run time doubles with compile time "static" doubles usually for values of 1 or 0.
The effect of this is to reduce a typical 4 x 4 matrix multiply from 64 multiplies and 48 adds down to for example of 9 multiplies and 9 adds in the case of a translation x rotation x translation transform That is quite a profound reduction. Similar reductions are of course possible when applying the transform to vertices.
However there is a problem in VC7.1, which is that the compiler simply runs out of resources in relatively simple transfoms, using Fusion, and there is no way round that with Fusion AFAICS. OTOH There is no such problem in VC8 or gcc4.1.1 the other 2 compilers I tested. However rather than lose VC7.1, I opted to try a hand rolled version, IOW I stripped Fusion out completely and removed the iterators and provided custom vectors of 3,9, 4 and 16 elements and custom row and columns. This is not quite as neat as Fusion where one algorithm can be applied to theoretically any combination of matrices, however in looking at the assembler output from the hand made version I saw that by simplifying the programming and removing the extra layers of references that the compiler did now produce what looks to me perfect. (The example code here is simply of a 3x3 rotation matrix multiplied by itself.)
N.B as an improvement on perfect, It should also be well noted that because this is a simple test with local constants, that the compiler has in fact Not instantiated this assembler code at all in the main function, but has actually simply outputs constants. (This can be seen in the main assembler at the end). This is an improvement on the Fusion version, where I guess the references do provide a barrier to some optimisations and functions were called in main. Be wary of short tests however ;-)
Note also the custom at_c functors, which I found useful. These enable the actual type of result... reference, const reference, value, to be sorted on a element by element basis. In fact the quanta::as_ref etc are functors so arbitrary functors could be substituted for e.g multiply by a constant etc.
IOW in light of this I am not sure now that using Fusion is optimal for what I want, but it did provide a good starting point and one could see this as optimising...
Source, with some extraneous stuff is at the end. The assembler represents the mux(matrix,matrix) part before its optimised out in this example. Finally the main assembler, showing output of a constant.
Andy, thanks for your analysis. Here are my thoughts: 1) Compilers will get better. That is implied in your statement that the problem you have with VC7.1 (internal structure overflow) is no longer present with VC8. 2) Fusion can and will be improved. Have you seen the latest code by Eric Niebler on segmented iterators and algorithms? 3) A tuple is a struct and a struct is a tuple. You can definitely mix hand-rolled structs and fusion tuples at will and provide algorithms "overloads" for optimization. There's great benefit with generic programming. Certainly there are obstacles that hinder progress and we generic programmers try to solve those as best we can. It's a journey. Anyway, thank you very much for your initial efforts! Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

"Joel de Guzman" <joel@boost-consulting.com> wrote in message
Andy, thanks for your analysis. Here are my thoughts:
1) Compilers will get better. That is implied in your statement that the problem you have with VC7.1 (internal structure overflow) is no longer present with VC8.
2) Fusion can and will be improved. Have you seen the latest code by Eric Niebler on segmented iterators and algorithms?
3) A tuple is a struct and a struct is a tuple. You can definitely mix hand-rolled structs and fusion tuples at will and provide algorithms "overloads" for optimization.
There's great benefit with generic programming. Certainly there are obstacles that hinder progress and we generic programmers try to solve those as best we can. It's a journey. Anyway, thank you very much for your initial efforts!
Hey. Its only really VC7.1 that prompted me to go for a handrolled version. Overall I would be happier sticking with Fusion, but.... (Also bear in mind that the quan::fusion Matrix version was as you say an initial effort and a Fusion Matrix might be better done much differently). In general though, wading in Deep... I have a feeling that the "everything's a list" approach and following STL, as favoured by STL and Fusion may not be the best model to follow for compile time programming. This is no disrespect to MPL or Fusion.... (That said I can't quantify why that is, but speed of compilation and compiler resources are a rough equivalent to memory use and runtime speed performance) IMO MPL particularly has some problems, however bearing in mind that MPL represents some kind of leap into a brave new world of template metaprogramming, this is only to be expected , and therefore MPL is a great achievement, and much respect to Alexsey Gurtovoy, but I cant help feeling that it Shouldnt be the final word, and I would hope for a more Open analysis of MPL's good and bad points, without just getting the feeling that I am getting criticised for attacking a helpless baby. The "fusion" of compile time and runtime progranmming is a complicated , but powerful beast, but more satisfactory, more 'whole' than TMP only. Fusion is built on MPL. That causes me problems, simply in the boost::type error messages. IMO this is indicative of the major problem in MPL.... there are no types. IMO types are important at compile time as runtime. I have a feeling that Fusion may be headed the same way as MPL, IOW its built top down ("the compile time STL") rather than bottom up. The bottom up approach would start form applications (IOW real world use) and then "find" the generics. Unfortunately Boost can distort that view AFAICS because of its nearly standardisation, and I find that problematic. IMO reinventing the wheel is good because each iteration makes the wheel a little rounder, but premature standardisation leaves us paring away at a mostly square wheel. OTOH .. what do I know :-) As far as optimisation is concerned... as you say, VC8 is sweet, but on its own turf only, but bear in mind that GCC must work on a large range of platforms, so it is unreasonable to expect it compete head to head with VC8's in VC8's native territory. I do hope that ConceptGCC will be finished though so it conforms to its docs... then hopefully VC.XX will want to catch up with Gcc in that department ;-) And if you're thinking that the above makes no sense... .......... Yeah maybe ;-) regards Andy Little

Andy Little wrote:
"Joel de Guzman" <joel@boost-consulting.com> wrote in message
Andy, thanks for your analysis. Here are my thoughts:
1) Compilers will get better. That is implied in your statement that the problem you have with VC7.1 (internal structure overflow) is no longer present with VC8.
2) Fusion can and will be improved. Have you seen the latest code by Eric Niebler on segmented iterators and algorithms?
3) A tuple is a struct and a struct is a tuple. You can definitely mix hand-rolled structs and fusion tuples at will and provide algorithms "overloads" for optimization.
There's great benefit with generic programming. Certainly there are obstacles that hinder progress and we generic programmers try to solve those as best we can. It's a journey. Anyway, thank you very much for your initial efforts!
Hey. Its only really VC7.1 that prompted me to go for a handrolled version. Overall I would be happier sticking with Fusion, but.... (Also bear in mind that the quan::fusion Matrix version was as you say an initial effort and a Fusion Matrix might be better done much differently).
BTW, did you install SP1 for VC7.1? http://tinyurl.com/hvfr9 ... just a thought... Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

On 10/19/2006 12:59 PM, Andy Little wrote:
"Joel de Guzman" <joel@boost-consulting.com> wrote in message news:eh85r6$ba$1@sea.gmane.org... [snip]
No problem. But have you seen Andy's work on matrices using fusion?
As far as the work on "tuple" matrices is concerned, though originally conceived to enable use in my Quan types in transform matrices:
http://quan.sourceforge.net/quan_matters/doc/html/index.html
The IMO more important use is to replace run time doubles with compile time "static" doubles usually for values of 1 or 0.
Andy, If all the types are the same(e.g. all double), then maybe a tuple is too heavyweight. I'd think something like vector_c, maybe called matrix_c, would be more appropriate. Or maybe you're wanting units with that so that the elements would be a double, but a double with units length/time or something else. Is that why tuples instead of matrix_c is being used?

"Larry Evans" <cppljevans@cox-internet.com> wrote in message news:eh99e2$4ch$1@sea.gmane.org...
On 10/19/2006 12:59 PM, Andy Little wrote:
"Joel de Guzman" <joel@boost-consulting.com> wrote in message news:eh85r6$ba$1@sea.gmane.org... [snip]
No problem. But have you seen Andy's work on matrices using fusion?
As far as the work on "tuple" matrices is concerned, though originally conceived to enable use in my Quan types in transform matrices:
http://quan.sourceforge.net/quan_matters/doc/html/index.html
The IMO more important use is to replace run time doubles with compile time "static" doubles usually for values of 1 or 0.
Andy, If all the types are the same(e.g. all double), then maybe a tuple is too heavyweight. I'd think something like vector_c, maybe called matrix_c, would be more appropriate. Or maybe you're wanting units with that so that the elements would be a double, but a double with units length/time or something else. Is that why tuples instead of matrix_c is being used?
OK. The use I want to put matrices to is quite simple. I want to make 2D and 3D transform matrices for transforming points from one coordinate system to another. The first point is that this is essentially runtime rather than compile time calculations. In the normal world AFAICS everything is a double. In this case making a 2D transform matrix is simple. You just need some container(s) of doubles which you then view as a matrix. AFAIK this is how UBLAS and MTL work for example. In the world of Quan however doubles are replaced by types representing (say) lengths in mm. This is a very different world from the "everything's a double" world. Lets take a multiply where everythings a double: typedef double T; T a, b T c = a * b; No problem. Apply this approach to a Quan length::mm; typedef quan::length::mm T; T a, b T c = a * b; // Error In the world of Quan when you multiply 2 lengths you get something representing an area: typedef quan::length::mm T1; T1 a, b typedef quan::area::mm2 T2 T2 c = a * b; // OK. (As to whether you see this as useful or not is beside the point. This is simply what Quan does). In fact there is a standard, obvious way to find the result type generically in Quan : typeof(a*b) c = a*b; or auto c = a* b; This is why I am a fan of Boost.Typeof ;-) Once you have access to the result_type of an operation, then it can be applied to any two types for which operator * is defined. It could be quan::lengths, it could be doubles, it could be complex quantities, Colours ;-) It could also be a matrix . Like in the length * length op above else the type of the result of multiplying 2 matrices is not the same as the arguments, but also it can be deduced in exactly the same way using typeof(matrix1 * matrix2); This is why I am a fan of Fusion ;-) In a matrix it is quite common to have many elements with a value of 0 or 1. In case of Q * 1, we already know the result at compile time, similarly Q * 0. Although I know that the result of Q * 0 ---> Q(0) unfortunately in general the compiler doesnt. However you can inform the compiler of your intent by using a special 'static' type, which in Quan I call a static_value. The static value has a type and a value, but the value is a compile time value not a runtime value. In quan I use the quan::meta::rational type as the value. To represent a static length in mm, with a value of 0 mm and 1 mm. typedef static_value<quan::length::mm, meta::rational<0,1> > zero_mm; typedef static_value<quan::length::mm, meta::rational<1,1> > one_mm; or a double: typedef static_value<double, meta::rational<0,1> > zero; typedef static_value<double, meta::rational<1,1> > one; These static_value types are also PODs, IOW without ctors or user defined assignment ops etc. This means the compiler can make assumptions and apply certain optimisations which it couldnt otherwise. static_values have runtime operations both for ops with other static_values and with runtime types. Because 'zero' is a different type to 'one' it is possible to provide separate overloads for particular types. The actual implementation is relatively complicated but doable even for static Quan quantities. The rules for these ops are that wherever possible the result_type is a static_value rather than a runtime value: auto result = quan::length::mm(3.45678) * zero; In this case the result type is predictable: static_assert( equal_to< typeof(result),zero_mm>::value); IOW the result in this case is a static value. The actual calc, quan::length::mm(3.45678) * zero , is in fact a no-op, and the compiler knows all about it! The result is a dramatic improvement in perfomance. In my previous post I showed the VC8 assemly for a multiply of a matrix with some static_values, by itself Repeated below for contrast. Here the transform is a rotation matrix, (different transforms, scaling translation and combos, etc have a different 'type'). typedef quanta::of_vector::vector9< double, double, zero, double, double, zero, zero,zero,one
my_vect9;
typedef quanta::matrix<3,3,my_vect9> matrix_type; matrix_type matrix( 1.,2.,zero(), 4.,5.,zero(), zero(),zero(),one() ); assembler for mux function (optimised out in example)... 00001 dd 02 fld QWORD PTR [edx] 00003 dc 09 fmul QWORD PTR [ecx] 00005 dd 41 18 fld QWORD PTR [ecx+24] 00008 dc 4a 08 fmul QWORD PTR [edx+8] 0000b de c1 faddp ST(1), ST(0) 0000d dd 18 fstp QWORD PTR [eax] 0000f dd 42 08 fld QWORD PTR [edx+8] 00012 dc 49 20 fmul QWORD PTR [ecx+32] 00015 dd 02 fld QWORD PTR [edx] 00017 dc 49 08 fmul QWORD PTR [ecx+8] 0001a de c1 faddp ST(1), ST(0) 0001c dd 58 08 fstp QWORD PTR [eax+8] 0001f dd 42 20 fld QWORD PTR [edx+32] 00022 dc 49 18 fmul QWORD PTR [ecx+24] 00025 dd 42 18 fld QWORD PTR [edx+24] 00028 dc 09 fmul QWORD PTR [ecx] 0002a de c1 faddp ST(1), ST(0) 0002c dd 58 18 fstp QWORD PTR [eax+24] 0002f dd 41 08 fld QWORD PTR [ecx+8] 00032 dc 4a 18 fmul QWORD PTR [edx+24] 00035 dd 42 20 fld QWORD PTR [edx+32] 00038 dc 49 20 fmul QWORD PTR [ecx+32] 0003b de c1 faddp ST(1), ST(0) 0003d dd 58 20 fstp QWORD PTR [eax+32] , and this was actually optimised out resulting in output of a constant with no runtime calc in the example. If I replace the static_values in the matrix with doubles (and incidentally changing the type of the matrix therefore): typedef quanta::of_vector::vector9< double, double, double, double, double, double, double,double,double
my_vect9;
typedef quanta::matrix<3,3,my_vect9> matrix_type; matrix_type matrix( 1.,2.,0., 4.,5.,0., 0.,0.,1. ); assembler for mux function (In this case the compiler calls the function from main in the modified example): 00001 dd 02 fld QWORD PTR [edx] 00003 dc 09 fmul QWORD PTR [ecx] 00005 dd 42 08 fld QWORD PTR [edx+8] 00008 dc 49 18 fmul QWORD PTR [ecx+24] 0000b de c1 faddp ST(1), ST(0) 0000d dd 41 30 fld QWORD PTR [ecx+48] 00010 dc 4a 10 fmul QWORD PTR [edx+16] 00013 de c1 faddp ST(1), ST(0) 00015 dd 18 fstp QWORD PTR [eax] 00017 dd 41 08 fld QWORD PTR [ecx+8] 0001a dc 0a fmul QWORD PTR [edx] 0001c dd 41 20 fld QWORD PTR [ecx+32] 0001f dc 4a 08 fmul QWORD PTR [edx+8] 00022 de c1 faddp ST(1), ST(0) 00024 dd 41 38 fld QWORD PTR [ecx+56] 00027 dc 4a 10 fmul QWORD PTR [edx+16] 0002a de c1 faddp ST(1), ST(0) 0002c dd 58 08 fstp QWORD PTR [eax+8] 0002f dd 41 10 fld QWORD PTR [ecx+16] 00032 dc 0a fmul QWORD PTR [edx] 00034 dd 42 08 fld QWORD PTR [edx+8] 00037 dc 49 28 fmul QWORD PTR [ecx+40] 0003a de c1 faddp ST(1), ST(0) 0003c dd 41 40 fld QWORD PTR [ecx+64] 0003f dc 4a 10 fmul QWORD PTR [edx+16] 00042 de c1 faddp ST(1), ST(0) 00044 dd 58 10 fstp QWORD PTR [eax+16] 00047 dd 42 18 fld QWORD PTR [edx+24] 0004a dc 09 fmul QWORD PTR [ecx] 0004c dd 41 18 fld QWORD PTR [ecx+24] 0004f dc 4a 20 fmul QWORD PTR [edx+32] 00052 de c1 faddp ST(1), ST(0) 00054 dd 41 30 fld QWORD PTR [ecx+48] 00057 dc 4a 28 fmul QWORD PTR [edx+40] 0005a de c1 faddp ST(1), ST(0) 0005c dd 58 18 fstp QWORD PTR [eax+24] 0005f dd 41 20 fld QWORD PTR [ecx+32] 00062 dc 4a 20 fmul QWORD PTR [edx+32] 00065 dd 42 18 fld QWORD PTR [edx+24] 00068 dc 49 08 fmul QWORD PTR [ecx+8] 0006b de c1 faddp ST(1), ST(0) 0006d dd 41 38 fld QWORD PTR [ecx+56] 00070 dc 4a 28 fmul QWORD PTR [edx+40] 00073 de c1 faddp ST(1), ST(0) 00075 dd 58 20 fstp QWORD PTR [eax+32] 00078 dd 41 28 fld QWORD PTR [ecx+40] 0007b dc 4a 20 fmul QWORD PTR [edx+32] 0007e dd 42 18 fld QWORD PTR [edx+24] 00081 dc 49 10 fmul QWORD PTR [ecx+16] 00084 de c1 faddp ST(1), ST(0) 00086 dd 41 40 fld QWORD PTR [ecx+64] 00089 dc 4a 28 fmul QWORD PTR [edx+40] 0008c de c1 faddp ST(1), ST(0) 0008e dd 58 28 fstp QWORD PTR [eax+40] 00091 dd 41 18 fld QWORD PTR [ecx+24] 00094 dc 4a 38 fmul QWORD PTR [edx+56] 00097 dd 42 30 fld QWORD PTR [edx+48] 0009a dc 09 fmul QWORD PTR [ecx] 0009c de c1 faddp ST(1), ST(0) 0009e dd 41 30 fld QWORD PTR [ecx+48] 000a1 dc 4a 40 fmul QWORD PTR [edx+64] 000a4 de c1 faddp ST(1), ST(0) 000a6 dd 58 30 fstp QWORD PTR [eax+48] 000a9 dd 41 20 fld QWORD PTR [ecx+32] 000ac dc 4a 38 fmul QWORD PTR [edx+56] 000af dd 42 30 fld QWORD PTR [edx+48] 000b2 dc 49 08 fmul QWORD PTR [ecx+8] 000b5 de c1 faddp ST(1), ST(0) 000b7 dd 41 38 fld QWORD PTR [ecx+56] 000ba dc 4a 40 fmul QWORD PTR [edx+64] 000bd de c1 faddp ST(1), ST(0) 000bf dd 58 38 fstp QWORD PTR [eax+56] 000c2 dd 41 28 fld QWORD PTR [ecx+40] 000c5 dc 4a 38 fmul QWORD PTR [edx+56] 000c8 dd 42 30 fld QWORD PTR [edx+48] 000cb dc 49 10 fmul QWORD PTR [ecx+16] 000ce de c1 faddp ST(1), ST(0) 000d0 dd 41 40 fld QWORD PTR [ecx+64] 000d3 dc 4a 40 fmul QWORD PTR [edx+64] 000d6 de c1 faddp ST(1), ST(0) 000d8 dd 58 40 fstp QWORD PTR [eax+64] call in main of double version, dramatic contrast to optimising to a constant (See my previous post ) ; Line 82 00081 8d 4c 24 34 lea ecx, DWORD PTR _matrix$[esp+196] 00085 83 c4 04 add esp, 4 00088 8b d1 mov edx, ecx 0008a 8d 44 24 78 lea eax, DWORD PTR _result$[esp+192] 0008e e8 00 00 00 00 call ??$?RU?$matrix@$02$02U?$vector9@NNNNNNNNNUas_value@quanta@@@of_vector@quanta@@@quanta@@U01@@?$matrix_mux@$02$02$02$02@quanta@@QBE?AU?$matrix@$02$02U?$vector9@NNNNNNNNNUas_value@quanta@@@of_vector@quanta@@@1@ABU21@0@Z ; quanta::matrix_mux<3,3,3,3>::operator()<quanta::matrix<3,3,quanta::of_vector::vector9<double,double,double,double,double,double,double,double,double,quanta::as_value>
,quanta::matrix<3,3,quanta::of_vector::vector9<double,double,double,double,double,double,double,double,double,quanta::as_value> > >; Line 84 00093 dd 44 24 78 fld QWORD PTR _result$[esp+192] 00097 83 ec 08 sub esp, 8 0009a dd 1c 24 fstp QWORD PTR [esp] 0009d e8 00 00 00 00 call??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@N@Z ;std::basic_ostream<char,std::char_traits<char> >::operator<<The example is for 2D matrices. The results will not be quite as good for 3Dtypes where the ratio of static to runtime types is lower, but other transformshave a much lower cost than a rotation * rotation, when using static values.More complex transforms may approach more and more the all runtime version asstatic_values are replaced by runtime as the calc proceeds, but this is a caseof runtime types being supplied 'on demand' rather than always. In other casesruntime calcs are knocked out by being multiplied by a static zero. For aparticular combination of transforms it is I think quite simple to work out thepercentage improvement in terms of number of adds and multiplies. I worked out astandard transform rotation round a point . I was incorrect in my previous post.The all double version has 128 muxes and 96 adds. The static_value version has 9multiplies and 9 Adds. (I need to check this again , but I think its correct).That is a factor of 10 improvement, which isnt bad ;-)regardsAndy Little

GIL incorporates a mechanism for active reduction of code
bloat due to
variant instantiations. I will present this approach at the LCSD workshop in the upcoming OOPSLA conference, which is in less than a week: http://lcsd.cs.tamu.edu/2006/
I'd like to read this paper.
I apologize but I am not allowed to post the paper due to copyright issues. In a nutshell, when you invoke, say, a binary algorithm on a variant of, say, 10 types, you get 100 algorithm instantiations. This is often done via double-dispatch in which case you have 10+1 switch statements with 10 cases each. It is often the case that many of these instantiations will be identical at assembly level. Consider, for example, copy_pixels(rgb8,rgb8) and copy_pixels(lab8,lab8). GIL has a mechanism that identifies algorithm instantiations that are assembly-level identical and instantiates them only once. It also optimizes the above-mentioned double-dispatch. In particular, it turns this into a single switch statement containing just the appropriate set of cases (typically far fewer than 100). It results in code that is both faster and smaller, and sometimes compiles faster (though on average compiles a lot slower). This optimization is disabled by default, primarily because it makes debugging harder and it can take a toll on compile time. But it is useful to enable it if you use variants extensively and care about code bloat. Some compilers already can detect and reuse assembly-level identical instantiations in which case this technique results in only marginal improvement. The code is inside the dynamic_image extension. The flag is GIL_REDUCE_CODE_BLOAT Lubomir

Ulli wrote:
Lubomir wrote:
Do you see any issues with the GIL interfaces that would prevent porting the Vigra algorithms to GIL, or that would affect their performance? If yes, what interfaces are missing and what algorithms need them?
Three things come to mind: - many more type conversions; e.g. int8 => uint16; RGB <=> LAB etc. - adaptors for the subsampled image view - support for multi-dimensional images (e.g. volume data, image sequences)
My question was not what could be added to GIL, my question was what in your opinion needs to _change_ in the GIL interfaces and design, if anything. Adding new functionality in later versions of the library is perfectly normal, and I foresee additions to the GIL core once the algorithms are implemented. Nothing you mentioned requires changes (the PixelAccessor issue is discussed in another thread, and even for it we agree it is possible to have adapters) Lubomir
participants (9)
-
Andy Little
-
Hailin Jin
-
Joel de Guzman
-
Larry Evans
-
Lubomir Bourdev
-
Paul Baxter
-
Peter Dimov
-
Stefan Seefeld
-
Ullrich Koethe