How to structurate libraries ?

I may have a few components to propose as library. They mostly deals with low-level, architecture-dependant problems that get homogeneized across compiler/platform. Things like class for padding a type to a given boundary, SIMD computation DSEL, traits concerning low-level properties of types and/or compilers and some more. What's the best way to handle this ? A series of small-scale libraries or a larger framework like boost::architecture or something ? -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Wed, Jan 14, 2009 at 3:08 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
I may have a few components to propose as library. They mostly deals with low-level, architecture-dependant problems that get homogeneized across compiler/platform. Things like class for padding a type to a given boundary, SIMD computation DSEL, traits concerning low-level properties of types and/or compilers and some more.
What's the best way to handle this ? A series of small-scale libraries or a larger framework like boost::architecture or something ?
I was wondering when somebody would build something like this... I think a first version would be a larger framework - or a collection of the smaller libraries (Boost.Arch?). That gives a focal point for reviewing, and the libraries can be broken out or the elements migrated later. Andrew Sutton andrew.n.sutton@gmail.com

Andrew Sutton escribió:
On Wed, Jan 14, 2009 at 3:08 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
I may have a few components to propose as library. They mostly deals with low-level, architecture-dependant problems that get homogeneized across compiler/platform. Things like class for padding a type to a given boundary, SIMD computation DSEL, traits concerning low-level properties of types and/or compilers and some more.
What's the best way to handle this ? A series of small-scale libraries or a larger framework like boost::architecture or something ?
I was wondering when somebody would build something like this...
I think a first version would be a larger framework - or a collection of the smaller libraries (Boost.Arch?).
I'd consider following the scheme adopted for Boost.TypeTraits, which in a sense is a collection of microlibraries (each of the different traits): this very compact scheme minimizes the impact in terms of documentation of adding a new feature. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Andrew Sutton a écrit :
I was wondering when somebody would build something like this... Indeed ;)
I think a first version would be a larger framework - or a collection of the smaller libraries (Boost.Arch?). That gives a focal point for reviewing, and the libraries can be broken out or the elements migrated later.
Well, maybe having a structure like Boost.math then ? Currently planned library (some in dev., some done) includes: * Arch.Memory : Provides free function and STL compatible allocators for aligned memory allocation and a macro to add aligned new/delete support into classes. Basically a platform independant wrapper around posix_memalign and its ilk. * Arch.Types: some low level traits and generation class including things like : -> make_integer that build an integer type able to hold at most N bytes and with a given signedness e.g : make_interger<3,signed>::type returns int32_t. -> typed_bitfield that wraps a type T and a properly sized byte array and provide proper byte access operator[]. Useful for decomposing large types into bytes for low level operations. -> padded<T,N> that provides an Enveloppe/Letter class around type T and whose sizeof is so that it's padded to a multiple of 2^N. It reuses compressed_pair for empty-base class optimisation. * Arch.SIMD : a SIMD computation library that provides a small scale binding to internal representation of a SIMD register on SIMD-enabled machine. Using Proto, a small DSEL is provided so one can write code like : vec<float> k = {5,5,5,5},r; r = ((2*k)/(3-k)).min(); and have vec generate a proper SSEx or Altivec code. This also require some large config code to be added so we cna detect which extension can/is being used. Traits like is_simd, cardinal, vector_of, scalar_of are provided. Current plan is to see if vec can also act as some Fusion container and to see how to properly optimize fused operation (have an actual PHD student on this) and make it easily extensible for upcoming SIMD extension (including AVX, SSE5 or Cell SPU). -> Depends on Memory and Types Some others maybe done I think adn some that are not meant to be there (like a eval_as meta-function that cna compute numarical data computation result_of) , but that's what I have at hand so far. So I guess a proper clean-up of those thingy need to be done before being uploaded onto the vault (hoping they'll raise more interest than my punny type ID library ;)).

Joel Falcou wrote:
* Arch.Types: some low level traits and generation class including things like : -> make_integer that build an integer type able to hold at most N bytes and with a given signedness e.g : make_interger<3,signed>::type returns int32_t.
Isn't that the same as Boost.Integer's int_t family? http://www.boost.org/doc/libs/1_36_0/boost/integer.hpp
* Arch.SIMD : a SIMD computation library that provides a small scale binding to internal representation of a SIMD register on SIMD-enabled machine. Using Proto, a small DSEL is provided so one can write code like :
vec<float> k = {5,5,5,5},r; r = ((2*k)/(3-k)).min();
and have vec generate a proper SSEx or Altivec code. This also require some large config code to be added so we cna detect which extension can/is being used. Traits like is_simd, cardinal, vector_of, scalar_of are provided. Current plan is to see if vec can also act as some Fusion container and to see how to properly optimize fused operation (have an actual PHD student on this) and make it easily extensible for upcoming SIMD extension (including AVX, SSE5 or Cell SPU).
This sounds very interesting. Working with SSE intrinsics is rather a pain. Sebastian

Sebastian Redl a écrit :
Isn't that the same as Boost.Integer's int_t family?
Well I needed some different interface : typedef make_integer<3, signed>::type returns a int32_t, ie the signed integer able to hold at least 3 bytes. make_integer doesn't do more than calling int_t/uint_t anyway so maybe it is redundant. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Joel Falcou wrote:
* Arch.SIMD : a SIMD computation library that provides a small scale binding to internal representation of a SIMD register on SIMD-enabled machine. Using Proto, a small DSEL is provided so one can write code like :
vec<float> k = {5,5,5,5},r; r = ((2*k)/(3-k)).min();
and have vec generate a proper SSEx or Altivec code.
Very interesting. If you need a motivating problem, have a look at the DCT code in libjpeg. I thought about trying to improve this using gcc's SIMD extensions (which I guess you'll use as the back-end of your code on that platform, right?) but quickly decided it was too complicated. (As to the "how to structure libraries?" question, I would prefer to make each independent unit a separate library in its own right, even if they are small. Bundling things up into collection-libraries will tend to hide the contents; see for example today's example of the binary constants feature, hidden inside Boost.Utility, and apparently undiscovered. I suggest that this SIMD code should be its own library, Boost.SIMD.) Phil.

Very interesting. If you need a motivating problem, have a look at the DCT code in libjpeg. Well ^^ why not :) I thought about trying to improve this using gcc's SIMD extensions (which I guess you'll use as the back-end of your code on that platform, right?) but quickly decided it was too complicated. SSE2 intrinsics on Intel platform and Altivec intrinsics on PPC
Phil Endecott a écrit : platform. Still have to see how to handle Visual Studio which can't handle typed vector types properly :/
(As to the "how to structure libraries?" question, I would prefer to make each independent unit a separate library in its own right, even if they are small. Bundling things up into collection-libraries will tend to hide the contents; see for example today's example of the binary constants feature, hidden inside Boost.Utility, and apparently undiscovered. I suggest that this SIMD code should be its own library, Boost.SIMD.) Was thinking about this too. But SIMD has a lot of dependencies on other small library (mainly my Memory one). What would happen if SIMD is accepted and not memory ?
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday 15 January 2009 09:02 am, Joel Falcou wrote:
* Arch.SIMD : a SIMD computation library that provides a small scale binding to internal representation of a SIMD register on SIMD-enabled machine. Using Proto, a small DSEL is provided so one can write code like :
vec<float> k = {5,5,5,5},r; r = ((2*k)/(3-k)).min();
and have vec generate a proper SSEx or Altivec code.
Isn't SIMD-type stuff what valarray was designed for? I don't think a SIMD optimized valarray requires its own DSEL, there is the macstl implementation that has been around for a while. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFJb3ym5vihyNWuA4URAg2MAKC/VF0l9dzlETPUl9womjr01dzg0ACfbQym ssBFEuu0tctPLpWzZcWkXPE= =QK3i -----END PGP SIGNATURE-----

Frank Mori Hess a écrit :
Isn't SIMD-type stuff what valarray was designed for? I don't think so or you'll have to show me where in the STL files of valarray this SIMD implementation lives. I don't think a SIMD optimized valarray requires its own DSEL, there is the macstl implementation that has been around for a while. It's not SIMD valarray, vec<T> acts as SIMD POD values. I know about macstl (is it still maintained btw ?) as I've designed the very same kind of library during my PHD but vec<T> as a complete different use case aka writing low level SIMD enabled numerical algorthims without necessarily working around an array abstraction.
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday 15 January 2009 13:17 pm, Joel Falcou wrote:
Frank Mori Hess a écrit :
Isn't SIMD-type stuff what valarray was designed for?
I don't think so or you'll have to show me where in the STL files of valarray this SIMD implementation lives.
I was referring to my recollection of this post from the guy who designed valarray: http://www.oonumerics.org/oon/oon-list/archive/0493.html where he says it was intended to be highly optimizable for running on vector supercomputers. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFJb4MY5vihyNWuA4URAn06AJ9W9EulnQ/hPn0tCrag0Yy+HtEGqACfcTqk F3FMWlqoO979f0dDLwZRVtY= =LmSE -----END PGP SIGNATURE-----

I was referring to my recollection of this post from the guy who designed valarray:
http://www.oonumerics.org/oon/oon-list/archive/0493.html
where he says it was intended to be highly optimizable for running on vector supercomputers Well, it's clear that valarray implements some Expression Templates to optimizes it evaluations. But beware, few compilers nowadays are able to automagically turn scalar code into
Frank Mori Hess a écrit : proper, optimized SIMD code (or, again, I'll be pleased to be introduced to one). Automatic vectorization - like automatic parallelisation in general - is still in its infancy and largely academic. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Thursday 15 January 2009 12:46, Joel Falcou wrote:
Frank Mori Hess a écrit :
I was referring to my recollection of this post from the guy who designed valarray:
http://www.oonumerics.org/oon/oon-list/archive/0493.html
where he says it was intended to be highly optimizable for running on vector supercomputers
But beware, few compilers nowadays are able to automagically turn scalar code into proper, optimized SIMD code (or, again, I'll be pleased to be introduced to one).
Automatic vectorization - like automatic parallelisation in general - is still in its infancy and largely academic.
Ahem: http://www.cray.com Vectorization has been a known technology since the '60's. Companies like PGI and Pathscale provide auto-vectorization for SSE. Cray recently released a vectorizing compiler for the XT supercomputer. Auto parallelization has been around since at least the '80's in production machines. I'm sure it was around even earlier than that. The tricky thing about SSE is all of the instructions that don't fit neatly into operations expressible in higher-level languages. Things like permutations and shuffles. Compilers will use some of those instructions (for complex data, for example), but in general the intrinsics are needed to do really fancy stuff. Perhaps your SIMD library could invent convenient ways to express those idioms in a machine-independent way. Your simple SIMD expression example isn't terribly compelling. Any competent compiler should be able to vectorize a scalar loop that implements it. What would be compelling is a library to express things like the Cell's scratchpad. Libraries to do data staging would be interesting because more and more processers are going to add these kinds of local memory. -Dave

Ahem: http://www.cray.com Two points : 1/ Not everyone has access to a cray-like machine. Parallelization tools for CotS machines is not to be neglected and, on this front, lots of
David A. Greene a écrit : thing need to be done 2/ vector supercomputer != SIMD-enabled processor even if the former may include the later.
Auto parallelization has been around since at least the '80's in production machines. I'm sure it was around even earlier than that.
What do you call auto-parallelization ? Are you telling me that, nowaday , I can take *any* source code written in C or C++ or w/e compile it with some compiler specifying --parallel and automagically get a parallel version of the code ? If so, you'll have to send a memo to at least a dozen research team (including mine) all over the world so they can stop trying working on this problem and move on something else. Should I also assume than each time a new architecture comes out, those compilers also know the best way to generate code for them ? I beg to differ, but automatic parallelization is far from "done". Then again, by just looking at the problem of writing SIMD code : explain why we still get better performance for simple code when writing SIMD code by hand than letting gcc auto-vectorize it ?
Perhaps your SIMD library could invent convenient ways to express those idioms in a machine-independent way.
Well, considering the question was first about how to structure the group of library i'm proposing, I apologize to not having taken the time to express all the features of those libraries. Moreover, even with a simple example, the fact that the library hides the differences between SSE2,SSSE3,SSE3,SSE4,Altivec,SPU-VMX and the forecoming AVX is a feature on its own. Oh, and as specified in the former mail, the DSL take care of optimizing fused operation so thing like FMA are detected and replaced by the proper intrinsic when possible. Same with reduction like min/max, operations like b*c-a or SAD on SSEx.
Your simple SIMD expression example isn't terribly compelling. Any competent compiler should be able to vectorize a scalar loop that implements it
What would be compelling is a library to express things like the Cell's scratchpad. Libraries to do data staging would be interesting because more and more processers are going to add these kinds of local memory I don't see what you have in mind. Do you mean something like Hierarchic Tiled Array ? or some Cell based development library ? If the later, I don't think boost is the best home for it. As for HTA, lots of implementation already exists, and guess what, they just do the
Well, sorry then to have given a simple example. parallelization themselves instead of letting the computer do it. Anyway, we'll be able to discuss the library in itself and its features when a proper thread for it will start. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Saturday 17 January 2009 03:33, Joel Falcou wrote:
David A. Greene a écrit :
Ahem: http://www.cray.com
Two points : 1/ Not everyone has access to a cray-like machine. Parallelization tools for CotS machines is not to be neglected and, on this front, lots of thing need to be done
gcc does some of this already. It's getting better. But it's far behind other compilers.
2/ vector supercomputer != SIMD-enabled processor even if the former may include the later.
A Cray XT machine is made up of AMD Barcelona processors. It is a SIMD machine. SIMD is nothing more than very short vectors. SSE lacks some nice hardware that allows compilers to vectorize more things but auto-vectorization is perfectly doable with SSE when the hardware supports it.
Auto parallelization has been around since at least the '80's in production machines. I'm sure it was around even earlier than that.
What do you call auto-parallelization ?
Are you telling me that, nowaday , I can take *any* source code written in C or C++ or w/e compile it with some compiler specifying --parallel and automagically get a parallel version of the code ? If so, you'll
Yes, you'll get regions of parallel code. Compilers can look at loops and schedule iterations across threads or cores. In some cases they can schedule independent calls on different threads. The compiler won't do magical stuff though.
have to send a memo to at least a dozen research team (including mine) all over the world so they can stop trying working on this problem and move on something else. Should I also assume than each time a new architecture comes out, those compilers also know the best way to generate code for them ? I beg to differ, but automatic parallelization is far from "done".
Did I say it was done? A compiler is not going to be able take complex pointer-chasing code and parallelize it automatically. With some of the new parallel languages it has a better chance but in most cases the information just isn't there. But your library didn't strike me as supporting general parallelization along the lines of futures or such things. Structuring high-level parallelism has almost[1] nothing to do with the minute machine details you've talked about so far. You're taking this way too personally. I'm not suggesting that researchers close up shop. I'm suggesting that researchers should make sure they're tackling the correct problems. My experience tells me the problem is not the use of vector instructions[2] but rather lies in how to express high-level, task-based parallelism. [1] Note that I said "almost." Some hardware features can greatly aid parallelization such as the semphore bits on the MTA. In these cases I would like to see architecture-specific optimizations within other libraries such as Boost.Futures and Boost.MPI. [2] Except for special-purpose fields like graphics where a general-purpose compiler probably hasn't been given the smarts because it's not worth it to the general-purpose compiler vendor.
Then again, by just looking at the problem of writing SIMD code : explain why we still get better performance for simple code when writing SIMD code by hand than letting gcc auto-vectorize it ?
Because gcc is not an optimizing compiler. That's not its focus. It's getting better but I would encourage you to explore compilers from Intel, PGI and Pathscale. All of these easily beat gcc in terms of performance. Probably the biggest mistake academic researchers do is compare their results to gcc. It's just not a valid comparison.
Perhaps your SIMD library could invent convenient ways to express those idioms in a machine-independent way.
Well, considering the question was first about how to structure the group of library i'm proposing,
I apologize to not having taken the time to express all the features of those libraries.
Well, one would expect the author of a library to enumerate what it can do. If you have these kinds of constructs, that's great!
Moreover, even with a simple example, the fact that the library hides the differences between SSE2,SSSE3,SSE3,SSE4,Altivec,SPU-VMX and the forecoming AVX is a feature
Only insofar as it handles idioms the compiler won't otherwise recognize.
on its own. Oh, and as specified in the former mail, the DSL take care of optimizing fused operation so thing like FMA are detected and replaced by the proper intrinsic when possible. Same with reduction like min/max, operations like b*c-a or SAD on SSEx.
A good compiler will do that too. I don't know that current compilers will make use of PSADBW because the computation may not be general-purpose enough. But it shouldn't be hard to teach a compiler about the idiom. It's simply a specific kind of reduction.
Your simple SIMD expression example isn't terribly compelling. Any competent compiler should be able to vectorize a scalar loop that implements it
Well, sorry then to have given a simple example.
Your claim was that compilers would not be able to handle it. I countered the claim. I'm interested in any other examples you have.
What would be compelling is a library to express things like the Cell's scratchpad. Libraries to do data staging would be interesting because more and more processers are going to add these kinds of local memory
I don't see what you have in mind. Do you mean something like Hierarchic Tiled Array ? or some Cell based development library ? If the later, I don't think boost is the best home for it. As for HTA, lots of implementation already exists, and guess what, they just do the parallelization themselves instead of letting the computer do it.
I can't find a reference for HTA specifically, but my guess from the name of the concept tells me that it's probably already covered elsewhere, as you say. Still, it might be worthwhile to propose a Boost version. Other MPI implementations exist, for example, but that didn't stop Boost.MPI. The closest reference I found was http://portal.acm.org/citation.cfm?id=645605.662909 Is that what you mean? A Cell-based development library wouldn't be terribly useful. Something that provided the same idioms across architectures would be. As you say, the architecture abstraction is important.
Anyway, we'll be able to discuss the library in itself and its features when a proper thread for it will start.
I look forward to it. I think there's value here but we should figure out the focus and drop any unnecessary things. -Dave

Did I say it was done? A compiler is not going to be able take complex pointer-chasing code and parallelize it automatically. With some of the new parallel languages it has a better chance but in most cases the information just isn't there. But your library didn't strike me as supporting general parallelization along the lines of futures or such things. Structuring high-level parallelism has almost[1] nothing to do with the minute machine details you've talked about so far.
Then again I didn't say it was meant to, I presented it as a building blocks for DEVELOPERS needing to write code at this level of abstraction on this kind of hardware.
You're taking this way too personally. I'm not suggesting that researchers close up shop. I'm suggesting that researchers should make sure they're tackling the correct problems. Sorry if I sound like this but since I started this thread it seems that half the community My experience tells me the problem is not the use of vector instructions[2] but rather lies in how to express high-level, task-based parallelism
Except maybe 90% of current HPC needs are simple data-parallelism. But of course, I may be wrong yet again as the following trend showed.
[2] Except for special-purpose fields like graphics where a general-purpose compiler probably hasn't been given the smarts because it's not worth it to the general-purpose compiler vendor.
graphics, Computer Vision, physics, cryptography etc ... you name it. If i have a machien with X level of hadrware parallelism, I want to take advantage of all of them as they fit the algorithm.
Because gcc is not an optimizing compiler. That's not its focus. It's getting better but I would encourage you to explore compilers from Intel, PGI and Pathscale. All of these easily beat gcc in terms of performance.
They beat gcc for sure, except they still don't beat hand written code in some simple case. Let ICC optimize a 2D convolution kernel, and it'll still be 5x too slow even the optimisation required to do it is perfecly doable by a machine.
Probably the biggest mistake academic researchers do is compare their results to gcc. It's just not a valid comparison.
It's just what the rest of the community use anyway ...
Well, one would expect the author of a library to enumerate what it can do. If you have these kinds of constructs, that's great!
*points at thread title* The fact the thread was about structuring them *before* the proposal may be a good hint.
Your claim was that compilers would not be able to handle it. I countered the claim. I'm interested in any other examples you have.
As I said : convolution kernel.
I can't find a reference for HTA specifically, but my guess from the name of the concept tells me that it's probably already covered elsewhere, as you say.
David Padau is the key name.
A Cell-based development library wouldn't be terribly useful. Something that provided the same idioms across architectures would be. As you say, the architecture abstraction is important.
Go say that to half the video game comapny that still use the Cell as a strange PPC and neglect the SPEs
I look forward to it. I think there's value here but we should figure out the focus and drop any unnecessary things. The focus was what I said : homogene interface to SIMD extensions no more no less. I never came and told I want to tackle down C++ parallel programming with only this library.
I consider this thread closed as it won't go further than slapping academic reference on this and that without actual content. 20+ posts of off topic is already enough. I have my answer from post 4. Thanks. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Monday 19 January 2009 14:12, Joel Falcou wrote:
You're taking this way too personally. I'm not suggesting that researchers close up shop. I'm suggesting that researchers should make sure they're tackling the correct problems.
Sorry if I sound like this but since I started this thread it seems that half the community
My experience tells me the problem is not the use of vector instructions[2] but rather lies in how to express high-level, task-based parallelism
Did I lose a bit of your message?
Except maybe 90% of current HPC needs are simple data-parallelism. But of course, I may be wrong yet again as the following trend showed.
HPC codes are getting very complex. Task parallelism is very important. Happily, that coincides with the commodity community finally having to grapple with parallelism.
[2] Except for special-purpose fields like graphics where a general-purpose compiler probably hasn't been given the smarts because it's not worth it to the general-purpose compiler vendor.
graphics, Computer Vision, physics, cryptography etc ... you name it. If i have a machien with X level of hadrware parallelism, I want to take advantage of all of them as they fit the algorithm.
Sure, a library for some of these special purpose areas could be valuable.
Because gcc is not an optimizing compiler. That's not its focus. It's getting better but I would encourage you to explore compilers from Intel, PGI and Pathscale. All of these easily beat gcc in terms of performance.
They beat gcc for sure, except they still don't beat hand written code in some simple case. Let ICC optimize a 2D convolution kernel, and it'll still be 5x too slow even the optimisation required to do it is perfecly doable by a machine.
The correct answer here is to file a bug with the compiler vendor. There's no way a library is going to be able to keep up with advances in compiler technology. That's because a library-based solution is essentially based on pattern-matching while a compiler solution is based on a more powerful dataflow and dependence abstraction.
Probably the biggest mistake academic researchers do is compare their results to gcc. It's just not a valid comparison.
It's just what the rest of the community use anyway ...
No, it isn't. It's what some of the community uses. A compiler/software performance paper that compares results to gcc is next to useless.
Well, one would expect the author of a library to enumerate what it can do. If you have these kinds of constructs, that's great!
*points at thread title* The fact the thread was about structuring them *before* the proposal may be a good hint.
That approach seems backward to me. Why would one try to figure out how to structure a library befone one even knows what's in it?
Your claim was that compilers would not be able to handle it. I countered the claim. I'm interested in any other examples you have.
As I said : convolution kernel.
Ok, so you found a good example where a compiler produced suboptimal code but admitted it would be readily vectorizable automatically. Again, the answer is to fix the compiler.
I can't find a reference for HTA specifically, but my guess from the name of the concept tells me that it's probably already covered elsewhere, as you say.
David Padau is the key name.
Ok, thanks for the info.
A Cell-based development library wouldn't be terribly useful. Something that provided the same idioms across architectures would be. As you say, the architecture abstraction is important.
Go say that to half the video game comapny that still use the Cell as a strange PPC and neglect the SPEs
We're talking about Boost here. Cell libraries for specific development communities could be very useful. But I don't think it's appropriate for Boost.
I look forward to it. I think there's value here but we should figure out the focus and drop any unnecessary things.
The focus was what I said : homogene interface to SIMD extensions no more no less.
I guess I need to see more of what you mean. I still don't have a good picture of what this would provide beyond current or near-future compiler capabilities. Doesn't gcc provide such extensions already? -Dave

The correct answer here is to file a bug with the compiler vendor. There's no way a library is going to be able to keep up with advances in compiler technology. That's because a library-based solution is essentially based on pattern-matching while a compiler solution is based on a more powerful dataflow and dependence abstraction.
I can't see how static flow analysis on a stripped down intermediate representation could go further than code restructuring based on intentional code analysis. There was a typo : it's DAVID PADUA not PADAU. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Monday 19 January 2009 15:01, Joel Falcou wrote:
The correct answer here is to file a bug with the compiler vendor. There's no way a library is going to be able to keep up with advances in compiler technology. That's because a library-based solution is essentially based on pattern-matching while a compiler solution is based on a more powerful dataflow and dependence abstraction.
I can't see how static flow analysis on a stripped down intermediate representation could go further than code restructuring based on intentional code analysis.
Well, it all depends on how stripped-down it is. Some intermediate forms are better for this than others. Again, I'm talking low-level parallelism here. Higher-level constructs are much more difficult to recognize mechanically.
There was a typo : it's DAVID PADUA not PADAU.
I figured that might be the case. :) -Dave

on Thu Jan 15 2009, Frank Mori Hess <frank.hess-AT-nist.gov> wrote:
Isn't SIMD-type stuff what valarray was designed for?
The fact is that nobody really knows what valarray was designed for. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Joel Falcou wrote:
* Arch.Memory : Provides free function and STL compatible allocators for aligned memory allocation and a macro to add aligned new/delete support into classes. Basically a platform independant wrapper around posix_memalign and its ilk.
It is possible to do better than posix_memalign. When you use posix_memalign, you do not provide the alignment information when freeing the memory. This does reduce the efficiency of posix_memalign. In C++, however, you always know the alignment of the memory when you free it.
* Arch.Types: some low level traits and generation class including things like : -> make_integer that build an integer type able to hold at most N bytes and with a given signedness e.g : make_interger<3,signed>::type returns int32_t.
I'm fairly sure there is already a library that does just that.
-> typed_bitfield that wraps a type T and a properly sized byte array and provide proper byte access operator[]. Useful for decomposing large types into bytes for low level operations.
aligned_storage from type traits?
-> padded<T,N> that provides an Enveloppe/Letter class around type T and whose sizeof is so that it's padded to a multiple of 2^N. It reuses compressed_pair for empty-base class optimisation.
Again, aligned_storage?
* Arch.SIMD : a SIMD computation library that provides a small scale binding to internal representation of a SIMD register on SIMD-enabled machine. Using Proto, a small DSEL is provided so one can write code like :
vec<float> k = {5,5,5,5},r; r = ((2*k)/(3-k)).min();
How would it work without writing vec<float, 4> k = {5, 5, 5, 5}; ? How can the minimum of a vector yield a vector? And isn't that already provided by Boost.uBlas?

Mathias Gaunard a écrit :
It is possible to do better than posix_memalign. When you use posix_memalign, you do not provide the alignment information when freeing the memory. This does reduce the efficiency of posix_memalign.
In C++, however, you always know the alignment of the memory when you free it.
I just checked this out indeed. Good to know for later.
I'm fairly sure there is already a library that does just that.
Yes I just rediscovered it myself too.
-> typed_bitfield that wraps a type T and a properly sized byte array and provide proper byte access operator[]. Useful for decomposing large types into bytes for low level operations.
aligned_storage from type traits? Hmmm, guess I have to watch this but I don't think it does what I think it does. I'll have a look. Mind you that if it's already done, then that's less for me to maintain. How would it work without writing vec<float, 4> k = {5, 5, 5, 5}; ?
How can the minimum of a vector yield a vector? And isn't that already provided by Boost.uBlas?
Guess I typed this too early or too late in the evening/morning. More details : vec<T> encapsulates ONE sidm register, not a array. So the type is enough to know, for a given platform, how many component fits in the SIMD type (basically size of vector = 16/sizeof(T) for 128 bits wide SIMD extension, which is ofc handled correctly for less or greater extension). Then again, there is a float missing before r as the min is indeed a scalar As for uBlas, uBlas handle array of data and may use SIMD extension. This library provide a POD like type for SIMD values. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Joel Falcou wrote:
More details : vec<T> encapsulates ONE sidm register, not a array. So the type is enough to know, for a given platform, how many component fits in the SIMD type (basically size of vector = 16/sizeof(T) for 128 bits wide SIMD extension, which is ofc handled correctly for less or greater extension).
So it's not portable at all. How great.
Then again, there is a float missing before r as the min is indeed a scalar
You had declared r to be a vec<float>.
As for uBlas, uBlas handle array of data and may use SIMD extension. This library provide a POD like type for SIMD values.
How is that any different?

So it's not portable at all. How great. Do you really thing I can't be able to hide this somehow ? This kind of thing *is* abstracted into the SIMD traits and use compile-time define to know how large a vector is on different platform. Thanks for thinking I am that much incompetent. You had declared r to be a vec<float>. Which is call a typo. As I said, it's meant to be float r; How is that any different? Because sometimes you want to perform SIMD operation on something else
Mathias Gaunard a écrit : than an array of value. More over, it's not like ublas array interface are sacred or anything. This library is aimed at building platform independent SIMD algorithm handling SIMD vector as POD so having a std::vector<vec<T>> or a std::valarray<vec<T>> or w/e can be done depending on the interface you need for your container. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Joel Falcou wrote:
Do you really thing I can't be able to hide this somehow ? This kind of thing *is* abstracted into the SIMD traits and use compile-time define to know how large a vector is on different platform. Thanks for thinking I am that much incompetent.
How are you supposed to code in a portable way? If I write vec<float> = {5., 5., 5., 5.}; the code will only compile if the SIMD register is 4 floats big on that architecture (as you said yourself). If it was vec<float, 4> = {5., 5., 5., 5.}; the library could actually fallback to something else to make the code work...
You had declared r to be a vec<float>. Which is call a typo. As I said, it's meant to be float r;
I was clarifying. You said you didn't declare 'r'.
How is that any different? Because sometimes you want to perform SIMD operation on something else than an array of value.
Like what? That's what it is as far as I can see. That's what a vector is. N times the same type. And SIMD performs the same operation on all elements. I really don't see the difference. I don't see much difference with uBlas either, except yours is a POD, which doesn't bring anything useful as far as I can see.

uBlas doesn't do any explicit vectorization. In an ideal world the compiler would handle this and emit optimal code for whatever architecture. Back in the real world, speed issues make uBlas basically unusable for my work. Better compiler technology would help, BUT, I think that it is a mistake to simply blame poor compilers. A high-level library like uBlas has a great deal of compile-time knowledge about data layout and computational structure that a compiler optimization pass on IR code does not. In such circumstances I think it is reasonable and logical to shift some effort from the compiler to library-side code generation. It's worth looking at the Eigen2 library (http://eigen.tuxfamily.org/) which is what I now use for high-performance linear algebra. It has its own small-scale binding to SIMD ops (SSE2, Altivec, or fall-back to C) and expresses all vectorizable calculations in terms of "packets," basically generalized SIMD registers. For SSE, a packet of floats is 4 floats, a packet of doubles is 2 doubles. If the platform doesn't have vector instructions, then a float packet is just a single float. I think this is a useful abstraction, and demonstrates the difference between something like Boost.SIMD and uBlas. The SIMD library is concerned with operating with maximal efficiency on fixed-size "packets" of data, where the size of a packet is determined by the data type and available instruction set. This can be used as a building block by, say, uBlas in operating on general arrays of data. Although I would be very happy to use Boost.SIMD directly as an end-user, I think that its greatest impact would be in other libraries. uBlas, dynamic bitset, GIL, and Math are Boost libraries which spring to mind as potentially benefiting enormously from a good cross-platform wrapper for SIMD operations. In fact, I had been thinking recently about writing my own version of a Boost.SIMD library based on Proto and Eigen2's packet model, but I'm very happy that Joel has taken the lead and actually produced some working code. I think a good Boost.SIMD library would be tremendously exciting, and I'm eager to see some code in the Vault. I'm a little surprised at the apparent hostility on the list so far. -Patrick On Sun, Jan 18, 2009 at 3:59 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Joel Falcou wrote:
Do you really thing I can't be able to hide this somehow ?
This kind of thing *is* abstracted into the SIMD traits and use compile-time define to know how large a vector is on different platform. Thanks for thinking I am that much incompetent.
How are you supposed to code in a portable way?
If I write vec<float> = {5., 5., 5., 5.};
the code will only compile if the SIMD register is 4 floats big on that architecture (as you said yourself).
If it was vec<float, 4> = {5., 5., 5., 5.}; the library could actually fallback to something else to make the code work...
You had declared r to be a vec<float>.
Which is call a typo. As I said, it's meant to be float r;
I was clarifying. You said you didn't declare 'r'.
How is that any different?
Because sometimes you want to perform SIMD operation on something else than an array of value.
Like what? That's what it is as far as I can see. That's what a vector is. N times the same type. And SIMD performs the same operation on all elements.
I really don't see the difference. I don't see much difference with uBlas either, except yours is a POD, which doesn't bring anything useful as far as I can see.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Le Lun 19 janvier 2009 07:49, Patrick Mihelich a écrit :
uBlas doesn't do any explicit vectorization. In an ideal world the
compiler would handle this and emit optimal code for whatever architecture. Back in the real world, speed issues make uBlas basically unusable for my work. Better compiler technology would help, BUT, I think that it is a mistake to simply blame poor compilers. A high-level library like uBlas has a great deal of compile-time knowledge about data layout and computational structure that a compiler optimization pass on IR code does not. In such circumstances I think it is reasonable and logical to
shift some effort from the compiler to library-side code generation. Exactly and that's exactly the same need I had for my own linear algebra library.
I think this is a useful abstraction, and demonstrates the difference between something
like Boost.SIMD and uBlas. The SIMD library is concerned with operating
with maximal efficiency on fixed-size "packets" of data, where the size of a packet is determined by the data type and available instruction set. This can be used as a building block by, say, uBlas in operating on general arrays of data.
Yes, and that's why I proposed it to the list. I had need that I covered as much as I could but I'm sure seasoned Boost dev adn user will fidn things I didn't catch first. Hence, the proposal I made.
Although I would be very happy to use Boost.SIMD directly as an end-user, You made my day :)
I think that its greatest impact would be in other libraries. uBlas, dynamic
bitset, GIL, and Math are Boost libraries which spring to mind as
potentially benefiting enormously from a good cross-platform wrapper for SIMD operations.
In fact, I had been thinking recently about writing my own version of a Boost.SIMD library based on Proto and Eigen2's
Yup, crawling the ML showed that lots of user had performances issues with those sometimes. packet model, but I'm very
happy that Joel has taken the lead and actually produced some working code.
I plan to do some clean-up and "boostify" my code so far and uploading something either on the vault or on my webpage in the upcoming weeks. What I seek is experience and return for user using "exotic" compiler I can't access too (Borland and things) and Visual studio users cause I had hard times getting a proper abstraction going on with VC++
I think a good Boost.SIMD library would be tremendously exciting, and I'm eager to see some code in the Vault. I'm a little surprised at the apparent hostility on the list so far.
So do I

Mathias Gaunard a écrit :
Than that's an implementation issue. It could be solved by submitting a patch. Ok, I'm really curious to here what you think is worthy to be a library and what is only worth a mere patch ? I thought boost ML were mostly untouched by c++.moderated pedantism but I guess i was wrong.
Have you ever tried optimizing *any* code using vector intrinsics on any kind of platform ? Do you really think that if it was that easy, then a lot more libraries will *already* provide such a mechanism ... -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Joel Falcou wrote:
Ok, I'm really curious to here what you think is worthy to be a library and what is only worth a mere patch ?
Why not modify Boost.uBlas so that it uses vectorization, then use Boost.uBlas as a basis to write other vectorization-aware libraries? Introducing a new library that does it then rewriting Boost.uBlas in terms of that library seems to be more work; especially since both libraries would have a very similar interface.
Have you ever tried optimizing *any* code using vector intrinsics on any kind of platform ?
I have used the GCC vector intrinsics, yes. I didn't find it that hard. At least much easier than writing inline assembly. Reading that sentence, however, I'm not too sure we're talking about the same thing. Are you saying those intrinsics aren't enough to generate optimal code?

On Mon, Jan 19, 2009 at 3:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Joel Falcou wrote:
Ok, I'm really curious to here what you think is worthy to be a library
and what is only worth a mere patch ?
Why not modify Boost.uBlas so that it uses vectorization, then use Boost.uBlas as a basis to write other vectorization-aware libraries?
Modifying Boost.uBlas to use vectorization necessarily implies writing a generic wrapper for SIMD instructions, as well as other support infrastructure including aligned memory allocation. I believe this is more or less what Joel has proposed. At least, this is the only sensible way I see to vectorize uBlas. It is the way that has worked well for other linear algebra libraries. Perhaps you can suggest a specific alternative approach? Using Boost.uBlas as a basis for other vectorization-aware libraries doesn't make sense to me because not all applications that can benefit from SIMD match the linear algebra domain. How would you write Boost.Dynamic Bitset in terms of uBlas? Why would uBlas even expose a bit vector or logical operations? It doesn't fit the domain. However, it is possible to write both libraries in terms of low-level data packets with efficient operations, and this is what Boost.SIMD would provide. Have you ever tried optimizing *any* code using vector intrinsics on any
kind of platform ?
I have used the GCC vector intrinsics, yes. I didn't find it that hard. At least much easier than writing inline assembly.
Well, that isn't a very high bar :). Have you tried customizing a function for different versions of SSE, or gone back later and changed what data type you're using? My experience with using vector intrinsics is that there is abundant room for it to be easier. -Patrick

On Mon, Jan 19, 2009 at 8:09 AM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Patrick Mihelich wrote:
uBlas doesn't do any explicit vectorization.
Than that's an implementation issue. It could be solved by submitting a patch.
Well, that's a pretty glib statement. I wonder if you have any idea what such a "patch" (I would say rewrite) would entail? Eigen2 was written from the ground up around the packet concept, which allows it to portably leverage SIMD instruction sets when available. uBlas I think would need to be restructured similarly to take advantage of SSE, Altivec, etc. in a manageable way. This is certainly doable, but there's a reason no one has actually done it yet: writing SIMD-enabled code for multiple architectures is currently really unpleasant. A SIMD wrapper library would solve this unpleasantness, and IMO is a prerequisite for a SIMD-accelerated uBlas.

Mathias Gaunard a écrit :
How are you supposed to code in a portable way?
If I write vec<float> = {5., 5., 5., 5.};
the code will only compile if the SIMD register is 4 floats big on that architecture (as you said yourself).
If it was vec<float, 4> = {5., 5., 5., 5.}; the library could actually fallback to something else to make the code work...
Fact is that there is no SIMD extension with less than 4 float per register and all upcoming ones go with more than that. In this case, if your float vector ends up having 8 element, it will jus follow the rule of the {} consructor and fill the rest with 0. I can see the problem however and maybe this notation should be removed in favor of a make_vector function.
participants (11)
-
Andrew Sutton
-
David A. Greene
-
David Abrahams
-
Frank Mori Hess
-
joaquin@tid.es
-
joel falcou
-
Joel Falcou
-
Mathias Gaunard
-
Patrick Mihelich
-
Phil Endecott
-
Sebastian Redl