[boost::ublas] low level of performance with xlC
data:image/s3,"s3://crabby-images/b5424/b54246ab1435b92409dc36270b93176d270529fe" alt=""
I'd like to get your point of view on the following topic: runtime performance of binaries built against the IBM VisualAge c++ compilers. I've been recently investigating (with a colleague) various performance matters on a subset of platforms: Power 4 and Power 5 architecture running on different flavors of the AIX operating system, using binaries built with VisualAge C/C++ . We built and ran the boost::uBlas regression tests using either the VisualAge C/C++ v.9 or the GCC 4.0 compilers. Using similar levels of optimization, we observed that the GCC binaries were running much faster. The same tests were performed on a Linux 64-bits machine using a standard Core2 processor and GCC-4.2 and performance was again much much higher. Are we missing something here, something that could allow us to get a decent level of performance with the IBM compilers ? Thanks, Eloi -- Eloi Gaudry Free Field Technologies Axis Park Louvain-la-Neuve Rue Emile Francqui, 1 B-1435 Mont-Saint Guibert BELGIUM Company Phone: +32 10 487 959 Company Fax: +32 10 454 626
data:image/s3,"s3://crabby-images/b5424/b54246ab1435b92409dc36270b93176d270529fe" alt=""
Hi there, I guess this topic isn't very popular: this is a compiler that few people in the community (want to) use, but I think that the above details numbers (from ublas/bench2 tests) are quite interesting for those who are willing to use AIX/Power5 platforms. Here is are a brief view of the benches we ran (the following output come from the prod(matrix, vector) benches from the ublas/bench2 tests): A/ were performed on a p505 IBM server (one dual-core p5+ processor running at 1.9GHz) with either xlC(v-9) or g++(v-4.2) ; B/ were performed on a Linux 64-bits platform (one core2 dual-core processor running at 2.4GHz) with g++(v-4.2). Briefly, and especially for coordinates_matrix and compressed_matrix: - the GCC Linux 64-bits platform outperforms the p5Server platform running AIX, either using g++ or xlC binaries ; - the GCC AIX platform is much more efficient than the VisualAge one. A/1) xlC using the following command line: "xlC_r -v -qphsinfo -qalign=full -qstrict -qinline -qeh -qrtti -O2 -qxflag=erratadce -q64 -qarch=pwr5 -qtune=pwr5 -qmaxmem=-1 -qtemplateregistry -DFC_LINK -DBOOST_UBLAS_ENABLE_PROXY_SHORTCUTS -DBOOST_LIB_DIAGNOSTIC -DBOOST_DISABLE_THREADS -DBOOST_ALL_NO_LIB -DNDEBUG -q64" bench_2 outer_prod C array elapsed: 0.009996 s, 515.19 Mflops compressed_matrix, compressed_vector safe elapsed: 1.22936 s, 4.18904 Mflops compressed_matrix, compressed_vector fast elapsed: 0.959484 s, 5.3673 Mflops coordinate_matrix, coordinate_vector safe elapsed: 1.34932 s, 3.81663 Mflops coordinate_matrix, coordinate_vector fast elapsed: 1.03945 s, 4.9544 Mflops prod (matrix, vector) C array elapsed: 0 s, INF Mflops compressed_matrix, compressed_vector safe elapsed: 0.609609 s, 7.03981 Mflops compressed_matrix, compressed_vector fast elapsed: 0.419836 s, 10.2219 Mflops coordinate_matrix, coordinate_vector safe elapsed: 1.00945 s, 4.25137 Mflops coordinate_matrix, coordinate_vector fast elapsed: 0.799695 s, 5.36646 Mflops matrix + matrix C array elapsed: 0 s, INF Mflops compressed_matrix safe elapsed: 1.96893 s, 2.61556 Mflops compressed_matrix fast elapsed: 1.64907 s, 3.12287 Mflops coordinate_matrix safe elapsed: 3.4183 s, 1.50655 Mflops coordinate_matrix fast elapsed: 3.06852 s, 1.67828 Mflops 2) g++ using the following command line: "g++ -pthread -maix64 -Wall -Wno-inline -ftemplate-depth-100 -finline-functions -O1 -DAdd_ -DMALLOC_RET_VOID=1 -DUSE_STDARG=1 -DHAVE_STDARG_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DHAVE_STDLIB_H=1 -DUSE_STDARG -DNOMINMAX -DBOOST_UBLAS_ENABLE_PROXY_SHORTCUTS -DBOOST_LIB_DIAGNOSTIC -DBOOST_DISABLE_THREADS -DBOOST_ALL_NO_LIB -DNDEBUG -mcpu=power5" bench_2 outer_prod C array elapsed: 0.009887 s, 520.87 Mflops compressed_matrix, compressed_vector safe elapsed: 0.58977 s, 8.73195 Mflops compressed_matrix, compressed_vector fast elapsed: 0.289889 s, 17.7649 Mflops coordinate_matrix, coordinate_vector safe elapsed: 0.509802 s, 10.1016 Mflops coordinate_matrix, coordinate_vector fast elapsed: 0.209922 s, 24.5322 Mflops prod (matrix, vector) C array elapsed: 0.009996 s, 429.325 Mflops compressed_matrix, compressed_vector safe elapsed: 0.279718 s, 15.3424 Mflops compressed_matrix, compressed_vector fast elapsed: 0.129946 s, 33.0255 Mflops coordinate_matrix, coordinate_vector safe elapsed: 0.439832 s, 9.75721 Mflops coordinate_matrix, coordinate_vector fast elapsed: 0.279893 s, 15.3328 Mflops matrix + matrix C array elapsed: 0.009998 s, 515.087 Mflops compressed_matrix safe elapsed: 1.22945 s, 4.18874 Mflops compressed_matrix fast elapsed: 0.909645 s, 5.66137 Mflops coordinate_matrix safe elapsed: 1.5694 s, 3.28142 Mflops coordinate_matrix fast elapsed: 1.26953 s, 4.0565 Mflops B/ g++ using the following command line: "g++ -pthread -Wall -Wno-inline -ftemplate-depth-100 -finline-functions -O1 -DAdd_ -DMALLOC_RET_VOID=1 -DUSE_STDARG=1 -DHAVE_STDARG_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DHAVE_STDLIB_H=1 -DUSE_STDARG -DNOMINMAX -DBOOST_UBLAS_ENABLE_PROXY_SHORTCUTS -DBOOST_LIB_DIAGNOSTIC -DBOOST_DISABLE_THREADS -DBOOST_ALL_NO_LIB -DNDEBUG" bench_2 outer_prod C array elapsed: 0 s, inf Mflops compressed_matrix, compressed_vector safe elapsed: 0.19 s, 27.1044 Mflops compressed_matrix, compressed_vector fast elapsed: 0.15 s, 34.3323 Mflops coordinate_matrix, coordinate_vector safe elapsed: 0.17 s, 30.2932 Mflops coordinate_matrix, coordinate_vector fast elapsed: 0.12 s, 42.9153 Mflops prod (matrix, vector) C array elapsed: 0.01 s, 429.153 Mflops compressed_matrix, compressed_vector safe elapsed: 0.09 s, 47.6837 Mflops compressed_matrix, compressed_vector fast elapsed: 0.07 s, 61.3076 Mflops coordinate_matrix, coordinate_vector safe elapsed: 0.17 s, 25.2443 Mflops coordinate_matrix, coordinate_vector fast elapsed: 0.14 s, 30.6538 Mflops matrix + matrix C array elapsed: 0 s, inf Mflops compressed_matrix safe elapsed: 0.53 s, 9.71668 Mflops compressed_matrix fast elapsed: 0.48 s, 10.7288 Mflops coordinate_matrix safe elapsed: 0.71 s, 7.2533 Mflops coordinate_matrix fast elapsed: 0.65 s, 7.92283 Mflops I'll appreciate any feedback on this topic, any explanation for these results. Thanks, Eloi Eloi Gaudry wrote:
I'd like to get your point of view on the following topic: runtime performance of binaries built against the IBM VisualAge c++ compilers.
I've been recently investigating (with a colleague) various performance matters on a subset of platforms: Power 4 and Power 5 architecture running on different flavors of the AIX operating system, using binaries built with VisualAge C/C++ . We built and ran the boost::uBlas regression tests using either the VisualAge C/C++ v.9 or the GCC 4.0 compilers. Using similar levels of optimization, we observed that the GCC binaries were running much faster. The same tests were performed on a Linux 64-bits machine using a standard Core2 processor and GCC-4.2 and performance was again much much higher.
Are we missing something here, something that could allow us to get a decent level of performance with the IBM compilers ?
Thanks, Eloi
-- Eloi Gaudry Free Field Technologies Axis Park Louvain-la-Neuve Rue Emile Francqui, 1 B-1435 Mont-Saint Guibert BELGIUM Company Phone: +32 10 487 959 Company Fax: +32 10 454 626
participants (1)
-
Eloi Gaudry