Regression testing: too many runners?

Guys, As you may know, I'm working on automatic regression testing of Boost on Android. I spent several weeks debugging it and now it seems to be close to the finish - i.e. we'll have soon fully automatic 24x7 regression testing with uploading all results to the Boost public FTP. As you can see, some results are already uploading on a regular basis - http://www.boost.org/development/tests/master/developer/summary.html (there are results of just few libraries at the current moment since I've limited testing by only those libraries to save time needed for debugging testing scripts; as far as I finish with my debugging, I'll enable testing of all libraries). However, I've realized that for thorough testing matrix of the variants become really big. Look, right now there are nine runners for Android, and this number is in fact very limited. They differs by target ABI (three arm variants, x86 and x86_64) and version of Android (API level 19 - Android 4.4 and 21 - Android 5.0). However, it has sense to test also on Android 4.0, 4.1, 4.2 and 4.3 since theirs market share is still large (see https://developer.android.com/about/dashboards/index.html). Add to that MIPS target ABI (not yet included to the testing) and multiple by two to have tests running with both default settings and with -std=c++11 - and you'll get really big total number of runners. I'm asking for advice from community. It's not the big problem for us to run all such tests in all variants, but I'm unsure if it will be acceptable for Boost community to see such wide table of runners. I'm afraid it will look like a flood. We also publish Android-only results on https://boost.crystax.net/master/developer/summary.html and we'll definitely display all results from all runners there. Please let me know if the same approach would work for http://www.boost.org/development/tests/master/developer/summary.html or should I limit somehow runners to be uploaded to the Boost FTP. -- Dmitry Moskalchuk

Guys,
As you may know, I'm working on automatic regression testing of Boost on Android. I spent several weeks debugging it and now it seems to be close to the finish - i.e. we'll have soon fully automatic 24x7 regression testing with uploading all results to the Boost public FTP. As you can see, some results are already uploading on a regular basis - http://www.boost.org/development/tests/master/developer/summary.html (there are results of just few libraries at the current moment since I've limited testing by only those libraries to save time needed for debugging testing scripts; as far as I finish with my debugging, I'll enable testing of all libraries).
However, I've realized that for thorough testing matrix of the variants become really big. Look, right now there are nine runners for Android, and this number is in fact very limited. They differs by target ABI (three arm variants, x86 and x86_64) and version of Android (API level 19 - Android 4.4 and 21 - Android 5.0). However, it has sense to test also on Android 4.0, 4.1, 4.2 and 4.3 since theirs market share is still large (see https://developer.android.com/about/dashboards/index.html). Add to that MIPS target ABI (not yet included to the testing) and multiple by two to have tests running with both default settings and with -std=c++11 - and you'll get really big total number of runners.
I'm asking for advice from community. It's not the big problem for us to run all such tests in all variants, but I'm unsure if it will be acceptable for Boost community to see such wide table of runners. I'm afraid it will look like a flood. We also publish Android-only results on https://boost.crystax.net/master/developer/summary.html and we'll definitely display all results from all runners there. Please let me know if the same approach would work for http://www.boost.org/development/tests/master/developer/summary.html or should I limit somehow runners to be uploaded to the Boost FTP.
The only thing I would say, is that if the number of runners is too high, there's a danger of developers being drowned in data. For me the most interesting variants would be the machine architechture ones (currently most of the tests are being run on Intel), and the least interesting would be C++ dialect ones as we already have good coverage of that, and I assume there's not much difference between GCC in C++11 mode on Andriod to GCC on Linux or whatever? John.

The only thing I would say, is that if the number of runners is too high, there's a danger of developers being drowned in data. For me the most interesting variants would be the machine architechture ones (currently most of the tests are being run on Intel), and the least interesting would be C++ dialect ones as we already have good coverage of that, and I assume there's not much difference between GCC in C++11 mode on Andriod to GCC on Linux or whatever?
Well, they shouldn't differ; however, how can you know that without thorough testing? Please note Android is not the Linux. Linux is very stable compared to Android. Many (even obvious) things works another way on Android due to highly non-standard Android libc (Bionic). Also, compilers are built from own sources, which are synced with upstream (from time to time), but no one can guarantee there would be the same behavior for GCC in C++11 mode on Android and GCC (the same version) on Linux. Nevertheless, even if C++ dialects doesn't mean so much for you (and may be for others), testing on different Android versions still produce really big matrix of variants. What about that? -- Dmitry Moskalchuk

John Maddock-2 wrote
I assume there's not much difference between GCC in C++11 mode on Andriod to GCC on Linux or whatever?
Hmmm - I'm not totally convinced of that. I would expect some differences between ARM architecture (Android) and Intel architecture (Linux). Robert Ramey -- View this message in context: http://boost.2283326.n4.nabble.com/Regression-testing-too-many-runners-tp467... Sent from the Boost - Dev mailing list archive at Nabble.com.

2015-02-08 0:15 GMT+04:00 Dmitry Moskalchuk
However, I've realized that for thorough testing matrix of the variants become really big. Look, right now there are nine runners for Android, and this number is in fact very limited. They differs by target ABI (three arm variants, x86 and x86_64) and version of Android (API level 19 - Android 4.4 and 21 - Android 5.0). However, it has sense to test also on Android 4.0, 4.1, 4.2 and 4.3 since theirs market share is still large (see https://developer.android.com/about/dashboards/index.html). Add to that MIPS target ABI (not yet included to the testing) and multiple by two to have tests running with both default settings and with -std=c++11 - and you'll get really big total number of runners.
MIPS platform is not well tested. Running tests on MIPS and ARM is highly required. Android API level is not really essential for Boost, almost all the Boost libraries work well on Android 2.3.3. Having minimal (2.3 or higher) and (optionally) maximal API levels covered in tests seems more than enough. c++11/c++14 may enable some code parts that are not tested with c++98. Just enable those on the modern compilers (gcc4.9+, clang3.5+). A few more notes: * Hard/soft floats may be essential for Boost.Math and compilers testing. * Some of the embedded developers compile code without RTTI and without exceptions. Testing this use case could be useful. * Some of the tests in Thread,Atomic,ASIO make sense if the host is multicore. Running tests on single core hosts may not be really valuable.
I'm asking for advice from community. It's not the big problem for us to run all such tests in all variants, but I'm unsure if it will be acceptable for Boost community to see such wide table of runners. I'm afraid it will look like a flood. We also publish Android-only results on https://boost.crystax.net/master/developer/summary.html and we'll definitely display all results from all runners there. Please let me know if the same approach would work for http://www.boost.org/development/tests/master/developer/summary.html or should I limit somehow runners to be uploaded to the Boost FTP.
I think that more is better. There's a lot of regression testers right now, but I have no feeling of drowning in data: just scroll till the yellow "fail" label and investigate the issue. -- Best regards, Antony Polukhin

On 09/02/15 13:19, Antony Polukhin wrote:
MIPS platform is not well tested. Running tests on MIPS and ARM is highly required.
As of now, we run Android tests on ARM (three ABI - armeabi, armeabi-v7a and armeabi-v7a-hard), x86 and x86_64 targets. We'll add MIPS to the list of active architectures as soon as will fix MIPS-specific issues preventing us to run even simple C++ code which uses GNU libstdc++. Also, we'll run tests on ARM64 (Aarch64) as soon as Google will release ARM64 emulator or we could figure out it somehow in other way (having dedicated Nexus 9 tablet constantly plugged to the CI server, for example).
Android API level is not really essential for Boost, almost all the Boost libraries work well on Android 2.3.3. Having minimal (2.3 or higher) and (optionally) maximal API levels covered in tests seems more than enough.
In fact, it's important to test with all more-or-less actual API levels. Android libc is very unstable and differ from one Android version to another. We've tried to minimize such differences in CrystaX NDK, but without thorough testing we can't guarantee it will work on all Android versions. Please note Boost regression testing is not only testing of Boost itself; it's testing of CrystaX NDK too. We already have set of automatic tests where we test many things to ensure behavior is POSIX-compatible, but having Boost tests successfully passed with all API levels will make us even more assured there is no problems.
c++11/c++14 may enable some code parts that are not tested with c++98. Just enable those on the modern compilers (gcc4.9+, clang3.5+).
A few more notes: * Hard/soft floats may be essential for Boost.Math and compilers testing. * Some of the embedded developers compile code without RTTI and without exceptions. Testing this use case could be useful. * Some of the tests in Thread,Atomic,ASIO make sense if the host is multicore. Running tests on single core hosts may not be really valuable.
This makes sense for me. Thank you for pointing out.
I think that more is better. There's a lot of regression testers right now, but I have no feeling of drowning in data: just scroll till the yellow "fail" label and investigate the issue.
Great! For me it definitely make sense. I just don't want violate rules of community and make people discontented. As soon as having many runners in results table is OK for you all, it's OK for me too. -- Dmitry Moskalchuk
participants (4)
-
Antony Polukhin
-
Dmitry Moskalchuk
-
John Maddock
-
Robert Ramey