Boost Library Testing - a modest proposal - was boost.test regression or behavior change (was Re: Boost.lockfree)
I believe this whole thread started from the changes in Boost.Test such that it can no longer support testing of C++03 compatible libraries. This is totally unrelated to the testing of Boost libraries. Here is what I would like to see: a) local testing by library developers. Of course library developers need this in order to develop and maintain libraries. Currently we have this and has worked quite well for many years. Making Boost.Test require C++11+ throws a monkey wrench into things for the libraries which use it. But that's only temporary. Libraries whose developers feel they need to maintain compatibility with C++98 can move to lightweight test with relatively little effort. Developers who are concerned that the develop branch is a "soup" can easily isolate themselves from this by testing against the master branch of all the other libraries. The Boost modularization system with git has made this very simple and practicle (thank you Beman!). So - not a problem. b) Testing on other platforms. We have a system which has worked pretty well for many years. Still it has some features that I'm not crazy about. i) it doesn't scale well - as boost gets bigger the testing load gets bigger. ii) it tests the develop branch of each library against the develop branch of all the other libraries - hence we have a testing "soup" where a test might show failure but this failure might not be related to the library under test but some other library. It diminishes the utility of the test results in tracking down problems. iii) it relies on volunteer testers to select compilers/platforms to test under. So it's not exhaustive and the selection might not reflect that which people are actually using. I would like to see us encourage our users to test the libaries that they use. This system would work in the following way. a) A user downloads/builds boost. b) he decides he's going to use library X, and Y c) he runs a tool which tells him which libraries he has to test. This would be the result of a dependency analysis. We have tools which do similar dependency analysis but they would have to be slightly enhanced to distinguish between testing, deployment, etc. I don't think this would be a huge undertaking given the work that has already been done. d) he runs the local testing setup on those libraries and their dependents. e) he uploads the test results to a dashboard similar if not identical to the current one. f) we would discourage uses from just using the boost libraries without runnig they're own tests. We would do this by exhortation and by refusing to support users who have been unwilling to run and post local tests. This would give us the following: a) a scalable testing setup which could handle a Boost containing any number of libraries. b) All combinations of libraries/platforms/compilers actually being used would be those being tested and vice versa. We would have complete and efficient test coverage. c) We would have statistics on libraries being used. Something we are sorely lacking now. d) We would be encouraging better software development practices. Sometime ago someone posted that he had a problem but couldn't run the tests because "management" wouldn't allocate the time - and this was a critical human life safety app. He escaped before I could weedle out of him which company he worked. And best of all - We're almost there !!!! we'd only need to: a) enhance slightly the dependency tools we've crafted but aren't actually using. b) develop a tool to post the local results to a common dashboard c) enhance the current dashboard to accept these results. Robert Ramey
Le 09/10/15 18:37, Robert Ramey a écrit :
I believe this whole thread started from the changes in Boost.Test such that it can no longer support testing of C++03 compatible libraries. This is totally unrelated to the testing of Boost libraries.
The thread started because boost.test broke something used by other libraries, in a development branch, which raised some misunderstanding on the purpose of this branch and the overall workflow. As a side note, I reverted the changes so that C++03 is not required for the set of features that are not explicitly stating this requirement in the documentation of 1.59 (datasets mainly, but also some forms of test declaration and test assertions).
Here is what I would like to see:
a) local testing by library developers.
Of course library developers need this in order to develop and maintain libraries.
Currently we have this and has worked quite well for many years. Making Boost.Test require C++11+ throws a monkey wrench into things for the libraries which use it. But that's only temporary. Libraries whose developers feel they need to maintain compatibility with C++98 can move to lightweight test with relatively little effort.
I do not think that local testing has ever been an issue. The value of the dashboard is on the scalability of the testing wrt. platforms/compiler combinations, especially for configurations that are hard to find today (eg. MSVC7) and/or hard to set up (eg. Android). I would also like to emphasis the difference between the unit testing tool (boost.test or lightweight) and the test driver (bjam): - The "API" for running the test bed is bjam. This is used by developers and the regression testing workflow - The API for writing tests can whatever developer like, boost.test is just one choice, which is not directly seen by the regression dashboard.
Developers who are concerned that the develop branch is a "soup" can easily isolate themselves from this by testing against the master branch of all the other libraries. The Boost modularization system with git has made this very simple and practicle (thank you Beman!).
So - not a problem.
Right: this is trivial locally, yet this is not the current workflow of the regression dashboard. The complains started because of failures in develop, and because of workflow considerations + safe increments. As a developer, I would like to test my library on many runners (and as fast as possible).
b) Testing on other platforms.
We have a system which has worked pretty well for many years. Still it has some features that I'm not crazy about.
i) it doesn't scale well - as boost gets bigger the testing load gets bigger.
I suggested a test procedure on "stages of quality" in my previous post: - fast feedback by continuous runners, giving a quick status on some mainstream compilers. Runners may have overlapping configuration/setup, so that the load is balanced somehow. - scheduling of less available runners on candidates selected from previous stage. The interface can be by increasing a git branch, the runners picking that branch only.
ii) it tests the develop branch of each library against the develop branch of all the other libraries - hence we have a testing "soup" where a test might show failure but this failure might not be related to the library under test but some other library. It diminishes the utility of the test results in tracking down problems.
Exactly, but also not being able to track down the history of the versions on the current dashboard is far from helping. As a developer, I would like to see a summary of eg. the number of failing tests vs. number of test, and *per revision*.
iii) it relies on volunteer testers to select compilers/platforms to test under. So it's not exhaustive and the selection might not reflect that which people are actually using.
I would say that it would be good if each runner publishes the setup (not the runtime, but how it has been deployed), and maybe a script for being able to reproduce this runner. I think about docker (and how easy it is to describe fully a system), there are tools for the other platforms, more complicated though. The idea behind that is to be able to reproduce the runners, so that they are not shown by name (eg. teeks99-08) but by property (eg. win2012R2-64on64, msvc-12). I am not saying that the current setup should not be followed, I am suggesting a way to address the scalability issue. For that we can have equivalent runners and balance the load.
I would like to see us encourage our users to test the libaries that they use. This system would work in the following way.
If by users you mean the post-release /end users/, are you expecting a post-release feedback? I am not sure I understand. BTW, do we have numbers on the number of ppl downloading an release candidate?
a) A user downloads/builds boost.
b) he decides he's going to use library X, and Y
c) he runs a tool which tells him which libraries he has to test. This would be the result of a dependency analysis. We have tools which do similar dependency analysis but they would have to be slightly enhanced to distinguish between testing, deployment, etc. I don't think this would be a huge undertaking given the work that has already been done.
d) he runs the local testing setup on those libraries and their dependents.
e) he uploads the test results to a dashboard similar if not identical to the current one.
So we expect having html pages of 10000 columns. I think again the information needs to be digested.
f) we would discourage uses from just using the boost libraries without runnig they're own tests. We would do this by exhortation and by refusing to support users who have been unwilling to run and post local tests.
Mmmm... sounds bad to me.
This would give us the following:
a) a scalable testing setup which could handle a Boost containing any number of libraries.
And what about just a randomized test? Say we have an ever growing number of tests N (big), but the acceptance or running N is decreasing with N. Say we limit to M << N (say 100), and we shuffle uniformly: the feedback would be much faster, the acceptance much higher. On our side, we need some machinery to digest this information based on the environment setup.
b) All combinations of libraries/platforms/compilers actually being used would be those being tested and vice versa. We would have complete and efficient test coverage.
c) We would have statistics on libraries being used. Something we are sorely lacking now.
I am wondering why this would be relevant.
d) We would be encouraging better software development practices. Sometime ago someone posted that he had a problem but couldn't run the tests because "management" wouldn't allocate the time - and this was a critical human life safety app. He escaped before I could weedle out of him which company he worked.
And best of all - We're almost there !!!! we'd only need to:
a) enhance slightly the dependency tools we've crafted but aren't actually using.
The dependencies are indirectly tested I would say, so testing the dependencies is a /nice to have/, but if I am using X that depends on Y, testing X should in most cases be enough. If it happens that the some breakage goes unnoticed through the tests of X, having tested Y might have helped but this is not trivial: coverage of X should be improved.
b) develop a tool to post the local results to a common dashboard c) enhance the current dashboard to accept these results.
Several tools exist already, eg. CDash together with cmake. Why spending that much effort in developing our tools? Our expectations are not that different than many other open or closed source softwares: we want quick and/or wide feedback on the development state of boost. Raffi
On 10/9/15 10:54 AM, Raffi Enficiaud wrote: It's hard to tell, but it seems to me that so far we're in agreement.
b) Testing on other platforms.
We have a system which has worked pretty well for many years. Still it has some features that I'm not crazy about.
i) it doesn't scale well - as boost gets bigger the testing load gets bigger.
I suggested a test procedure on "stages of quality" in my previous post: - fast feedback by continuous runners, giving a quick status on some mainstream compilers. Runners may have overlapping configuration/setup, so that the load is balanced somehow. - scheduling of less available runners on candidates selected from previous stage. The interface can be by increasing a git branch, the runners picking that branch only.
This a pretty elaborate setup. And also fairly ambiguous to me. Seems like implementing such a thing would be quite an effort - by whom I don't know.
ii) it tests the develop branch of each library against the develop branch of all the other libraries
...
Exactly,
OK - so we're agreement about this.
but also not being able to track down the history of the versions on the current dashboard is far from helping. As a developer, I would like to see a summary of eg. the number of failing tests vs. number of test, and *per revision*.
I don't think such information would be useful to me. But maybe that's just me.
iii) it relies on volunteer testers to select compilers/platforms to test under. So it's not exhaustive and the selection might not reflect that which people are actually using.
I would say that it would be good if each runner publishes the setup (not the runtime, but how it has been deployed), and maybe a script for being able to reproduce this runner. I think about docker (and how easy it is to describe fully a system), there are tools for the other platforms, more complicated though.
The idea behind that is to be able to reproduce the runners, so that they are not shown by name (eg. teeks99-08) but by property (eg. win2012R2-64on64, msvc-12). I am not saying that the current setup should not be followed, I am suggesting a way to address the scalability issue. For that we can have equivalent runners and balance the load.
Sounds very ambitious and complex.
I would like to see us encourage our users to test the libaries that they use. This system would work in the following way.
If by users you mean the post-release /end users/, are you expecting a post-release feedback? I am not sure I understand.
This suggestion doesn't address pre-release issues. Frankly, except for a few issues (develop vs master) cited above I don't think they are a big problem and I think the current testing setup is adequate. But this system can really only test the combinations that the testers select. The problem comes up after release when one gets bug reports form users of the released library. I would like to get these sooner rather than later and on the platforms that people are actually using. I often get issues reported which are related the current configuration but but the user hasn't run the latest tests on his current setup so all I get is a complaint. If the user ran the tests on the libraries which he's using (which he should be doing in any case!) I'd have a lot more to work with and bugs would get discovered and addressed sooner with less effort. Of course if users want to switch to develop branch on those libraries they use and run the tests pre-release - that would be great. But I'm not really expecting many people to do that.
BTW, do we have numbers on the number of ppl downloading an release candidate?
I'm guessing we do.
a) A user downloads/builds boost.
...
So we expect having html pages of 10000 columns. I think again the information needs to be digested.
LOL - that would be great !!! Of course if such a proposal were to be so wildly successful so as to create such a problem, we'd have to upgrade our archiving and inquiry of test results. I'm not losing any sleep regarding this issue right now.
f) we would discourage uses from just using the boost libraries without runnig they're own tests. We would do this by exhortation and by refusing to support users who have been unwilling to run and post local tests.
Mmmm... sounds bad to me.
LOL - we can't agree on everything.
This would give us the following:
a) a scalable testing setup which could handle a Boost containing any number of libraries.
And what about just a randomized test?
I don't see how that would be better.
c) We would have statistics on libraries being used. Something we are sorely lacking now.
I am wondering why this would be relevant.
OK - it's not really relevant as far as testing is concerned. This information would become available as a side effect. But it would be extremely useful to know that library X has N users. This would help indicate which libraries might be considered for elimination from the standard boost distribution. If something like "boost/shared_ptr" is used by only 10 people - it would be interesting to know. If the serialization library is only used by 10 people, I would be very interesting to know. Etc.
And best of all - We're almost there !!!! we'd only need to:
a) enhance slightly the dependency tools we've crafted but aren't actually using.
The dependencies are indirectly tested I would say, so testing the dependencies is a /nice to have/, but if I am using X that depends on Y, testing X should in most cases be enough.
Let's suppose I'm going to use some boost library X and Y (through dependency) as part of the aircraft control system of the next 400 person passenger plane. Wouldn't you feel safer if all the code used in the system were tested? Would you say it's good enough only test some of it? And if you can run the tests almost for free, is there any reason you would skip it? Basically if I'm going to deploy X in my product and it depends on Y and Z, all those should be tested in my environment. And there's absolutely no reason not to do this. OK - I didn't explain this well.
b) develop a tool to post the local results to a common dashboard c) enhance the current dashboard to accept these results.
Several tools exist already, eg. CDash together with cmake. Why spending that much effort in developing our tools? Our expectations are not that different than many other open or closed source softwares: we want quick and/or wide feedback on the development state of boost.
I totally agree. But it's not that simple when you got down to details. I have personal experience with CDash. I've used as part of the Safe Numerics library to be found at www.blincubator.com . I've recommend it's usage and describe how to use it at that same web site. So I'm more familiar with it than most. It's pretty tightly coupled to CMake and CTest and I don't see an obvious way to use it with our bjam test setup. How about replacing bjam with CMake - interesting but not simple either as they don't really match in capability. And the test reporting isn't quite up to our needs. Having a bit experience in all this in the context of Boost, I still believe they path I've proposed is the best one. Robert Ramey
On 09 Oct 2015, at 21:47, Robert Ramey
wrote: On 10/9/15 10:54 AM, Raffi Enficiaud wrote:
It's hard to tell, but it seems to me that so far we're in agreement.
b) Testing on other platforms.
We have a system which has worked pretty well for many years. Still it has some features that I'm not crazy about.
i) it doesn't scale well - as boost gets bigger the testing load gets bigger.
I suggested a test procedure on "stages of quality" in my previous post: - fast feedback by continuous runners, giving a quick status on some mainstream compilers. Runners may have overlapping configuration/setup, so that the load is balanced somehow. - scheduling of less available runners on candidates selected from previous stage. The interface can be by increasing a git branch, the runners picking that branch only.
This a pretty elaborate setup. And also fairly ambiguous to me. Seems like implementing such a thing would be quite an effort - by whom I don't know.
I am not sure any real solutions to testing needs of boost is simple. So it may be some elaborate setup is needed. Let´s hope not.
ii) it tests the develop branch of each library against the develop branch of all the other libraries
...
Exactly,
OK - so we're agreement about this.
but also not being able to track down the history of the versions on the current dashboard is far from helping. As a developer, I would like to see a summary of eg. the number of failing tests vs. number of test, and *per revision*.
I don't think such information would be useful to me. But maybe that's just me.
iii) it relies on volunteer testers to select compilers/platforms to test under. So it's not exhaustive and the selection might not reflect that which people are actually using.
I would say that it would be good if each runner publishes the setup (not the runtime, but how it has been deployed), and maybe a script for being able to reproduce this runner. I think about docker (and how easy it is to describe fully a system), there are tools for the other platforms, more complicated though.
The idea behind that is to be able to reproduce the runners, so that they are not shown by name (eg. teeks99-08) but by property (eg. win2012R2-64on64, msvc-12). I am not saying that the current setup should not be followed, I am suggesting a way to address the scalability issue. For that we can have equivalent runners and balance the load.
Sounds very ambitious and complex.
I would like to see us encourage our users to test the libaries that they use. This system would work in the following way.
If by users you mean the post-release /end users/, are you expecting a post-release feedback? I am not sure I understand.
This suggestion doesn't address pre-release issues. Frankly, except for a few issues (develop vs master) cited above I don't think they are a big problem and I think the current testing setup is adequate.
But this system can really only test the combinations that the testers select. The problem comes up after release when one gets bug reports form users of the released library. I would like to get these sooner rather than later and on the platforms that people are actually using. I often get issues reported which are related the current configuration but but the user hasn't run the latest tests on his current setup so all I get is a complaint. If the user ran the tests on the libraries which he's using (which he should be doing in any case!) I'd have a lot more to work with and bugs would get discovered and addressed sooner with less effort.
Of course if users want to switch to develop branch on those libraries they use and run the tests pre-release - that would be great. But I'm not really expecting many people to do that.
That is really the big question. Is it realistic to hope for sufficient numbers of users, with interesting configurations that are not already very well tested, to set up test runners like this. I do not think it is something users would not seriously consider if challenged, but in practical life this depend on practical things, such as: - Time to set up and maintain the test runner v.s. just running tests privately on what you use. - Hardware and software availability, including licenses, in an environment which for security reasons often need to be isolated from the rest of the local development environment. - Willingness to publish their use of boost. This may be more important than many expect as this is not a boolean state, it is a lot of fuzziness and uncertain users are harder to convince. Clearly, if there is ways of improving this, so it is simple, easier, less costly, more private, then there may be good hope. If it is not improved, I am a bit on the sceptical that sufficient interesting new test runners will arrive. But I guess it is hard to know if you do not try. — Bjørn
participants (3)
-
Bjørn Roald
-
Raffi Enficiaud
-
Robert Ramey