
Aleksey Gurtovoy wrote:
Martin Wille writes:
The people involved in creating the test procedure have put very much effort in it and the resulting system does its job nicely when it happens to work correctly. However, apparently, the overall complexity of the testing procedure has grown above our management capabilities.
Honestly, I don't see from what you conclude that, less how it's apparent. Having said that...
- many reports of "something doesn't work right", often related to post-processing. Less than optimal responses on those. We all do understand that you and Misha are under time constraints and therefor aren't able to answer immediately. Having only two people who are able to fix these things is one small part of our problems. The fact that people do not know who would be responsible for finding out what part of the testing procedure is going wrong seems to indicate a management problem. - bugs suddenly go away and people involved in tracking them down do not understand what was causing them. This kind of problem is probably related to the build system. I consider this one fairly dangerous, actually. - We're not really able to tell when a bug started to get reported.
Maybe, we should take a step back and collect all the issues we have and all knowledge about what is causing these issues.
... this is a good idea. Making the issues visible definitely helps in keeping track of where we are and what still needs to be done, and quite possibly in soliciting resources to resolve them.
I'll make a start, I hope others will contribute to the list. Issues and causes unordered (please, excuse any duplicates):
I'll comment on the ones I have something to say about.
- testing takes a huge amount of resources (HD, CPU, RAM, people operating the test systems, people operating the result rendering systems, people coding the test post processing tools, people finding the bugs in the testing system)
True. It's also a very general observation. Don't see how having it here helps us.
I'm under the impression some people did not know how much resources testing actually costs. I've seen reactions of surprise when I mentioned the CPU time, HD space or RAM consumed by the tests. Pleas for splitting test cases were ignored (e.g. random_test).
- the testing procedure is complex
Internally, yes. The main complexity and _the_ source of fragility lies in "bjam results to XML" stage of processing. I'd say it's one of the top 10 issues by solving which we can substantially simplify everybody's life.
I agree. This processing step has to deal with the build system (which in complex itself) and with different compiler output. Other complexity probably stems from having to collect and to display test results that reflect different cvs checkout times.
- the code-change to result-rendering process takes too long
Not anymore. In any case, there is nothing in the used technology (XSLT) that would make this an inherent bottleneck. It became one because the original implementation of the reporting tools just wasn't written for the volume of the processed data the tools are asked to handle nowdays.
*This* step might be a lot fast now (congrats, this is a *big* improvement). However, there still are other factors which make the code-change to result rendering process take too long.
- bugs in the testing procedure take too long to get fixed
I think all I can say on this one is said here -- http://article.gmane.org/gmane.comp.lib.boost.devel/119341.
I'm not trying to imply Misha or you wouldn't do enough. However, the fact that only two people have the knowledge and the access to the result collection stage of the testing process is a problem in itself.
- incremental testing doesn't work flawlessly
That's IMO another "top 10" issue that hurts a lot.
- deleting tests requires manual purging of old results in an incremental testing environment.
Just an example of the above, IMO.
Right. However, it's one of the more difficult problems to solve. The build system would have to be expanded to make it delete results for tests which don't exist anymore.
- lousy performance of Sourceforge - resource limitations at Sourceforge (e.g. the number of files there)
This doesn't hurt us anymore, does it?
It hurts everytime the result collecting stage doesn't work correctly. We're not able to generate our own XML results and to upload them due to the SF resource limits.
- test results aren't easily reproducible. They depend much on the components on the respective testing systems (e.g. glibc version, system compiler version, python version, kernel version and even on the processor used on Linux)
True. There is much we can do about it, though, is it?
You're probably right. However, I wanted to mention this point, because someone might have an idea how to address it. I guess it boils down to needing more testers in order to see more flavours of similar environments.
- becoming a new contributor for testing resources is too difficult.
I don't think it's true anymore. How simplier it can become -- http://www.meta-comm.com/engineering/regression_setup/instructions.html?
Hmm, recent traffic on the testing reflector seemed to indicate it isn't too simple. This might be caused by problems with the build system.
- we're supporting compilers that compile languages significantly different from C++.
Meaning significantly non-conforming compilers or something else?
Yes, significantly non-conforming compilers.
- post-release displaying of test results apparently takes too much effort. Otherwise, it would have been done.
Huh? The were on the website (and still are) the day the release was announced. See http://www.meta-comm.com/engineering/boost-regression/1_32_0/developer/summa...
Well, I take that back then. However, this URL seems not to be well known. Not a problem then.
- some library maintainers feel the need to run their own tests regularly. Ideally, this shouldn't be necessary.
Agreed ("regularly" is a key word here). IMO the best we can do here is to ask them to list the reasons for doing so.
One reason sure is that the test environments or the test cycles available are somehow unsatisfying. I would understand either. More testers would help here, too.
- test post processing has to work on output from different compilers. Naturally, that output is formatted differently.
What's the problem here?
It isn't a problem? We don't parse the output from the compilers?
- several times the post processing broke due to problems with the XSLT processor.
And twice as often it broke due to somebody's erroneous checkin. The latter is IMO much more important to account for and handle to gracefully. Most of XSLT-related problems of the past were caused by inadequate usage, such as transformation algorithms not prepared for a huge volume of data we are now processing.
Do you expect the recent updates to be able to handle a significantly higher volume? This would be big improvement. I'm asking because I had the impression some parts in the XSL processing used O(n^2) algorithms (or worse). My local tests with changing the length of pathnames seemed to indicate that (replacing "/home/boost" with "/boost" resulted in significant speedup of the XSLT processor).
- there's no way of testing experimental changes to core libraries without causing reruns of most tests (imagine someone would want to test an experimental version of some part of MPL).
Do you mean running library tests only off the branch?
Yes, and running only a reduced set of tests for that if possible. I think this would help the library maintainers.
- switching between CVS branches during release preparations takes additional resources and requires manual intervention.
What do you think of this one -- http://article.gmane.org/gmane.comp.lib.boost.devel/119337?
I'm with Victor on this point; for the testers (and hopefully there'll be more of them one day) it's significantly easier not to have to change anything during the release preparations. This could be achieved by using the CVS trunk as release branch until the actual release gets tagged. Development would have to continue in a branch and to be merged back into the branch after the release. Ideally, the testers would be able to run the test without having to attend the runs. This is currently not possible. (Just as an example: while I'm writing this I recognize that, apparently, I'm unable to upload test results now because of an error caused by one of the Python scripts: "ImportError: No module named ftp")
Finally, thanks for putting this together!
I hoped other people would contribute to the list; I'm sure there's a lot more to say about testing. E.g. it would be nice to have some sort of history of recent regression results. It would be nice to be able to split the runs vertically (running tests for a smaller set of toolsets) and horizontally (running tests for a smaller set of libraries) easily; I realize, though, that presenting the results would become more difficult. Regards, m