[1.33.0] Let's start preparations...

Let's start revving up to release Boost 1.33.0. Personally, I'd like to get it out the door by mid-April at the latest, and I'm offering to manage this release. To get things rolling, I'd like library maintainers to add their goals for the 1.33.0 release to the checklist on the Wiki. This will give us all an indication of our progress toward the feature set we want for this release. A few weeks before the release, we'll freeze the feature set, bump the remaing list to 1.34.0, and work on portability and stability until we get it right. The checklist is here: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?1.33.0_Checkl... Depending on the size of the list, we'll decide when to feature freeze. I'll be travelling for the next week, but will try to answer e-mail as I find time. Doug

On Sun, 06 Mar 2005 20:52:36 -0500, Douglas Gregor wrote
I've added a list of date-time things. One of the things on my list is fixing the pdf generation of boost-book stuff for date-time. I think the problem that stopped this from happening in 1.32 was fixed, but date-time got removed by hand and never got put back in. Unfortunately, I can't test whether it is working as we've spent several hours trying to get boostbook pdf generation working to no avail (we will probably try again in the next couple weeks to isolate the problem). Anyway I'd like to see if anyone can get this to generate early on so we can make any needed changes. So Doug if you or Aleksy or someone that has pdf generation working can try and build an early pdf file I would appreciate it... Jeff

At Sunday 2005-03-06 18:52, you wrote:
thank you for your offer, but if you don't get the damned regression testing working FIRST (it's been non-responsive since Report Time: Fri, 04 Mar 2005 06:30:29 +0000.... that's SEVENTY-TWO (72) hours), you're not gonna have any testers. AND can we talk about HOW we're going to manage the cvs this time (with the testers also?) or are we gonna do it the way we've been muddling through on the past releases? I submit that having the regression testers do ANYTHING other than let the automated stuff run is essentially unacceptable.
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

"Victor A. Wagner Jr." <vawjr@rudbek.com> writes:
Please keep your language civil.
testing working FIRST (it's been non-responsive
Can you please be more specific about what has been non-responsive? I doubt anyone can fix anything without more information.
-- Dave Abrahams Boost Consulting www.boost-consulting.com

At Monday 2005-03-07 07:02, you wrote:
Hi Dave, thank's for the kind lesson in political correctness. I'll give it all the attention I give all "PC" edicts. BTW, rule #1 is "get the student's attention" and for sure we needed to get SOMEONE's attention. Results are more important than someone's feelings, IMO.
<hostile> Here's a lesson in using your brain. If you'd looked at the meta-comm regression page any time since Friday morning (see the paste below) you would have seen that nothing was changing. That's what "non-responsive" means. .............OR.......... if you'd been reading the boost-testing echo, you would have noticed that I commented that the regression results weren't being updated (so did Rene)....nada/zip/zilch for response Yes, I'm hostile, I don't need some "kid" telling me how to communicate (I need _results_).</hostile> I _do_ note that it seems to be working now (approximately 1530 UT on Monday March 7)... well sorta. Now that we have a dialog going: I also note that I _still_ cannot check the results of the changes I made Thursday night to localtime_test because although the webpage asserts localtime_test failed on my machine (it does, for some reason, in their <sarcasm> infinite wisdom and desire to innovate</sarcasm> Microsoft have apparently decided that attempting to format any date before 1900 will cause an exception) when I click on the "fail" link, I get a "page missing" (not particularly useful). I further note that there are some "white spaces" the regression results show for me. IF we're going to have automated testing, then someone _else_ has to do something so that _all_ of the results show up. My tests are run using "scheduled tasks" on a windows XPpro system, every 6 hours under their own logon (clicking on the RudbekAssociates link will tell you more than you want to know). I've done everything I can think of thus far to make them completely automatic, which is the _only_ rational way to run regression tests. As soon as you _require_ manual intervention you run the risk (probability 1) that your results will be inaccurate. Sooooooo Let's get the regression test system up to snuff. Let's make it completely "hands off" for the persons volunteering their (personal & computer) time to run the tests. in other words: Let's get it right
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

Victor A. Wagner Jr. wrote:
Labeling something hostile doesn't make acceptable. (Consider: <pornography>http://tinyurl.com/6lv3e</pornography>) Jonathan

Victor A. Wagner Jr. writes:
Nobody besides yourself can guarantee you fast response time all the time. People get busy and have obligations outside of Boost. Two main problems with the current state of affairs are that: a) There is a limited number of people knowledgeable enough to fix the issues with regression reporting effectively, and b) The machine that is running the reports is only accessible to us (Meta), which makes it impossible for another Boost developer to step in and fix things on the occasion when we are swamped (SourceForge is not an answer to this one, in particular because the sheer amount of processed data is overwhelming for their machines). Until these are resolved, an occasional delay in getting things to work again after an unexpected breakage is inevitable. Having said that, we are trying our best to be responsive within a reasonable time frame even when everybody here is busy.
Fixed now, http://tinyurl.com/3ppqs.
I further note that there are some "white spaces" the regression results show for me.
Sparse "white spaces" in Python are pending a fix to bjam (http://thread.gmane.org/gmane.comp.lib.boost.build/6582), property_map one needs to be looked at, and others look normal to me. Python tests issue aside, none of them indicate loss of a valuable information.
Agreed 100%.
Ditto. -- Aleksey Gurtovoy MetaCommunications Engineering

Hi Dave, thank's for the kind lesson in political correctness. I'll give it all the attention I give all "PC" edicts.
<hostile> Here's a lesson in using your brain.
Victor, apart from the fact that your message was offensive (and has caused the moderators to receive complaints), you will very likely get a better response if you moderate your tone: please remember that everyone around here is a volunteer. Be assured that we do appreciate you using your computers time for running the regression tests, and accept that you have raised some valid points; however, if you want another volunteer to spend their time on these, haranguing them is certainly not the best way to go. It is unfortunate that the MetaCom test result site started failing just prior to the weekend, but remember it was fixed on Monday, and with a great deal of good grace on Misha's part as well. Boost only functions by cooperation, I trust that all involved will continue that into the future. Yours in "Moderator Mode", John Maddock.

At Tuesday 2005-03-08 09:59, you wrote:
I never forget that this is a volunteer organization. And I _hope_ what I put in the <hostile>...</hostile> was offensive, it was intended to be. I'd hate to think that I'm accidently offensive but can't be when I want to. Now, since you also quoted my snide comment about "PC" I'm going to have to infer that some thought/think that comment was ill conceived. Would a "Yes, massah." have been more appropriate? Not to put too fine a point on it, but if some people took offense by my use of the word "damned", then some people need to reconsider that the standard phrase "take offense" implies action on the part of the newly offended one. I live in a free (well mostly) country where people are free to do many things, including taking offense at whatever strikes their fancy. Their choices to so do, does not, and can not impose any burden on me. I applaud their exercise of their rights and will fight to the death to defend such rights. They _still_ cannot bind me to issue only comments/statements at which _they_ will not "take offense". I grant that the word is considered vulgar (people ought to look up what that word really means, and where it came from), and I accept that I'm likely a vulgar person. I'll try to clean up my act. What "set me off" (and I'm free to do that just as people are free to take offense) was being unfairly accused of submitting an incomplete bug report. That the report might not be clear to everyone wasn't the issue here, I suspect that everyone involved in the regression testing understood what I'd said. It seemed somewhat out of place to publicly criticize my choice of emphatic (a word I use to refer to a method of emphasizing something: italics, underlining, throwing in a "swear word", etc.) Then apparently not be able to finish reading the entire sentence before complaining that there is insufficient information. I suppose a private response would be considered by most to be more appropriate, and indeed if it had been made privately, I would have responded thusly.
Be assured that we do appreciate you using your computers time for running the regression tests, and accept that you have raised some valid points;
if it were just the computer time, I probably wouldn't have reacted. Other than trying to verify OGR in the idle time, the machine can easily afford it. It's the personal time also (I'm a volunteer, too). I see it as my responsibility, since I've said I'd run the tests, to keep up to date on the boost-testing EMail echo and to verify that my tests are running correctly. Since I seem to be the only person running tests on VC8.0, I also look at all the failures to see if there's something that can easily be fixed in _any_ of the tests or the boost libraries which show up as failures only for VC8.0. I'd chosen, Thursday night...about an hour before my next test run, to try to fix some exceptions which were being generated in the date_time tests. For those who don't know, VC8.0 (in debug code generation) has a library that checks iterators pretty thoroughly, and it will find (and has found in the past) problems that just don't show up elsewhere. So I chased down the ones I could find (and fixed the relevant libraries....after running several local tests, of course) and checked them in. Now I'm operating in two modes, tester, and library author/maintainer (no, I'm not any of the primary authors of the date_time stuff), so I'm doubly interested in the results of the complete regression (I can only test flavors of VC++). It's _possible_ that I've fixed things for VC++8.0 and killed some other compiler/runtime (given the nature of the fixes it's _staggeringly_ unlikely, but.... that's why we have regression tests). So, it's in this frame that someone mentions to me (irc, efnet, #boost) that someone has suggested we release 1.33 by April 15. I figure someone ought to say, ummm, wait a minute. It was also _very_ important that anyone volunteering to manage a release _UNDERSTAND_ that there is a showstopper problem with the regression testing. Hence, the original message. The follow up you've clearly seen.
however, if you want another volunteer to spend their time on these, haranguing them is certainly not the best way to go.
who did I harangue?
Agreed, and I have no animosity towards meta-comm (nor anyone for that matter...well some (we'll leave them unnamed) politicians, lol).
Boost only functions by cooperation, I trust that all involved will continue that into the future.
Likewise, I believe the success of C++ is directly tied to how well boost seduces (boost reacts faster than the committee).
Yours in "only mode I've got" Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

At Tuesday 2005-03-08 13:50, I wrote: [deleted as irrelevant]
Likewise, I believe the success of C++ is directly tied to how well boost seduces (boost reacts faster than the committee).
that SHOULD read......: Likewise, I believe the success of C++ is directly tied to how well boost succeeds (boost reacts faster than the committee). I'd typoed both success and succeeds with a single c. when the spell checker complained about sucess, it picked success as a correction. then it complained about suceeds and I just said OK (assuming, obviously incorrectly) that it would pick succeeds (silly me) sorry for any confusion. [also irrelevant] Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

Victor A. Wagner Jr. wrote:
I liked your first version better... Maybe it was a typo, or _maybe_ it was a subconscious nod to Boost's enticing nature ;) - james -- __________________________________________________________ James Fowler, Open Sea Consulting http://www.OpenSeaConsulting.com, Marietta, Georgia, USA Do C++ Right. http://www.OpenCpp.org, opening soon!

Victor, Therein lies the problem, many people have found that your messages convey a distinctly hostile and angry manner. If everyone posted in the same way, Boost would quickly degenerate into a flame war, and I'm sure you will appreciate why that would be a "bad idea". If you can't see why that's a bad idea, please say so and we'll bump your messages back into the moderation queue. In short: we all get angry, annoyed and upset from time to time, most Boosters are remarkably restrained at not letting it show, and instead post cogent and well reasoned messages (when they've calmed down probably). Please, lets move on, I know I have better things to do, I'm sure you do too. Still in moderator mode, John Maddock.

David Abrahams wrote:
Whatever tone might be appropriate or not ... Several testers have raised issues and plead for better communication several (probably many) times. Most of the time, we seem to get ignored, unfortunately. I don't want to accuse anyone of voluntarily neglecting our concerns. However, I think we apparently suffer from a "testing is not too well understood" problem at several levels. The tool chain employed for testing is very complex (due to the diversity of compilers and operation systems involved) and too fragile. Complexity leads to lack of understanding (among the testers and among the library developers) and to false assumptions and to lack of communication. It additionally causes long delays between changing code and running the tests and between running the tests and the result being rendered. This in turn makes isolating bugs in the libraries more difficult. Fragility leads to the testing procedure breaking often and to breaking without getting noticed for some time and to breaking without anyone being able to recognize immediately exactly what part broke. This is a very unpleasant situation for anyone involved and it causes a significant level of frustration at least among those who run the tests (e.g. to see the own test results not being rendered for severals days or to see the test system being abused as a change announcement system isn't exactly motivating). Please, understand that a lot of resources (human and computers) are wasted due to these problems. This waste is most apparent those who run the tests. However, most of the time, issues raised by the testers seemed to get ignored. Maybe, that was just because we didn't yell loud enough or we didn't know whom to address or how to fix the problems. Personally, I don't have any problem with the words Victor chose. Other people might have. If you're one of them, then please understand that we're feeling there's something going very wrong with the testing procedure and we're afraid it will go on that way and we'll lose a lot of the quality (and the reputation) Boost has. The people involved in creating the test procedure have put very much effort in it and the resulting system does its job nicely when it happens to work correctly. However, apparently, the overall complexity of the testing procedure has grown above our management capabilities. This is one reason why release preparations take so long. Maybe, we should take a step back and collect all the issues we have and all knowledge about what is causing these issues. I'll make a start, I hope others will contribute to the list. Issues and causes unordered (please, excuse any duplicates): - testing takes a huge amount of resources (HD, CPU, RAM, people operating the test systems, people operating the result rendering systems, people coding the test post processing tools, people finding the bugs in the testing system) - the testing procedure is complex - the testing procedure is fragile - the code-change to result-rendering process takes too long - bugs in the testing procedure take too long to get fixed - changes to code that will affect the testing procedure aren't communicated well - incremental testing doesn't work flawlessly - deleting tests requires manual purging of old results in an incremental testing environment. - the number of target systems for testing is rather low; this results in questionable portability. - lousy performance of Sourceforge - resource limitations at Sourceforge (e.g. the number of files there) - between releases the testing system isn't as well maintained as during the release preparations. - test results aren't easily reproducible. They depend much on the components on the respective testing systems (e.g. glibc version, system compiler version, python version, kernel version and even on the processor used on Linux) - library maintainers don't have access to the testing systems; this results in longer test-fix cycles. - changes which will cause heavy load at the testing sites never get announced in advance. This is a problem when testing resources have to be shared with the normal workload (like in my case). - changes that requires old test results to get purged usually don't get announced. - becoming a new contributor for testing resources is too difficult. - we're supporting compilers that compile languages significantly different from C++. - there's no common concept of which compilers to support and which not. - post-release displaying of test results apparently takes too much effort. Otherwise, it would have been done. - tests are run for compilers for which they are known to fail. 100% waste of resources here. - known-to-fail tests are rerun although the dependencies didn't change. - some tests are insanely big. - some library maintainers feel the need to run their own tests regularly. Ideally, this shouldn't be necessary. - test post processing has to work on output from different compilers. Naturally, that output is formatted differently. - test post processing makes use of very recent XSLT features. - several times the post processing broke due to problems with the XSLT processor. - XSLT processing takes long (merging all the components that are input to the result rendering takes ~1 hour just for the tests I run) - the number of tests is growing - there's no way of testing experimental changes to core libraries without causing reruns of most tests (imagine someone would want to test an experimental version of some part of MPL). - switching between CVS branches during release preparations takes additional resources and requires manual intervention. I'm sure testers and library developers are able to add a lot more to the list. Regards, m

Martin Wille <mw8329@yahoo.com.au> writes:
I'll make a start, I hope others will contribute to the list. Issues and causes unordered (please, excuse any duplicates):
- the code-change to result-rendering process takes too long
We are working on result-rendering process part. We recognize that this has been a big bottleneck in the past and hopefully have significantly improved it.
- resource limitations at Sourceforge (e.g. the number of files there)
My problem with SF is that do not have much of control over the environment there.
- between releases the testing system isn't as well maintained as during the release preparations.
Well, that's because in 1.32.0 timeframe a lot of effort went in the releasing the release and creating the testing tools to ensure the adequate quality of it, and it's been done by the same group of people, who obviously need time to catch up with other things they have.
- library maintainers don't have access to the testing systems; this results in longer test-fix cycles.
I investigated this a little bit: the current licensing for commercial compilers doesn't permit to "access to the testing systems" by people other than the licensee (not so with free compilers). If somebody with more influence would make some kind of arrangement with a compiler vendors, we for one could try to give people reasonable free access to our test environment.
Can be alleviated by splitting testing vertically (by toolsets)
- becoming a new contributor for testing resources is too difficult.
http://www.meta-comm.com/engineering/regression_setup/instructions.html (Of course this needs to be easily accessible from main boost site)
- post-release displaying of test results apparently takes too much effort. Otherwise, it would have been done.
http://www.meta-comm.com/engineering/boost-regression/1_32_0/developer/summa... (Of course this needs to be more easily accessible from main boost site)
- test post processing has to work on output from different compilers. Naturally, that output is formatted differently.
Testing needs a better support from Boost.Build. Till that is implemented we will depend on parsing the bjam output.
- test post processing makes use of very recent XSLT features.
It is such an _enormous_ (trust me) pain to do without them. And it would take quite a significant effort to manage w/o XSLT.
Hopefully, not anymore. We merge all results in less than 30 minutes now.
I'm sure testers and library developers are able to add a lot more to the list.
Thanks for constructive feedback. -- Misha Bergal MetaCommunications Engineering

Misha Bergal wrote:
Martin Wille writes:
[...]
This is good news!
Yes, that's another problem.
That'd be great. I'm wondering whether companies that are both hardware vendors and compiler vendors would be willing to support testing of Boost by donating resources (hardware and compilers). However, legal issues would have to get examined closely. We can't make promises like "the next Boost version will support this compiler or that operating system".
That currently would require to use several runner-ids, wouldn't it?
Agreed.
Have databases been considered for storing the test results? Regards, m

Martin Wille <mw8329@yahoo.com.au> writes:
Misha Bergal wrote:
Martin Wille writes:
[...]
Needs to be worked at. [...]
It is the the results format (not multiple runner-ids) that is a problem, right? If this is the case, would the following format eliminate your problems: Martin Wille Tue, 08 Mar 2005 10:00:22 +0000 Tue, 08 Mar 2005 13:00:22 +0000 gcc-xxx gcc-xxx gcc-xxx gcc-xxx gcc-xxx gcc-xxx gcc-xxx gcc-xxx
Briefly. I thought they would help a lot if we were to generate the results dynamically. In that case instead of pregenerating all results we would generate them in front-end on the fly. One of our main requirement for the Boost-wide regression log processor was to minimize environment requirements for processing and web site scripts. We didn't want the web front-end, because it would tie us to particular technology (PHP,CGI,Java or ASP.NET), which would reduce our hosting choices. The same goes for a database. Currently, the processing stuff (boost_wide_report.py) requires just Python, xsltproc and some stuff on IIS/Apache to do the on-demand extracting of files from zip results archive. -- Misha Bergal MetaCommunications Engineering

Aleksey Gurtovoy <agurtovoy@meta-comm.com> writes:
Yes.
_and_ group access to commercial compilers?
In some form, I think. I'm not sure we would all be able to log in and run a commercial compiler, but if OSL had a BuildBot slave running these compilers, we wouldn't have to. A scheme something like this doesn't neccessarily depend on the use of BuildBot either. [followups to boost-testing] -- Dave Abrahams Boost Consulting www.boost-consulting.com

[Please follow up on Boost.Testing list] Martin Wille writes:
And in almost all cases "something doesn't work right" usually ended up being a temporary breakage caused either by newly implemented functionality in the regression tools' chain / internal environment changes on our side, or malfunctioning directly related to inremental runs/jam log parsing. The only thing the former cases indicate is that the tools are being worked on and only _possibly_ that people doing the work taking somewhat more risks at breaking things than, say, during the release. In any case, this by no means indicates loss of control -- quite the opposite. The latter cases, as we all agree, _are_ tips of the seriously hurting issues that needs to be resolved ASAP. Yet it's nothing new.
Less than optimal responses on those.
Well, I disagree with this particular angle of looking at the situation. Given the history of the recent issues which _I_ would classify as suboptimally resolved/responded to, for me the above statement is equivalent to saying: "Something didn't work right recently and it seemed like it might as well be the problem be on the reporting side -- I'd expect the corresponding maintainers to look at it proactively and sort things out". Needless to say I don't consider this to be neither fair nor productive way of looking at things.
IMO the problem is not that people don't know who is responsible (in fact, assigning a single person to be responsible is going to bring us back to square one) but rather that nobody steps up and says "I'll research this and report back" -- in a timely manner, that is. Is it a management problem? Rather lack of resources, I think.
Same here. Yet again, we've been having these problems from the day one. If your point is that it's time to solve them, I agree 100%.
- We're not really able to tell when a bug started to get reported.
I'm not sure I understand this one. Could you please provide an example?
OK.
It it really a problem nowdays? I think we have timestamps in every possible place and they make things pretty obvious.
I think the asnwer to this is further splitting of work among distributed machines.
It is. Anybody who feels interested enough to be filled in on this is more than welcome to join. [...]
I'd say we just need a backup results-processing site.
If you are talking about CodeWarrior on OS X saga, then it is more build system-related than anything else. [...]
Oh, I thought you were referring to something else. Yes, as we've agreed before, the need to post-process the output is probably the biggest source of problems.
Yes, and we have implemented only the most obvious optimizations. If there is further need to speed up things, we'll speed them up.
What about the tarballs, though?
Its' on our TODO list -- http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?Boost.Testing.
It would be nice to be able a to split the runs vertically (running tests for a smaller set of toolsets)
Aren't this possible now?
and horizontally (running tests for a smaller set of libraries) easily;
Agreed.
I realize, though, that presenting the results would become more difficult.
Nothing we can't figure out. -- Aleksey Gurtovoy MetaCommunications Engineering

Martin Wille writes:
I agree with everything what is said above...
.. but not this one. I, for one, don't feel this way. There is work to be done and issues to be resolved, true, but people are working on it and things do improve substantially over time. Comparing to pre-1.32 testing procedures and practices we are now on a totally different level of usability, coverage, simplicity of installation, and overall usefulness of the regression tools.
and we're afraid it will go on that way and we'll lose a lot of the quality (and the reputation) Boost has.
I think you are overstating things. If anything, things got significantly better on this front. The quality was poorly inforced until very recently -- we simply had no tools to get a more or less accurate, comprehensive picture of it. We do now, and people are working on moving things further forward. Not fast enough? Somebody who feels that way should give them a hand, then.
Honestly, I don't see from what you conclude that, less how it's apparent. Having said that...
Maybe, we should take a step back and collect all the issues we have and all knowledge about what is causing these issues.
... this is a good idea. Making the issues visible definitely helps in keeping track of where we are and what still needs to be done, and quite possibly in soliciting resources to resolve them.
I'll make a start, I hope others will contribute to the list. Issues and causes unordered (please, excuse any duplicates):
I'll comment on the ones I have something to say about.
True. It's also a very general observation. Don't see how having it here helps us.
- the testing procedure is complex
Internally, yes. The main complexity and _the_ source of fragility lies in "bjam results to XML" stage of processing. I'd say it's one of the top 10 issues by solving which we can substantially simplify everybody's life.
- the testing procedure is fragile
See the above.
- the code-change to result-rendering process takes too long
Not anymore. In any case, there is nothing in the used technology (XSLT) that would make this an inherent bottleneck. It became one because the original implementation of the reporting tools just wasn't written for the volume of the processed data the tools are asked to handle nowdays.
- bugs in the testing procedure take too long to get fixed
I think all I can say on this one is said here -- http://article.gmane.org/gmane.comp.lib.boost.devel/119341.
- incremental testing doesn't work flawlessly
That's IMO another "top 10" issue that hurts a lot.
- deleting tests requires manual purging of old results in an incremental testing environment.
Just an example of the above, IMO.
- the number of target systems for testing is rather low; this results in questionable portability.
Yes, we need more volunteers. Another "top 10" item.
- lousy performance of Sourceforge - resource limitations at Sourceforge (e.g. the number of files there)
This doesn't hurt us anymore, does it?
True. There is much we can do about it, though, is it?
- becoming a new contributor for testing resources is too difficult.
I don't think it's true anymore. How simplier it can become -- http://www.meta-comm.com/engineering/regression_setup/instructions.html?
- we're supporting compilers that compile languages significantly different from C++.
Meaning significantly non-conforming compilers or something else?
- there's no common concept of which compilers to support and which not.
I think the criteria have been formulated several times.
- post-release displaying of test results apparently takes too much effort. Otherwise, it would have been done.
Huh? The were on the website (and still are) the day the release was announced. See http://www.meta-comm.com/engineering/boost-regression/1_32_0/developer/summa...
- tests are run for compilers for which they are known to fail. 100% waste of resources here.
Agreed 100%. Also "top 10" item.
- known-to-fail tests are rerun although the dependencies didn't change.
Ditto.
- some library maintainers feel the need to run their own tests regularly. Ideally, this shouldn't be necessary.
Agreed ("regularly" is a key word here). IMO the best we can do here is to ask them to list the reasons for doing so.
- test post processing has to work on output from different compilers. Naturally, that output is formatted differently.
What's the problem here?
- several times the post processing broke due to problems with the XSLT processor.
And twice as often it broke due to somebody's erroneous checkin. The latter is IMO much more important to account for and handle to gracefully. Most of XSLT-related problems of the past were caused by inadequate usage, such as transformation algorithms not prepared for a huge volume of data we are now processing.
Already fixed.
- the number of tests is growing
And more distributing testing is the only answer to this.
Do you mean running library tests only off the branch?
- switching between CVS branches during release preparations takes additional resources and requires manual intervention.
What do you think of this one -- http://article.gmane.org/gmane.comp.lib.boost.devel/119337? Finally, thanks for putting this together! -- Aleksey Gurtovoy MetaCommunications Engineering

Aleksey Gurtovoy wrote:
Martin Wille writes:
- many reports of "something doesn't work right", often related to post-processing. Less than optimal responses on those. We all do understand that you and Misha are under time constraints and therefor aren't able to answer immediately. Having only two people who are able to fix these things is one small part of our problems. The fact that people do not know who would be responsible for finding out what part of the testing procedure is going wrong seems to indicate a management problem. - bugs suddenly go away and people involved in tracking them down do not understand what was causing them. This kind of problem is probably related to the build system. I consider this one fairly dangerous, actually. - We're not really able to tell when a bug started to get reported.
I'm under the impression some people did not know how much resources testing actually costs. I've seen reactions of surprise when I mentioned the CPU time, HD space or RAM consumed by the tests. Pleas for splitting test cases were ignored (e.g. random_test).
I agree. This processing step has to deal with the build system (which in complex itself) and with different compiler output. Other complexity probably stems from having to collect and to display test results that reflect different cvs checkout times.
*This* step might be a lot fast now (congrats, this is a *big* improvement). However, there still are other factors which make the code-change to result rendering process take too long.
I'm not trying to imply Misha or you wouldn't do enough. However, the fact that only two people have the knowledge and the access to the result collection stage of the testing process is a problem in itself.
Right. However, it's one of the more difficult problems to solve. The build system would have to be expanded to make it delete results for tests which don't exist anymore.
It hurts everytime the result collecting stage doesn't work correctly. We're not able to generate our own XML results and to upload them due to the SF resource limits.
You're probably right. However, I wanted to mention this point, because someone might have an idea how to address it. I guess it boils down to needing more testers in order to see more flavours of similar environments.
Hmm, recent traffic on the testing reflector seemed to indicate it isn't too simple. This might be caused by problems with the build system.
Yes, significantly non-conforming compilers.
Well, I take that back then. However, this URL seems not to be well known. Not a problem then.
One reason sure is that the test environments or the test cycles available are somehow unsatisfying. I would understand either. More testers would help here, too.
It isn't a problem? We don't parse the output from the compilers?
Do you expect the recent updates to be able to handle a significantly higher volume? This would be big improvement. I'm asking because I had the impression some parts in the XSL processing used O(n^2) algorithms (or worse). My local tests with changing the length of pathnames seemed to indicate that (replacing "/home/boost" with "/boost" resulted in significant speedup of the XSLT processor).
Yes, and running only a reduced set of tests for that if possible. I think this would help the library maintainers.
I'm with Victor on this point; for the testers (and hopefully there'll be more of them one day) it's significantly easier not to have to change anything during the release preparations. This could be achieved by using the CVS trunk as release branch until the actual release gets tagged. Development would have to continue in a branch and to be merged back into the branch after the release. Ideally, the testers would be able to run the test without having to attend the runs. This is currently not possible. (Just as an example: while I'm writing this I recognize that, apparently, I'm unable to upload test results now because of an error caused by one of the Python scripts: "ImportError: No module named ftp")
Finally, thanks for putting this together!
I hoped other people would contribute to the list; I'm sure there's a lot more to say about testing. E.g. it would be nice to have some sort of history of recent regression results. It would be nice to be able to split the runs vertically (running tests for a smaller set of toolsets) and horizontally (running tests for a smaller set of libraries) easily; I realize, though, that presenting the results would become more difficult. Regards, m

On Mar 8, 2005, at 4:52 PM, Martin Wille wrote:
To improve the reproducibility of results and make testing more predictable, we might want to have the regression scripts always check out using a given date/time tag, e.g., 12:00am EST each night. That way, all of the tests for the day will be on the same codfe. If it helps fix other problems with regression testing, great! Doug

At Tuesday 2005-03-08 12:16, you wrote:
wow, you want me to only run the tests once a day instead of 4 times? Surely, you're not suggesting that I'd get differing results if I checked out using the same time more than once.
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

On Mar 8, 2005, at 10:39 PM, Victor A. Wagner Jr. wrote:
Obviously, this is not the case. Looking at the summary page, however, we get a view across many CVS states, so it's hard to tell which version of the source code we're looking at. The answer isn't to submit only once per day (more testing is better, always!), but to have at least one build from each tester that references the source code at 12:00am EST. When a tester is capable of submitting more builds in a day, we have the most up-to-date build AND the 12:00am EST build in the results, which will make it very easy to see what's we've broken in a given day. Doug

Douglas Gregor wrote:
Would keeping a history of previous builds that are marked with the state of CVS they tested solve the problem? If yes.. That's what BuildBot does. [[ follow ups to boost-testing ]] -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com - 102708583/icq

Vladimir Prus wrote:
I think the real answer is Subversion with its repository-wide revision number, which can be shown next to test results.
Nice idea. How would we automatically find an agreement about which revision to test? It's much more usable than
date -- I don't even know for sure what's 12:00 EST and how it get it ;-)
You don't need to know what (or when) 12:00 EST is. You just have to know what options to pass to cvs ;) Regards, m

Vladimir Prus wrote:
BuildBot supports both SubVersion and CVS. And keeps track of the "version" that is getting tested. For SVN it's the revision number, and for CVS it's a checkout time stamp (based on the actual change notification). [[ follow ups to boost-testing ]] ;-) -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com - 102708583/icq

Douglas Gregor <doug.gregor@gmail.com> writes:
Doug, I want to put in writing some things which I believe to be important to be stated explicitly. I believe that at the end, you are the release manager for 1.33.0. As such, you have a great influence on what the testing group goals are - you are our "main customer". You just need to tell us what you consider to be your main problems and their priorities. For example, * Developers/Release manager need to see what changes caused this particular test to fail. * Release manager should be able to set what revision of codebase is getting tested and get all results for that particular revision * When developer checks something in - she should see the results of that checkin ASAP. Something like telling the testing system, test me my library on all toolsets and return be the results, quick! We will then see what we can do for you in time for 1.33.0. -- Misha Bergal MetaCommunications Engineering

On Mar 11, 2005, at 10:08 PM, Misha Bergal wrote:
I've been a rather silent customer, but time is again (temporarily) my friend :)
I think getting more immediate feedback about new failures would go along way in helping is determine which changes cause failures. The GCC developers have a wonderful system that complains very loudly when someone checks in broken code. When new regressions are found, it: (1) Determines what code has changed since the last-known-good version (2) Determines who made changes to the repository since then (3) E-mails everyone that made changes, giving them a summary of the new failures and a link to the log file. Alternatively, we could spam the developer list or a regression-testing list (in decreasing order of effectiveness). (4) Keeps e-mailing everyone until the problem is fixed. The last one isn't really necessary, but having something that does the first three would be *great*. It might even be nice to know when we fix a regression (it can sometimes be hard to tell give the compiler status tables).
* Release manager should be able to set what revision of codebase is getting tested and get all results for that particular revision
This is extremely important. Being able to easily change all of the regression testing to a branch means that we might be able to move to a more lightweight release management process. If it's easy for the release manager to switch over testing, stability the branch in a week or two, then release, we'll get releases out without so much fuss.
This takes more hardware than we currently have access to, so it's not very high priority to me. Doug

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:000601c53704$45a0f360$6501a8c0@pdimov2...
"test-first" really does help. I normally develop under Windows, but have installed Linux on an old machine, and also bought a cute little Mac Mini's to test on Mac OS X. Being able to cycle tests quickly on all three platforms really does speed multi-platform development. If we put Boost's collective mind to it, perhaps we can come up with a way to cycle tests much more quickly. I suspect we can find the machine resources, particularly if we can make the testing process robust enough that the test framework can be started, and then forgotten for months at a time. --Beman

On Apr 1, 2005, at 8:46 PM, Beman Dawes wrote:
OSL has two x86 Linux boxes we can spare (and should be usable in a week or so). I'm thinking one of them can be an interactive testing farm and perhaps the other can be set up for "continuous" testing, e.g., each time a CVS commit is performed [1], regression tests are re-run [2] and the results posted immediately (potentially coupled with a script that automatically complains to the committer, as I've mentioned before). [1] Actually, it should wait about 5 minutes so that other related commits can get into the repository first. [2] Of course, if a test is already running then the re-run request should be queued.
Yes, this would help greatly. Like OSL, I'm sure many companies and universities can spare some bandwidth and compute cycles but not labor. Make testing easy and we get more testing. Doug

On Fri, 1 Apr 2005 22:20:39 -0500, Douglas Gregor wrote
I notice that the Rudbek and Laru tests are running several times a day (thx guys). This makes it much faster to see VC results -- very helpful for someone like me that is gcc centric. Seems like by providing a smaller number of compilers they can cycle faster than some of the bigger testing farms. I'm hoping I'll get a chance to setup a Linux machine here that can run some 'release mode' gcc tests for the 1.33 release push. It will be a week or so before I get time to try it though... Jeff

Douglas Gregor wrote:
And you just describe what Buildbot does :-) The basic functionality for mine has been running this week without problems: http://build.redshift-software.com:9990/ -- And I'll be adding things like posting result, individual running of tests, and selection of tests to run this weekend. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com - 102708583/icq

Martin Wille <mw8329@yahoo.com.au> writes:
I've always thought that a design that gets information by processing stdout from bjam would be fragile. Furthermore, it means we can't use the -j option with bjam, which, even on uniprocessors, can speed up builds considerably. The build system itself should be writing the XML. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Aleksey Gurtovoy <agurtovoy@meta-comm.com> writes:
Not by itself, but I think it should be possible to build some target types that, along with Python scripts (or a C++ tool we build), would do it. [followups to boost-testing] -- Dave Abrahams Boost Consulting www.boost-consulting.com

Aleksey Gurtovoy wrote:
I'm verifying V2 operation on regression tests *now* and I think it's pretty close. But you're asking about a new feature which is not present in V1, either. I would say it's more reasonable to switch to V2 first -- from the point of regression testing, that would mean adding --v2 to bjam invocation, and nothing else. One that's done, we can think about further enhancements. Doing the switch together with some enhancements will only create problems, IMO. - Volodya

Vladimir Prus writes:
I wasn't specifically asking for it as a prerequisite for "officially" declaring V2 our production build system. I was just pointing out that, at the very least, for the latter to happen the regression testing has to be switched over to V2. At the moment I have no idea what's involved in it. My major concern is this: does V2 guarantee exactly the same format of the output ("bjam log") as V1?
The V2 toolsets infrastructure is different, and it matters for the reports. Other than that, if --v2 play well with process_jam_log, I agree with the rest of your post below.
-- Aleksey Gurtovoy MetaCommunications Engineering

Maybe you want to look at what Kitware does; they use CMake as a cross-platform generator, and Dart for testing. The process seems reasonably automated. CMake is a cross-platform generator similar to bjam with the advantage that it generates IDE's, it also handes testing fairly automatically. It is also PC friendly (aside from being *nix and Mac friendly). You guys could do wonders using CMake and Dart linked with Boost. Look at: http://www.vtk.org/Testing/Dashboard/20050309-0300-Nightly/Dashboard.html for the dashboard and http://www.cmake.org/HTML/Index.html Andrew -----Original Message----- From: Aleksey Gurtovoy [mailto:agurtovoy@meta-comm.com] Sent: Wednesday, 9 March 2005 21:05 To: boost@lists.boost.org Cc: boost-testing@lists.boost.org Subject: [boost] Re: [1.33.0] Let's start preparations... David Abrahams writes:
Exactly. Is BoostBuild v2. going to give us that? -- Aleksey Gurtovoy MetaCommunications Engineering

Martin Wille <mw8329@yahoo.com.au> writes:
That is worrisome.
- incremental testing doesn't work flawlessly That's IMO another "top 10" issue that hurts a lot.
It's a bit of a problem that when bjam scans for header dependencies it can't preprocess the files, so dependencies created by #include SOME_MACRO() don't get registered.
We can get resources from OSL if we need them... once this discussion discovers what we need ;-) By the way, this discussion should really be moved to boost-testing. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Aleksey Gurtovoy <agurtovoy@meta-comm.com> writes:
Well, if we could, it would be better to always do real preprocessing, or the equivalent, to eliminate false dependencies. The problem is that fitting it into the structure of bjam is nontrivial. It's certainly not doable in the current Boost.Jam language. [followups to boost-testing] -- Dave Abrahams Boost Consulting www.boost-consulting.com

Martin Wille wrote:
Whatever tone might be appropriate or not ...
IMO an offensive tone against some who have invested an incredible amount of work during the last months is never appropriate.
Maybe it's the fact that I don't run incremental tests and therefore don't encounter as much problems with the tests as you, but I've never got the impression of being neglected by Aleksey and his team except for 'usual' newsgroup delays.
No doubt. [...]
I agree 100%! IMO there is one major issue that was raised by me and others several times and that has <cynismn>successfully</cynismn> been set aside so far: boost is getting larger and larger but nobody wants to talk about the side effects this brings along: - the size of the binaries has grown incredibly! - the time and disk space needed for test runs are higher than ever before - the boost library itself is on the way to become a blob of more or less unrelated code/library fragments. IMO its overdue to think about splitting boost into components. For me it's not clear why a user who'll never use python must install and build boost.python on his machine. The same yields for many other boost libraries like graph, spirit, serialization, wave, etc. etc. It would be _much_ easier for us testers to run tests on boost _components_ than on the complete boost blob! If boost continues growing as it did in the past, this will be the only way to continue regression testing in the quality of today. Sorry to say that, but if we don't start thinking about this ASAP, there will definitely be a 'test breakdown' in the future.
[...] Very good idea! Thanks for putting them together here! Cheers, Stefan

Stefan Slapeta <stefan@slapeta.com> writes:
I actually have noticed and do want to talk about it, and I am trying to prepare some thoughts on the subject. I have been planning to raise these issues first with the Boost moderators, as we're primarily responsible for Boost administration. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams writes:
Many people who do a fair amount of work on "administration" are not moderators. I don't see why a discussion like this should be started in a private circle that doesn't include all the interested parties. -- Aleksey Gurtovoy MetaCommunications Engineering

Aleksey Gurtovoy <agurtovoy@meta-comm.com> writes:
Normally we do some things that way to keep them manageable in the early stages of discussion. It was part of my plan to suggest inviting the other major players. Obviously I didn't think it was all that crucial to do in private or I wouldn't have mentioned it here at all. Just to be clear, I think this discussion should go way beyond testing to look at all of Boost's infrastructure and management. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Stefan Slapeta wrote:
Martin Wille wrote:
[...]
I wasn't specifically talking about Aleksey or Misha here. E.g. I've several times asked to split certain tests into smaller parts and got ignored. I've also reported bugs in libraries and got ignored. I'm aware of the fact that Aleksey and Misha also have duties in their jobs and that they can't work for Boost 24/7. In fact, I found them very responsive when they had enough time. Regards, m

Victor A. Wagner Jr. writes:
You'd _have_ to switch between branches at some point (unless we branch for release immediately once the timeframe has been decided on), and regardless of everything else the tarballs still need to be tested as we come close to the finish line because that's what actually gets released. But I guess the latter can be automated, and on the whole, I agree. For tarballs testing, how about we modify 'regression.py' to check some specific location for new tarballs and give preference to them over the branch testing for one round? -- Aleksey Gurtovoy MetaCommunications Engineering

At Monday 2005-03-07 22:05, you wrote:
I'm suggesting that the work continue on the mainline towards release and that anyone working on something OTHER than the release, work on a branch. That way as long as we're playing with CVS the testers don't have to do ANYTHING to their already functioning scripts. The only people who will need to worry about weird labels will be those working on stuff unrelated to the release.
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

Victor A. Wagner Jr. wrote:
FWIW, I agree with Victor. I'd go further and state that the CVS should _always_ be kept in a release-ready form. Any deviations from this ideal should be fixed as quickly as possible. I, personally, always use the CVS version of Boost, I never wait for an "official" release. This model is not well supported. Most energy is invested in releases, which makes them very hard to manage (because the effort is not amortized over time.) Scheduled releases are still a good thing because they provide a deadline, though.

Peter Dimov wrote:
I'd *so* much like to see that. (Not only for running the test, which would become significantly easier.) The actual release process would become so much easier! The sheer number of libraries that form Boost more and more requires a more disciplined way of using CVS. Things like somebody checking in a change that breaks other libraries without supplying the documentation about how to deal with the breaking change should never happen. Of course, progress needs to be saved even when the work isn't complete yet. That's what you can use branches for. Regards, m

Peter Dimov wrote:
FWIW - I also agree with the idea that we should strive to have the mainline always "release ready" and that experimentation be undertaken on either local versions or a separate branch. There are a couple of fundamental problems with the current situation: a) testing is not scalable - it now take so much resources that not many can do it. As it gets larger, the problem gets worse. b) testing is a little "too expermental". There are issues with incremental build and testing that, if resolved, could make the process less resource intensive. This would be helpful. c) Library developers make their ehancements and run on the compilers that they have. Then they have to upload to the main line to "see what happens" - now the can address issues related to other compilers - But - now: i) mainline isn't "ready for release" ii) other developer whose packages depend on the uploaded "experimental" code are stymied if the experimental code has a bug. Now this might cause a ripple effect and make libraries fail that depend on the "experimental" upload which wastes a lot of testing resources. So to really "fix" this a couple of things would have to change: a) mainline would always have to be considered "candidate for release" Tarballs would be tested when and only when code where merged from a development branch into the mainline. I would expect this to occurr relatively infrequently - whenever a library maintainer thinks its really ready. (guestimate - one or two weeks) The question to be answered by this testing is "Is this ready for release" rather than "what is the current state of development". Whenever a release came up clean it would be labeled boost 1.32.?? (beta). b) Users who want to download it would be encouraged to do so and file bug reports and upload test results to a new "pending issues" test matices. This would require that the "Getting Started" section of the documentation explain the test as part of the installation process. In fact, we might even want to make the default install a "test" as a side effect of the test is building of all the libraries anyhow. That is - spread the testing around to make it scalable. c) Separate development test on the "development tree" would be similar as it is now except. i) incremental update, build and test would be used - this presumes some fixes - including the detection that Jamfiles have been changed and perhaps library authors being permitted to request a total rebuild. d) Its time to approach some industry players for more support. A company that toutes "Boost Compatibility" on its packaging and or promotional material has to expect to be hit up for some support. Such support could/should consist of: i) running at least the release testing on their own platforms. ii) perhaps providing a platform/compiler for running developmental tests. iii) providing software to developers/maintainers of libraries accepted into boost. Note for the above to work - the developemental and release testing has to be TOTALLY automatic once its setup. I realize that this might sound ambitious - but I think that boost is on the verge of bigger things. Robert Ramey
participants (18)
-
Aleksey Gurtovoy
-
Andrew Maclean
-
Beman Dawes
-
David Abrahams
-
Doug Gregor
-
Douglas Gregor
-
James Fowler
-
Jeff Garland
-
John Maddock
-
Jonathan Turkanis
-
Martin Wille
-
Misha Bergal
-
Peter Dimov
-
Rene Rivera
-
Robert Ramey
-
Stefan Slapeta
-
Victor A. Wagner Jr.
-
Vladimir Prus