
On Sun, 22 Apr 2012 21:51:06 -0400, Edward Diener wrote:
On 4/22/2012 8:43 PM, Stephan T. Lavavej wrote:
If someone could prepare *minimized* test cases demonstrating the VC bugs that are holding you back here, I could file compiler bugs.
One of the main issues is that Microsoft follows the C89 standard when it comes to the preprocessor. That has been superseded in both C and C++ a long time ago. So what does one say to a company that follows a very old standard and defends the inadequacy of its preprocessor by saying that according to that standard they are processing preprocessor symbols correctly ? It is like a company saying that they implement a computer language of 23 years ago and despite the fact that the language has changed drastically since that time, they are proud to not keep up with those changes.
It isn't that. VC++ doesn't even implement the preprocessor correctly for C89. Essentially, the macro expansion algorithm appears to be fundamentally broken. One small example: #define A() 123 #define B() () A B() // should expand to A() (and does) #define C() A B() C() // should *still* expand to A() (but instead expands to 123) The entire methodology is borked here. Leaving blue-paint aside (and argument processing aside), macro expansion should work like a stream editor. + + C ( ) + + ^ + + C ( ) + + ^ + + C ( ) + + ^ + + A B ( ) + + ^ THIS is "rescanning and further replacement." Rescanning and further replacement is *not* a recursive process that occurs "in the replacement list." + + A B ( ) + + ^ + + A ( ) + + ^ // note: not at 'A', scanning doesn't go backward + + A ( ) + + ^ + + A ( ) + + ^ + + A ( ) + + ^ + + A ( ) + + ^ In the above, everything to the left of the caret is output and everything to the right is input. For the top level source, that output is the underlying language parser (or preprocess-only output). The only time it isn't is when an argument to a macro--which is *actually used* at least once in the replacement list without being an operand of # or ##--is processed. In that case, a recursive scan of the sequence of tokens making up the argument is necessary. In that case, the output of the scan is redirected to a buffer (depending on exact implementation methodology) for future substitution into a copy of the replacement list. Things get more complicated when dealing with blue paint and disabling contexts. A disabling context is a range associated with a macro name (that exists only during a scan for macro expansion--i.e. if that scan is redirected to buffer for substitution, the result in the buffer does not contain annotations representing the disabling context). If an identifier token which matches the macro name is found in the range during the scan, it is "painted blue." A painted token cannot be used as the name of a macro in a macro invocation. However, while the context that causes blue paint is transient, blue paint on a token is not. It remains even when the scan output is redirected for substitution. Note that substitution ultimately means that the sequence of tokens in the argument are going to get scanned again. When that happens, the contexts from the first scan are gone, but any blue paint on particular tokens is not. As a concrete example, #define A(x) B(x) C(x) D(x) #define B(x) #define C(x) x #define D(x) #x A(B(1)) Where M' = painted M token, { ... } is the hideset H (which may be flags in the symbol table), +M is a virtual token which means H := H | M, and -M is a virtual token which means H := H \ M... [begin] A ( B ( 1 ) ) EOF ^ {} [begin] B ( 1 ) EOF ^ {} [begin] 1 EOF ^ {} 1 EOF ^ {} [end] +B 1 -B EOF ^ {} 1 -B EOF ^ { B } 1 -B EOF ^ { B } 1 EOF ^ {} [end] +A "B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF ^ {} "B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF ^ { A } "B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF ^ { A } [begin] 1 EOF ^ { A } 1 EOF ^ { A } [end] "B(1)" +B 1 -B C ( 1 ) D ( 1 ) -A EOF ^ { A } "B(1)" 1 -B C ( 1 ) D ( 1 ) -A EOF ^ { A, B } "B(1)" 1 -B C ( 1 ) D ( 1 ) -A EOF ^ { A, B } "B(1)" 1 C ( 1 ) D ( 1 ) -A EOF ^ { A } "B(1)" 1 +C -C D ( 1 ) -A EOF ^ { A } "B(1)" 1 -C D ( 1 ) -A EOF ^ { A, C } "B(1)" 1 D ( 1 ) -A EOF ^ { A } [begin] 1 EOF ^ { A } 1 EOF ^ { A } [end] "B(1)" 1 +D "1" A ( 1 ) B ( 1 ) -D -A EOF ^ { A } "B(1)" 1 "1" A ( 1 ) B ( 1 ) -D -A EOF ^ { A, D } "B(1)" 1 "1" A ( 1 ) B ( 1 ) -D -A EOF ^ { A, D } "B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF ^ { A, D } "B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF ^ { A, D } "B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF ^ { A, D } "B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF ^ { A, D } [begin] 1 EOF ^ { A, D } 1 EOF ^ { A, D } [end] "B(1)" 1 "1" A' ( 1 ) +B 1 -B -D -A EOF ^ { A, D } "B(1)" 1 "1" A' ( 1 ) 1 -B -D -A EOF ^ { A, B, D } "B(1)" 1 "1" A' ( 1 ) 1 -D -A EOF ^ { A, D } "B(1)" 1 "1" A' ( 1 ) 1 -A EOF ^ { A } "B(1)" 1 "1" A' ( 1 ) 1 EOF ^ {} [end] VC++ gets the result correct here (though it doesn't get it by the correct method). The algorithm is bizarre in relation to typical code interpreters, but it is actually shockingly simple. The only times that it gets tricky are when a -M occurs within the sequence of tokens that makes up an invocation which I glossed over above (because they don't occur in the above). E.g. scenarios like: MACRO -M ( ARG ) or MACRO ( ARG, -M ARG ) In both of these cases, the -M must get executed prior to MACRO(...) being replaced by MACRO's replacement list (and resuming scanning). The first of these is not very tricky--and both boost-pp and chaos-pp require this. The algorithm merely has to execute the virtual token when finding the left parentheses. The second is more tricky (and I don't think even chaos-pp requires this one to work correctly) because the algorithm has to *correctly* deal with possibly four different scenarios: 1. The argument is not used (any -M must still be executed therein). 2. The argument is used as an operand of # (any -M must still be executed therein). 3. The argument is used as an operand of ## (any -M must still be executed and tokens must be painted as required). 4. The argument is used as a non-operand of # or ## (recursive scan) Worse, 2, 3, and 4 can all be used at the same time: #define A(x) #x x ## x x However, in 2 and 3, you cannot encounter a +M, so you can just push a +M into a "reset" stack for every -M you find and then execute the reset stack prior to processing an argument (provided the recursive scan case is done last). VC++'s macro expansion algorithm appears to be a combination of optimizations and hacks that yield correct results for simple cases, but often yields incorrect results where the input requires the actual rules to be followed. Obvious examples are the way __VA_ARGS__ is "optimized" into a single argument and the extra scan applied in the following causing it to yield 123 instead of A()--which is presumably a hack of some kind. #define A() 123 #define B() () #define C() A B() C() However, these are only the tip of the iceberg. I have a suspicion that the algorithm used is actually totally unlike what it is supposed to be. I.e. I'd guess it's a re-write-the-algorithm rather than a fix-a-few-bugs scenario.
A second issue is that quite often simple test cases are unavailable since the sort of failures exhibited by Microsoft's preprocessor occur when one is attempting to use the facilities of the Boost PP library and that is often at the cutting edge of what one can do with the preprocessor, aside from Chaos which assumes a 100% compliant preprocessor.
Edward is correct here. The problems often occur in a combinatorial fashion. This is almost certainly due to the actual algorithm operating in way very foreign to how it should, so when simple examples are used, the output appears correct except that you cannot see the algorithm state (i.e. hidesets, blue paint, etc.) which is often *not* correct. So, when things are used in combination, what appears to be correct in isolation is actually carrying forward with algorithm state that affects further processing incorrectly. These types of scenarios are inordinately difficult to find workarounds for, and any workarounds are inherently unstable (because you don't know exactly what they are "fixing"). All it really takes is some different input or different combination and things might break. Chaos has over 500 primary interface macros and over two thousand primary and secondary interface macros (i.e. SEQ_FOR_EACH is primary, SEQ_FOR_EACH_S is secondary (derivative))--not to mention the number of implementation macros (which is probably in the 10000 range, though I haven't counted). Full circle, even if it was a matter of testing permutations and combinations of primary interfaces used only once, eΓ(500 + 1, 1) is still about 3.3 x 10^1134 possible combinations. Aka, not tenable. The way boost-pp deals with this is essentially by having a complete separate implementation for MSVC which tries to do everything it can to force things to happen when they should or before it is too late--which frequently doesn't work. <rant> Since I've been doing this pp-stuff, which is well over a decade, I've seen numerous compiler and tool vendors fix or significantly improve their preprocessors--several of which have informally consulted with me. However, I have seen absolutely zero effort by MS. The stock response for a bug report is "closed as won't fix" for anything related to the preprocessor. I've heard--sometimes on this list--numerous people who represent MS say that "they'll see what they can do," but nothing ever happens. Fix your preprocessor, MS, which will probably gain you nothing financially, and stop implementing extensions to the language in lieu of implementing the language itself, and I'll actually maybe believe that you care about C++, standards, and ethics. </rant> Regards, Paul Mensonides