Re: [boost] [preprocessor] Sequences vs. All Other Data Structures

23 Apr 2012

      On Sun, 22 Apr 2012 21:51:06 -0400, Edward Diener wrote:
...
On 4/22/2012 8:43 PM, Stephan T. Lavavej wrote:
...
...
If someone could prepare *minimized* test cases demonstrating the VC
bugs that are holding you back here, I could file compiler bugs.
One of the main issues is that Microsoft follows the C89 standard when
it comes to the preprocessor. That has been superseded in both C and C++
a long time ago. So what does one say to a company that follows a very
old standard and defends the inadequacy of its preprocessor by saying
that according to that standard they are processing preprocessor symbols
correctly ? It is like a company saying that they implement a computer
language of 23 years ago and despite the fact that the language has
changed drastically since that time, they are proud to not keep up with
those changes.
It isn't that.  VC++ doesn't even implement the preprocessor correctly for 
C89.  Essentially, the macro expansion algorithm appears to be 
fundamentally broken.  One small example:

#define A() 123

#define B() ()

A B() // should expand to A() (and does)

#define C() A B()

C() // should *still* expand to A() (but instead expands to 123)

The entire methodology is borked here.  Leaving blue-paint aside (and 
argument processing aside), macro expansion should work like a stream 
editor.

+ + C ( ) + +
^

+ + C ( ) + +
  ^

+ + C ( ) + +
    ^

+ + A B ( ) + +
    ^

THIS is "rescanning and further replacement."  Rescanning and further 
replacement is *not* a recursive process that occurs "in the replacement 
list."

+ + A B ( ) + +
      ^

+ + A ( ) + +
      ^       // note: not at 'A', scanning doesn't go backward

+ + A ( ) + +
        ^

+ + A ( ) + +
          ^

+ + A ( ) + +
            ^

+ + A ( ) + +
              ^

In the above, everything to the left of the caret is output and everything 
to the right is input.

For the top level source, that output is the underlying language parser 
(or preprocess-only output).  The only time it isn't is when an argument 
to a macro--which is *actually used* at least once in the replacement list 
without being an operand of # or ##--is processed.  In that case, a 
recursive scan of the sequence of tokens making up the argument is 
necessary.  In that case, the output of the scan is redirected to a buffer 
(depending on exact implementation methodology) for future substitution 
into a copy of the replacement list.

Things get more complicated when dealing with blue paint and disabling 
contexts.  A disabling context is a range associated with a macro name 
(that exists only during a scan for macro expansion--i.e. if that scan is 
redirected to buffer for substitution, the result in the buffer does not 
contain annotations representing the disabling context).  If an identifier 
token which matches the macro name is found in the range during the scan, 
it is "painted blue."  A painted token cannot be used as the name of a 
macro in a macro invocation.  However, while the context that causes blue 
paint is transient, blue paint on a token is not.  It remains even when 
the scan output is redirected for substitution.  Note that substitution 
ultimately means that the sequence of tokens in the argument are going to 
get scanned again.  When that happens, the contexts from the first scan 
are gone, but any blue paint on particular tokens is not.

As a concrete example,

#define A(x) B(x) C(x) D(x)
#define B(x)
#define C(x) x
#define D(x) #x

A(B(1))

Where M' = painted M token, { ... } is the hideset H (which may be flags 
in the symbol table), +M is a virtual token which means H := H | M, and -M 
is a virtual token which means H := H \ M...

[begin]
A ( B ( 1 ) ) EOF
^ {}
    [begin]
    B ( 1 ) EOF
    ^ {}
        [begin]
        1 EOF
        ^ {}
        1 EOF
          ^ {}
        [end]
    +B 1 -B EOF
    ^ {}
    1 -B EOF
    ^ { B }
    1 -B EOF
      ^ { B }
    1 EOF
      ^ {}
    [end]
+A "B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF
^ {}
"B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF
^ { A }
"B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF
       ^ { A }
    [begin]
    1 EOF
    ^ { A }
    1 EOF
      ^ { A }
    [end]
"B(1)" +B 1 -B C ( 1 ) D ( 1 ) -A EOF
       ^ { A }
"B(1)" 1 -B C ( 1 ) D ( 1 ) -A EOF
       ^ { A, B }
"B(1)" 1 -B C ( 1 ) D ( 1 ) -A EOF
         ^ { A, B }
"B(1)" 1 C ( 1 ) D ( 1 ) -A EOF
         ^ { A }
"B(1)" 1 +C -C D ( 1 ) -A EOF
         ^ { A }
"B(1)" 1 -C D ( 1 ) -A EOF
         ^ { A, C }
"B(1)" 1 D ( 1 ) -A EOF
         ^ { A }
    [begin]
    1 EOF
    ^ { A }
    1 EOF
      ^ { A }
    [end]
"B(1)" 1 +D "1" A ( 1 ) B ( 1 ) -D -A EOF
         ^ { A }
"B(1)" 1 "1" A ( 1 ) B ( 1 ) -D -A EOF
         ^ { A, D }
"B(1)" 1 "1" A ( 1 ) B ( 1 ) -D -A EOF
             ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                  ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                    ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                      ^ { A, D }
    [begin]
    1 EOF
    ^ { A, D }
    1 EOF
      ^ { A, D }
    [end]
"B(1)" 1 "1" A' ( 1 ) +B 1 -B -D -A EOF
                      ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) 1 -B -D -A EOF
                      ^ { A, B, D }
"B(1)" 1 "1" A' ( 1 ) 1 -D -A EOF
                        ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) 1 -A EOF
                        ^ { A }
"B(1)" 1 "1" A' ( 1 ) 1 EOF
                        ^ {}
[end]

VC++ gets the result correct here (though it doesn't get it by the correct 
method).

The algorithm is bizarre in relation to typical code interpreters, but it 
is actually shockingly simple.  The only times that it gets tricky are 
when a -M occurs within the sequence of tokens that makes up an invocation 
which I glossed over above (because they don't occur in the above).  E.g. 
scenarios like:

MACRO -M ( ARG )

or

MACRO ( ARG, -M ARG )

In both of these cases, the -M must get executed prior to MACRO(...) being 
replaced by MACRO's replacement list (and resuming scanning).

The first of these is not very tricky--and both boost-pp and chaos-pp 
require this.  The algorithm merely has to execute the virtual token when 
finding the left parentheses.  The second is more tricky (and I don't 
think even chaos-pp requires this one to work correctly) because the 
algorithm has to *correctly* deal with possibly four different scenarios:

1. The argument is not used (any -M must still be executed therein).
2. The argument is used as an operand of # (any -M must still be executed 
therein).
3. The argument is used as an operand of ## (any -M must still be executed 
and tokens must be painted as required).
4. The argument is used as a non-operand of # or ## (recursive scan)

Worse, 2, 3, and 4 can all be used at the same time:

#define A(x) #x x ## x x

However, in 2 and 3, you cannot encounter a +M, so you can just push a +M 
into a "reset" stack for every -M you find and then execute the reset 
stack prior to processing an argument (provided the recursive scan case is 
done last).

VC++'s macro expansion algorithm appears to be a combination of 
optimizations and hacks that yield correct results for simple cases, but 
often yields incorrect results where the input requires the actual rules 
to be followed.  Obvious examples are the way __VA_ARGS__ is "optimized" 
into a single argument and the extra scan applied in the following causing 
it to yield 123 instead of A()--which is presumably a hack of some kind.

#define A() 123
#define B() ()
#define C() A B()

C()

However, these are only the tip of the iceberg.  I have a suspicion that 
the algorithm used is actually totally unlike what it is supposed to be.  
I.e. I'd guess it's a re-write-the-algorithm rather than a fix-a-few-bugs 
scenario.
...
A second issue is that quite often simple test cases are unavailable
since the sort of failures exhibited by Microsoft's preprocessor occur
when one is attempting to use the facilities of the Boost PP library and
that is often at the cutting edge of what one can do with the
preprocessor, aside from Chaos which assumes a 100% compliant
preprocessor.
Edward is correct here.  The problems often occur in a combinatorial 
fashion.  This is almost certainly due to the actual algorithm operating 
in way very foreign to how it should, so when simple examples are used, 
the output appears correct except that you cannot see the algorithm state 
(i.e. hidesets, blue paint, etc.) which is often *not* correct.  So, when 
things are used in combination, what appears to be correct in isolation is 
actually carrying forward with algorithm state that affects further 
processing incorrectly.  These types of scenarios are inordinately 
difficult to find workarounds for, and any workarounds are inherently 
unstable (because you don't know exactly what they are "fixing").  All it 
really takes is some different input or different combination and things 
might break.  Chaos has over 500 primary interface macros and over two 
thousand primary and secondary interface macros (i.e. SEQ_FOR_EACH is 
primary, SEQ_FOR_EACH_S is secondary (derivative))--not to mention the 
number of implementation macros (which is probably in the 10000 range, 
though I haven't counted).  Full circle, even if it was a matter of 
testing permutations and combinations of primary interfaces used only 
once, eΓ(500 + 1, 1) is still about 3.3 x 10^1134 possible combinations.  
Aka, not tenable.

The way boost-pp deals with this is essentially by having a complete 
separate implementation for MSVC which tries to do everything it can to 
force things to happen when they should or before it is too late--which 
frequently doesn't work.

<rant>
Since I've been doing this pp-stuff, which is well over a decade, I've 
seen numerous compiler and tool vendors fix or significantly improve their 
preprocessors--several of which have informally consulted with me.  
However, I have seen absolutely zero effort by MS.  The stock response for 
a bug report is "closed as won't fix" for anything related to the 
preprocessor.  I've heard--sometimes on this list--numerous people who 
represent MS say that "they'll see what they can do," but nothing ever 
happens.  Fix your preprocessor, MS, which will probably gain you nothing 
financially, and stop implementing extensions to the language in lieu of 
implementing the language itself, and I'll actually maybe believe that you 
care about C++, standards, and ethics.
</rant>

Regards,
Paul Mensonides