Re: [boost] [preprocessor] "How macro expansion works", comments

15 Feb 2006

      ...
-----Original Message-----
From: boost-bounces@lists.boost.org 
[mailto:boost-bounces@lists.boost.org] On Behalf Of Tobias Schwinger
...
Here are my comments on the text. Note that I use imperative 
for the sake of simplicity.
Okay.  Thanks for your comments.
...
...
The process of macro expansion is best viewed as the process of 
scanning for macro  expansion (rather than the process of a single 
macro expansion alone).
When the  preprocessor encounters a sequence of 
preprocessing tokens 
and whitespace separations  that needs to be scanned for macro 
expansion, it has to
perform a number of steps.   These steps are examined in this 
document in
detail.
Strike that paragraph. It uses terms not yet defined and 
doesn't say much more than the title (assuming it's still 
"how macro expansion works").
The paragraph might not be in the right spot, but what the paragraph says is
important.  The process of macro expansion includes more than just expanding a
single macro invocation.  Rather, it is the process of scanning for macro
expansions, and the whole thing is defined that way.
...
...
[Conventions]
Several conventions are used in this document for the sake of 
simplicity.
First,  except in contexts where the distinction matters, this 
document uses the terms "token"  and "preprocessing token" 
interchangeably.  Where the distinction does matter, this  document 
will explicitly differentiate between them by using the 
terms "regular 
token"  and "preprocessing token."
From a technical standpoint, the primary difference between the two 
token types relates to permissiveness.  Preprocessing 
tokens are much 
more  permissive than regular tokens.
How can tokens be permissive? Their definiton is probably 
more permissive (if you can say that).
Yes, I'll rewrite this.  Their definitions are more permissive because they
don't have the semantic contraints that regular tokens do.
...
...
Ignoring classifications (e.g. keyword, identifier),  all regular 
tokens are trivially converted to preprocessing tokens.  Not all  
preprocessing tokens can be converted to regular tokens.  
For example, 
0.0.0.0 is a  single, valid preprocessing token, but it cannot be 
converted to a regular token.
Thus, preprocessing tokens can be thought of as a superset 
of regular 
tokens.
Second, when this document uses the phrase "sequence of 
tokens" (i.e. 
"sequence of  preprocessing tokens"), the possibility of whitespace 
separations is implied.  In other words, "sequence of 
tokens" really 
means "sequence of tokens and whitespace  separations."  Both the C 
and C++ standards do this as well, though they don't  
explicitly say 
they are doing it.  Rather, it is implied by the phases of 
translation  
and by certain parts of the text such as §16.3.2/2 of the
C++ standard and §6.10.3.2/2  of the C standard.
Third, this document uses the term "scanning" to mean "scanning for 
macro
expansion."   There are other kinds of scanning (in the 
general sense) that are
performed during the  macro expansion process.  For example, a 
replacement list is scanned for instances of  formal parameters to 
substitute.  Other kinds of scanning besides scanning for macro  
expansion are referred to explicitly.
Lastly, this document uses a graphical notation to 
illustrate various 
parts of the  process in examples.  For clarity, all whitespace 
separations are ignored in these  diagrams.  Given a sequence of 
tokens, the current location of the scanner is indicated  with a 
circumflex (^).  For example,
+ + +
    ^
A range of tokens is indicated by using a frame such as:
+ + +
  |^    |
  |_____|
     |
     LABEL
These frames are labeled with the meaning of the frame or 
the sequence 
of tokens bounded by the frame.  An identifier token that has been 
painted is marked with an  apostrophe (').
The reader probably has no idea what "painted" means at this 
point. Indicate the forward-declaration by "see below" or 
something like that.
I do in the very next sentence.
...
...
A' token
     ^
This document will discuss what it means for an identifier 
token to be 
"painted."  It  also doesn't use character literals ('a') to avoid 
ambiguity.
It is important to note  that the apostrophe is not 
literally added to 
the sequence of tokens.  Rather, it is  only a notation 
that indicates 
a property of the identifier to which it is "attached."
[Locations]
There are several points where the preprocessor must scan a 
sequence 
of tokens looking  for macro invocations to expand.
The most obvious of these is between preprocessing 
directives (subject 
to conditional compilation).  For example,
I had to read this sentence multiple times for me to make sense...
What part was difficult to follow?  It seems pretty straightforward to me (but
then, I know what I'm looking for).
...
...
#include "file.h"
// ...
#define A()
// ...
#undef A
Both of the sections containing comments must be scanned for macro 
expansion.
The preprocessor must also scan a few other sequences of tokens.  
These include the  expression that controls an #if or #elif 
directive.  
Subject to certain conditions, the operands of the #include 
directive 
and the #line directive are also scanned for macro  expansion.
The places where a sequence of tokens is scanned for macro 
expansion 
in a source file  are not really the subject of this document.  As 
such, the remainder of this document concentrates on just 
sequences of 
tokens that must be scanned, though sometimes the  #define 
directive 
is used.
^^^^^ simplify this section by using negative logic (in other
words: enumerate the contexts where /no/ macro expansion is done).
!! I don't think this is a good idea.  A "List O' Negatives" almost invariably
forgets some things--e.g. the SFINAE "List O' Negatives".
...
...
Before continuing, however, note that an argument to a macro that 
contains what could  be a preprocessing directive results 
in undefined 
behavior.  For example,
#define MACRO(x) x
MACRO(
      #include "file.h"
  )
Indicate more clearly that this code is not OK.
The next sentence says that it is undefined behavior.  I'm not sure how to make
it more clear than that.
...
...
This is undefined behavior because it gives a preprocessor 
implementation
latitude.   Specifically, it allows for a preprocessor to 
parse an entire file
into directives and  sequences of tokens between directives 
prior to 
semantic evaluation.  It also allows  for a preprocessor to 
semantically evaluate the result of preprocessing in a single  pass.
Put the implementation details in parentheses or use a footnote.
First, this article is about how macro expansion works, and is therefore very
implementation-oriented.  Second, this isn't really implementation detail so
much as rationale for the undefined behavior classification for this particular
case.

I considered striking this entire section because the article is not about what
the preprocessor does as a whole--just macro expansion.
...
...
[Scanning]
Consider the following macro definitions and the subsequent 
sequence 
of tokens to be  scanned:
#define OBJECT OBJ ## ECT F()
  #define F() G(G(G))
  #define G(x) x
+ X OBJECT +
These definitions and the subsequent sequence of tokens 
will be used 
as a running  example in describing the steps of the expansion 
process.
Given a sequence of tokens to be scanned, the preprocessor 
successively looks at each  token looking for identifiers.
If a given token is not an identifier, the preprocessor outputs the 
token and moves on to the next.
+ X OBJECT +
  ^
+ X OBJECT +
    ^
The term "output" is being used generally.  Sometimes it means that 
the preprocessor is  finished with the token and that the 
token can be 
handed off to the underlying language parser or output to a text 
stream (in the case of a
compiler's preprocess-only mode).   Other times, it means 
that it is output to a
buffer that is later reused by the  preprocessor for 
another purpose.  
In all cases, however, it does mean that _this_ scan  for macro 
expansion is finished with the token and that the token 
results from 
_this_  scan.
^^^^^ shouldn't this be part of the "Conventions" section?
I'm not sure.  This is the first point where it is used, and so I describe how
it is being used here.  I think that if I try to define it in the "Conventions"
section, I'll end up having to pull a bunch of the surround context up with it.
...
...
If the token _is_ an identifier, the preprocessor must 
check to see if 
the identifier  corresponds to the name of a macro that is 
currently 
defined.  If it is not, the  preprocessor outputs the token 
and moves 
on to the next.  In the running example, the  current token is the 
identifier X, which does not correspond to a macro name.
+ X OBJECT +
    ^
+ X OBJECT +
      ^
[Blue Paint]
If the current token is an identifier that refers to a macro, the 
preprocessor must  check to see if the token is painted.
If it is painted, it outputs the token and moves  on to the next.
When an identifier token is painted, it means that the preprocessor 
will not attempt to  expand it as a macro (which is why it 
outputs it 
and moves on).  In other words, the token itself is flagged as 
disabled, and it behaves like an identifier that does not  
corresponds 
to a macro.  This disabled flag is commonly referred to as "blue 
paint,"  and if the disabled flag is set on a
particular token, that token is called "painted."   (The 
means by which an
identifier token can become painted is described below.)
Remove redundancy in the two paragraphs above.
It might be redundant, but it is a result of the underlying double-structure
that I'm trying to maintain over most of the article.  When I first wrote the
article (after being asked to do so by several people from Boost a few weeks
ago), I wrote it so it was very algorithmic.  E.g. if <predicate> then <A> else
<B>.  After writing it this way, I decided that it was too much like a technical
specification, so I decided that I needed a concrete (if contrived) running
example.  Then, instead of having a pure algorithm structure, I could describe
the processing of this example from start to finish.

The example that I decided on was designed specifically to address most of the
major points about the macro expansion process and to introduce complexity as
gradually as possible.  For example, the example deals with an object-like macro
invocation first, then a nullary macro invocation, then a non-nullary macro
invocation.  Each of these adds significant complexity to the process.

However, I also didn't want to lose the technical specification completely.  The
net result is that you get paragraphs like the above.  The first is a technical
"if <predicate> then <A> else <B>".  The second is commentary on the predicate
used in the first.

All in all, this was the most difficult part.  I'm trying to maintain two
structures at the same time because I think that both are necessary.
...
I like the "behaves like an identifier that does not 
correspond to a macro name"-part.
Okay.
...
...
In the running example, the current token is the identifier OBJECT, 
which _does_ correspond to a macro name.  It is not 
painted, however, 
so the preprocessor moves on  to the next step.
[Disabling Contexts]
If the current token is an identifier token that corresponds to a 
macro name, and the  token is _not_ painted, the preprocessor must 
check to see if a disabling context that corresponds to the macro 
referred to by the identifier is active.  If a corresponding  
disabling context is active, the preprocessor paints the identifier 
token, outputs it,  and moves on to the next token.
A "disabling context" corresponds to a specific macro and 
exists over 
a range of tokens  during a single scan.  If an identifier 
that refers 
to a macro is found inside a disabling context that 
corresponds to the 
same macro, it is painted.
Disabling contexts apply to macros themselves over a given 
geographic 
sequence of  tokens, while blue paint applies to particular 
identifier 
tokens.  The former causes  the latter, and the latter is what 
prevents "recursion" in macro expansion.  (The means  by which a 
disabling cotnext comes into existence is discussed below.)
In the running example, the current token is still the identifier 
OBJECT.  It is not  painted, and there is no active 
disabling context 
that would cause it to be
painted.   Therefore, the preprocessor moves on to the next step.
The introductions of these terms feels structurally too 
aprupt to me. Introduce these terms along the way, continuing 
with the example.
They appear at the first point where their definition must appear.  The
preprocessor must perform these steps at this specific time.  This again is an
occurrence of the double-structure.  The algorithm is iterative and loops back
to this same situation but where tokens might be painted or disabling contexts
might be active.  Thus, the article contains a degree of iteration also.  I
don't know how to avoid this without discarding one structure or the other.
...
...
[Object-like Macro Invocations]
If an identifier token that corresponds to a macro name is not 
painted, and there is  not active disabling context that 
would cause 
it to be painted, the preprocessor must  check to see what type of 
macro is being referenced--object-like or function-like.  If  the 
macro is defined as an object-like macro, which is the case 
OBJECT in 
the running  example, the identifier alone forms an 
invocation of the 
macro, and the preprocessor  begins the replacement process of that 
invocation.
+ X  OBJECT +
      |^     |
      |______|
          |
          OBJECT invocation (INV)
For an object-like macro, the first step in the replacement 
process is 
the retrieval of  the replacement list of the macro from the symbol 
table.  In the running example, the OBJECT macro's 
replacement list is
OBJ ## ECT F()
The preprocessor then performs any token-pasting operations in the 
replacement
list.   (Note that there is no stringizing operator in 
object-like macros.)  The
result of  token-pasting in OBJECT's replacement list is
OBJECT F()
After all token-pasting has been performed, the resulting 
sequence of 
tokens is used to  replace the invocation of the macro.  At 
the same 
time, a disabling context that corresponds to the macro 
being replaced 
is activated.  This disabling context surrounds  the tokens 
that came 
from the replacement list.
+ X OBJECT F() +
     |          |
     |__________|
          |
          OBJECT disabling context (DC)
<-- explain what a disabling context and then what blue paint 
is is here
Do you mean that they should be defined here for the first time, or that they
should be defined here again (but maybe with less detail)?
...
...
Finally, scanning resumes beginning with the first token that came 
from the replacement  list (or the next token after the 
invocation if 
the replacement list was empty).  In  the running example, scanning 
resumes at OBJECT.
+ X OBJECT F() +
     |^         |
     |__________|
           |
           OBJECT DC
Note that this OBJECT identifier is a _different_ identifier token 
than the one that  formed an invocation of the OBJECT 
macro.  It came 
from the replacement list of the  OBJECT macro (indirectly 
by way of 
token-pasting).
[Iterative Replacement vs. Functional Hierarchy]
Before continuing, it is important to note that a 
replacement list is 
not scanned for  macro expansion prior to the replacement of an 
invocation.  For example,
#define A B
  #define B 123
A
It is tempting to believe that the top-level scan calls A, 
A calls B, 
B returns
123 to  A, and A returns 123 to the top-level.  In other 
words, it is 
tempting to believe that  the invocations form a functional 
hierarchy 
similar to the procedural programming model  that is at the root of 
most programming paradigms.
That is _not_ the way that macro  expansion works, however.  
Instead, the
top-level scan replaces A with B and resumes.   It then 
replaces B with 123 and
resumes.
This iterative replacement model may seem foreign, but it 
has a strong 
similarity to  inlined functions.  For example,
inline int g() {
      return 123;
  }
inline int f() {
      return g();
  }
const int x = f();
Instead of calling f, the call of f is effectively replaced 
with the 
body of f.
The  call of g contained therein, is effectively replaced with the 
body of g.
Inlined  functions are not quite that simple, of course.  
They operate under the rules of  lexical scope, for example, which 
macro expansion does not.
Nevertheless, inlined  functions have a similarity to the iterative 
replacement model of macro expansion, and  one can view macro 
expansion as a sort of "continual inlining."
[Resumption of Scanning]
After the OBJECT invocation has been replaced in the 
running example, 
scanning resumes  on the first token that came from the replacement 
list.
+ X OBJECT F() +
     |^         |
     |__________|
           |
           OBJECT DC
This time, however, when the preprocessor looks at the 
OBJECT token, a 
disabling context that corresponds to OBJECT is active.  Because of 
that, this particular OBJECT  token is painted, and the 
preprocessor 
moves on to the next token.
+ X OBJECT' F() +
     |        ^  |
     |___________|
           |
           OBJECT DC
[Function-like Macro Invocations]
If an identifier token that corresponds to a macro name is not 
painted, and there is  not active disabling context that 
would cause 
it to be painted, the preprocessor must  check to see what type of 
macro is being referenced--object-like or function-like.  If  the 
macro is function-like, things are a little more complex than with 
object-like  macros.  The process is similar, but several 
more steps 
are performed by the  preprocessor.
After finding that the an identifier token refers to a 
function-like 
macro, the preprocessor must look ahead to the next token in the 
sequence of tokens to see if it  is a left-parenthesis.  If 
it is not, 
the preprocessor outputs the identifier, and  moves on to the next 
token (i.e. the one that was not a left-parenthesis).
If the next token after the identifier _is_ a left-parenthesis, the 
preprocessor must  check the arity of the macro referenced by the 
identifier.  In the running example, the  preprocessor has 
found that 
F identifier and the ( token immediately following it (as  the next 
token, whitespace between the identifier
and the left-parenthesis is ignored).   The preprocessor sees 
that F is defined
as a nullary macro--that is, it takes no  arguments.
[Nullary Invocations]
If scanning finds that an identifier (followed by a
left-parenthesis) that refers to a  nullary function-like 
macro (as is 
the case with F in the running
example) the  preprocessor must find the corresponding 
right-parenthesis.  The preprocessor _must_  find the 
right-parenthesis.  If it cannot (e.g. it finds EOF instead), the 
result is an  "incomplete invocation" error.
While it does so, it must ensure that there are no tokens 
between the 
left-parenthesis  and the right-parenthesis.
Specifically, between the two parentheses, there must be 
either one or 
more whitespace separations or nothing at all.
If any tokens are present  between them, the result is a "too many 
arguments" error.
The identifier and the left- and right-parentheses form an 
invocation 
of the macro, and  the preprocessor begins the replacement 
process of 
that invocation.
+ X OBJECT' F()  +
     |       |^  ||
     |       |___||
     |         |  |
     |         F INV
     |____________|
            |
            OBJECT DC
As with an object-like macro, the first step in the replacement 
process is the retrieval of the replacement list of the 
macro from the 
symbol table.  In the running  example, the F macro's 
replacement list 
is
G(G(G))
The preprocessor then performs any token-pasting operations in the 
replacement
list.   (Note that a nullary macro is a function-like macro, 
so the stringizing
operator  exists.  However, the stringizing operator must 
be applied 
to an instance of a formal  parameter in the replacement list.  A
<-- add missing "nullary"
Thanks.  There are a few other typos as well.
...
...
function-like macro has no formal
parameters, and  therefore any use of the stringizing operator is 
automatically an error.)  The result  of token-pasting in F's 
replacement list is
It's not clear to me why the stringizing operator leads to an 
error rather than a '#' character. Probably too much of a 
sidenote, anyway.
I don't know the rationale for why it is the way it is.  It simply is that way.
An occurrence of # in the replacement list of a macro definition is the
stringizing operator, yet the stringizing operator has the semantic constraint
that it must be applied to an instance of a formal parameter.  I mention it
because it is an automatic error (at the time of definition) and thus doesn't
have to be dealt with here.  I.e. I'm making it clear that I'm not skipping a
step that otherwise has to be performed for a function-like macro.
...
...
G(G(G))
In this case, there are no token-pasting operations to 
perform, so the 
result is a  no-op.
The sequence of tokens resulting from token-pasting is used 
to replace 
the invocation  of the macro.  At the same time, a 
disabling context 
that corresponds to the macro  being replaced is activated.  This 
disabling context surrounds the tokens that came  from the 
replacement 
list.
+ X OBJECT' G(G(G))  +
     |       |       ||
     |       |_______||
     |           |    |
     |           F DC |
     |________________|
              |
              OBJECT DC
Finally, scanning resumes beginning with the first token that came 
from the replacement  list (or the next token after the 
invocation if 
the replacement list was empty).
+ X OBJECT' G(G(G))  +
     |       |^      ||
     |       |_______||
     |           |    |
     |           F DC |
     |________________|
              |
              OBJECT DC
Nullary function-like macro invocations are nearly identical to 
object-like macro  invocations.  The primary difference is that an 
invocation of a function-like macro  requires multiple tokens.
[Interleaved Invocations]
It is important to note that disabling contexts only exist during a 
_single_
scan.   Moreover, when scanning passes the end of a disabling 
context, that
disabling context  no longer exists.  In other words, the 
output of a 
scan results only in tokens and  whitespace separations.  Some of 
those tokens might be painted (and they remain  painted), but 
disabling contexts are not part of the result of scanning.  
(If they  
were, there would be no need for blue paint.)
Misses (at least) a reference to 16.3.4-1 (the wording "with 
the remaining tokens of the source" (or so) is quite nice 
there, so consider using something similar).
I'm not so sure.  The reason that I even need to write this article is because
the standard uses terminology (like "rescanning") in overly general ways.
Likewise, the standard doesn't make it clear that it is describing the process
of scanning for macro expansion as opposed to just macro expansion.  That
section of the standard has confused many people--implementors included, so I'm
intentionally trying to avoid using the terminlogy that the standard uses
because it is misleading to some people.  "Rescanning" is a great example of
that.  It doesn't mean re-(scanning the sequence of tokens for more macros to
replace), it means re-scanning the sequence of tokens again, where this scan is
scanning for more macros to replace.  There are other types of scans (scanning
for concatenations to perform, formal parameter instances to substitute, removal
of placemarkers in C99) that happen to this particular sequence of tokens.
Before this point, this particular sequence of tokens was never scanned for
macro replacement with the exception of those tokens that came from an argument
to the macro (without being an operand...).  So, the terms "scanning" and
"rescanning" are being used in a very general way despite the latter being used
as a heading. 

However, after you already know what the process does (because of an article
like this one, for example), the standard makes a lot more sense.

(Another thing that I'm not doing is bringing up the original psuedo-code
algorithm whose behavior was the basis for the text in the standard.)
...
I believe I wouldn't really understand what you are talking 
about here without knowing that part of the standard. "A 
single scan" -- the concept of rescanning was introduced too 
periphicially to make much sense to someone unfamiliar with the topic.
This all comes back to the beginning--the process is scanning a sequence of
tokens for macros to expand (i.e. the first paragraph that you said I should
strike).  This entire process is recursively applied to arguments to macros
(without begin an operand...) and thus this entire scan for macros to expand can
be applied to the same sequence of tokens more than once.  It is vitally
important that disabling contexts don't continue to exist beyond the scan in
which they were created, but that blue paint does.  As I mentioned, there would
be no need for blue paint--what the standard calls "non-replaced macro name
preprocessing tokens"--if the disabling contexts weren't transient.

I'm pretty sure that I don't use the term "rescanning" anywhere in the whole
article (yep, I checked, and I don't).
...
...
In the diagrams used in this document, the tokens that have been 
output by a scan are  left in the diagram to provide 
context for the 
reader, such as:
+ + +
      ^
Here, scanning is looking at the third + token.  This 
particular scan 
is done with the  two preceding tokens.
Thus, if a disabling context existed around these tokens...
+ + +
  |^    |
  |_____|
     |
     DC
...the disabling context effectively "closes" as the scan moves on 
from token to token  because the disabling context is not 
part of the 
result of the scan.
+ + +
   |^  |
   |___|
     |
     DC
+ + +
     |^|
     |_|
      |
      DC
+ + +
        ^
Because function-like macro invocations require multiple 
tokens, it is 
possible for  such an invocation to span the end of a replacement 
list.  For example,
#define A() B
  #define B() 123
A()()
The A() invocation is replaced with B, and a B invocation is formed 
using the tokens  that follow the A invocation.
A() ()
  |^  |
  |___|
    |
    A INV
B ()
  ||^|  |
  ||_|__|
  |  | |
  |__| B INV
    |
    A DC
Thus, the invocation of B is geographically _interleaved_ with the 
disabling context  corresponding to A.  When the invocation 
of B gets 
replaced by 123 and scanning resumes  at 123, the disabling context 
corresponding to A _no longer exists_.
In order for an "outer" disabling context to remain active, 
invocations must be nested  (rather than interleaved) within that 
disabling context.  For example,
#define C() D()
  #define D() 123
C()
Here, the D invocation will be nested within the disabling context 
corresponding to C.
C()
  |^  |
  |___|
    |
    C INV
D()
  ||^  ||
  ||___||
  |  |  |
  | D INV
  |_____|
     |
     C DC
123
  ||^  ||
  ||___||
  |  |  |
  |  D DC
  |_____|
     |
     C DC
Disabling contexts (or lack thereof) have no effect on the 
result of 
either of these  two examples.  However, that isn't always the case.
#define A() B
  #define B() A
A()()()
Here, the result is ultimately an unpainted B 
token--because no A or B 
token ever  becomes painted.
A() ()()
  |^  |
  |___|
    |
    A INV
B () ()
  ||^|  |
  ||_|__|
  |  | |
  |__| B INV
    |
    A DC
A () 
  ||^|  |
  ||_|__|
  |  | |
  |__| A INV
    |
    B DC
B
  |^|
  |_|
   |
   A DC
Things are different in the following example:
#define C() D()
  #define D() C()
C()
Here, the result is ultimately C'().
C()
  |^  |
  |___|
    |
    C INV
D()
  ||^  ||
  ||___||
  |  |  |
  |  D INV
  |_____|
     |
     C DC
C()
  ||^  ||
  ||___||
  |  |  |
  |  D DC
  |_____|
     |
     C DC
C'()
  ||  ^ ||
  ||____||
  |   |  |
  |   D DC
  |______|
      |
      C DC
Note that interleaved invocations do _not_ allow for infinite 
expansion.  More tokens  must physically be present after the 
replacement list to complete an interleaved  invocation, and this 
sequence of tokens is ultimately limited to the finite sequence of  
tokens contained in the source file.
The above part seems very good to me.
Okay.
...
...
[Non-nullary Invocations]
If scanning finds an identifier (followed by a
left-parenthesis) that refers to a  non-nullary function-like macro 
(as is the case with G in the running
example) the  preprocessor must find the corresponding 
right-parenthesis.  The preprocessor _must_  find the 
right-parenthesis.  If it cannot (e.g. it finds EOF instead), the 
result is an  "incomplete invocation" error.
While it does so, it must delineate the actual arguments 
between the 
left- and right-parentheses.  This delineation process is a 
distinct 
step that separates each  argument into a separate argument 
before any 
other processing of the
arguments is done.   Each argument is separated by a comma 
token (,), but commas
between matched pairs of  left- and right-parentheses do 
not separate 
arguments, and the right-parenthesis of  such matched pairs is not 
used as the right-parenthesis that terminates the argument  
list.  For 
example, in
MACRO(a, (b, c), d)
the argument list to MACRO is delineated into three arguments.
After the arguments have been delineated, the preprocessor compares 
the number of  actual arguments to the arity of the macro.  
If there 
are more or less, than the result  is a "too many 
arguments" or "too 
few arguments" error.  (If the macro is variadic, the  number of 
arguments must be at least one greater than the number of 
named formal  
parameters.  This implies that a macro defined as:
#define V(x, y, ...) // ...
Mention that variadic macros are not a C++ feature.
Yes, good idea.  This was an oversight that I should have caught.  Note,
however, that variadic macros and placemarkers as all but guaranteed to be part
of the next C++ standard.
...
...
has a minimum arity of three--not two.)
In C++, if any argument is empty or contains only whitespace 
separations, the behavior  is undefined.  In C, an empty 
argument is 
allowed, but gets special treatment.  (That special treatment is 
described below.)
It requires at least C99, right? If so, say it (it's likely 
there are C compilers that don't support that version of the 
language).
As far as I am concerned, the 1999 C standard defines what C is until it is
replaced by a newer standard.  Likewise, 1998 standard defines what C++ is until
it is replaced by a newer standard.  I.e. an unqualified C implies C99, and
unqualified C++ implies C++98.  If I wished to reference C90, I'd say C90.
Luckily, I don't wish to reference C90 because I don't want to maintain an
article that attempts to be backward compatible with all revisions of a
language.  This is precisely why I should have a note above that variadic macros
are not part of C++, BTW, even though I know they will be part of C++0x.

Furthermore, deficiencies of compilers (like not implementing the language as it
is currently defined or not doing what they should during this process) is not a
subject of this article.

OTOH, it wouldn't hurt to mention in the "Conventions" section that, at the time
of this writing, C is C99 and C++ is C++98.
...
...
Assuming that the number of actual arguments is correct, the 
identifier, the
left- and  right-parentheses, and all of the tokens between 
them form 
an invocation of the macro,  and the preprocessor begins the 
replacement process of that invocation.
+ X OBJECT'  G(G(G))   +
     |       ||^      |||
     |       ||_______|||
     |       |    |    ||
     |       |    G INV |
     |       |_________||
     |            |     |
     |            F DC  |
     |__________________|
               |
               OBJECT DC
Here, the preprocessor has delineated the arguments to G 
which results 
in in the correct number of arguments for G (which is one, 
because G 
is unary).  That argument is
G(G)
As with both object-like and nullary function-like invocations, the 
first step in the  replacement list of a function-like macro 
invocation is the retrieval of the replacement list of the 
macro from 
the symbol table.  In the running example, the G  macro's 
replacement 
list is
x
which is an instance of the formal parameter.
After retrieval of the replacement list, the preprocessor 
replaces all 
instances of  formal parameters in the replacement with actual 
parameters.  However, it does this in  a variety of ways 
depending on 
how an instance of the formal parameter is used in the  replacement 
list.  If a particular instance of a formal parameter in the 
replacement  list is an operand of either the token-pasting 
operator 
or the stringizing operator,  the actual argument is substituted in 
place of the formal parameter instance.  (In C99,  if the 
argument is 
empty or contains only whitespace separations, a placemarker is 
substituted instead.)  If a particular instance of a formal 
parameter 
is _not_ an  operand of either the token-pasting operator or the 
stringizing operator, the actual  parameter is scanned for macro 
expansion, and the result is substituted in place of the  formal 
parameter instance.  (The scanning of an argument is the only true 
instance of  recursion in the macro expansion
process.)  Though it is scanned independently, any  
disabling contexts 
still exist.
Note that an argument is scanned for macro expansion if and 
_only_ if 
at least one  instance of a corresponding formal parameter 
exists in 
the replacement list without  being an operand of one of the 
operators.  Thus, in
#define EAT(x)
  #define BINARY(a, b) // ...
EAT( BINARY() )
no error should occur--even though there would be an error if the 
argument to EAT was  scanned for macro expansion.
After all instances of formal parameters in the replacement 
list have 
been substituted,  the preprocessor performs all token-pasting and 
stringizing operations in the resulting sequence of tokens. 
 Note that 
the precedence between the token-pasting and stringizing  
operators is 
unspecified, as is the associativity of token-pasting operator.  
However, the operands of the token-pasting operator are 
single tokens 
(those immediately to the  right and left of it).  The 
operand of the 
stringizing operator is the entire argument.
(Recall that it must be applied to an instance of a formal
parameter.)  Thus, even the  order of evaluation between 
token-pasting 
and stringizing is unspecified, one can think of a use of the 
stringizing operator as being replaced with a third "stringized"  
version of the actual argument.
(Otherwise, the preprocessor would have to mark the  extent of the 
substituted argument in some way.)  Of course, the order of 
evaluation 
is  unspecified, and therefore code should avoid relying on any 
particular order.
It is important to note that, even though the token-pasting 
operator 
operates on exactly two tokens (rather than entire 
arguments), it does 
affect entire arguments in  the sense that the 
token-pasting operator 
prevents the entire argument from being  scanned for macro 
expansion 
prior to substitution.
In C99, placemarkers are used as a temporary pseudo-token so that 
token-pasting and  stringizing work "correctly."  For example,
#define M(x) 1 x ## 3
M()
Here, the argument for x is empty, so a placemarker is 
substituted in 
place of the x in  the replacement list.
1 <placemarker> ## 3
This causes the token-pasting operator to operate on 3 and the 
placemarker (instead of  3 and 1).  Token-pasting of a token to a 
placemarker results in the
token.   Token-pasting of two placemarkers results in a placemarker.
Stringizing of a  placemarker results in the empty string:
#define STR(x) #x
STR() // ""
After all token-pasting and stringizing operations have been 
performed, all remaining  placemarkers are removed, and the 
resulting 
sequence of tokens is used to replace the invocation of the 
macro.  At 
the same time, a disabling conttext that corresponds to  the macro 
being replaced is activated.  This disabling context surrounds the 
toknes  that came from the replacement list.
In the running example, the x formal parameter instance in G's 
replacement list is not  an operand of either the token-pasting or 
stringizing operator.
Therefore, the actual  argument corresponding to x is scanned for 
macro expansion separately, but within any  active 
disabling contexts.  
In this example, during the recursive scan of G's argument, 
 another 
invocation of G is found, and its argument is recursively 
scanned as 
well.
+ X OBJECT'  G(G(G))   +
     |       ||^      |||
     |       ||_______|||
     |       |    |    ||
     |       |    G INV |
     |       |_________||
     |            |     |
     |            F DC  |
     |__________________|
               |
               OBJECT DC
// recursive scan of argument #1 to G:
G(G)
          |^   |
          |____|
             |
             G INV
      _ _ ______ _ _
             |
             F DC
      _ _ ______ _ _
             |
             OBJECT DC
// recursive scan of argument #1 to G:
G
              ^
          _ _ _ _ _
              |
              F DC
          _ _ _ _ _
              |
              OBJECT DC
G
               ^
          _ _ _ _ _
              |
              F DC
          _ _ _ _ _
              |
              OBJECT DC
// recursive scan results in: G
G
          |^|
          |_|
           |
           G DC
      _ _ ___ _ _
           |
           F DC
      _ _ ___ _ _
           |
           OBJECT DC
G'
             ^
      _ _ ___ _ _
           |
           F DC
      _ _ ___ _ _
           |
           OBJECT DC
// recursive scan results in: G'
+ X OBJECT'  G'   +
     |       ||^ |||
     |       ||__|||
     |       |  | ||
     |       |  G DC
     |       |____||
     |          |  |
     |          F DC
     |_____________|
            |
            OBJECT DC
+ X OBJECT' G' +
                 ^
+ X OBJECT' G' +
                  ^
Thus, the running example finally results in
+ X OBJECT' G' +
Note that the disabling context that is created when a macro 
invocation is replaced is  not created until _after_ all arguments 
that need to scanned for macro expansion have been.  This 
is why both 
G's expand in a the example.
You should repeat the "painting rules" here. You already said 
that paint applies to single tokens but it's a good idea to 
more explicitly point out to the reader that no parentheses 
are needed for G to end up painted.
I thought that I already did that (it is definitely worth noting).  Perhaps a
"recall that..." back-reference?
...
...
[Arguments to Interleaved Invocations]
Because arguments not used as operands of token-pasting or 
stringizing 
are scanned  separately but within any active disabling 
contexts, the 
possibility exists for an  argument (or part of an argument) to be 
scanned in inside a disabling
An "in" too much?
Yes.  I found a few mispeligs too.
...
...
context that no  longer exists after the macro invocation has been 
replaced.  Consider,
#define A(m) m( B(f)
  #define B(x) A(x)
#define C(x) [ x ]
A(C) ) *
The following diagrams the expansion of this sequence of tokens:
A(C) ) *
  |^   |
  |____|
     |
     A INV
// recursive scan of argument #1 to A results in:
C( B(f) ) *
  ||^      | |
  ||_______|_|
  |     |  |
  |     C INV
  |________|
       |
       A DC
// recursive scan of argument #1 to C:
B(f)
          |^   ||
          |____||
             |  |
             B INV
      _ _ ______|
             |
             A DC
// recursive scan of argument #1 to B results in: f
A(f)
          |^   ||
          |____||
             |  |
             B DC
      _ _ ______|
             |
             A DC
A'(f)
          |  ^  ||
          |_____||
             |   |
             B DC
      _ _ _______|
             |
             A DC
// recursive scan results in: A'(f)
[ A'(f) ] *
  |^        |
  |_________|
       |
       C DC
The result of scanning for expansion is
[ A'(f) ] *
Note that the argument to C was scanned inside the 
disabling context 
corresponding to  A, but that disabling context is no longer active 
when scanning resumes after the  replacement of C's invocation.  
(Scenarios such as this one are tricky, and preprocessors rarely 
handle this kind of thing
correctly.)
Point out explicitly that A remains painted.
Okay.
...
...
[Virtual Tokens]
In the implementation of a scanning routine, the single most 
design-affecting part of  this whole process is the handling of the 
disabling contexts.  This section discusses a possible way in which 
they can be implemented.  Recall that disabling contexts can be  
viewed as closing contexts because they are only active during a 
single scan and  because scanning never moves backwards.
Because of this property, disabling contexts  can be 
implemented using 
flags in the symbol table and virtual tokens.  Instead of  
maintaining 
expensive data structures similar to the above diagrams, 
when a macro  
invocation is replaced, the preprocessor can flag the macro in the 
symbol table as  "disabled" and insert a virtual token immediately 
following the sequence of tokens used  to replace an 
invocation.  When 
scanned, this virtual token clears the flag in the symbol 
table and is 
removed.  Using @A to indicate a virtual token that clears the  
disabled flag on A in the symbol table...
#define A() A()
A() A() *
  |^  |
  |___|
    |
    A INV
// set disabled flag on A in symbol table
A() @A A() *
  ^
A'() @A A() *
       ^
// clear disabled flag on A
A'() A() *
      |^  |
      |___|
        |
        A INV
// set disabled flag on A
A'() A() @A *
       ^
A'() A'() @A *
            ^
// clear disabled flag on A
A'() A'() *
            ^
This implementation strategy is fairly straightforward (and it 
produces the correct  results).  It does, however, get a 
little more 
complex when parameters
are involved.   For any given formal parameter, how it is 
handled during
substitution is a property of  the formal parameter itself.  
If the parameter is unused, no variations of the  parameter are 
required.  If the parameter is used without being an 
operand of the  
token-pasting or stringizing operator, a scanned variation is 
required.  If the  parameter is used as an operand of either of the 
two operators, the actual argument  itself is required.  These last 
two variations are not mutually exclusive.  For a given  formal 
parameter, both may be required, as in
#define F(x) #x x
Which variations are required for a particular formal parameter is 
known at the point  of definition of the macro.
When a macro invocation is expanded, the preprocessor 
processes each 
argument in the  order they are passed (rather than the 
order in which 
they appear in the replacement list).  This ensures a virtual token 
that exists inside one argument is processed at  the correct point.
If a formal parameter is not used in the replacement list 
of a macro, 
the argument will  be discarded.  However, before it is 
discarded, the 
preprocessor must execute any  virtual tokens that it 
contains.  For 
example,
#define EAT(x) x
  #define M() EAT(1
M() 2)
  |^  |
  |___|
    |
    M INV
// set disabled flag on M
EAT(1 @M 2)
  |^          |
  |___________|
        |
        EAT INV
Here, even though the argument to EAT is discarded, the 
virtual token 
@M must be executed.  Otherwise, the disabled flag in the 
symbol table 
will never be cleared.
If only a scanned variation of a formal parameter is required, the 
preprocessor can  simply recursively scan the actual 
argument.  That 
recursive scan will automatically handle the virtual tokens 
correctly.
If only the actual argument itself is required, the 
preprocessor must 
iterate through  the tokens contained in the argument to 
paint tokens 
and execute virtual tokens.  It must paint all identifier 
tokens that 
correspond to macros that are flagged in the  symbol table 
as disabled 
and execute all virtual tokens.  For example, if the argument  to a 
macro is
A @A A
and the actual argument itself is required for argument 
substitution, 
the sequence of  tokens used for substitution must be
A' A
If both a scanned variation and an actual argument variation is 
required, the preprocessor has to handle both of the above 
scenarios.  
This situation is a little  more complex, because it has to handle 
both scenarios correctly with regard to painting  and 
virtual tokens.  
The simplest strategy is to do the actual argument case 
first, but  to 
keep track of all virtual tokens executed therein.  After 
the actual 
argument variation is generated, the preprocessor can undo 
all of the 
state changes caused by  the execution of those virtual 
tokens.  Then, 
the scanned variation can be generated in  the same way as 
above--simply by recursively scanning it for macro 
expansion.  It is  
important that the actual argument variation is handled 
first, because 
during the recursive scanning the of the scanned variation, 
any number 
of disabled flags could be  set and later cleared by 
virtual tokens.  
At the start of processing either type of  argument, the 
state must be 
the same for both cases, and because virtual tokens may be  
generated 
and executed during the recursive scan, it would be very 
difficult to 
keep  track of which ones originally existed in the argument.
Interesting! Is it implemented somewhere?
I implemented it in psuedo-code prior to writing the article.  As I mentioned
above, I wrote the first draft of the article according to the algorithm, and I
wanted a more concrete pseudo-code algorithm for me to reference when
"translating" it into text.

BTW, the pseudo-code algorithm that I wrote was almost a complete implementation
(including pedantically correct handling of whitespace).  It wasn't abstract at
all.  I call it pseudo-code because it is using a bunch of infrastructure that
is used but isn't there (such as the lexer, a slew of functions like
'is_defined', etc.).  There are a variety of things that would have to change in
converting the pseudo-code algorithm to real code (there are a variety of
opportunities to reduce dynamic allocation, for example, that would be necessary
in real code).
...
BTW. I'm done with the comments, so far.
I really do appreciate your viewpoint even if it might seem that I'm arguing.
Hopefully my responses will give you an idea of what I was thinking when I was
writing it and what goals I was trying to achieve.  I think a lot of your
comments about the structure either directly relate to or are fallout from
trying to have both technical specification + commentary and following the
running example.
...
I believe there is more that can be done to enhance the 
structure. Anyway, all in all it reads quite nice.
Thanks.  More than anything else I want to make sure that the result is
understandable and useful.  I don't expect it to be perfect.  I'm not a writer
(and I don't like writing)--I'm just writing it to help others (and, indirectly,
to help myself).  So, I want to do a good enough job that it will useful and
clear enough to be understandable.  IOW, my aim is not to produce the ideal
example of technical writing, but to get the concepts across.
...
I hope it's of any use.
Definitely.  Thanks again.

Regards,
Paul Mensonides