Re: [boost] Clang: Open-source C/C++ front end under development

2 Sep 2007

      On Sep 1, 2007, at 2:14 PM, David Abrahams wrote:
...
on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:
...
Can you give me a specific example of where making the entire lexer a
template would be more useful than adding a couple of virtual method
calls?  It isn't meaningful to lex anything other than a sequence of
char's
You're serious?  It's meaningless to lex UTF-32?
It's not meaningless, but templatizing the entire lexer (which  
includes the preprocessor) is not necessarily the best way to achieve  
this.  Consider three possible design points:

1. Templatize the entire lexer and preprocessor on the input encoding.
2. translate UTF-32 (and every other encoding) to UTF-8 in a prepass  
over the buffer.  Lex the resulting buffer as UTF-8.
3. Make the low level "getchar" routine in the lexer do a dynamic  
switch on the input encoding type, translating the input encoding  
into the internal encoding on the fly.

These three design points all have strengths and weaknesses.  For  
example:

#1 is the best solution if your compiler is known to only statically  
cares about a single encoding - the single instantiation is obviously  
going to be performance optimal for any specific encoding, and the  
code size/compile time will be comparable to a non-templated  
implementation.  However, if it has to handle multiple encodings (as  
is typical for most real compilers), it requires one instantiation of  
a large amount of code for a large number of encoding types.  This is  
a serious compile time (of the lexer) and code size problem.

#2 is the best solution if your compiler almost always reads ASCII or  
UTF-8, but needs to be compatible with a wide range of encodings.  It  
features low compile time (of the lexer) and is optimal for UTF-8 and  
compatible encodings.  The bad part about it is that non-compatible  
encodings must have an expensive prepass over their code, which can  
be a significant performance problem.

#3 is the best solution for a compiler that reads a wide variety of  
encodings and there is no common case.  It features low compile time,  
but obviously has worse performance than #1.  One interesting aspect  
of it (which is often under-appreciated) is that common processors  
will perfectly predict the encoding-related branches in the inner  
scanning loop, so the performance impact would be much lower that may  
be expected.

My point isn't to tell you what I think is the optimal solution - I'm  
just trying to point out that a very important part of software  
engineering is choosing the right design point for a problem.  I love  
C++ as a language because it gives me many tools to choose from to  
solve a specific problem such as this.

Compiler design in particular doesn't have any "magic": it's all just  
a series of design decisions based on various engineering tradeoffs.   
Being aware of and carefully considering all of the options is the  
best way to make a high-quality product IMO.
...
...
for example, and efficient buffer management (at least in our
context) means that the input to the lexer isn't useful as an
iterator interface.
Well, the kind of input sequence is exactly one thing I would
templatize.
To what benefit?  In practice, clang requires its input to come from  
a nul terminated memory buffer (yes, we do correctly handle embedded  
nul's in the input buffer as whitespace).  Here are the pros and cons:

Pros: clang is designed for what we perceive to be the common case.   
In particular, mmap'ing in files almost always implicitly null  
terminates the buffer (if a file is not an even multiple of a page  
size, most major OS's null fill to the end of the page) so we get  
this invariant for free in most cases.  Memory buffers and many  
others are also easy to handle in this scheme.

Futher, knowing that we have a sequential memory buffer as an input  
makes various optimizations really trivial: for example our block  
comment skipper is vectorized on hosts that support SSE or Altivec.   
Having the nul terminator at the end of the file means that the lexer  
doesn't have to check for "end of buffer" condition in *many* highly  
performance sensitive lexing loops (e.g. lexing identifiers, which  
cannot have a nul in them).

Cons: for files that are exactly a multiple of the system page size,  
sometimes you have to do a dynamic allocation and copy the buffer to  
nul terminate the input.  Likewise, if you want to support an  
arbitrary input iterators then you have to copy the input range into  
a memory buffer before you start.

Personally, I can't think of a situation where lexing C code from an  
arbitrary input iterator range would be useful.  I'll assume you can  
however: in this case, copying the input into a buffer before you  
start seems quite likely to give *better* performance than a fully  
parameterized the lexer would.  By restricting the lexer to only  
being able to assume input iterators, you force it to check for end  
of stream after it reads *every* character, you effectively prevent  
vectorization and other optimizations, and you are more at the mercy  
of the compiler optimizer to produce good code.

-Chris