
On Sep 1, 2007, at 2:14 PM, David Abrahams wrote:
on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:
Can you give me a specific example of where making the entire lexer a template would be more useful than adding a couple of virtual method calls? It isn't meaningful to lex anything other than a sequence of char's
You're serious? It's meaningless to lex UTF-32?
It's not meaningless, but templatizing the entire lexer (which includes the preprocessor) is not necessarily the best way to achieve this. Consider three possible design points: 1. Templatize the entire lexer and preprocessor on the input encoding. 2. translate UTF-32 (and every other encoding) to UTF-8 in a prepass over the buffer. Lex the resulting buffer as UTF-8. 3. Make the low level "getchar" routine in the lexer do a dynamic switch on the input encoding type, translating the input encoding into the internal encoding on the fly. These three design points all have strengths and weaknesses. For example: #1 is the best solution if your compiler is known to only statically cares about a single encoding - the single instantiation is obviously going to be performance optimal for any specific encoding, and the code size/compile time will be comparable to a non-templated implementation. However, if it has to handle multiple encodings (as is typical for most real compilers), it requires one instantiation of a large amount of code for a large number of encoding types. This is a serious compile time (of the lexer) and code size problem. #2 is the best solution if your compiler almost always reads ASCII or UTF-8, but needs to be compatible with a wide range of encodings. It features low compile time (of the lexer) and is optimal for UTF-8 and compatible encodings. The bad part about it is that non-compatible encodings must have an expensive prepass over their code, which can be a significant performance problem. #3 is the best solution for a compiler that reads a wide variety of encodings and there is no common case. It features low compile time, but obviously has worse performance than #1. One interesting aspect of it (which is often under-appreciated) is that common processors will perfectly predict the encoding-related branches in the inner scanning loop, so the performance impact would be much lower that may be expected. My point isn't to tell you what I think is the optimal solution - I'm just trying to point out that a very important part of software engineering is choosing the right design point for a problem. I love C++ as a language because it gives me many tools to choose from to solve a specific problem such as this. Compiler design in particular doesn't have any "magic": it's all just a series of design decisions based on various engineering tradeoffs. Being aware of and carefully considering all of the options is the best way to make a high-quality product IMO.
for example, and efficient buffer management (at least in our context) means that the input to the lexer isn't useful as an iterator interface.
Well, the kind of input sequence is exactly one thing I would templatize.
To what benefit? In practice, clang requires its input to come from a nul terminated memory buffer (yes, we do correctly handle embedded nul's in the input buffer as whitespace). Here are the pros and cons: Pros: clang is designed for what we perceive to be the common case. In particular, mmap'ing in files almost always implicitly null terminates the buffer (if a file is not an even multiple of a page size, most major OS's null fill to the end of the page) so we get this invariant for free in most cases. Memory buffers and many others are also easy to handle in this scheme. Futher, knowing that we have a sequential memory buffer as an input makes various optimizations really trivial: for example our block comment skipper is vectorized on hosts that support SSE or Altivec. Having the nul terminator at the end of the file means that the lexer doesn't have to check for "end of buffer" condition in *many* highly performance sensitive lexing loops (e.g. lexing identifiers, which cannot have a nul in them). Cons: for files that are exactly a multiple of the system page size, sometimes you have to do a dynamic allocation and copy the buffer to nul terminate the input. Likewise, if you want to support an arbitrary input iterators then you have to copy the input range into a memory buffer before you start. Personally, I can't think of a situation where lexing C code from an arbitrary input iterator range would be useful. I'll assume you can however: in this case, copying the input into a buffer before you start seems quite likely to give *better* performance than a fully parameterized the lexer would. By restricting the lexer to only being able to assume input iterators, you force it to check for end of stream after it reads *every* character, you effectively prevent vectorization and other optimizations, and you are more at the mercy of the compiler optimizer to produce good code. -Chris