[Wave] Does it do phases?

Wave is our C++ preprocessor, but preprocessing is the third phase of translating a file. (Looking at section 2.1 in the standard). I have a gut feeling that all the compilers out there mush the first three phases together in parsing a file. Glancing over the Wave docs gives me the same impression about it. Are either one of these feelings accurate (this requires a separate answer for each parser)? If the answer for Wave is "yes", could we separate them, at least as an option? I feel that this is important so we can gain full understanding of each phases. It may be more complicated[1], and most likely slower, but it could represent a clean implementation. (BTW, what phases does Wave act like?) The first two[2] phases are: 1. Native characters that match basic source characters are converted as so (including line breaks). Trigraphs are expanded to basic source[3]. Other characters are turned into internal Unicode expansions (i.e. act like "\uXXXX" or \Uxxxxxxxx"[4]). 2. The backslash-newline soft line-break combination are collapsed, folding multiple native lines into one logical line. We should spit out an error if the folding creates Unicode escapes. For non-empty files, we need to spit out errors if the last line is not a hard line-break, either a non-newline character or a backslash-newline combination is forbidden. [1] Our "Wave-1" would convert the original text (iterators) into phase-1 tokens. Our "Wave-2" would convert phase-1 token (iterators) into phase-2 tokens, etc. Remember that any file-name and line/column positions will have to be passed through each phase. [2] I thought Wave just did phase-3, with phases 1 and 2 thrown in at the same time. But now I'm not sure which phase Wave stops at. I don't think it can go past phase-4, because doing phase-5 needs knowledge of the destination platform. [3] Only '?' characters that are part of a valid trigraph sequence are converted; all others are left unchanged. [4] But actual "\uXXXX" resolution doesn't happen until phase 5! -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote: Sorry for the late answer, your questions required some investigation to be answered correctly.
Wave is our C++ preprocessor, but preprocessing is the third phase of translating a file. (Looking at section 2.1 in the standard). I have a gut feeling that all the compilers out there mush the first three phases together in parsing a file. Glancing over the Wave docs gives me the same impression about it. Are either one of these feelings accurate (this requires a separate answer for each parser)? If the answer for Wave is "yes", could we separate them, at least as an option? I feel that this is important so we can gain full understanding of each phases. It may be more complicated[1], and most likely slower, but it could represent a clean implementation. (BTW, what phases does Wave act like?)
Wave internally doesn't distinguish explicitely between the different phases mandated by the Standard. This is probably similar to most other compilers out there. Wave currently is built up out of two separate layers (I don't call them phases to avoid confusion with the Standard phases). A lexing layer and a preprocessing layer. What you will have to know when you're going to use Wave is, that it does not act on preprocessing tokens but directly on the C++ token level. This might be a wrong design decision in the beginning, but it allows to expose the full set of C++ tokens as defined by the standard from Wave's iterator interface without having to rescan (retokenize) the generated preprocessed output. Additionally, I wanted to have the lexing components to generate C++ tokens to make them usable in other contexts where no preprocessing is required. The drawback of this design is that Wave - doesn't conform to the standard in this regard - currently doesn't fully support the handling of preprocessing numbers as mandated by the Standard. This is rearly an practical issue though, since many uses of preprocessing numbers are handled correctly anyway. The first (bottom) layer in Wave generates the C++ tokens. These are generated by a lexing component exposing them through an iterator interface. This lexing component implements the compilation phases 1 and 2. There are two different lexing components for supporting the full set of C++ tokens, both usable separately without the preprocessing layer described below. As I've said, you get C++ tokens at this level already. The difference between these lexing components is implementation wise only. Their implementations are using different lexer generator toolkits (re2c an slex). I have a xpression based lexer here as well, but this needs some additional work. Phase 3 is not implemented in Wave as outlined above. Wave generates C++ tokens instead. The second layer in Wave is the preprocessing layer. It uses the C++ tokens generated by the lexer to - recognise the preprocessing directives and execute them - recognise identifiers representing macro invocations and expands these That corresponds to phase 4. Phase 5 and above are not implemented in Wave.
The first two[2] phases are:
1. Native characters that match basic source characters are converted as so (including line breaks). Trigraphs are expanded to basic source[3]. Other characters are turned into internal Unicode expansions (i.e. act like "\uXXXX" or \Uxxxxxxxx"[4]). 2. The backslash-newline soft line-break combination are collapsed, folding multiple native lines into one logical line. We should spit out an error if the folding creates Unicode escapes. For non-empty files, we need to spit out errors if the last line is not a hard line-break, either a non-newline character or a backslash-newline combination is forbidden.
This is done except the test for invalid unicode characters resulting from line collapsing. Generally Wave is not unicode aware. I'd like to use a future Boost library for that.
[1] Our "Wave-1" would convert the original text (iterators) into phase-1 tokens. Our "Wave-2" would convert phase-1 token (iterators) into phase-2 tokens, etc. Remember that any file-name and line/column positions will have to be passed through each phase.
Wave currently follows exactly this described design except that it does not apply it to compilation phases as outlined by the Standard but to layers as outlined above. Both the lexing and the preprocessing layer provide tokens through an iterator interface.
[2] I thought Wave just did phase-3, with phases 1 and 2 thrown in at the same time. But now I'm not sure which phase Wave stops at. I don't think it can go past phase-4, because doing phase-5 needs knowledge of the destination platform.
Yes. Wave stops at phase 4.
[3] Only '?' characters that are part of a valid trigraph sequence are converted; all others are left unchanged.
Yes. This is done as expected.
[4] But actual "\uXXXX" resolution doesn't happen until phase 5!
Wave treats \uxxxx and \Uxxxxxxxx as single characters but doesn't care about it's semantics. The only things it verifies are: - that these have valid values (as described in Annex E of the Standard) - that token concatenation does not produce invalid \uxxxx or \Uxxxxxxxx character values. Generally speaking, I agree with you that now as Wave is part of Boost it should conform to the Standard in this regard as well. It will result in a major rewrite of some parts of Wave, though. Regards Hartmut
participants (2)
-
Daryle Walker
-
Hartmut Kaiser