[Wave] Wrong phase for trigraphs? Generator interface?

[Sorry if this has already been reported, fixed, and/or superceded.] Looking at <http://www.boost.org/libs/wave/doc/token_ids.html>, I see various kinds of tokens. There are tokens for the preprocessor that are seen by the lexer and don't make it to the preprocessing iterator level. The other sets of tokens do make it to that level, modulo any transformations. The trigraphs are put in the operator token list. However, trigraphs should not be there. They are processed before anything else, even before the preprocessor tokens. So there should be another level of lexer working here. As is, it doesn't seem that you could use the trigraph for "#", "??=", for preprocessor directives. ??=include <cstdio> // this should work On a related note, I thought maybe Wave should use a generator interface: template < typename Iterator, typename FileID > class phase1 { public: phase1( Iterator b, Iterator e, FileID id ); operator bool() const; // TRUE while not done cpp_p1_char_type operator ()(); }; template < typename Iterator, typename FileID > class phase2 { public: explicit phase2( phase1<Iterator, FileID> const &p ); operator bool() const; // TRUE while not done cpp_p2_line_string_type operator ()(); }; //... You generally can't rewind, of course. The cpp_p1_char_type would contain the expanded character's identity AND some indicator of its location (starting iterator, file ID, and line, row, and un-lined offset numbers). The cpp_p2_line_string_type would carry the locations for each character in its string. Then the tokens of later phases would know the location of their first characters. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote:
Looking at <http://www.boost.org/libs/wave/doc/token_ids.html>, I see various kinds of tokens. There are tokens for the preprocessor that are seen by the lexer and don't make it to the preprocessing iterator level. The other sets of tokens do make it to that level, modulo any transformations. The trigraphs are put in the operator token list. However, trigraphs should not be there.
The trigraph token types are generally in the same token set as the corresponding tokens they represent. And yes these are in the operator tokenset, this is because of section 2.12 [lex.operators] of the Standard.
They are processed before anything else, even before the preprocessor tokens. So there should be another level of lexer working here. As is, it doesn't seem that you could use the trigraph for "#", "??=", for preprocessor directives.
??=include <cstdio> // this should work
Wave correctly interprets this. As I pointed out already, Wave currently doesn't strictly follow the mandated translation phases. The trigrah tokens are processed on the lexer level, i.e. before anything else. Wave has a runtime option to convert the trigraphs token values to their equivalent token representation (i.e. '??=' to '#'), but always leaves the trigraph token id in place. Please let me elaborate. 1. The trigraph token id's are essentially equivalent to their corresponding non-trigraph token id's modulo a single bit, i.e. You're able to get at the 'real' token id by using the BASEID_FROM_TOKEN(t) macro. Because of this Wave correctly interprets the ??=include directive. 2. The token _values_ are converted only optionally to their non-trigraph representation to allow the library user to access the original token value (which may be useful in some contexts). I must admit, though, that the current default -namely not to convert the values - is a bug and should be fixed (I'll do that asap).
On a related note, I thought maybe Wave should use a generator interface:
template < typename Iterator, typename FileID > class phase1 { public: phase1( Iterator b, Iterator e, FileID id ); operator bool() const; // TRUE while not done cpp_p1_char_type operator ()(); };
template < typename Iterator, typename FileID > class phase2 { public: explicit phase2( phase1<Iterator, FileID> const &p ); operator bool() const; // TRUE while not done cpp_p2_line_string_type operator ()(); };
//...
You generally can't rewind, of course. The cpp_p1_char_type would contain the expanded character's identity AND some indicator of its location (starting iterator, file ID, and line, row, and un-lined offset numbers). The cpp_p2_line_string_type would carry the locations for each character in its string. Then the tokens of later phases would know the location of their first characters.
Yes, I agree. Wave should be rewritten (and hopefully will be rewritten) in a layered way cleanly implementing every of the mandated translation phases each on top of the previous. But I'm inclined to expose iterator interfaces, not generators - to allow for rewinds, error handling etc. Such a layered implementation would be generally useful for tool builders allowing to use every of the translation steps separately, if necessary. But that's for V2 and not done in a week or so. Thanks! Regards Hartmut

Hartmut Kaiser wrote:
Yes, I agree. Wave should be rewritten (and hopefully will be rewritten) in a layered way cleanly implementing every of the mandated translation phases each on top of the previous. But I'm inclined to expose iterator interfaces, not generators - to allow for rewinds, error handling etc.
Yes !! Here is another use case worth considering (while we are at it ;-) ): Imagine a C++ parser interfacing with such a backtracking lexer. It often needs to parse tentatively, rolling back to a previous state if the tentative was not successfull. However, some parts of the tentative may have been successfull nevertheless, such as the parsing of a nested-name-specifier. It would be very nice if the parser could somehow mark up such a token sequence by injecting synthetic tokens back into the lexer so it doesn't need to parse them again. A similar use case is described in the 'Decorating tokens...' paper (http://www.cs.clemson.edu/~malloy/papers/papers.html) where the parsing requires some (non strictly layered) interaction between parser, symbol lookup, and lexer. Just some food for thought... Regards, Stefan
participants (3)
-
Daryle Walker
-
Hartmut Kaiser
-
Stefan Seefeld