[Boost-users] Tokenizer usage, combining escaped_list_ and char_?

24 Jan 2006


      I'm attempting to process a scripting language from a file through tokenizer (only recently found out about Spirit, more on that later), and am having difficulties, namely, in processing input like this: 
"[ ExampleRoutine input;
    switch( input.messageNumber ) 
    { 
        1: "Number 1"; 
        2: "Stand. On one foot 
                    (and jump around)"; 
        3: ! a comment to be ignored
            input.messageNumber = 1;
            break;
    }
];" 
The ending quotes mark the input as a string literal, the internal are part of the file.
The problem comes from the fact that I need to process certain separators as tokens, such as braces, parentheses, periods, end lines, etc, but also allow escape characters in the case of the quotes, which should be treated as one returned token.
My current method of processing this is:
typedef char_separator<char>    CharSep;
typedef tokenizer<CharSep>     CharTokenizer;
typedef CharTokenizer::iterator   CTokenIter;
const CharSep g_RoutineSep(" \t\n,", "\"';[]{}()<>.!");
#define S_QUOTE "\""
...
// meanwhile, inside a function body
CharTokenizer tok( getRoutineFromFile(), g_RoutineSep );
CTokenIter curTok( tok.begin() );
for(; ( curTok != tok.end() ) && ( *curTok != S_CBRACKET ); ++curTok )
{
    string curWord;
    if( ( *curTok == S_QUOTE )
    {
        for(++curTok; ( *curTok != S_QUOTE ) && ( curTok != tok.end() ); ++curTok)
        {
            curWord += *curTok + ' ';
            // sure, we could check to see if *curTok is a punctation mark, and if so
            // not include that last space, but a better way must exist!
        }
    }
    else
    {
        // assume it's a command word, and process it here
    }
} 
   
Of course, this is prone to gross misinterpretation, as "switch( input.messageNumber )" 
would be handled in each iteration as "switch", "(", "input", ".", "messageNumber", ")".
Yet, this same functionality breaks the string literal into unnatural spacings, as the above code would turn it into "Stand . On one foot ( and jump around ) ", which isn't desired.
However, as it stands, escape_list_separator doesn't return the separators, it just acts upon them, so all that fancy operator parsing isn't possible out of the box, requiring breaking things like "switch(" and "input.messageNumber" into separate things. Possible, yes, but extra work.
So, the question. Could my needs be satisfied by defining my own TokenizerFunction, and if so, is there a simpler/more exhaustive reference besides the page?
Or, conversely, is it time to look into Spirit?
-  Veni, Vidi, Vemaili. I came, I saw, I replied. -- Jeremy Tudisco, circa now.