Tokenizer usage, combining escaped_list_ and char_?
I'm attempting to process a scripting language from a file through tokenizer (only recently found out about Spirit, more on that later), and am having difficulties, namely, in processing input like this: "[ ExampleRoutine input; switch( input.messageNumber ) { 1: "Number 1"; 2: "Stand. On one foot (and jump around)"; 3: ! a comment to be ignored input.messageNumber = 1; break; } ];" The ending quotes mark the input as a string literal, the internal are part of the file. The problem comes from the fact that I need to process certain separators as tokens, such as braces, parentheses, periods, end lines, etc, but also allow escape characters in the case of the quotes, which should be treated as one returned token. My current method of processing this is: typedef char_separator<char> CharSep; typedef tokenizer<CharSep> CharTokenizer; typedef CharTokenizer::iterator CTokenIter; const CharSep g_RoutineSep(" \t\n,", "\"';[]{}()<>.!"); #define S_QUOTE "\"" ... // meanwhile, inside a function body CharTokenizer tok( getRoutineFromFile(), g_RoutineSep ); CTokenIter curTok( tok.begin() ); for(; ( curTok != tok.end() ) && ( *curTok != S_CBRACKET ); ++curTok ) { string curWord; if( ( *curTok == S_QUOTE ) { for(++curTok; ( *curTok != S_QUOTE ) && ( curTok != tok.end() ); ++curTok) { curWord += *curTok + ' '; // sure, we could check to see if *curTok is a punctation mark, and if so // not include that last space, but a better way must exist! } } else { // assume it's a command word, and process it here } } Of course, this is prone to gross misinterpretation, as "switch( input.messageNumber )" would be handled in each iteration as "switch", "(", "input", ".", "messageNumber", ")". Yet, this same functionality breaks the string literal into unnatural spacings, as the above code would turn it into "Stand . On one foot ( and jump around ) ", which isn't desired. However, as it stands, escape_list_separator doesn't return the separators, it just acts upon them, so all that fancy operator parsing isn't possible out of the box, requiring breaking things like "switch(" and "input.messageNumber" into separate things. Possible, yes, but extra work. So, the question. Could my needs be satisfied by defining my own TokenizerFunction, and if so, is there a simpler/more exhaustive reference besides the page? Or, conversely, is it time to look into Spirit? - Veni, Vidi, Vemaili. I came, I saw, I replied. -- Jeremy Tudisco, circa now.
Jeremy Tudisco wrote:
I'm attempting to process a scripting language from a file through tokenizer (only recently found out about Spirit, more on that later), and am having difficulties, namely, in processing input like this: "[ ExampleRoutine input;
IMHO Tokenizer should not be used for parsing such syntaxes.
However, as it stands, escape_list_separator doesn't return the separators, it just acts upon them, so all that fancy operator parsing isn't possible out of the box, requiring breaking things like "switch(" and "input.messageNumber" into separate things. Possible, yes, but extra work.
Yes, it does not return the seperator that just acted upon in tokenizing.
So, the question. Could my needs be satisfied by defining my own TokenizerFunction, and if so, is there a simpler/more exhaustive reference besides the page?
IMHO it would not be worth doing so. There is no documentation available.
Or, conversely, is it time to look into Spirit?
I think for these kind of things using spirit would be elegant. If provides better features for parsing complex syntaxes. Thank you, Nitin Motgi
participants (2)
-
Jeremy Tudisco
-
Nitin Motgi