tokenizing with xpressive
I'm trying to tokenize lines of a file using the included static regex. I only care about the tokens indicated by s* = ... When I use the sregex_token_iterator to parse the lines, I only get the last match for s2 and s3. How should I change things so that I can get every match for s2 and s3 rather than just the last match? sregex whitespace_regex = *_s; sregex line_regex = whitespace_regex >> (s1 = +_d) >> whitespace_regex >> ( +( '\"' >> (s2 = *~as_xpr('\"')) >> '\"' >> whitespace_regex >> ':' >> whitespace_regex >> '\"' >> (s3 = *~as_xpr('\"')) >> '\"' >> whitespace_regex ) | +( '\"' >> (s2 = *~as_xpr('\"')) >> '\"' >> whitespace_regex ) ); This e-mail transmission contains information that is confidential and may be privileged. It is intended only for the addressee(s) named above. If you receive this e-mail in error, please do not read, copy or disseminate it in any manner. If you are not the intended recipient, any disclosure, copying, distribution or use of the contents of this information is prohibited. Please reply to the message immediately by informing the sender that the message was misdirected. After replying, please erase it from your computer system. Your assistance in correcting this error is appreciated.
Dilts, Daniel D. wrote:
I'm trying to tokenize lines of a file using the included static regex. I only care about the tokens indicated by s* = ... When I use the sregex_token_iterator to parse the lines, I only get the last match for s2 and s3.
How should I change things so that I can get every match for s2 and s3 rather than just the last match?
sregex whitespace_regex = *_s; sregex line_regex = whitespace_regex >> (s1 = +_d) >> whitespace_regex >> ( +( '\"' >> (s2 = *~as_xpr('\"')) >> '\"' >> whitespace_regex >> ':' >> whitespace_regex >> '\"' >> (s3 = *~as_xpr('\"')) >> '\"' >> whitespace_regex ) | +( '\"' >> (s2 = *~as_xpr('\"')) >> '\"' >> whitespace_regex ) );
Hi, Daniel. If you are using the latest version of xpressive from the Boost File Vault, I would solve the problem this way: #include <string> #include <vector> #include <iostream> #include <boost/foreach.hpp> #include <boost/xpressive/xpressive.hpp> #include <boost/xpressive/regex_actions.hpp> using namespace boost; using namespace xpressive; int main() { local<std::vector<ssub_match> > strings; sregex line_regex = skip(_s) // skip whitespace ( (s1 = +_d) >> +( '\"' >> (s2 = *~as_xpr('\"'))[push_back(strings, s2)] >> '\"' >> ':' >> '\"' >> (s3 = *~as_xpr('\"'))[push_back(strings, s3)] >> '\"' ) | +( '\"' >> (s2 = *~as_xpr('\"'))[push_back(strings, s2)] >> '\"' ) ) ; std::string input(" 42 \"The answer to\" : \"Life\" \"The Universe\" : \"And Everything!\" "); if(regex_match(input, line_regex)) { BOOST_FOREACH(ssub_match s, strings.get()) { std::cout << s << std::endl; } } } The above uses semantic actions (the parts in []) to push sub-matches into a vector for reference later. (It also uses skip(_s) to skip whitespace.) If you are using xpressive 1.0, which is part of Boost 1.34.1, it would be a little trickier. There is no skip(), and no semantic actions. If that's the case, you can define a nested sregex quoted_string=*~as_xpr('\"');, and use that in your line_regex. Then every quoted string that matches will cause a nested result to be added to your match_results. See below: sregex quoted_string = *~as_xpr('\"'); sregex line_regex = keep(*_s) >> (s1 = +_d) >> keep(*_s) >> ( +( '\"' >> quoted_string >> '\"' >> keep(*_s) >> ':' >> keep(*_s) >> '\"' >> quoted_string >> '\"' >> keep(*_s) ) | +( '\"' >> quoted_string >> '\"' >> keep(*_s) ) ); std::string input(" 42 \"The answer to\" : \"Life\" \"The Universe\" : \"And Everything!\" "); smatch what; if(regex_match(input, what, line_regex)) { BOOST_FOREACH(smatch const &str, what.nested_results()) { std::cout << str[0] << std::endl; } } This is less efficient, but gets the job done. HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com
participants (2)
-
Dilts, Daniel D.
-
Eric Niebler