[Xpressive] Baffled about results of search

<alert comment="xpressive newbie">
vc7.1
xpressive from "vault" as of today (April 11)
With the statements below (more or less based on Example 1 and Example
2 in the sample code provided with xpressive), I'm attempting to
detect which "group" of subexpressions matched, and getting results
that I don't understand (not unexpected for a newbie <g>).
std::string altDow( "Alternate days of the week are Tue and Thursday
and Sat and Monday." );
sregex rex = sregex::compile( "((Sunday|Sun)"
"|(Monday|Mon)"
"|(Tuesday|Tue)"
"|(Wednesday|Wed)"
"|(Thursday|Thu)"
"|(Friday|Fri)"
"|(Saturday|Sat))" );
The first regex_search sort of works, but I'm unclear what field in
the smatch:what variable to use to tell what matched, and some
oddities are happening.
regex_search( altDow, what, rex )
Length(0): 3
Position(0): 31
what[0]: Tue
what[1]: Tue
what[2]:
what[3]:
what[4]: Tue
The subsequent searches give valid information about the length and
position, but other fields in the smatch:what variable seem invalid.
regex_search( altDow.substr(32), what, rex)
regex_search( altDow.substr(45), what, rex)
Regardless, I'm unclear how to tell which of the "groups" of
subexpressions was "hit". With boost::regex there was a way to loop
thru the boolean field:
what[i].matched
and see what "hit", but I am not seeing a comparable accessor.
Am I doing something wrong, leaving something out, completely clueless
on using xpressive, or is xpressive not intended for this kind of
usage?
(and if xpressive is appropriate, what statements will provide the
intended information?)
The intention is to be able to to replace/encode the altDay string to
be:
Alternate days of the week are

Lynn Allan wrote:
The subsequent searches give valid information about the length and position, but other fields in the smatch:what variable seem invalid. regex_search( altDow.substr(32), what, rex) regex_search( altDow.substr(45), what, rex)
Watch out! altDow.substr(32) is returning a temporary string object. After regex_search returns, the temporary is destroyed and the results object is left holding iterators into a destroyed string.
Regardless, I'm unclear how to tell which of the "groups" of subexpressions was "hit". With boost::regex there was a way to loop thru the boolean field: what[i].matched and see what "hit", but I am not seeing a comparable accessor.
Xpressive is no different that Boost.Regex in this regard. what[i].matched is correct. -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler wrote:
Lynn Allan wrote:
The subsequent searches give valid information about the length and position, but other fields in the smatch:what variable seem invalid. regex_search( altDow.substr(32), what, rex) regex_search( altDow.substr(45), what, rex)
Watch out! altDow.substr(32) is returning a temporary string object. After regex_search returns, the temporary is destroyed and the results object is left holding iterators into a destroyed string.
Arggggg. Trying to get "up to speed" with Boost is probably a flawed thing to try to do at the same time as learning rudimentary aspects of STL. "We're not in MFC-land any more, Toto." <g> std::string altDow32 = altDow.substr(32); if( regex_search( altDow32, what, rex ) ) { unpack } The above (ugly) code clears up the "self inflicted wound". I'm "holding my nose" and using it to "unpack" the contents of the match_results variables. Thanks for the clarification.
Regardless, I'm unclear how to tell which of the "groups" of subexpressions was "hit". With boost::regex there was a way to loop thru the boolean field: what[i].matched and see what "hit", but I am not seeing a comparable accessor.
Xpressive is no different that Boost.Regex in this regard. what[i].matched is correct.
Works well. I think an earlier "trial and error stab at it" used what[i].matched() and back-tracked too far when the compiler complained. Your documentation is superior, but I've got a long ways to go to understand template notation adequately. Thanks again for your patience ....

Lynn Allan wrote:
Eric Niebler wrote:
Lynn Allan wrote:
The subsequent searches give valid information about the length and position, but other fields in the smatch:what variable seem invalid. regex_search( altDow.substr(32), what, rex) regex_search( altDow.substr(45), what, rex)
Watch out! altDow.substr(32) is returning a temporary string object. After regex_search returns, the temporary is destroyed and the results object is left holding iterators into a destroyed string.
Arggggg. Trying to get "up to speed" with Boost is probably a flawed thing to try to do at the same time as learning rudimentary aspects of STL. "We're not in MFC-land any more, Toto." <g>
std::string altDow32 = altDow.substr(32); if( regex_search( altDow32, what, rex ) ) { unpack }
A better approach would be to use iterators. // Begin searching 32 characters into altDow regex_search( altDow.begin() + 32, altDow.end(), what, rex ) Note #1: the match results will now contain iterators into altDow, but position offsets will be relative to the start of the search, not the start of altDow. Note #2: this code is only valid if altDow has at least 32 characters. Error checking is left as an exercise. :-) Incidentally, your earlier mistake was distressingly easy to make. I can think of a way I might be able to catch this mistake at compile time in a future versions of xpressive. -- Eric Niebler Boost Consulting www.boost-consulting.com

A better approach would be to use iterators.
// Begin searching 32 characters into altDow regex_search( altDow.begin() + 32, altDow.end(), what, rex )
Note #1: the match results will now contain iterators into altDow, but position offsets will be relative to the start of the search, not the start of altDow.
Is the following getting closer to "best practice" to use xpressive-static for the kinds of specialized searches I'm attempting. When I'm further along on the learning curve, I'll try to prepare some "apples and apples" timing comparisons between boost::regex, xpressive, spirit, a hand-tuned state-machine searcher, and an automatic FSM generator utility. I think I'm conforming to the guidelines on this page: X:\DevTools\Boost\libs\xpressive\doc\html\boost_xpressive\user_s_guide\tips_n_tricks.html except I don't understand how to specify syntax_option_type::optimize (does it apply to xpressive-static, or only xpressive-dynamic)? boost::xpressive::regex_constants::syntax_option_type flags = boost::xpressive::regex_constants::optimize; sregex rexStatic = (s1 = (as_xpr("Sunday") | "Sun")) | (s2 = (as_xpr("Monday") | "Mon")) | (s3 = (as_xpr("Tuesday") | "Tue")) | (s4 = (as_xpr("Wednesday") | "Wed")) | (s5 = (as_xpr("Thursday") | "Thu")) | (s6 = (as_xpr("Friday") | "Fri")) | (s7 = (as_xpr("Saturday") | "Sat")) ; //??? rexStatic.compile(rexStatic, flags); smatch what; std::string testStr = "Alternate days of the week are Tue and Thursday and Sat and Monday. " "And then Monday and Wed and Friday and Sun. "; std::string::const_iterator start = testStr.begin(); std::string::const_iterator finish = testStr.end(); int foundCount = 0; while (regex_search(start, finish, what, rexStatic)) { foundCount++; size_t limit = what.size(); size_t matchIndex = 1; while (matchIndex <= limit) { if (what[matchIndex].matched == true) { break; } ++matchIndex; } #ifdef _DEBUG cout << "FoundCount: " << foundCount << " Index: " << static_cast<int>(matchIndex) << " what[0]: " << what[0] << endl; #endif start = what[0].second; }

Lynn Allan wrote:
Is the following getting closer to "best practice" to use xpressive-static for the kinds of specialized searches I'm attempting. When I'm further along on the learning curve, I'll try to prepare some "apples and apples" timing comparisons between boost::regex, xpressive, spirit, a hand-tuned state-machine searcher, and an automatic FSM generator utility.
I think I'm conforming to the guidelines on this page: X:\DevTools\Boost\libs\xpressive\doc\html\boost_xpressive\user_s_guide\tips_n_tricks.html except I don't understand how to specify syntax_option_type::optimize (does it apply to xpressive-static, or only xpressive-dynamic)?
Static regexes assume the optimize flag.
boost::xpressive::regex_constants::syntax_option_type flags = boost::xpressive::regex_constants::optimize; sregex rexStatic = (s1 = (as_xpr("Sunday") | "Sun")) | (s2 = (as_xpr("Monday") | "Mon")) | (s3 = (as_xpr("Tuesday") | "Tue")) | (s4 = (as_xpr("Wednesday") | "Wed")) | (s5 = (as_xpr("Thursday") | "Thu")) | (s6 = (as_xpr("Friday") | "Fri")) | (s7 = (as_xpr("Saturday") | "Sat")) ;
Yep, you got it.
//??? rexStatic.compile(rexStatic, flags); smatch what;
std::string testStr = "Alternate days of the week are Tue and Thursday and Sat and Monday. " "And then Monday and Wed and Friday and Sun. "; std::string::const_iterator start = testStr.begin(); std::string::const_iterator finish = testStr.end(); int foundCount = 0; while (regex_search(start, finish, what, rexStatic)) { foundCount++; size_t limit = what.size(); size_t matchIndex = 1; while (matchIndex <= limit) { if (what[matchIndex].matched == true) { break; } ++matchIndex; } #ifdef _DEBUG cout << "FoundCount: " << foundCount << " Index: " << static_cast<int>(matchIndex) << " what[0]: " << what[0] << endl; #endif start = what[0].second; }
Ah, you're trying to find all the matches. Check out regex_iterator. sregex_iterator begin(testStr.begin(), testStr.end(), rexStatic), end; for(; begin != end; ++begin, ++foundCount) { smatch const &what = *begin; size_t limit = what.size(); size_t matchIndex = 1; for(; matchIndex <= limit; ++matchIndex) if(what[matchIndex].matched == true) break; #ifdef _DEBUG cout << "FoundCount: " << foundCount << " Index: " << static_cast<int>(matchIndex) << " what[0]: " << what[0] << endl; #endif } HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

FWIW, here are some timings for different "flavors" of xpressive, and some other boost and non-boost search algo's. YMMW. Caveats: * contrived data (BMH probably doesn't help) std::string testStr = "Alternate days of the week are Tue and Thursday and Sat and Monday. " "And then Monday and Wed and Friday and Sun. " "Abbrev days in alpha order are Fri Mon Sat Sun Thu Tue Wed " "Full days in alpha order are Friday Monday Saturday Sunday Thursday Tuesday Wednesday " "Abbrev days in reverse alpha order are Wed Tue Thu Sun Sat Mon Fri " "Full days in reverse alpha order are Wednesday Tuesday Thursday Sunday Saturday Monday Friday " "Embedded abbrev days in reverse alpha order are 1Wed 2Tue 3Thu 4Sun 5Sat 6Mon 7Fri " "Embedded Full days in alpha order are 1Friday 2Monday 3Saturday 4Sunday 5Thursday 6Tuesday 7Wednesday " "Abbrev dot days in alpha order are Fri. Mon. Sat. Sun. Thu. Tue. Wed. " "Abbrev dot days in reverse alpha order are Wed. Tue. Thu. Sun. Sat. Mon. Fri. " "Near misses are Wed.X Tue.X Thu.X Sun.X Sat.X Mon.X Fri.X " "Near misses are WednesDay TuesDay ThursDay SunDay SaturDay MonDay FriDay " "Near misses are WeD TuE ThU SuN SaT MoN FrI " ; * not real confident of the numbers .... seem reasonable but ... Boost::xpressive-static: Elapsed for 10000 loops Ms: 422.72 Boost::xpressive-static: Elapsed for 100000 loops Ms: 4145.9 Boost::xpressive-dynamic-not-optimized: Elapsed for 10000 loops Ms: 429.847 Boost::xpressive-dynamic-not-optimized: Elapsed for 100000 loops Ms: 4392.54 Boost::xpressive-dynamic-optimized: Elapsed for 10000 loops Ms: 435.617 Boost::xpressive-dynamic-optimized: Elapsed for 100000 loops Ms: 4282.6 Boost::xpressive-static-iterator: Elapsed for 10000 loops Ms: 448.425 Boost::xpressive-static-iterator: Elapsed for 100000 loops Ms: 4379.07 Boost::regex: Elapsed for 10000 loops Ms: 622.377 Boost::regex: Elapsed for 100000 loops Ms: 5965.24 (can probably be tweaked and improved .... just q/d from example) FSM generator Elapsed for 10000 loops Ms: 2487.97 FSM generator Elapsed for 100000 loops Ms: 24870.5 Hand-tuned parser: Elapsed for 10000 loops Ms: 110.583 Hand-tuned parser: Elapsed for 100000 loops Ms: 1107.81 (no numbers for spirit .... barely started on its learning curve)

* not real confident of the numbers .... seem reasonable but ...
That's roughly what I'd expect: Regex and xpressive use very similar algorithms internally, with Eric and I occationally playing a game of leapfrog :-) And probably we've both already grabbed the low hanging fruit, unless Eric has any cunning plans he's keeping to himself :-) Performance between the two depends upon the exact expression used, and the data searched, but they probably should be within a factor of 2 of each other. Historically Regex tended to do better with sparse matches in big files, and xpressive better with short dense matches, but I haven't tested a really recent version of xpressive, so as ever your mileage may vary. John.

On 4/13/06, Lynn Allan
FWIW, here are some timings for different "flavors" of xpressive, and some other boost and non-boost search algo's. YMMW.
<snip> Lynn - it might be interesting to use re2c (http://re2c.org/) to generate regex machines and see what their performance is like - after all, they do clain 're2c generates warning free code that is equal to hand-written code in terms of size, speed and quality.' (I'm offering no opinion on the veracity of that statement....), so (allegedly) they'd give a reasonable hand-code analogue.... Stuart Dootson

Stuart Dootson wrote:
On 4/13/06, Lynn Allan
wrote: FWIW, here are some timings for different "flavors" of xpressive, and some other boost and non-boost search algo's. YMMW.
Lynn - it might be interesting to use re2c (http://re2c.org/) to generate regex machines and see what their performance is like - after all, they do clain 're2c generates warning free code that is equal to hand-written code in terms of size, speed and quality.' (I'm offering no opinion on the veracity of that statement....), so (allegedly) they'd give a reasonable hand-code analogue....
I don't think re2c does backreferences, so it might not be usable for Lynn's purposes. It also means any comparison would be apples to oranges, since regex matching with backreferences is a much harder problem. -- Eric Niebler Boost Consulting www.boost-consulting.com

I don't think re2c does backreferences, so it might not be usable for Lynn's purposes. It also means any comparison would be apples to oranges, since regex matching with backreferences is a much harder problem.
Some of this is simply curiousity on my part. (... and too much time on my hands?) Also, I've been quite impressed by the performance of xpressive and regex (haven't looked much at spirit) ... my initial expectation was that something general purpose and powerful would be worse than an order of magnitude slower than "rolling your own" hand-tuned recognizer. I think it might be valuable for a person in the position of evaluating whether to put in the time to learn boost::regex and/or boost::xpressive could find information on the performance trade-offs involved: "So, how slow is it?" (and then, "How bloated is it?" "How hard is it to learn?" "How hard is it to build and integrate?") Epecially that it probably won't be nearly as negative as a first guess might be. (but of course, "it depends" applies to everything ). BTW, my first impression is that re2c might be useful for some of the things I work on .... it could get the code to the "area of interest", where more elaborate replacements could be done. I don't think I understand "backreferences" ... my mind hasn't absorbed past DFA's and "greedy" always seems to be a liability. Another BTW: my impression is that pcre doesn't do substitutions .... based on an email from its author, Dr. Philip Hazel. Did I misunderstand? I didn't see anything in its api to indicate substitutions were possible.
participants (4)
-
Eric Niebler
-
John Maddock
-
Lynn Allan
-
Stuart Dootson