[Xpressive] regex-like boost libraries vs re2c [was: Some timings [was: Baffled about results of search]]

Stuart Dootson wrote:
Lynn - it might be interesting to use re2c (http://re2c.org/) to generate regex machines and see what their performance is like - after all, they do clain 're2c generates warning free code that is equal to hand-written code in terms of size, speed and quality.' (I'm offering no opinion on the veracity of that statement....), so (allegedly) they'd give a reasonable hand-code analogue....
I took a look at it, and it looks like it could possibly be quite useful if executable size was a concern .... but it is pretty obscure (at least to me) .... the only example this dummy could make any sense out of at all was to recognize a number .... and that was missing/obscuring the important illustration of how to invoke and use the generated scanner code. Were you able to get it to work to do anything even close to recognizing Days-of-the-Week (full and abbreviations)? Did you find actual compilable examples with "main" invoking the generated scanners with a test-string to crunch? <alert comment="whining"> I really wish developers would provide COMPLETE "baby steps with training wheels" examples to flatten the learning curve. Eric N. did a great job of documenting and providing examples for xpressive .... that could be a model for other projects (not just boost). I've subscribed to their sourceforge list and will swallow my pride and ask the newbie question: "This dummy doesn't expect to be able to figure out your project in 30 minutes ... how do you get it to something basic like recognize:"? ((Sunday|Sun)|(Monday|Mon)....(Saturday|Sat)) </alert>

Stuart Dootson wrote: Lynn - it might be interesting to use re2c (http://re2c.org/) to generate regex machines and see what their performance is like - after all, they do clain 're2c generates warning free code that is equal to hand-written code in terms of size, speed and quality.' (I'm offering no opinion on the veracity of that statement....), so (allegedly) they'd give a reasonable hand-code analogue....
Lynn Allan wrote: I took a look at it, and it looks like it could possibly be quite useful if executable size was a concern .... but it is pretty obscure
re2c seems very fast, and once you get the hang of it, not that difficult to use. For the contrived test of looking for DayOfWeek within a 1kb buffer, here are some results ... ymmv (WinXp-Sp2 Microsoft ToolKit-2003 C++ Compiler with /O2 optimize flag) regex = "(Sunday|Sun)|(Monday|Mon) etc.(Saturday|Sat)"; By-hand parser: Elapsed for 10000 loops Ms: 110.583 By-hand parser: Elapsed for 100000 loops Ms: 1107.81 re2c generator Elapsed for 10000 loops Ms: 69.4683 re2c generator Elapsed for 100000 loops Ms: 700.546 Boost::xpressive-static-iterator: Elapsed for 10000 loops Ms: 410.492 Boost::xpressive-static-iterator: Elapsed for 100000 loops Ms: 4164.45 Boost::regex: Elapsed for 10000 loops Ms: 622.377 Boost::regex: Elapsed for 100000 loops Ms: 5965.24

Lynn Allan wrote:
Stuart Dootson wrote: Lynn - it might be interesting to use re2c (http://re2c.org/) to generate regex machines and see what their performance is like - after all, they do clain 're2c generates warning free code that is equal to hand-written code in terms of size, speed and quality.' (I'm offering no opinion on the veracity of that statement....), so (allegedly) they'd give a reasonable hand-code analogue....
Lynn Allan wrote: I took a look at it, and it looks like it could possibly be quite useful if executable size was a concern .... but it is pretty obscure
re2c seems very fast, and once you get the hang of it, not that difficult to use. For the contrived test of looking for DayOfWeek within a 1kb buffer, here are some results ... ymmv (WinXp-Sp2 Microsoft ToolKit-2003 C++ Compiler with /O2 optimize flag)
regex = "(Sunday|Sun)|(Monday|Mon) etc.(Saturday|Sat)";
By-hand parser: Elapsed for 10000 loops Ms: 110.583 By-hand parser: Elapsed for 100000 loops Ms: 1107.81
re2c generator Elapsed for 10000 loops Ms: 69.4683 re2c generator Elapsed for 100000 loops Ms: 700.546
Boost::xpressive-static-iterator: Elapsed for 10000 loops Ms: 410.492 Boost::xpressive-static-iterator: Elapsed for 100000 loops Ms: 4164.45
Boost::regex: Elapsed for 10000 loops Ms: 622.377 Boost::regex: Elapsed for 100000 loops Ms: 5965.24
Interesting! Can you confirm that re2c is not handling backreferences? That is, after a match, is there a way to access what the Nth group matched? Also, do you think you could send around the code that re2c is generating for this expression? I'm guessing that re2c is generating a DFA. Boost.Regex and Xpressive generate NFAs because DFAs aren't suited to doing backreferences[*]. I've considered adding DFA support to Xpressive, and use DFAs for those regexes that don't need the full power of NFAs. Clearly, the performance win would be worth the trouble. This would not be a trivial undertaking, however. [*] Technically, DFAs only have a problem with patterns such as "(.)\\1"; that is, when the result of the backreference is used within the pattern itself. -- Eric Niebler Boost Consulting www.boost-consulting.com

Interesting! Can you confirm that re2c is not handling backreferences? That is, after a match, is there a way to access what the Nth group matched? Also, do you think you could send around the code that re2c is generating for this expression?
This regex newbie is ignorant about what it means to "back-reference." My usage (so far) has been recognizing one type of expression at a time, like: ((Sunday|Sun)|(Monday|Mon) ...etc.(Saturday|Sat)) or ZipCode ##### or #####-#### It seems to be able to get part way thru the longer possibility, and then "settle for" the shorter "abbreviation". That probably isn't back-referencing. ZipCode 00000-000 is "almost" #####-####, but isn't, so it recognizes #####. MondX is "almost" Monday, but isn't, so it recognizes Mon, which is also group=2. It is relatively straightforward to get the results of a single match and do what you want with it ... I don't have experience with untangling something more complicated like a multi-piece date/time: MMM DDD, YYYY hh:mm:ss [ap].m
I'm guessing that re2c is generating a DFA. Boost.Regex and Xpressive generate NFAs because DFAs aren't suited to doing backreferences[*]. I've considered adding DFA support to Xpressive, and use DFAs for those regexes that don't need the full power of NFAs. Clearly, the performance win would be worth the trouble. This would not be a trivial undertaking, however.
[*] Technically, DFAs only have a problem with patterns such as "(.)\\1"; that is, when the result of the backreference is used within the pattern itself.
Re2c generates VERY gnarly code, full of goto's and labels ... but the compiler is happy and the optimizer seems to straighten everything out into fast object code. Re2c is apparently part of PHP. yy4: yych = *++YYCURSOR; goto yy3; yy5: yych = *++YYCURSOR; if(yych <= '/') goto yy6; if(yych <= '9') goto yy7; yy6: YYCURSOR = YYMARKER; switch(yyaccept){ case 1: goto yy10; case 0: goto yy3; } yy7: yych = *++YYCURSOR; if(yych <= '/') goto yy6; if(yych >= ':') goto yy6; yych = *++YYCURSOR; if(yych <= '/') goto yy6;

regex = "(Sunday|Sun)|(Monday|Mon) etc.(Saturday|Sat)";
By-hand parser: Elapsed for 10000 loops Ms: 110.583 By-hand parser: Elapsed for 100000 loops Ms: 1107.81
re2c generator Elapsed for 10000 loops Ms: 69.4683 re2c generator Elapsed for 100000 loops Ms: 700.546
Boost::xpressive-static-iterator: 10000 loops Ms: 410.492 Boost::xpressive-static-iterator: 100000 loops Ms: 4164.45
Eric Niebler wrote:
Interesting!
I did some HiResTimer comparisons to strstr and wonder if the results are credible ... the re2c numbers are much closer to strstr than I expected. (also, these reflect some "tweaking" since the previous email that recognized DayOfWeek rather than ZipCode, and the numbers below are about 40% faster than previous) const char* pzStrToScan_Re2cSearch = "12345 at pos=0 and " "another zip-code 12345-6789 at pos=36 and " "another 98765-4321 at pos=69 and " "another at end=113 11223-3445"; const char* pzStrToScan_strstr = "12345 at pos=0 and " "another zip-code 12345-6789 at pos=36 and " "another 12345-4321 at pos=69 and " "another at end=113 12345-3445"; WinXp-Sp2 vc7.1 on AMD-3700 10,000 loops thru above to find 40,000 matches strstr just looking for 12345: 3.2 milliseconds Re2cSearch looking for [0-9]{5}(-[0-9]{4})? : 5.3 ms 100,000 loops thru above to find 400,000 matches (mostly to verify that optimizer isn't distorting the results) strstr just looking for 12345: 31.8 milliseconds Re2cSearch looking for [0-9]{5}(-[0-9]{4})? : 53.8 ms The re2c generated code also passes a relatively extensive cppunit-like test.
Also, do you think you could send around the code that re2c is generating for this expression?
Here is a link to a .zip with the vc6 and vc7.1 projects (vc8 to follow): http://cleanspeech.sf.net/misc/re2c_ZipCodeRe_060419.zip Feedback appreciated, especially if I've messed up and the numbers are flawed (not unlikely). (Also, there is some "hold your nose" code in this C prototype. The intent is to eventually be able to use the relatively simple ZipCodeRe as a "getting up to speed" example for re2c newbies (like myself), and perhaps as a template/clone for other 'recognizers'.)

On 4/13/06, Lynn Allan
Stuart Dootson wrote:
Lynn - it might be interesting to use re2c (http://re2c.org/) to generate regex machines and see what their performance is like - after all, they do clain 're2c generates warning free code that is equal to hand-written code in terms of size, speed and quality.' (I'm offering no opinion on the veracity of that statement....), so (allegedly) they'd give a reasonable hand-code analogue....
Were you able to get it to work to do anything even close to recognizing Days-of-the-Week (full and abbreviations)? Did you find actual compilable examples with "main" invoking the generated scanners with a test-string to crunch?
Nope - never used it - just know of it. I've only needed to discard Boost.Regex in a few (profiler indicated) cases, at which point I've hand-coded a suitable routine. Stuart Dootson
participants (3)
-
Eric Niebler
-
Lynn Allan
-
Stuart Dootson