boost::regexp_iterator illogical behaviour?
Hi, I am using the boost::regexp_iterator to list every match of a user configured regexp in the contents of a file. The file is read up into a single string and passed into the constructor of regexp_iterator. Take the example regexp of "a|\Ab" which means that every 'a' letter will be a match, plus the 'b' letter should be a match if it is the first one of the file. But the algorithm does not work like this! The letter 'b' is matched every time it follows an 'a'. Because '\A' does not mean the beginning of the original client-code-supplied buffer! After the first match, the '\A' only means the end of the last match. I can imagine situations where a metacharater with such meaning is needed, but I would need a different behaviour. Diving into the boost source code: template <snip> class regex_iterator_implementation { <snip> bool next() { if(what.prefix().first != what[0].second) flags |= match_prev_avail; BidirectionalIterator next_start = what[0].second; match_flag_type f(flags); if(!what.length()) f |= regex_constants::match_not_initial_null; bool result = regex_search(next_start, end, what, *pre, f); if(result) what.set_base(base); return result; } <snip> } What I would think logical is (note the two new lines): template <snip> class regex_iterator_implementation { <snip> bool next() { if(what.prefix().first != what[0].second) flags |= match_prev_avail; BidirectionalIterator next_start = what[0].second; match_flag_type f(flags); if(!what.length()) f |= regex_constants::match_not_initial_null; if(base != next_start) f |= regex_constants::match_not_bob; bool result = regex_search(next_start, end, what, *pre, f); if(result) what.set_base(base); return result; } <snip> } Or at least provide a new flag to enable this behaviour. It is quite tedious to reimplement regex_iterator in client code just to add this small feature. What do you think? - Sandor
I am using the boost::regexp_iterator to list every match of a user configured regexp in the contents of a file. The file is read up into a single string and passed into the constructor of regexp_iterator. Take the example regexp of "a|\Ab" which means that every 'a' letter will be a match, plus the 'b' letter should be a match if it is the first one of the file. But the algorithm does not work like this! The letter 'b' is matched every time it follows an 'a'. Because '\A' does not mean the beginning of the original client-code-supplied buffer! After the first match, the '\A' only means the end of the last match.
Confirmed as a bug, it'll be fixed in cvs very shortly, Thanks for the report, John.
participants (2)
-
John Maddock
-
Sandor