Thank you for your quick reply. It's just because of the leftmost-longest rule, that I'm indeed using POSIX extended style (from boost 1.33.1 with Borland CBuilder 6). To assert, that I didn't do anything wrong im my software, I now have made an extra commandline program: /////////////////////////////////////////// #pragma hdrstop #include <iostream> #include <string> #include "boost\regex.hpp" using namespace std; using namespace boost; //--------------------------------------------------------------------------- int main(int argc, char* argv[]) { // my usual option, but makes no difference here, I think regex_constants::syntax_option_type syntax_flags = regex_constants::extended & ~regex_constants::no_escape_in_lists; regex expression1("(<[^>]*>)|(
]*)>)", regex_constants::extended); regex expression2("(]*)>)|(<[^>]*>)", regex_constants::extended); string testtext = ""; std::string::const_iterator start, end; start = testtext.begin(); end = testtext.end(); match_resultsstd::string::const_iterator what; match_flag_type flags = match_default; cout << "testing: " << expression1 << endl; if(regex_search(start, end, what, expression1, flags)) { int iCount = 0; for(unsigned int i = 0 ; i < what.size() ; ++i ) cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl; } cout << endl; cout << "testing: " << expression2 << endl; if(regex_search(start, end, what, expression2, flags)) { int iCount = 0; for(unsigned int i = 0 ; i < what.size() ; ++i ) cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl; } return 0; } ///////////////////////////////////////////////////////// That's the result: testing: (<[^>]*>)|(]*)>) sub-expression0: "<body bgcolor="white">" // (<[^>]*>)|(]*)>) sub-expression1: "<body bgcolor="white">" // <[^>]*> sub-expression2: "" // ]*)> sub-expression3: "" // [^>]* testing: (]*)>)|(<[^>]*>) sub-expression0: "<body bgcolor="white">" // (]*)>)|(<[^>]*>) sub-expression1: "<body bgcolor="white">" // ]*)> sub-expression2: " bgcolor="white"" // [^>]* sub-expression3: "" // <[^>]*> I hoped, for both arrangements of the alternatives, the matching parts would be the same. I would be very happy, if I had made a mistake ;-) Best regards Detlef Meyer-Eltz -- am Montag, 18. Dezember 2006 um 18:31 schrieben Sie:Detlef Meyer-Eltz wrote:
I have a difficulty to predict, which part of a regular expression will match.
Example: I have a regular expression for a general HTML tag: <[^>]*> combined with an expression for the body tag:
]*)>to: (<[^>]*>)|(
]*)>)This expression matches the text: <body bgcolor="white">
As both alternatives can match the input with the same length, I expected, that the repeated fouth part of the "Leftmost Longest" Rule would determine, which alternatve is chosen:
4. Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on(e) such match possible then return it.
// note the missing 'e'
As the tag-expression has no sub-expression at all, the body-expression should win. Its sub-expression could match, but doesn't. It seems to me, that the sequence of the alternatives determines the match.
Now I guess, that I misinterpreted 4.: its not a means to predict the matching alternative but only to find the one that matched accidentally? My software constructs lexers from elementary expressions automatically. So it's important for me to direct and predict the expected matching alternative. Are there any other rules? Does the sequence of the alternatives determine the match unmistakably?
Which Boost.Regex version are you using, and how are you compiling the expression?
Recent versions default to the Perl matching rules: *which do not use the leftmost longest rule*. They match based on a "first match found" rule, so if the first alternative leads to a match then subsequent alternatives are never examined.
If you really want leftmost-longest semantics, then compile the expression as a POSIX extended regex, but of course then you loose the ability to use Perl-like regex extensions.
HTH, John.
PS, your analysis of the leftmost-longest rule looks correct however.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users