boost regex with ICU issue on OS X

Here's my situation. We've got some software that makes use of the boost regex libraries and we've compiled and linked with the ICU libraries enabled. We need this for utf-8 support. Our platforms are Windows (XP and Vista) and OS X. I have a regular expression that is parsing an html page that is utf-8 encoded and it's a rather complex expression, but I've made sure to add anchors such that I don't run into catastrophic backtraking. Another note, I'm using u32regex_search and my flags are match_default | match_partial. I have to use match_partial because the input data is long and I've seen memory exhausted even when I've increased BOOST_REGEX_MAX_BLOCKS (I probably have more work to do on the expression to reduce the complexity). On Windows I have no problems. Partial matches are returned and I eventually get full matches and my code runs great. On OS X, u32regex_search returns and indicates it has found a partial match (what[0].match == false). But what[0].second is set to the end of my input string, so no further matching takes place and my loop ends. This is very different behavior than what I'm seeing on Windows. I know this email is vague, but I can provide any details that would be helpful in solving this issue. Why the difference between Windows and OS X, I've compiled both boost libraries with the same compiler options on both platforms? Is it an ICU compile issue? Thanks, Tommy McClung IMSafer, Inc. Founder tmcclung@imsafer.com

Tommy McClung wrote:
Here's my situation.
We've got some software that makes use of the boost regex libraries and we've compiled and linked with the ICU libraries enabled. We need this for utf-8 support.
Our platforms are Windows (XP and Vista) and OS X.
I have a regular expression that is parsing an html page that is utf-8 encoded and it's a rather complex expression, but I've made sure to add anchors such that I don't run into catastrophic backtraking. Another note, I'm using u32regex_search and my flags are match_default | match_partial. I have to use match_partial because the input data is long and I've seen memory exhausted even when I've increased BOOST_REGEX_MAX_BLOCKS (I probably have more work to do on the expression to reduce the complexity).
There may be a misunderstanding here - or rather I need to improve the error messages :-) If regex_search throws, it usually does so because the time-complexity of matching the expression has grown too complex too fast, it gives up rather than risk going on indefinitely. You *can* get exceptions from very large expressions if BOOST_REGEX_MAX_BLOCKS is set too low, but it has to be a truely humongous expression for that to happen :-) So the normal fix is to try and make the expression as "precise" as possible and avoid things that end up looking like (.+)+
On Windows I have no problems. Partial matches are returned and I eventually get full matches and my code runs great.
On OS X, u32regex_search returns and indicates it has found a partial match (what[0].match == false). But what[0].second is set to the end of my input string, so no further matching takes place and my loop ends. This is very different behavior than what I'm seeing on Windows.
If you get a partial match then what[0].second should always point at the end of the input: that's what a partial match means - that maybe if there was more input we could have had a full match.
I know this email is vague, but I can provide any details that would be helpful in solving this issue. Why the difference between Windows and OS X, I've compiled both boost libraries with the same compiler options on both platforms? Is it an ICU compile issue?
Could be either: if you are using VC 6, 7 or 7.1 on Win32 (but not VC 8) then the regex engine uses a different recursive (ie quicker) algorithm on Win32, than it does on platforms/compilers that don't support recovery from stack-overflows. You can change the behaviour on either platform by defining either BOOST_REGEX_RECURSIVE or BOOST_REGEX_NON_RECURSIVE in boost/regex/user.hpp. Probably the easiest thing is for you to let me have a test case I can play with. HTH, John.
participants (2)
-
John Maddock
-
Tommy McClung