pathological regex?

newer
Copy of a C++ object extended with...

older
Define a callback member function...

Michał Nowotka

27 Aug 2009 27 Aug '09

9:22 a.m.

Hello, I use boost regex v 1.37 on Ubuntu 9.04, g++-4.3. Because of performance issues i replaced defaut matching algorithm to BOOST_REGEX_RECURSIVE but then my program dont't work properly. So, is this expression pathological: (https?://)?([^/@]*[\\.@])?google.com[/\\?&#]* ? -- Regards Michał Nowotka

Show replies by date

OvermindDL1

27 Aug 27 Aug

9:58 p.m.

2009/8/27 Michał Nowotka <mmmnow@gmail.com>:

...

Hello, I use boost regex v 1.37 on Ubuntu 9.04, g++-4.3. Because of performance issues i replaced defaut matching algorithm to BOOST_REGEX_RECURSIVE but then my program dont't work properly.

So, is this expression pathological: (https?://)?([^/@]*[\\.@])?google.com[/\\?&#]*

?

If you want the best performance and you are always matching something similar then you should use Spirit2.1 from the boost trunk, it is a *great* deal faster, and I *think* it works on Boost 1.37 and higher, but I know it works on 1.38 and higher.

Michał Nowotka

10:50 p.m.

Ok, first of all this isn't answer to my question. Secondly - do you have any performance test to campere these libraries? -- Regards Michał Nowotka

Mike Marchywka

11:40 p.m.

----------------------------------------

...

Date: Fri, 28 Aug 2009 00:50:35 +0200 From: To: boost-users@lists.boost.org Subject: Re: [Boost-users] pathological regex?

Ok, first of all this isn't answer to my question. Secondly - do you have any performance test to campere these libraries?

I've never used Spirit but using Boost regex and greta for repetitive stuff I finally had to write my own very limited compiler- however, it is very important if you do that to have a known-good library against which you can test your optimized stuff. I'm not sure what Spirit does at compile time as opposed to runtime but I guess a code generator may be something to consider. If you really need speed for a fixed expression, I think you are stuck with hard coded logic. Parsing is never really free. Even without a fixed literal expression, you may be able to define a class of regex's which your custom version could accommodate and be fast enough for what you need but not requiring new code each time.

...

-- Regards

Michał Nowotka _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

_________________________________________________________________ Windows Live: Make it easier for your friends to see what you're up to on Facebook. http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON...

OvermindDL1

28 Aug 28 Aug

12:08 a.m.

2009/8/27 Michał Nowotka <mmmnow@gmail.com>:

...

Ok, first of all this isn't answer to my question. Secondly - do you have any performance test to campere these libraries?

I cannot answer if that is pathological or not as I do not know regex that well, but you did mention performance issues, and Spirit2.1 is faster then regex in just about every possible way. Spirit2.1 creates a parse tree at compile time that is fully optimized for what you are wanting to parse, and thus far it is faster at every testing thing, from optimized old price parsers to even being faster then the built-in atoi and similar kin. I do not know the regex syntax perfectly, but based on "(https?://)?([^/@]*[\\.@])?google.com[/\\?&#]*", a similar Spirit2.1 parser might be: -("http" >> -lit('s') >> "://") // equal to the (https?://)? part above

...

...
-(*char_("^/@") >> char_(".@")) // equal to the ([^/@]*[\\.@])? part above "google.com" *char_("/?&#") // equal to the [/\\?&#]* part above

That would create the Spirit2.1 parse tree at compile time, and if you wished it could parse out anything using their proper type faster then anything else that has yet been tested. But yes, Spirit2.1 does contain a few benchmarks with it in the docs, and the spirit mailing lists contain a rather large hoard of a lot more of other tests that people have done themselves. The latest Spirit2.1 docs are at: http://svn.boost.org/svn/boost/trunk/libs/spirit/doc/html/index.html The docs are consistently being updated, they are the only part of Spirit2.1 not yet complete, but are being worked on all the time. I need to help with that... >.>

John Maddock

8:55 a.m.

...

I use boost regex v 1.37 on Ubuntu 9.04, g++-4.3. Because of performance issues i replaced defaut matching algorithm to BOOST_REGEX_RECURSIVE but then my program dont't work properly.

How so? Do you have a test case? And more to the point, did you rebuild the regex library with BOOST_REGEX_RECURSIVE set? If not then the program will very likely crash. BTW the difference between the recursive and non-recursive algorithms is usually too small to notice.

...

So, is this expression pathological: (https?://)?([^/@]*[\\.@])?google.com[/\\?&#]*

Superficially it look OK to me... but there is no such thing as a pathological expression on it's own: it's the combination of the pattern and the text being matched that can become pathological. The only problem case that springs to mind in that pattern would be long sequences of .'s but I could be wrong. Do you have a test case you can post so I can debug things here? HTH, John.

Michał Nowotka

10:40 a.m.

...

And more to the point, did you rebuild the regex library with BOOST_REGEX_RECURSIVE set?

I've installed boost libs just by typing sudo apt-get install libboost1.37-dev so I've never compiled the library. I thought all I should do is to uncomment #define BOOST_REGEX_RECURSIVE in boost/regex/user.hpp and rebuild my project. I'll try to dwonload sources and biuld regex from it. BTW - can you give me some tips how can I improve performance when matching regex I've mentioned? Maybe there is possibility to optimize it by using some more sophisticated expressions form perl syntax? -- Regards Michał Nowotka

John Maddock

11:19 a.m.

...

...
And more to the point, did you rebuild the regex library with BOOST_REGEX_RECURSIVE set?

I've installed boost libs just by typing sudo apt-get install libboost1.37-dev so I've never compiled the library. I thought all I should do is to uncomment #define BOOST_REGEX_RECURSIVE in boost/regex/user.hpp and rebuild my project. I'll try to dwonload sources and biuld regex from it.

BTW - can you give me some tips how can I improve performance when matching regex I've mentioned? Maybe there is possibility to optimize it by using some more sophisticated expressions form perl syntax?

The basic idea is to make each alternative mutually exclusive, so for example given: [something]*[something-else]* If "something" and "something-else" are mutually exclusive then you should never get pathological behavior, if they're not exclusive then long runs of whatever they have in common can trigger problems. Otherwise it's more down to your code - avoid temporary strings and match_results, and avoid creating regular expressions "on the fly". That's why I suggested that you post a test case. HTH, John.

5840

Age (days ago)

5841

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

John Maddock
Michał Nowotka
Mike Marchywka
OvermindDL1