[xpressive] Support for multi-capture and balancing groups

Hi! My company have been using pcre for a long while but there has been gripes about it only returning the last matched value from a capture group. Because of this I have been searching for a C++ regex engine that can handle the same stuff as the .NET implementation can do, to no avail. During my searches I've stumbled on several forum threads where others have been searching for the same thing but it doesn't seem to exist a regular expression library in C/C++ that handles both named captures and multicapture. Found boost.xpressive and it had almost everything we need. It's open source, fast, got flexible api and named captures. But alas, just as all other C and C++ based implementations I have found, it lacked multiple captures. So, I added it. On top of that I added support for balancing groups (http://blog.stevenlevithan.com/archives/balancing-groups). But the syntax for the pop capture and capture conditional is slightly different then the .NET version to better fit xpressive. Syntax for pop capture: dynamic: (?P<-name>stuff) static: (name -= stuff) Syntax for capture conditional: dynamic: (?P(name)stuff) static: (name &= stuff) There is no support for the (?<name-othername>stuff) construct. All captures made by a group is stored in sub_match::captures which is a vector of sub_match_capture objects. A sub_match_capture behaves like a stripped down sub_match. It can be put in an ostream and has a length and helper function for returning a string. The changes are in the vault and can be found here: http://tinyurl.com/3aak7mp It can be unpacked against trunk from 2010-10-02 or the 1.44.0 release. I've run the dynamic regression tests without errors and I have added some tests for the new functionality. The code it only tested on Visual Studio 2010 since I don't have access to any other compiler. Please give feedback on my changes since I would love to see them in an official release. Thanks in advance. Regards, Erik

On 10/2/2010 5:52 PM, Erik Rydgren wrote:
Hi!
My company have been using pcre for a long while but there has been gripes about it only returning the last matched value from a capture group. Because of this I have been searching for a C++ regex engine that can handle the same stuff as the .NET implementation can do, to no avail. During my searches I've stumbled on several forum threads where others have been searching for the same thing but it doesn't seem to exist a regular expression library in C/C++ that handles both named captures and multicapture.
Found boost.xpressive and it had almost everything we need. It's open source, fast, got flexible api and named captures. But alas, just as all other C and C++ based implementations I have found, it lacked multiple captures.
It sort of has them, just not in the form you happen to be looking for. In xpressive, you can call named regexes from other regexes. When you do that, you end up with nested match_results. If you quantify a named regex, you end up with a sequence of match_results, kind of like multicapture. Sadly, it's not very efficient to create a tree of match_results, and xpressive gives you no help in navigating this tree. It's a bit of an ugly hack. FWIW, Boost.Regex has multicapture if you compile with a certain flag, IIRC.
So, I added it.
Whoa, cool!
On top of that I added support for balancing groups (http://blog.stevenlevithan.com/archives/balancing-groups). But the syntax for the pop capture and capture conditional is slightly different then the .NET version to better fit xpressive.
Syntax for pop capture: dynamic: (?P<-name>stuff) static: (name -= stuff)
Syntax for capture conditional: dynamic: (?P(name)stuff) static: (name &= stuff)
There is no support for the (?<name-othername>stuff) construct.
I'll need to read up on what those constructs do. Can you send some pointers?
All captures made by a group is stored in sub_match::captures which is a vector of sub_match_capture objects. A sub_match_capture behaves like a stripped down sub_match. It can be put in an ostream and has a length and helper function for returning a string.
The changes are in the vault and can be found here: http://tinyurl.com/3aak7mp
It can be unpacked against trunk from 2010-10-02 or the 1.44.0 release. I've run the dynamic regression tests without errors and I have added some tests for the new functionality. The code it only tested on Visual Studio 2010 since I don't have access to any other compiler.
Please give feedback on my changes since I would love to see them in an official release. Thanks in advance.
This sounds really great and I have every intention of taking this change once I grok it and look over the code. Can you open a feature request ticket at http://svn.boost.org so I don't forget, because I'm a little busy at the moment. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 10/2/2010 5:52 PM, Erik Rydgren wrote:
Hi!
My company have been using pcre for a long while but there has been gripes about it only returning the last matched value from a capture group. Because of this I have been searching for a C++ regex engine that can handle the same stuff as the .NET implementation can do, to no avail. During my searches I've stumbled on several forum threads where others have been searching for the same thing but it doesn't seem to exist a regular expression library in C/C++ that handles both named captures and multicapture.
Found boost.xpressive and it had almost everything we need. It's open source, fast, got flexible api and named captures. But alas, just as all other C and C++ based implementations I have found, it lacked multiple captures.
It sort of has them, just not in the form you happen to be looking for. In xpressive, you can call named regexes from other regexes. When you do that, you end up with nested match_results. If you quantify a named regex, you end up with a sequence of match_results, kind of like multicapture. Sadly, it's not very efficient to create a tree of match_results, and xpressive gives you no help in navigating this tree. It's a bit of an ugly hack.
Yea, I realized that but it wasn't practical for our needs. I also did a static solution that used actions to make the captures just to try them out.
FWIW, Boost.Regex has multicapture if you compile with a certain flag, IIRC.
Ok, I didn't know that. Will take a second look at Boost.Regex then.
So, I added it.
Whoa, cool!
That is the respose I was hoping for :)
On top of that I added support for balancing groups (http://blog.stevenlevithan.com/archives/balancing-groups). But the syntax for the pop capture and capture conditional is slightly different then the .NET version to better fit xpressive.
Syntax for pop capture: dynamic: (?P<-name>stuff) static: (name -= stuff)
Syntax for capture conditional: dynamic: (?P(name)stuff) static: (name &= stuff)
There is no support for the (?<name-othername>stuff) construct.
I'll need to read up on what those constructs do. Can you send some pointers?
I already did, this blogpost explains them without fuss http://blog.stevenlevithan.com/archives/balancing-groups. But the very short version is that a (?P<-tag>exp) first matches exp then removes the last capture from an earlier group named tag. If the tag group haven't captured anything yet it fails and backtracks. The (?P(tag)exp) is a shorthand if-then-else where the else part always matches. Pseudo code: if (tag has matched) { exp must match } else { true }. To demonstrate here is the regression definitions I've made ; multi capture [test175] str=aabb pat=(..)* br0=aabb cp0_0=aabb br1=bb cp1_0=aa cp1_1=bb [end] ; multi capture several groups [test176] str=abba pat=(.){2}(.){2} br0=abba cp0_0=abba br1=b cp1_0=a cp1_1=b br2=a cp2_0=b cp2_1=a [end] ; multi capture, pop capture with backreference, check capture [test177] str=startabccbarest pat=^(.*?)(?P<n>.)+(?P<-n>(?P=n))+(?P(n)(?!))(.*)$ br0=startabccbarest cp0_0=startabccbarest br1=start cp1_0=start br2= br3=rest cp3_0=rest [end] ; match count [test178] str=aabb pat=^(?P<n>a)*(?P<-n>b)*(?P(n)(?!))$ br0=aabb br1= [end] ; match count, fail on pop [test179] str=aabbb pat=^(?P<n>a)*(?P<-n>b)*$ [end] ; match count, fail on check [test180] str=aab pat=^(?P<n>a)*(?P<-n>b)*(?P(n)(?!))$ [end]
All captures made by a group is stored in sub_match::captures which is a vector of sub_match_capture objects. A sub_match_capture behaves like a stripped down sub_match. It can be put in an ostream and has a length and helper function for returning a string.
The changes are in the vault and can be found here: http://tinyurl.com/3aak7mp
It can be unpacked against trunk from 2010-10-02 or the 1.44.0 release. I've run the dynamic regression tests without errors and I have added some tests for the new functionality. The code it only tested on Visual Studio 2010 since I don't have access to any other compiler.
Please give feedback on my changes since I would love to see them in an official release. Thanks in advance.
This sounds really great and I have every intention of taking this change once I grok it and look over the code. Can you open a feature request ticket at http://svn.boost.org so I don't forget, because I'm a little busy at the moment.
Will do.
-- Eric Niebler BoostPro Computing http://www.boostpro.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 10/2/2010 5:52 PM, Erik Rydgren wrote:
Hi!
My company have been using pcre for a long while but there has been gripes about it only returning the last matched value from a capture group. Because of this I have been searching for a C++ regex engine [SNIP]
This sounds really great and I have every intention of taking this change once I grok it and look over the code. Can you open a feature request ticket at http://svn.boost.org so I don't forget, because I'm a little busy at the moment.
Ticket #4704: Support for multicapture and balancing groups
-- Eric Niebler BoostPro Computing http://www.boostpro.com
/ Erik

On Sun, Oct 3, 2010 at 3:16 AM, Erik Rydgren <erik@rydgrens.net> wrote:
On 10/2/2010 5:52 PM, Erik Rydgren wrote:
Hi!
My company have been using pcre for a long while but there has been gripes about it only returning the last matched value from a capture group. Because of this I have been searching for a C++ regex engine
[SNIP]
This sounds really great and I have every intention of taking this change once I grok it and look over the code. Can you open a feature request ticket at http://svn.boost.org so I don't forget, because I'm a little busy at the moment.
Ticket #4704: Support for multicapture and balancing groups
For note, depending on what exactly you need to parse, but if you can statically define the parse then Boost.Spirit can do all you want and a whole lot more with a lot more simple code and it will execute a lot faster. If you have an example of what you want done, I (or others) can give an example Spirit code to do the same if you want a comparison.

On Sun, Oct 3, 2010 at 3:16 AM, Erik Rydgren <erik@rydgrens.net> wrote:
On 10/2/2010 5:52 PM, Erik Rydgren wrote:
Hi!
My company have been using pcre for a long while but there has been gripes about it only returning the last matched value from a capture group. Because of this I have been searching for a C++ regex engine
[SNIP]
This sounds really great and I have every intention of taking this change once I grok it and look over the code. Can you open a feature request ticket at http://svn.boost.org so I don't forget, because I'm a little busy at the moment.
Ticket #4704: Support for multicapture and balancing groups
For note, depending on what exactly you need to parse, but if you can statically define the parse then Boost.Spirit can do all you want and a whole lot more with a lot more simple code and it will execute a lot faster. If you have an example of what you want done, I (or others) can give an example Spirit code to do the same if you want a comparison.
That is a kind offer but unfortunately I don't have the luxury of static expressions. The regex patterns is stored in a database and other sources and we use them for a wide range of tasks. But I surly will take a look at Spirit anyway, just so I might have another tool in my belt ready when needed. Thanks Erik

On Oct 3, 2:16 am, "Erik Rydgren" <e...@rydgrens.net> wrote:
On 10/2/2010 5:52 PM, Erik Rydgren wrote:
Hi!
My company have been using pcre for a long while but there has been gripes about it only returning the last matched value from a capture group. Because of this I have been searching for a C++ regex engine [SNIP]
This sounds really great and I have every intention of taking this change once I grok it and look over the code. Can you open a feature request ticket athttp://svn.boost.orgso I don't forget, because I'm a little busy at the moment.
Ticket #4704: Support formulticaptureand balancing groups
Hi Erik, I've been looking at your changes and have some comments. Please see the bug report: https://svn.boost.org/trac/boost/ticket/4704. In particular, the last one is a deal-breaker until it's fixed: """ Oh wait! I just realized from looking at your change to mark_end_matcher that you are always populating the captures member of the sub_match struct (for non-hidden captures). This will have disastrous performance implications. There cannot potentially be an allocation every time you match a marked sub-expression. I can't accept this change. I suggest you find a way to make this behavior opt-in. """ I hope we can find a solution to this problem. -- Eric Niebler BoostPro Computing www.boostpro.com
participants (3)
-
Eric Niebler
-
Erik Rydgren
-
OvermindDL1