[RegEx] does match_extra flag follow the specification ?

The boost documentation located (HYPERLINK "http://www.boost.org/libs/regex/doc/match_flag_type.html"http://www.boost.org/libs/regex/doc/match_flag_type.html) tell this : “match_extra Instructs the matching engine to retain all available HYPERLINK "http://www.boost.org/libs/regex/doc/captures.html"capture information; if a capturing group is repeated then information about every repeat is available via HYPERLINK "http://www.boost.org/libs/regex/doc/match_results.html#m17"match_results::captures() or HYPERLINK "http://www.boost.org/libs/regex/doc/sub_match.html#m8"sub_match_captures(). “ This feature was for me THE great feature that can provide a great way to link related information together. But the behavior using this flag with search (algorithm) was not the one expected (for me). Because instead of getting information about every repeat, sub_match_captures() contains all the captures obtained for corresponding sub-expression (as documentation HYPERLINK "http://www.boost.org/libs/regex/doc/sub_match.html"http://www.boost.org/libs/regex/doc/sub_match.html of sub_match’s captures member says). A capturing group repeat differ from captures and the fact that regex behave this way prevent me to link information that were captured in the same repeat. For example (with use of named capture syntax (wich is not supported today in boost) to clarify regular expression): ^(?<time>[^ ]+)(?: (?<attr>[A-Za-z]+)=(?:"(?<qvalue>[^"]+)"|(?<svalue>[^ ]+)))+ which intend to parse this kind of lines 12/05/2006_12:04:25 id=5 msg="this is a problem" user=paul captures for this example time={‘12/05/2006_12:04:25’} attr={‘id’,’msg’,’user’} qvalue={‘this is a problem’} svalue={‘5’,’paul’} and I was expecting time={‘12/05/2006_12:04:25’} attr={‘id’,’msg’,’user’} qvalue={null,‘this is a problem’,null} svalue={‘5’,null,’paul’} I’ve got “useless” data because we loose the data structure, no way to link paul to user neither to link “msg” to “this is a problem”. Sorry for my English that’s may be a starting point for misunderstanding, but it should be cool that documentation match specification and or behave like I was expecting. I understand that there a limitation to the behavior i was expecting since it does not take care of underneath structure if there is repeated group in repeated group. There is several way to prevent loosing these relationship between data (with different degree of relevance) : - build a hierarchical tree of capture (syntactical tree) - Provide iterator on all captures that keep track apparition’s order. - Allow named capture with duplicate group name. So, is it documentation to fix or a bug? Alquier Luc -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.409 / Virus Database: 268.13.25/515 - Release Date: 03/11/2006

Luc LA. ALQUIER wrote:
The boost documentation located (HYPERLINK "http://www.boost.org/libs/regex/doc/match_flag_type.html"http://www.boost.org/libs/regex/doc/match_flag_type.html) tell this :
“match_extra Instructs the matching engine to retain all available HYPERLINK "http://www.boost.org/libs/regex/doc/captures.html"capture information; if a capturing group is repeated then information about every repeat is available via HYPERLINK "http://www.boost.org/libs/regex/doc/match_results.html#m17"match_results::captures() or HYPERLINK "http://www.boost.org/libs/regex/doc/sub_match.html#m8"sub_match_captures(). “
This feature was for me THE great feature that can provide a great way to link related information together.
But the behavior using this flag with search (algorithm) was not the one expected (for me).
Because instead of getting information about every repeat, sub_match_captures() contains all the captures obtained for corresponding sub-expression (as documentation HYPERLINK "http://www.boost.org/libs/regex/doc/sub_match.html"http://www.boost.org/libs/regex/doc/sub_match.html of sub_match’s captures member says).
I'm sorry you didn't find it clear: it contains information about every sub-expression that was *matched*.
For example (with use of named capture syntax (wich is not supported today in boost) to clarify regular expression):
^(?<time>[^ ]+)(?: (?<attr>[A-Za-z]+)=(?:"(?<qvalue>[^"]+)"|(?<svalue>[^ ]+)))+
which intend to parse this kind of lines
12/05/2006_12:04:25 id=5 msg="this is a problem" user=paul
captures for this example
time={‘12/05/2006_12:04:25’}
attr={‘id’,’msg’,’user’}
qvalue={‘this is a problem’}
svalue={‘5’,’paul’}
and I was expecting
time={‘12/05/2006_12:04:25’}
attr={‘id’,’msg’,’user’}
qvalue={null,‘this is a problem’,null}
svalue={‘5’,null,’paul’}
Nod: understood. However there are a couple of problems here: 1) The underlying engine has no knowledge of whether one capturing group is embedded "inside" another. This pretty much rules out tree-like structures without a major rewrite. 2) If a capturing group is unmatched then the engine can't output an empty string to the result array because it never "sees" the unmatched capture so it has no knowledge that it has been skipped over.
I’ve got “useless” data because we loose the data structure, no way to link paul to user neither to link “msg” to “this is a problem”.
Sorry for my English that’s may be a starting point for misunderstanding, but it should be cool that documentation match specification and or behave like I was expecting.
I understand that there a limitation to the behavior i was expecting since it does not take care of underneath structure if there is repeated group in repeated group.
There is several way to prevent loosing these relationship between data (with different degree of relevance) :
- build a hierarchical tree of capture (syntactical tree)
- Provide iterator on all captures that keep track apparition’s order.
That's the only option that might be possible.
- Allow named capture with duplicate group name.
So, is it documentation to fix or a bug?
I don't think it's either, but I'll see if I can make the docs clearer. Aside: one way to tackle the "maybe a string, or maybe not" problem would be to use: (")?(?(1)[^"]*|[^ ]*)(?(1)") Hopefully I've typed that in correctly! Regards, John.

John Maddock wrote:
Luc LA. ALQUIER wrote:
“match_extra Instructs the matching engine to retain all available HYPERLINK "http://www.boost.org/libs/regex/doc/captures.html"capture information; if a capturing group is repeated then information about every repeat is available via HYPERLINK "http://www.boost.org/libs/regex/doc/match_results.html#m17"match_results::captures() or HYPERLINK "http://www.boost.org/libs/regex/doc/sub_match.html#m8"sub_match_captures(). “
This feature was for me THE great feature that can provide a great way to link related information together.
But the behavior using this flag with search (algorithm) was not the one expected (for me).
<snip>
Nod: understood. However there are a couple of problems here:
1) The underlying engine has no knowledge of whether one capturing group is embedded "inside" another. This pretty much rules out tree-like structures without a major rewrite.
You might look into xpressive, a new regex engine that will be part of Boost 1.34. With xpressive, you can build grammars out of regexes, and the results of matching such a grammar is a tree of results. You can read about xpressive's regex grammars and nested results here: http://tinyurl.com/9f6x3 The code lives here: http://boost-consulting.com/vault/index.php?directory=Strings%20-%20Text%20P... -- Eric Niebler Boost Consulting www.boost-consulting.com
participants (3)
-
Eric Niebler
-
John Maddock
-
Luc LA. ALQUIER