Re: [boost] [Review] xpressive

Daryle Walker wrote:
On 9/15/05 1:10 PM, "Eric Niebler" <eric_at_[hidden]> wrote:
In practice, the only reason why you might iterator over
all sub-matches is to print them out. Otherwise, the sub-matches are
accessed
randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See?
<snip>
It looks like the current setup is not STL-friendly. Most of the "what" list is one type of thing, the in-order pieces of the regex parse. The first item of the list doesn't match that pattern (since it's the whole parse). I'm guessing that this "old" way wasn't a problem because people expected 1-based arrays, so the 0-index could be special. That doesn't work in a 0-based array culture, like C++ (or C). C++ people would expect the 0-index element to match the general rule of the list. This mixing of element types mixes concerns (violating "keep it simple, silly"). A STL-friendly alternative would to have separate member functions for the whole-parse and the list-of-parse-pieces, then have a special function (member or non-member) that generates a regex-culture combined list.
I agree, if we we're only concerned about satisfying people familiar with C++ culture. But we are also trying to satisfy people familiar with regex culture. Every regex package out there I know of that supports back-references begins numbering captures at 1. I don't know why. But I do know that to break with that tradition now would cause massive confusion. Besides, I'm trying to minimize the differences between xpressive's interface and TR1 regex. Is it a wart? OK, I agree. But frankly, I don't feel that this is an ugly enough wart for me to break with established practice. -- Eric Niebler Boost Consulting www.boost-consulting.com

From: Eric Niebler <eric@boost-consulting.com>
Daryle Walker wrote:
On 9/15/05 1:10 PM, "Eric Niebler" <eric_at_[hidden]> wrote:
In practice, the only reason why you might iterator over all sub-matches is to print them out. Otherwise, the sub-matches are accessed randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See? <snip>
It looks like the current setup is not STL-friendly. Most of the "what" list is one type of thing, the in-order pieces of the regex parse. The first item of the list doesn't match that pattern (since it's the whole
That's completely STL-friendly: there are iterators. When using STL-style algorithms, one must determine the applicable range. For many uses, I'll grant that you'd want to skip the first element, but that hardly constitutes being unfriendly to the STL.
parse). I'm guessing that this "old" way wasn't a problem because people expected 1-based arrays, so the 0-index could be special. That doesn't work in a 0-based array culture, like C++ (or C). C++ people would expect the 0-index element to match the general rule of the list. This mixing of element types mixes concerns (violating "keep it simple, silly"). A STL-friendly alternative would to have separate member functions for the whole-parse and the list-of-parse-pieces, then have a special function (member or non-member) that generates a regex-culture combined list.
You could fatten the interface that way, but it really wouldn't gain much and can certainly lead to confusion because of the differing indices based upon which interface one uses. Since each user of the library could choose a different interface, maintenence would be more difficult due to requiring knowledge of all of the interfaces and knowing which was employed in a given case.
I agree, if we we're only concerned about satisfying people familiar with C++ culture. But we are also trying to satisfy people familiar with regex culture. Every regex package out there I know of that supports back-references begins numbering captures at 1. I don't know why. But I do know that to break with that tradition now would cause massive confusion. Besides, I'm trying to minimize the differences between xpressive's interface and TR1 regex.
There are numerous examples of using the 0th element to be "the whole thing" and then the parts being elements 1 through N. For example awk uses $0 for the entire line and $1 through $(NF) for the fields matched by the field separator. IIRC, JavaScript's RE support provides the entire matched string in element 0 of the result, with the captures in elements 1 through N.
Is it a wart? OK, I agree. But frankly, I don't feel that this is an ugly enough wart for me to break with established practice.
I think it was a wise decision. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

On 9/16/05 12:49 PM, "Rob Stewart" <stewart@sig.com> wrote:
From: Eric Niebler <eric@boost-consulting.com>
Daryle Walker wrote:
On 9/15/05 1:10 PM, "Eric Niebler" <eric_at_[hidden]> wrote:
In practice, the only reason why you might iterator over all sub-matches is to print them out. Otherwise, the sub-matches are accessed randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See? <snip>
It looks like the current setup is not STL-friendly. Most of the "what" list is one type of thing, the in-order pieces of the regex parse. The first item of the list doesn't match that pattern (since it's the whole
That's completely STL-friendly: there are iterators. When using STL-style algorithms, one must determine the applicable range. For many uses, I'll grant that you'd want to skip the first element, but that hardly constitutes being unfriendly to the STL.
Making something like: std::copy( pieces.begin(), pieces.end(), destination ); completely useless isn't unfriendly? You'll have to use a (mutable) object to store "pieces.begin()" so you can increment it before the copy, or use a "+ 1" if the "what" list supports random iteration.
parse). I'm guessing that this "old" way wasn't a problem because people expected 1-based arrays, so the 0-index could be special. That doesn't work in a 0-based array culture, like C++ (or C). C++ people would expect the 0-index element to match the general rule of the list. This mixing of element types mixes concerns (violating "keep it simple, silly"). A STL-friendly alternative would to have separate member functions for the whole-parse and the list-of-parse-pieces, then have a special function (member or non-member) that generates a regex-culture combined list.
You could fatten the interface that way, but it really wouldn't gain much and can certainly lead to confusion because of the differing indices based upon which interface one uses.
I'm guessing that C++ people would use the single-step interface and not bother with numeric indices, and Regex people would do the reverse. And I suspect that the C++ format is internally generated anyway and just hidden before the whole-string piece is prepended to it. The only "flaw" is that numeric indices require random-acess iteration, which brings a single-step interface because it's a superset of forward iteration.
Since each user of the library could choose a different interface, maintenence would be more difficult due to requiring knowledge of all of the interfaces and knowing which was employed in a given case.
Restating what I said, I think most people would pick a C++ culture interface at every step or a Regex culture interface at every step.
I agree, if we we're only concerned about satisfying people familiar with C++ culture. But we are also trying to satisfy people familiar with regex culture. Every regex package out there I know of that supports back-references begins numbering captures at 1. I don't know why. But I do know that to break with that tradition now would cause massive confusion. Besides, I'm trying to minimize the differences between xpressive's interface and TR1 regex.
There are numerous examples of using the 0th element to be "the whole thing" and then the parts being elements 1 through N. For example awk uses $0 for the entire line and $1 through $(NF) for the fields matched by the field separator. IIRC, JavaScript's RE support provides the entire matched string in element 0 of the result, with the captures in elements 1 through N.
I'm guessing that these many not be independent examples, but simple borrowing of an interface. In other words, doing it just to follow precedent. Maybe JavaScript's RE does it this way only because regular "regex" does. And maybe "awk" does it because "regex" does. (Or since I don't know too much about Unix history, the order could be reversed so "regex" copied the idea from "awk" instead. And then "awk" would have done it to save resources, which were tight back then.)
Is it a wart? OK, I agree. But frankly, I don't feel that this is an ugly enough wart for me to break with established practice.
I think it was a wise decision.
-- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote:
That's completely STL-friendly: there are iterators. When using STL-style algorithms, one must determine the applicable range. For many uses, I'll grant that you'd want to skip the first element, but that hardly constitutes being unfriendly to the STL.
Making something like:
std::copy( pieces.begin(), pieces.end(), destination );
completely useless isn't unfriendly?
<snip> I can see this is something you feel strongly about. If change is what you're after, you should send a Defect Report to comp.std.c++. This is not an area where I think it makes sense for xpressive to differ from TR1 regex. If the interface is changed for TR1, xpressive will follow suit. -- Eric Niebler Boost Consulting www.boost-consulting.com

From: Daryle Walker <darylew@hotmail.com>
On 9/16/05 12:49 PM, "Rob Stewart" <stewart@sig.com> wrote:
From: Eric Niebler <eric@boost-consulting.com>
Daryle Walker wrote:
On 9/15/05 1:10 PM, "Eric Niebler" <eric_at_[hidden]> wrote:
It looks like the current setup is not STL-friendly. Most of the "what" list is one type of thing, the in-order pieces of the regex parse. The first item of the list doesn't match that pattern (since it's the whole
That's completely STL-friendly: there are iterators. When using STL-style algorithms, one must determine the applicable range. For many uses, I'll grant that you'd want to skip the first element, but that hardly constitutes being unfriendly to the STL.
Making something like:
std::copy( pieces.begin(), pieces.end(), destination );
completely useless isn't unfriendly? You'll have to use a (mutable) object to store "pieces.begin()" so you can increment it before the copy, or use a "+ 1" if the "what" list supports random iteration.
Martin Bonner had the same reaction as I did when I read that: why not just pass ++pieces.begin()? Furthermore, as I said, one must always determine the applicable range to pass to an algorithm. Sure it's nice to pass the whole range, but you can't always do that. The downside here is that you'd rarely want to do that (maybe to copy the strings for some post processing or doing I/O).
parse). I'm guessing that this "old" way wasn't a problem because people expected 1-based arrays, so the 0-index could be special. That doesn't work in a 0-based array culture, like C++ (or C). C++ people would expect the 0-index element to match the general rule of the list. This mixing of element types mixes concerns (violating "keep it simple, silly"). A STL-friendly alternative would to have separate member functions for the whole-parse and the list-of-parse-pieces, then have a special function (member or non-member) that generates a regex-culture combined list.
You could fatten the interface that way, but it really wouldn't gain much and can certainly lead to confusion because of the differing indices based upon which interface one uses.
I'm guessing that C++ people would use the single-step interface and not bother with numeric indices, and Regex people would do the reverse. And I suspect that the C++ format is internally generated anyway and just hidden before the whole-string piece is prepended to it. The only "flaw" is that numeric indices require random-acess iteration, which brings a single-step interface because it's a superset of forward iteration.
What about "C++ people" that are also "Regex people?" Which do they use? Note also what I said here:
Since each user of the library could choose a different interface, maintenence would be more difficult due to requiring knowledge of all of the interfaces and knowing which was employed in a given case.
Restating what I said, I think most people would pick a C++ culture interface at every step or a Regex culture interface at every step.
If a "C++ person" chose the C++ interface and a maintainer was a "Regex person," confusion would ensue.
There are numerous examples of using the 0th element to be "the whole thing" and then the parts being elements 1 through N. For example awk uses $0 for the entire line and $1 through $(NF) for the fields matched by the field separator. IIRC, JavaScript's RE support provides the entire matched string in element 0 of the result, with the captures in elements 1 through N.
I'm guessing that these many not be independent examples, but simple borrowing of an interface. In other words, doing it just to follow precedent. Maybe JavaScript's RE does it this way only because regular "regex" does. And maybe "awk" does it because "regex" does. (Or since I don't know too much about Unix history, the order could be reversed so "regex" copied the idea from "awk" instead. And then "awk" would have done it to save resources, which were tight back then.)
You may be right, but is it wise to part with decades of precedent? Besides, I think Eric pointed out the biggest reason to keep the 1-based capture interface: the captures in the RE are 1-based, so those accessed from C++ should be, too. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Rob Stewart <stewart@sig.com> writes:
From: Daryle Walker <darylew@hotmail.com>
On 9/16/05 12:49 PM, "Rob Stewart" <stewart@sig.com> wrote:
From: Eric Niebler <eric@boost-consulting.com>
Martin Bonner had the same reaction as I did when I read that: why not just pass ++pieces.begin()?
Actually boost::next(pieces.begin()) would be more general.
Furthermore, as I said, one must always determine the applicable range to pass to an algorithm. Sure it's nice to pass the whole range, but you can't always do that. The downside here is that you'd rarely want to do that (maybe to copy the strings for some post processing or doing I/O).
parse). I'm guessing that this "old" way wasn't a problem because people expected 1-based arrays, so the 0-index could be special. That doesn't work in a 0-based array culture, like C++ (or C). C++ people would expect the 0-index element to match the general rule of the list. This mixing of element types mixes concerns (violating "keep it simple, silly"). A STL-friendly alternative would to have separate member functions for the whole-parse and the list-of-parse-pieces, then have a special function (member or non-member) that generates a regex-culture combined list.
You could fatten the interface that way, but it really wouldn't gain much and can certainly lead to confusion because of the differing indices based upon which interface one uses.
I'm guessing that C++ people would use the single-step interface and not bother with numeric indices, and Regex people would do the reverse. And I suspect that the C++ format is internally generated anyway and just hidden before the whole-string piece is prepended to it. The only "flaw" is that numeric indices require random-acess iteration, which brings a single-step interface because it's a superset of forward iteration.
What about "C++ people" that are also "Regex people?" Which do they use? Note also what I said here:
Since each user of the library could choose a different interface, maintenence would be more difficult due to requiring knowledge of all of the interfaces and knowing which was employed in a given case.
Restating what I said, I think most people would pick a C++ culture interface at every step or a Regex culture interface at every step.
If a "C++ person" chose the C++ interface and a maintainer was a "Regex person," confusion would ensue.
There are numerous examples of using the 0th element to be "the whole thing" and then the parts being elements 1 through N. For example awk uses $0 for the entire line and $1 through $(NF) for the fields matched by the field separator. IIRC, JavaScript's RE support provides the entire matched string in element 0 of the result, with the captures in elements 1 through N.
I'm guessing that these many not be independent examples, but simple borrowing of an interface. In other words, doing it just to follow precedent. Maybe JavaScript's RE does it this way only because regular "regex" does. And maybe "awk" does it because "regex" does. (Or since I don't know too much about Unix history, the order could be reversed so "regex" copied the idea from "awk" instead. And then "awk" would have done it to save resources, which were tight back then.)
You may be right, but is it wise to part with decades of precedent?
Besides, I think Eric pointed out the biggest reason to keep the 1-based capture interface: the captures in the RE are 1-based, so those accessed from C++ should be, too.
-- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer; _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Dave Abrahams Boost Consulting www.boost-consulting.com

Hi All, ----- Original Message ----- From: "Rob Stewart" <stewart@sig.com> To: <boost@lists.boost.org> Cc: <boost@lists.boost.org> Sent: Tuesday, September 20, 2005 4:23 AM Subject: Re: [boost] [Review] xpressive [snip]
It looks like the current setup is not STL-friendly. Most of the "what"
[snip]
That's completely STL-friendly: there are iterators. When using
[snip]
completely useless isn't unfriendly? You'll have to use a (mutable) object
[snip]
why not just pass ++pieces.begin()? Furthermore, as I said, one
in a 0-based array culture, like C++ (or C). C++ people would expect
[snip] the etc. Several interesting points but nothing to convince me that the 1-based access to the results vector is anything other than a legacy attribute of the xpressive interface. In fact, quite the opposite. The scale of the relevant history was a small surprise. Directing the 0-based activists to the TR1 and C++0x standards efforts sounded reasonable to me. As one of those "activists" I wouldn't feel comfortable consuming further time and resource smoothing such a small wrinkle. Someone's cute, original idea (K. Thompson?) is going to be around for some time yet, in spite of the changes in software development (OO, generic coding) over the intervening decades. An idea did bubble to the surface while reading this thread. If the justification for 1-based access relates to the notion that the index is a logical id for a member of some composite object (e.g. a class), can this be taken further? Are we not dealing with a lexical representation of a "record", "row" or "message"? Could the results machinery be augmented with a typelist somehow? Access through the results object might invoke an appropriate boost cast. NYI? Cheers.

On 9/16/05 11:44 AM, "Eric Niebler" <eric@boost-consulting.com> wrote:
Daryle Walker wrote:
On 9/15/05 1:10 PM, "Eric Niebler" <eric_at_[hidden]> wrote:
In practice, the only reason why you might iterator over all sub-matches is to print them out. Otherwise, the sub-matches are accessed randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See?
<snip>
It looks like the current setup is not STL-friendly. Most of the "what" list is one type of thing, the in-order pieces of the regex parse. The first item of the list doesn't match that pattern (since it's the whole parse). I'm guessing that this "old" way wasn't a problem because people expected 1-based arrays, so the 0-index could be special. That doesn't work in a 0-based array culture, like C++ (or C). C++ people would expect the 0-index element to match the general rule of the list. This mixing of element types mixes concerns (violating "keep it simple, silly"). A STL-friendly alternative would to have separate member functions for the whole-parse and the list-of-parse-pieces, then have a special function (member or non-member) that generates a regex-culture combined list.
I agree, if we we're only concerned about satisfying people familiar with C++ culture. But we are also trying to satisfy people familiar with regex culture. Every regex package out there I know of that supports back-references begins numbering captures at 1. I don't know why. But I do know that to break with that tradition now would cause massive confusion. Besides, I'm trying to minimize the differences between xpressive's interface and TR1 regex.
I wasn't considering back-references. Does the list have to be tied to back-references at this level?
Is it a wart? OK, I agree. But frankly, I don't feel that this is an ugly enough wart for me to break with established practice.
But we should strive for an optimal interface, and then go back for any "necessary" warts. Since the nature of the 0-element is different from the rest of the elements, I'm guessing that you generate that element in a different manner than the others. Is that correct? If so, then you're handling the elements differently anyway in your internal computations and just merging them together at the user level. You already did the work for what I asked for; you're just doing extra work in hiding it. Didn't the "old" way just use numeric indices, never going through the list via single-step iteration? So there was never a mixed-metaphor problem that we have now. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

I wasn't considering back-references. Does the list have to be tied to back-references at this level?
Yes. IMO it would be extremely confusing to refer to the same element in code as the zeroth element, and in the expression as \1. There is a very long history behind using 1 as the first index for the first sub-expression and 0 as the index for the whole match. Every programming language with native regex support does it that way, so does every regex lib I've ever seen (and Eric and I have seen - and written - a few between us). Sorry, but changing this is really bad idea. John.
participants (6)
-
Daryle Walker
-
David Abrahams
-
Eric Niebler
-
John Maddock
-
Rob Stewart
-
Scott Woods