Regex ease-of-use ideas

older
[boost-sandbox] [BGL] Random path...

David Abrahams

5 Apr 2004 5 Apr '04

7:22 p.m.

I was just writing up a simple tutorial example; finding the subject in a set of email headers. Here's what I got: std::string line; boost::regex pat("^Subject: (Re: )?(.*)"); boost::smatch matches; while (std::cin) { std::getline(std::cin, line); if (boost::regex_match(line,matches, pat)) std::cout << matches[2]; } 1. There's no way to search a stream for a match because a regex requires bidirectional iterators, so I have to do this totally frustrating line-by-line search. I think Spirit has some kind of iterator that turns an input iterator into something forward by holding a cache of the data starting with the earliest copy of the original iterator. Could something like that be added? 2. Seems to me that if match objects could be converted to bool, we might be able to: std::string line; boost::regex pat("^Subject: (Re: )?(.*)"); while (std::cin) { std::getline(std::cin, line); if (boost::smatch m = boost::regex_match(line, pat)) std::cout << m[2]; } which would be much smoother to the touch. Are match objects expensive to construct? -- Dave Abrahams Boost Consulting www.boost-consulting.com

Show replies by date

Allan Odgaard

5 Apr 5 Apr

9:55 p.m.

On 5. Apr 2004, at 21:22, David Abrahams wrote:

...

1. There's no way to search a stream for a match because a regex requires bidirectional iterators, so I have to do this totally frustrating line-by-line search. I think Spirit has some kind of iterator that turns an input iterator into something forward by holding a cache of the data starting with the earliest copy of the original iterator. Could something like that be added?

Added only to the regex library? sounds like it would be a very useful general purpose iterator adaptor, as there are also other (non standard) algorithms which need to backtrack over the input.

...

2. Seems to me that if match objects could be converted to bool, we might be able to:

I can only second that, I am currently using my own regex library (some of my reasoning to be found in this c.l.c++.m thread: <http://tinyurl.com/2xnbd>), here I also allow implicit conversion to the iterator type, which allow code like: iterator it = regex:find(first, last, ptrn); Although I already did propose it for boost, but was told that it poses a problem with the ambiguity of an "empty" match at the end of the string and "no match at all" -- my argument here is that if one knows that the pattern might generate such a match (and one is interested in knowing about it), one just declares the result to be the match object. The former generally allows to code w/o all those if's to see if something was actually matched -- at least it has made much of my code simpler/shorter.

...

if (boost::smatch m = boost::regex_match(line, pat))

or: if (boost::smatch const& m = boost::regex_match(line, pat))

...

[...] Are match objects expensive to construct?

At least they do not have to be :)

David Abrahams

6 Apr 6 Apr

2:55 a.m.

Allan Odgaard <ML@Top-House.DK> writes:

...

On 5. Apr 2004, at 21:22, David Abrahams wrote:

...
1. There's no way to search a stream for a match because a regex requires bidirectional iterators, so I have to do this totally frustrating line-by-line search. I think Spirit has some kind of iterator that turns an input iterator into something forward by holding a cache of the data starting with the earliest copy of the original iterator. Could something like that be added?

Added only to the regex library? sounds like it would be a very useful general purpose iterator adaptor, as there are also other (non standard) algorithms which need to backtrack over the input.

Of course. I just meant, "added for the benefit of the regex library".

...

...
2. Seems to me that if match objects could be converted to bool, we might be able to:

I can only second that, I am currently using my own regex library (some of my reasoning to be found in this c.l.c++.m thread: <http://tinyurl.com/2xnbd>), here I also allow implicit conversion to the iterator type, which allow code like:

iterator it = regex:find(first, last, ptrn);

Although I already did propose it for boost, but was told that it poses a problem with the ambiguity of an "empty" match at the end of the string and "no match at all" -- my argument here is that if one knows that the pattern might generate such a match (and one is interested in knowing about it), one just declares the result to be the match object. The former generally allows to code w/o all those if's to see if something was actually matched -- at least it has made much of my code simpler/shorter.

Sounds good to me. John? -- Dave Abrahams Boost Consulting www.boost-consulting.com

John Maddock

10:59 a.m.

...

I was just writing up a simple tutorial example; finding the subject in a set of email headers. Here's what I got:

std::string line; boost::regex pat("^Subject: (Re: )?(.*)"); boost::smatch matches;

while (std::cin) { std::getline(std::cin, line); if (boost::regex_match(line,matches, pat)) std::cout << matches[2]; }

1. There's no way to search a stream for a match because a regex requires bidirectional iterators, so I have to do this totally frustrating line-by-line search. I think Spirit has some kind of iterator that turns an input iterator into something forward by holding a cache of the data starting with the earliest copy of the original iterator. Could something like that be added?

Yes, but it's a more general iterator type rather than just regex specific, incidentally I also have a use for a "fileview" class which presents a files contents as a pair of random access iterators. If you want me to provide these though, you'll need to wait until I've finished the next round of regex internal changes / refactoring.

...

2. Seems to me that if match objects could be converted to bool, we might be able to:

std::string line; boost::regex pat("^Subject: (Re: )?(.*)");

while (std::cin) { std::getline(std::cin, line); if (boost::smatch m = boost::regex_match(line, pat)) std::cout << m[2]; }

which would be much smoother to the touch. Are match objects expensive to construct?

Currently, expensive'ish. Originally these were reference counted, and cheap to copy, but I ran into problems with thread safety (it's not uncommon to obtain a match with one thread, then hand off a copy to another thread for processing). Now that we have a thread safe shared_ptr though I need to revisit this, it just makes my head hurt trying to analyse concurrent code :-| One other thing - the current regex_match overload that doesn't take a match_results as a parameter currently returns bool - the intent is that if the user doesn't need the info generated in the match_results, then some time can be saved by not storing it. Boost.Regex doesn't currently take advantage of that, but I was planning to in the next revision (basically you can cut out memory allocation altogether, and that's an order or magnitude saving).

...

...
...
2. Seems to me that if match objects could be converted to bool, we might be able to:

I can only second that, I am currently using my own regex library (some of my reasoning to be found in this c.l.c++.m thread: <http://tinyurl.com/2xnbd>), here I also allow implicit conversion to the iterator type, which allow code like:

iterator it = regex:find(first, last, ptrn);

Although I already did propose it for boost, but was told that it poses a problem with the ambiguity of an "empty" match at the end of the string and "no match at all" -- my argument here is that if one knows that the pattern might generate such a match (and one is interested in knowing about it), one just declares the result to be the match object. The former generally allows to code w/o all those if's to see if something was actually matched -- at least it has made much of my code simpler/shorter.

Sounds good to me. John?

So we make match_results implicitly convertible to it's iterator type? I'm not necessarily against that, but there are dangers: mainly as Alan stated that you can easily miss corner cases (when the regex matches a zero-length string). John.

Pavol Droba

12:55 p.m.

On Tue, Apr 06, 2004 at 11:59:00AM +0100, John Maddock wrote: Hi,

...

So we make match_results implicitly convertible to it's iterator type? I'm not necessarily against that, but there are dangers: mainly as Alan stated that you can easily miss corner cases (when the regex matches a zero-length string).

I just want to add 0.02p. Maybe it would be good, to make regex match_results convertible to the iterator_range defined in the string algo library. iterator_range is used as a result of find operations, it is convertible to safe-bool and its purpose is to delimit a part of a collection. It was requested during the review of the string algo library, that this facility should be extracted and documented as a general purpose utility. It can be standardized then as a generic result of find operation. What would be the advantage? User can write iterator_range res=regex_maxch(...); if(res) { std::copy(res.begin(), res.end(), It) // or what ever } iterator_range copying is cheap, it is safe from the problem of empty match (empty is a well defined state) and there are algorithms that it can be used with. It the user does not need all the info provided by the regex match_results (which is not so uncommon) then iterator_range sufficient. regex match_results would make a logical extension to iterator_range. Regards, Pavol

David Abrahams

6:43 p.m.

Pavol Droba <droba@topmail.sk> writes:

...

On Tue, Apr 06, 2004 at 11:59:00AM +0100, John Maddock wrote: Hi,

...
So we make match_results implicitly convertible to it's iterator type? I'm not necessarily against that, but there are dangers: mainly as Alan stated that you can easily miss corner cases (when the regex matches a zero-length string).

I just want to add 0.02p.

Maybe it would be good, to make regex match_results convertible to the iterator_range defined in the string algo library. iterator_range is used as a result of find operations, it is convertible to safe-bool and its purpose is to delimit a part of a collection.

It was requested during the review of the string algo library, that this facility should be extracted and documented as a general purpose utility. It can be standardized then as a generic result of find operation.

What would be the advantage? User can write

iterator_range res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators? -- Dave Abrahams Boost Consulting www.boost-consulting.com

Pavol Droba

8:56 p.m.

On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote: [snip]

...

...
What would be the advantage? User can write

iterator_range res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators?

Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range. Pavol

David Abrahams

9:51 p.m.

Pavol Droba <droba@topmail.sk> writes:

...

On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote:

[snip]

...
...
What would be the advantage? User can write

iterator_range res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators?

Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range.

The match object *is* the collection (of submatches) being searched. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Eric Niebler

11:39 p.m.

David Abrahams wrote:

...

Pavol Droba <droba@topmail.sk> writes:

...
On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote:

...
...
What would be the advantage? User can write iterator_range res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators?

Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range.

The match object *is* the collection (of submatches) being searched.

A sub-match is just a glorified pair of iterators. The iterators refer to the sequence being searched, which has a lifetime independent of the match object. There is no risk of iterator invalidation here. -- Eric Niebler Boost Consulting www.boost-consulting.com

David Abrahams

7 Apr 7 Apr

12:33 a.m.

"Eric Niebler" <eric@boost-consulting.com> writes:

...

David Abrahams wrote:

...
Pavol Droba <droba@topmail.sk> writes:

...
On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote:

...
...
What would be the advantage? User can write iterator_range res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators?

Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range. The match object *is* the collection (of submatches) being searched.

A sub-match is just a glorified pair of iterators. The iterators refer to the sequence being searched, which has a lifetime independent of the match object. There is no risk of iterator invalidation here.

I'm not worried about submatch's iterators over the original input sequence, but about the match_results's iterators over submatches. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Pavol Droba

6:19 a.m.

On Tue, Apr 06, 2004 at 08:33:56PM -0400, David Abrahams wrote:

...

"Eric Niebler" <eric@boost-consulting.com> writes:

...
David Abrahams wrote:

...
Pavol Droba <droba@topmail.sk> writes:

...
On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote:

...
...
What would be the advantage? User can write iterator_range res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators?

Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range. The match object *is* the collection (of submatches) being searched.

A sub-match is just a glorified pair of iterators. The iterators refer to the sequence being searched, which has a lifetime independent of the match object. There is no risk of iterator invalidation here.

I'm not worried about submatch's iterators over the original input sequence, but about the match_results's iterators over submatches.

These are destroyed and forgotten. That's the idea of the example. If you are interested only in the success/failure of the search operation, or you want to know which part of sequence has been matched, then you don't need full match_results. Those two iterators (contained in the iterator_range) are more then sufficient, and you don't need to pay an extra price implied by match_results. Maybe I got it wrong, but the discussion before seemed to deal with a problem like this. Pavol

David Abrahams

7:37 a.m.

Pavol Droba <droba@topmail.sk> writes:

...

On Tue, Apr 06, 2004 at 08:33:56PM -0400, David Abrahams wrote:

...
"Eric Niebler" <eric@boost-consulting.com> writes:

...
David Abrahams wrote:

...
Pavol Droba <droba@topmail.sk> writes:

...
On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote:

...
> What would be the advantage? User can write iterator_range > res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators?

Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range. The match object *is* the collection (of submatches) being searched.

A sub-match is just a glorified pair of iterators. The iterators refer to the sequence being searched, which has a lifetime independent of the match object. There is no risk of iterator invalidation here.

I'm not worried about submatch's iterators over the original input sequence, but about the match_results's iterators over submatches.

These are destroyed and forgotten. That's the idea of the example. If you are interested only in the success/failure of the search operation, or you want to know which part of sequence has been matched, then you don't need full match_results. Those two iterators (contained in the iterator_range) are more then sufficient, and you don't need to pay an extra price implied by match_results.

Maybe I got it wrong, but the discussion before seemed to deal with a problem like this.

Maybe I got it wrong; I want to be able to look at the 2nd submatch when the match succeeded. If all I get is an iterator the whole substring that matched, it's of no use to me. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Pavol Droba

8:34 a.m.

On Wed, Apr 07, 2004 at 03:37:46AM -0400, David Abrahams wrote:

...

Pavol Droba <droba@topmail.sk> writes:

...
On Tue, Apr 06, 2004 at 08:33:56PM -0400, David Abrahams wrote:

...
"Eric Niebler" <eric@boost-consulting.com> writes:

...
David Abrahams wrote:

...
Pavol Droba <droba@topmail.sk> writes:

...
On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote: >> What would be the advantage? User can write iterator_range >> res=regex_maxch(...); > >Doesn't the match object get destroyed here, thereby invalidating the >iterators? > Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range. The match object *is* the collection (of submatches) being searched.

A sub-match is just a glorified pair of iterators. The iterators refer to the sequence being searched, which has a lifetime independent of the match object. There is no risk of iterator invalidation here.

I'm not worried about submatch's iterators over the original input sequence, but about the match_results's iterators over submatches.

These are destroyed and forgotten. That's the idea of the example. If you are interested only in the success/failure of the search operation, or you want to know which part of sequence has been matched, then you don't need full match_results. Those two iterators (contained in the iterator_range) are more then sufficient, and you don't need to pay an extra price implied by match_results.

Maybe I got it wrong, but the discussion before seemed to deal with a problem like this.

Maybe I got it wrong; I want to be able to look at the 2nd submatch when the match succeeded. If all I get is an iterator the whole substring that matched, it's of no use to me.

Sorry, I have overlooked your need for the second match. Then the whole story I have written is probably of a little use to you. You can of course apply the same idea for submatches. I.e to write something like this: iterator_range res=regex_search(....)[2] Anyway, this would work if you need only a specific part of the match, not if you need to play with whole match_results object. Regards, Pavol

Pavol Droba

6:14 a.m.

On Tue, Apr 06, 2004 at 05:51:07PM -0400, David Abrahams wrote:

...

Pavol Droba <droba@topmail.sk> writes:

...
On Tue, Apr 06, 2004 at 02:43:20PM -0400, David Abrahams wrote:

[snip]

...
...
What would be the advantage? User can write

iterator_range res=regex_maxch(...);

Doesn't the match object get destroyed here, thereby invalidating the iterators?

Not really as far as I know. Iterators are bound to the collection that is being searched, not to the match itself. Therfore, they validity should not be bound the lifetime of them match. They will be copied to the iterarator_range.

The match object *is* the collection (of submatches) being searched.

But what is a submatch? From the definition it is a pair of iterators. Those iterators are for sure not pointing into the submatch itself, rather to the container that was searched. Unless I'm mistaken, but then it would have no sense. It the example above, the iterator_range will contain the pair of iterators from the top level submatch. Pavol

Giovanni Bajo

6 Apr 6 Apr

1:13 p.m.

John Maddock wrote:

...

Yes, but it's a more general iterator type rather than just regex specific, incidentally I also have a use for a "fileview" class which presents a files contents as a pair of random access iterators.

Like what, boost::spirit::file_iterator (boost/spirit/iterator/file_iterator.hpp)? -- Giovanni Bajo

David Abrahams

6:47 p.m.

"Giovanni Bajo" <giovannibajo@libero.it> writes:

...

John Maddock wrote:

...
Yes, but it's a more general iterator type rather than just regex specific, incidentally I also have a use for a "fileview" class which presents a files contents as a pair of random access iterators.

Like what, boost::spirit::file_iterator (boost/spirit/iterator/file_iterator.hpp)?

Probably, but a rhetorical question: "Where is the documentation for this component?" http://www.boost.org/libs/spirit/doc/file_iterator.html is really inadequate. I can't even tell what iterator category it provides. FWIW, regex requires bidirectional iterators. IMHO, Spirit sorely needs some formal reference docs. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Giovanni Bajo

7 Apr 7 Apr

1:08 p.m.

David Abrahams wrote:

...

...
Like what, boost::spirit::file_iterator (boost/spirit/iterator/file_iterator.hpp)?

Probably, but a rhetorical question:

"Where is the documentation for this component?"

http://www.boost.org/libs/spirit/doc/file_iterator.html is really inadequate.

I think it serves its purpose for using the iterator within Spirit. I agree it should be rework, if file_iterator had to be generalized in Boost.

...

I can't even tell what iterator category it provides.

It's a random access iterator. You can experiment with Regex if you want, it should work as-is. IIRC, it's used in Wave too. -- Giovanni Bajo

David Abrahams

2:54 p.m.

"Giovanni Bajo" <giovannibajo@libero.it> writes:

...

David Abrahams wrote:

...
...
Like what, boost::spirit::file_iterator (boost/spirit/iterator/file_iterator.hpp)?

Probably, but a rhetorical question:

"Where is the documentation for this component?"

http://www.boost.org/libs/spirit/doc/file_iterator.html is really inadequate.

I think it serves its purpose for using the iterator within Spirit. I agree it should be rework, if file_iterator had to be generalized in Boost.

Hmmph. I have to say that while I love what Spirit is doing, actually using it with confidence requires way too much use of the source, Luke. Reference docs are annoying to write, but neccessary.

...

...
I can't even tell what iterator category it provides.

It's a random access iterator. You can experiment with Regex if you want, it should work as-is. IIRC, it's used in Wave too.

That's great! Thank you! -- Dave Abrahams Boost Consulting www.boost-consulting.com

Joel de Guzman

9 Apr 9 Apr

1:20 a.m.

David Abrahams wrote:

...

"Giovanni Bajo" <giovannibajo@libero.it> writes:

...

...
...
Probably, but a rhetorical question:

"Where is the documentation for this component?"

http://www.boost.org/libs/spirit/doc/file_iterator.html is really inadequate.

I think it serves its purpose for using the iterator within Spirit. I agree it should be rework, if file_iterator had to be generalized in Boost.

Hmmph. I have to say that while I love what Spirit is doing, actually using it with confidence requires way too much use of the source, Luke. Reference docs are annoying to write, but neccessary.

Point well taken. Formal documentation is high on the TODO list. -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Fredrik Blomqvist

6 Apr 6 Apr

11:20 p.m.

Giovanni Bajo wrote:

...

John Maddock wrote:

...
Yes, but it's a more general iterator type rather than just regex specific, incidentally I also have a use for a "fileview" class which presents a files contents as a pair of random access iterators.

Like what, boost::spirit::file_iterator (boost/spirit/iterator/file_iterator.hpp)?

[Slightly OT for this thread] This was interesting, any chance of making this a separate, free standing boost.component? Memory mapped file access have been briefly discussed before and a posix/windows implementation of mem-mapped files by Craig Henderson has been in the sandbox for quite some time. // Fredrik Blomqvist

Giovanni Bajo

7 Apr 7 Apr

1:06 p.m.

Fredrik Blomqvist wrote:

...

This was interesting, any chance of making this a separate, free standing boost.component?

If there is interest, we can probably ask a review for it. Given the tight review schedule, it might take a while.

...

Memory mapped file access have been briefly discussed before and a posix/windows implementation of mem-mapped files by Craig Henderson has been in the sandbox for quite some time.

spirit::file_iterator has different implementations (through a policy), and works on posix/windows, using memory mapped files when available. Plus, there is a freestand version (std::fopen and friends). It uses Boost.IteratorAdaptor (new version). -- Giovanni Bajo

Neal D. Becker

3:41 p.m.

Giovanni Bajo wrote:

...

Fredrik Blomqvist wrote:

...
This was interesting, any chance of making this a separate, free standing boost.component?

If there is interest, we can probably ask a review for it. Given the tight review schedule, it might take a while.

...
Memory mapped file access have been briefly discussed before and a posix/windows implementation of mem-mapped files by Craig Henderson has been in the sandbox for quite some time.

spirit::file_iterator has different implementations (through a policy), and works on posix/windows, using memory mapped files when available. Plus, there is a freestand version (std::fopen and friends). It uses Boost.IteratorAdaptor (new version).

I'm looking at your spirit::file_iterator impl. Looks pretty cool. One question. Why do failures, such as failure to open mmap posix file, return instead of throw?

David Abrahams

4:32 p.m.

"Neal D. Becker" <ndbecker2@verizon.net> writes:

...

...
spirit::file_iterator has different implementations (through a policy), and works on posix/windows, using memory mapped files when available. Plus, there is a freestand version (std::fopen and friends). It uses Boost.IteratorAdaptor (new version).

I'm looking at your spirit::file_iterator impl. Looks pretty cool.

Yeah, very! Unfortunately it doesn't quite do what I want AFAICT. Suppose I want a file iterator over std::cin? I think I want a cacheing iterator that can adapt some other base iterator, e.g. istreambuf_iterator<char>. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams

6 Apr 6 Apr

6:40 p.m.

"John Maddock" <john@johnmaddock.co.uk> writes:

...

...
1. There's no way to search a stream for a match because a regex requires bidirectional iterators, so I have to do this totally frustrating line-by-line search. I think Spirit has some kind of iterator that turns an input iterator into something forward by holding a cache of the data starting with the earliest copy of the original iterator. Could something like that be added?

Yes, but it's a more general iterator type rather than just regex specific,

Of course.

...

incidentally I also have a use for a "fileview" class which presents a files contents as a pair of random access iterators. If you want me to provide these though, you'll need to wait until I've finished the next round of regex internal changes / refactoring.

Note that we need some additional guarantees from libraries in order to use such an iterator. It isn't possible to move such an iterator N positions forward and then N positions backward unless there's another iterator pointing at or before the iterator's position in the sequence. Libraries have to guarantee that they won't do that, or describe their invalidation expectations.

...

...
2. Seems to me that if match objects could be converted to bool, we might be able to:

std::string line; boost::regex pat("^Subject: (Re: )?(.*)");

while (std::cin) { std::getline(std::cin, line); if (boost::smatch m = boost::regex_match(line, pat)) std::cout << m[2]; }

which would be much smoother to the touch. Are match objects expensive to construct?

Currently, expensive'ish. Originally these were reference counted, and cheap to copy

Actually, I was asking about initial construction cost, in particular of an object representing a failed match. The acceptance of N1610 means that copy costs should be insignificant for cases like this one, provided that the smatch author puts in the required effort to make it moveable. ;-)

...

but I ran into problems with thread safety (it's not uncommon to obtain a match with one thread, then hand off a copy to another thread for processing). Now that we have a thread safe shared_ptr though I need to revisit this, it just makes my head hurt trying to analyse concurrent code :-|

Well if we could solve problem #1, the expense of the initial construction becomes a non-issue for my case, because I'd only have to search once. And regardless of all that, often convenience is *way* more important than efficiency. That said, as long as the match object is immutable, there's little to worry about w.r.t. thread safety.

...

One other thing - the current regex_match overload that doesn't take a match_results as a parameter currently returns bool - the intent is that if the user doesn't need the info generated in the match_results, then some time can be saved by not storing it. Boost.Regex doesn't currently take advantage of that, but I was planning to in the next revision (basically you can cut out memory allocation altogether, and that's an order or magnitude saving).

But I do need the match results, when the match succeeds.

...

...
...
...
2. Seems to me that if match objects could be converted to bool, we might be able to:

I can only second that, I am currently using my own regex library (some of my reasoning to be found in this c.l.c++.m thread: <http://tinyurl.com/2xnbd>), here I also allow implicit conversion to the iterator type, which allow code like:

iterator it = regex:find(first, last, ptrn);

Although I already did propose it for boost, but was told that it poses a problem with the ambiguity of an "empty" match at the end of the string and "no match at all" -- my argument here is that if one knows that the pattern might generate such a match (and one is interested in knowing about it), one just declares the result to be the match object. The former generally allows to code w/o all those if's to see if something was actually matched -- at least it has made much of my code simpler/shorter.

Sounds good to me. John?

So we make match_results implicitly convertible to it's iterator type? I'm not necessarily against that, but there are dangers: mainly as Alan stated that you can easily miss corner cases (when the regex matches a zero-length string).

I guess my original suggestion of making it implicitly convertible to some safe_bool solves that problem. I guess I prefer that idea, though Allan probably has more experience with this than I do. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Eric Niebler

7:03 p.m.

David Abrahams wrote:

...

Well if we could solve problem #1, the expense of the initial construction becomes a non-issue for my case, because I'd only have to search once. And regardless of all that, often convenience is *way* more important than efficiency.

That said, as long as the match object is immutable, there's little to worry about w.r.t. thread safety.

Here there be dragons. An immutable match object is appealing, but the performance would be problematic. The match object is essentially a std::vector of sub-matches, dynamically allocated because you don't know how many sub-matches there will be until runtime. Often, you want to match the same pattern repeatedly. In that case, you reuse the match object and avoid the extra allocations. Forcing the creation of a new match object each time would be a terrible pessimization. I'm all for ease-of-use, as long as accommodation is made for those who care about speed. -- Eric Niebler Boost Consulting www.boost-consulting.com

David Abrahams

7:22 p.m.

"Eric Niebler" <eric@boost-consulting.com> writes:

...

David Abrahams wrote:

...
Well if we could solve problem #1, the expense of the initial construction becomes a non-issue for my case, because I'd only have to search once. And regardless of all that, often convenience is *way* more important than efficiency. That said, as long as the match object is immutable, there's little to worry about w.r.t. thread safety.

Here there be dragons. An immutable match object is appealing, but the performance would be problematic. The match object is essentially a std::vector of sub-matches, dynamically allocated because you don't know how many sub-matches there will be until runtime.

Well, making the match object immutable is not my preferred approach anyway; I was just trying to help John out. -- Dave Abrahams Boost Consulting www.boost-consulting.com

John Maddock

7 Apr 7 Apr

10:54 a.m.

...

Actually, I was asking about initial construction cost, in particular of an object representing a failed match. The acceptance of N1610 means that copy costs should be insignificant for cases like this one, provided that the smatch author puts in the required effort to make it moveable. ;-)

Sounds like a hint - maybe if we could make shared_ptr moveable then we could all delegate the work to that :-) As for the initial construction cost - yes there is a cost - it has to allocate memory to store the sub-expression matches, the matcher needs some working space and therefore starts storing the submatches before it knows whether there will be a match. Consider your current code: std::string line; boost::regex pat("^Subject: (Re: )?(.*)"); boost::smatch matches; while (std::cin) { std::getline(std::cin, line); if (boost::regex_match(line,matches, pat)) std::cout << matches[2]; } The first time regex_match gets called it allocates the storage it needs in the match_results class, subsequent calls then re-use this storage. This is efficient - in fact the cost of a single memory allocation is about 10 times that of a simple regex_match attempt - so this is very important IMO. In fact I've spent a lot of last year eliminating unnecessary memory allocations from regex, and there are some more I intend to stamp on this year. Believe me it makes a difference, and other libraries like GRETA and PCRE have all been through the same process and for the same reasons. In contrast if regex_match returns a match_results structure then you effectively "pessimise" the performance for a small improvement in ease of use (although I admit that there are options similar to the small-string optimisation that *might* be applicable here). BTW, just to be hyper critical, your alternative code: std::string line; boost::regex pat("^Subject: (Re: )?(.*)"); while (std::cin) { std::getline(std::cin, line); if (boost::smatch m = boost::regex_match(line, pat)) std::cout << m[2]; } contains an assignment inside a while loop, which while "neat", I have often seen criticised for being potentially error prone, there are even some compilers that throw out a helpful(!) warning if you do that (along the lines of "didn't you want to use operator==).

...

...
One other thing - the current regex_match overload that doesn't take a match_results as a parameter currently returns bool - the intent is that if the user doesn't need the info generated in the match_results, then some time can be saved by not storing it. Boost.Regex doesn't currently take advantage of that, but I was planning to in the next revision (basically you can cut out memory allocation altogether, and that's an order or magnitude saving).

But I do need the match results, when the match succeeds.

I understand that, but there is a group of users who don't - one example is a (commercial) email spam-filter that uses Boost.Regex. It only needs a true/false result "does this message have this pattern or not", and it wants the answer as fast as possible. For uses like this even a small change in performance can make the difference between "coping" and "not coping" with the email traffic they're seeing these days.

...

I guess my original suggestion of making it implicitly convertible to some safe_bool solves that problem. I guess I prefer that idea, though Allan probably has more experience with this than I do.

OK, let me mull this over, maybe we can find a way to keep everyone happy, maybe not ... John.

David Abrahams

2:52 p.m.

"John Maddock" <john@johnmaddock.co.uk> writes:

...

BTW, just to be hyper critical, your alternative code:

std::string line; boost::regex pat("^Subject: (Re: )?(.*)");

while (std::cin) { std::getline(std::cin, line); if (boost::smatch m = boost::regex_match(line, pat)) std::cout << m[2]; }

contains an assignment inside a while loop,

No, that's an initialization. Note the leading declaration.

...

which while "neat", I have often seen criticised for being potentially error prone, there are even some compilers that throw out a helpful(!) warning if you do that (along the lines of "didn't you want to use operator==).

I don't think that's true of initializations.

...

...
...
One other thing - the current regex_match overload that doesn't take a match_results as a parameter currently returns bool - the intent is that if the user doesn't need the info generated in the match_results, then some time can be saved by not storing it. Boost.Regex doesn't currently take advantage of that, but I was planning to in the next revision (basically you can cut out memory allocation altogether, and that's an order or magnitude saving).

But I do need the match results, when the match succeeds.

I understand that, but there is a group of users who don't - one example is a (commercial) email spam-filter that uses Boost.Regex. It only needs a true/false result "does this message have this pattern or not", and it wants the answer as fast as possible. For uses like this even a small change in performance can make the difference between "coping" and "not coping" with the email traffic they're seeing these days.

Sure, I understand. I'm just saying that there's not neccessarily anything wrong with providing the *option* of some convenience at the expense of speed.

...

...
I guess my original suggestion of making it implicitly convertible to some safe_bool solves that problem. I guess I prefer that idea, though Allan probably has more experience with this than I do.

OK, let me mull this over, maybe we can find a way to keep everyone happy, maybe not ...

We shall see, shan't we? -- Dave Abrahams Boost Consulting www.boost-consulting.com

7757

Age (days ago)

7761

Last active (days ago)

List overview

Download

27 comments

9 participants

participants (9)

Allan Odgaard
David Abrahams
Eric Niebler
Fredrik Blomqvist
Giovanni Bajo
Joel de Guzman
John Maddock
Neal D. Becker
Pavol Droba