regex_search: Lazy evaluation of iterators
In previous emails I complained that if one used an extended syntax regex on an InputIterator (or a BidirectionalIterator) then regex_search sought to the end of the iterator before returning the first match. This has obvious disadvantages if one wants to implement lazy matching of regexes on a potentially large input stream. My hypothesis about what is happening is: The seeking to the end is due to the extended syntax wanting to know which is the leftmost match. This is done by asking the distance between the matched substring and the 'end' of the sequence. If the iterator is an InputIterator then STL (at least the version I'm using from gcc 3.3.2) says, the only way I can do this, is to seek to the end of the sequence, to find how far away I am from the end. Thus, the problem. If I redefine the iterator as a RandomAccessIterator, and define operator-, and operator +=, then STL uses operator- to find the distance between the match and the end. Which is better, but still not ideal, as I'm assuming that I dont know at that point where the end is. I'll know when I get there, but not before. I suggest that a cleaner conceptual way of doing this would be to ask how far we are from the start of the iterator. My apologies for any inaccuracies in the above and my ignorance of earlier debates on this or related issues. David
This has obvious disadvantages if one wants to implement lazy matching of regexes on a potentially large input stream.
Yes, it's a bug, or at least an unfortunate design choice.
My hypothesis about what is happening is:
I suggest that a cleaner conceptual way of doing this would be to ask how far we are from the start of the iterator.
That's basically what it does, but when the first potential match is found it "accidentally" ends up computing the length of the whole sequence, I'm testing the patch below now, all being well it should be in cvs shortly. John. Index: boost/regex/v4/match_results.hpp =================================================================== RCS file: /cvsroot/boost/boost/boost/regex/v4/match_results.hpp,v retrieving revision 1.18 diff -r1.18 match_results.hpp 303c303,304 < BidiIterator base = (*this)[-1].first; ---
BidiIterator base = this->prefix().first; BidiIterator end = this->suffix().second; 312a314,320 if((p1->first == end) && (p2->first != end)) { // don't compute distances if we don't have to: base1 = 1; base2 = 0; break; }
As an aside: with the following match values boost::regex_constants::match_flag_type mflags = boost::match_default | boost::match_not_dot_newline | boost::match_continuous ; The following is an irritating feature of the regex package expression.assign("[ \t\n]", boost::regex::extended); does not match tab or newline expression.assign("[ \t\\n]", boost::regex::extended); does not match tab or newline expression.assign("[ \\t\\n]", boost::regex::extended); does not match tab or newline The full expression I'm using is expression.assign("(keyword)|([a-zA-Z][a-zA-Z0-9_-]*)|([ \\t\\n]+)|([0-9]+)|(\"[^\"]*\")|(.)", sflags); ^^^^^^^^^ Presumably I need to set regbase::escape_in_lists but why arent TAB and NL allowed as raw characters in character ranges? David
John Maddock wrote:
That's basically what it does, but when the first potential match is found it "accidentally" ends up computing the length of the whole sequence, I'm testing the patch below now, all being well it should be in cvs shortly.
Thank you for looking at this issue. I tried your patch and it worked on the regex "(aaa)|(bbb)|(.)|(\n)" But I suspect it isnt the whole story, as I have the same 'seeking to end' problem with the more realistic tokeniser regex "(aaa)|([a-zA-Z][a-zA-Z0-9_-]*)|(.)|(\n)" where I'm looking for the longest identifier, but want 'aaa' treated as a keyword provided it isnt followed by other alphanums. In the foo4.cc file I sent, I changed expression.assign("(aaa)|([a-zA-Z][a-zA-Z0-9_-]*)|(.)|(\n)", sflag); and MyCharIter& MyCharIter::operator--(){ // prefix --X cout << "NT: --X called bp=" << bp << " c= '" << s[bp] <<"'" << endl; --bp; return *this; } as this function is now needed. I also wondered if the fact that match_continuous is set is relevant since all (toplevel) matches will start at the same place. David McKelvie
But I suspect it isnt the whole story, as I have the same 'seeking to end' problem with the more realistic tokeniser regex
"(aaa)|([a-zA-Z][a-zA-Z0-9_-]*)|(.)|(\n)"
where I'm looking for the longest identifier, but want 'aaa' treated as a keyword provided it isnt followed by other alphanums.
Doh! I forgot about subsequent (possibly unmatched) sub-expressions being used to break ties, there's another patch below that I'm testing now, I'm fairly sure it covers all eventualities this time.
I also wondered if the fact that match_continuous is set is relevant since all (toplevel) matches will start at the same place.
Correct, it's not relevant in this issue (it might be necessary in your "real" code however, depending on what you want to do). John. Index: boost/regex/v4/match_results.hpp =================================================================== RCS file: /cvsroot/boost/boost/boost/regex/v4/match_results.hpp,v retrieving revision 1.18 diff -r1.18 match_results.hpp 303c303,304 < BidiIterator base = (*this)[-1].first; ---
BidiIterator base = this->prefix().first; BidiIterator end = this->suffix().second; 312a314,341 if(p1->first == end) { if(p2->first != end) { // p2 must be better than p1, and no need to calculate // actual distances: base1 = 1; base2 = 0; break; } else { // *p1 and *p2 are either unmatched or match end-of sequence, // either way no need to calculate distances: base1 = 0; base2 = 0; len1 = 0; len2 = 0; break; } } else if((p2->first == end) && (p1->first != end)) { // p1 better than p2, and no need to calculate distances: base1 = 0; base2 = 1; break; }
Doh! I forgot about subsequent (possibly unmatched) sub-expressions being used to break ties, there's another patch below that I'm testing now, I'm fairly sure it covers all eventualities this time.
Or maybe not: forget that patch, it causes regressions, there will be one soon that really does work, honest ;-) John.
Hi;
I forgot about subsequent (possibly unmatched) sub-expressions being used to break ties, there's another patch below that I'm testing now, I'm fairly sure it covers all eventualities this time.
That's really cool. Thanks for your second patch to match_results.hpp. That now seems to work nicely with a bidirectional iterator. I'll report back if I see any further issues, but I'm pretty sure the remaining problems I'm having are now in my code. cheers, David
That now seems to work nicely with a bidirectional iterator. I'll report back if I see any further issues, but I'm pretty sure the remaining problems I'm having are now in my code.
In case you didn't notice my second message: there are some serious problems
with that as a proposed patch (it causes regressions in the test suite).
I've had to do a complete rewrite of that function to get things working
correctly - this also addresses another issue - if the first match found is
near the end of the sequence, then all the distances measured are relative
to the start of the sequence, and that could be a big hit for non-random
access iterators. In this case the workaround is to measure distances from
the start of the first match found (because no match further left than this
can be found by the matcher anyway). I'm still in the process of bolstering
up the test cases to check all this, but all being well the version below
really is the last one! Use this if you're in a rush, otherwise hold off
for a few days while I finish off the new test cases...
John.
template
Great, thanks. I got your second email, but even the previous buggy patch was enough to let me make progress. I'll wait with the new patch until I hear that you're happy with it. If this is going to be a big job and a problem, it may be worth considering if what I'm asking for isnt something outside of the concept of a bidirectional iterator or whether boost::regex isnt asking for more than that. I think the original code is relatively fine as long as the operator++ doesnt have to do a lot of work or dereference the iterators. The iterator I'm using has the property that I dont know the distance to the end or how to answer the question iter == end without dereferencing(reading) the whole input. David
Great, thanks. I got your second email, but even the previous buggy patch was enough to let me make progress. I'll wait with the new patch until I hear that you're happy with it.
It's in cvs now, as far as I can tell (and I've added quite a few new test cases), the third patch is finally correct.
I think the original code is relatively fine as long as the operator++ doesnt have to do a lot of work or dereference the iterators. The iterator I'm using has the property that I dont know the distance to the end or how to answer the question
iter == end
without dereferencing(reading) the whole input.
Yikes, that's going to really kill things, the code has to be able to tell whether it's reached the end of the sequence or not, and iter == end is evaluated frequently. One option may be to change your iterator implementation to use a special "singular" value for end-of-sequence, so that particular comparison then becomes trivial. HTH, John.
It's in cvs now, as far as I can tell (and I've added quite a few new test cases), the third patch is finally correct.
Great. Thanks for your help in this. Is what you emailed me the same as what is now in CVS? (I havent used the CVS before - but it may be time to learn)
Yikes, that's going to really kill things, the code has to be able to tell whether it's reached the end of the sequence or not, and
One option may be to change your iterator implementation to use a special "singular" value for end-of-sequence, so that particular comparison then becomes trivial.
Yes, that's what I do do. I probably didnt explain myself very well. But it's the operator++ that has to read input. TokenSource::iterator & TokenSource::iterator::operator++(){ // prefix ++X ++bp; get_stuff(); return *this; } void TokenSource::iterator::get_stuff(){ string a; if( bp >= fb->my_finish ){ // Run off end - read more if( fb->p->eof() ){ bp = -1; // special value for end() iterator } else { getline(*(fb->p),a); fb->inc_line_number(); fb->add_stuff(a+"\n"); } } } bool TokenSource::iterator::operator==(const TokenSource::iterator &x) const { return ( fb == x.fb ) && ( bp == x.bp ); } Anyway, I could implement it differently and put get_stuff in operator* and even in the case of a straight-forward string::iterator your patch now avoids doing many additions in the case of a very long string. David
Great. Thanks for your help in this.
Is what you emailed me the same as what is now in CVS? (I havent used the CVS before - but it may be time to learn)
Yes.
Yes, that's what I do do. I probably didnt explain myself very well. But it's the operator++ that has to read input.
Understood: that should be OK, the code now shouldn't call operator++ on something it isn't going to read (or hasn't read already). John.
participants (2)
-
David McKelvie
-
John Maddock