regex_search? Which subexpression smatch'ed?
<alert comment="boost and regex newbie"> I'm using regex_search to detect which subexpression matched. I adapted the "Captures" example at: http://www.boost.org/libs/regex/doc/captures.html In the snippet below, it works ok to loop thru m[num].matched, but I was wondering if there was a more direct way of getting the information about which match was "hit". (The real application is going to have about 70 possible subexpressions to match against, instead of just four.) std::string contents("String has two, three, one, and four in it. "); boost::regex reg("(one)|(two)|(three)|(four)");
From the above, I want the output to be:
two three one four
void test_search()
{
boost::smatch m;
std::string contents("String has two, three, one, and four in it. ");
boost::regex reg("(one)|(two)|(three)|(four)");
std::string::const_iterator it = contents.begin();
std::string::const_iterator end = contents.end();
while (boost::regex_search(it, end, m, reg)) {
for (int num = 1; num <= 4; ++num) {
if (m[num].matched == true) {
std::cout << "
In the snippet below, it works ok to loop thru m[num].matched, but I was wondering if there was a more direct way of getting the information about which match was "hit". (The real application is going to have about 70 possible subexpressions to match against, instead of just four.)
70!!! It wasn't really designed that kind of useage, and there isn't a simple way to get say the lowest index sub-expression that was matched: even with a full knowledge of the implementation it's not clear to me that it's possible to do better than an O(N) algorithm. You could I suppose group the sub-expressions together so you could use something more like a binary search to find the one that matched: ((a|b|c)|(d|e|f)) Then if(my_match[1].matched) { // must be a, b or c } else { // must be d e or f. } Sorry I can't be more helpful, but it's tempting to suggest that you use the spirit parser libary rather than regexes. John.
"John Maddock"
In the snippet below, it works ok to loop thru m[num].matched, but I was wondering if there was a more direct way of getting the information about which match was "hit". (The real application is going to have about 70 possible subexpressions to match against, instead of just four.)
70!!! It wasn't really designed that kind of useage, and there isn't a simple way to get say the lowest index sub-expression that was matched: even with a full knowledge of the implementation it's not clear to me that it's possible to do better than an O(N) algorithm.
Thanks for the reply.
It's actually a lot more than 70 total sub-expressions, but 66 "groups" ....
the application is recognizing Bible verse references and emitting a code.
Genesis 1:1 becomes
You could I suppose group the sub-expressions together so you could use something more like a binary search to find the one that matched:
((a|b|c)|(d|e|f))
Then
if(my_match[1].matched) { // must be a, b or c } else { // must be d e or f. }
My impression is that I'm already doing something like you suggest. The code doesn't need to know which of the three variants of Genesis was matched, just that it was from the Genesis "group". The adaption of the "Captures" code works well. Thanks for providing it. Obviously easier said than done, but if the library can turn the appropriate m[index].matched to "true", then perhaps there could be another variable in the class to reflect the lowest index of the first group that is true. I realize this gets more complicated because there can be multiple groups.
Sorry I can't be more helpful, but it's tempting to suggest that you use the spirit parser libary rather than regexes.
I'll take a look at it .... I'm really impressed with your boost::regex library, and am inclined to continue using it. Thanks for making it available.
Obviously easier said than done, but if the library can turn the appropriate m[index].matched to "true", then perhaps there could be another variable in the class to reflect the lowest index of the first group that is true. I realize this gets more complicated because there can be multiple groups.
It's backtracking that gets you: each time the "lowest sub-expression set" gets updated you'd need to store a complete history of previous settings so you can backtrack out again. However, now I think about that some more, I actually think it might be possible after all :-) John.
Whatever you could work out would be greatly appreciated.
----- Original Message -----
From: "John Maddock"
It's backtracking that gets you: each time the "lowest sub-expression set" gets updated you'd need to store a complete history of previous settings so you can backtrack out again. However, now I think about that some more, I actually think it might be possible after all :-)
Lynn Allan wrote:
Whatever you could work out would be greatly appreciated.
----- Original Message ----- From: "John Maddock"
It's backtracking that gets you: each time the "lowest sub-expression set" gets updated you'd need to store a complete history of previous settings so you can backtrack out again. However, now I think about that some more, I actually think it might be possible after all :-)
Forgive me for jumping in late. This seems like a pretty silly optimization to me. Matching a regex, especially a complicated one like this, is going to take soooo much longer than a simple O(N) traversal of the resulting sub-matches that it just doesn't make sense to track this information during match time. In fact, it'll probably run *slower*. Don't do during match time what you can do before or after. You'll turn a simple O(N) operation into something far worse, and you'll make everybody pay for it. If you don't want to muddy up user code with a bunch of "if(results[n].matched)" checks, just write an iterator adaptor (using boost::filter_iterator<>) that iterates over only those sub-matches which matched, and also provides the index. My $.02, -- Eric Niebler Boost Consulting www.boost-consulting.com
Forgive me for jumping in late. This seems like a pretty silly optimization to me. Matching a regex, especially a complicated one like this, is going to take soooo much longer than a simple O(N) traversal of the resulting sub-matches that it just doesn't make sense to track this information during match time. In fact, it'll probably run *slower*. Don't do during match time what you can do before or after. You'll turn a simple O(N) operation into something far worse, and you'll make everybody pay for it.
If you don't want to muddy up user code with a bunch of "if(results[n].matched)" checks, just write an iterator adaptor (using boost::filter_iterator<>) that iterates over only those sub-matches which matched, and also provides the index.
My $.02,
Good point: even enumerating 100 sub-expressions to find out which was matched is going to be pretty quick compared to the alternatives. The balance only really changes if you have thousands of marked sub-expressions, and I really hope we don't see regexes like that any time soon :-) John.
Just seemed that at the point when the code sets mySmatch[curIndex].matched
to true, then it could "tuck away" that index in an accessible variable ....
but apparently not.
Any suggestions about alternative on how to best accomplish the search for
Bible verse references? I'd have to think the participants in this thread
would have some deep insights on the best way to proceed.
I've written different "flavors" of state-machines that recognize these kind
of patterns, but a regex seems like the way to go unless performance has to
be emphasized.
Thanks.
----- Original Message -----
From: "John Maddock"
Forgive me for jumping in late. This seems like a pretty silly optimization to me. Matching a regex, especially a complicated one like this, is going to take soooo much longer than a simple O(N) traversal of the resulting sub-matches that it just doesn't make sense to track this information during match time. In fact, it'll probably run *slower*. Don't do during match time what you can do before or after. You'll turn a simple O(N) operation into something far worse, and you'll make everybody pay for it.
If you don't want to muddy up user code with a bunch of "if(results[n].matched)" checks, just write an iterator adaptor (using boost::filter_iterator<>) that iterates over only those sub-matches which matched, and also provides the index.
My $.02,
Good point: even enumerating 100 sub-expressions to find out which was matched is going to be pretty quick compared to the alternatives. The balance only really changes if you have thousands of marked sub-expressions, and I really hope we don't see regexes like that any time soon :-)
John.
Just seemed that at the point when the code sets mySmatch[curIndex].matched to true, then it could "tuck away" that index in an accessible variable .... but apparently not.
No it has to maintain a stack of all past values, in order to backtrack out of a match.
Any suggestions about alternative on how to best accomplish the search for Bible verse references? I'd have to think the participants in this thread would have some deep insights on the best way to proceed.
This is a deep insight free zone :-) If you just want the first index to have matched then simply a linear search is as good as anything: int lowest_match = 1; for(; lowest_match < mymatch.size(); ++lowest_match) if(mymatch[lowest_match].matched) break; which assumes that your regex is along the lines of: (a)|(b)|(c)|... If that turns out too slow, then you might then have to collect together common prefixes for some of the expressions into single alternatives, but I'm not knowledgeable on bible references. John.
Thanks for providing and maintaining regex. The following code is used in a number of the example snippets: #void load_file(std::string& s, std::istream& is) #{ # s.erase(); # if(is.bad()) return; # s.reserve(is.rdbuf()->in_avail()); # char c; # while(is.get(c)) # { # if(s.capacity() == s.size()) # s.reserve(s.capacity() * 3); # s.append(1, c); # } #} My impression is that this code is used because some compilers won't accept a simpler and more direct approach. I was wondering if the following would work "cross-platform" (It seems to work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant of other compilers): #void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #}
Lynn Allan wrote:
I was wondering if the following would work "cross-platform" (It seems to work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant of other compilers):
#void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #} Well, there's an immediate problem if the file contains any NULL bytes, which most non-text files do - the code will instantly stop on them. There may be a faster way using a getline() loop, based on the fact that many bytes are not \0, but I'm not sure that you'd actually lose much overhead, given that the stream is likely to be buffered?
I think that the code given is probably the simplest that is actually guaranteed to work. Please tell me if I'm being an idiot of course! Rupert
Lynn Allan wrote:
#void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #}
Well, there's an immediate problem if the file contains any NULL bytes, which most non-text files do - the code will instantly stop on them. There may be a faster way using a getline() loop, based on the fact that many bytes are not \0, but I'm not sure that you'd actually lose much overhead, given that the stream is likely to be buffered?
Granted, the proposed code only works with non-binary data that doesn't have embedded NULL bytes. That seems a reasonable assumption for the snippets provided, as well as most situations where boost::regex would be used. And o/s buffering/caching may make it a moot point.
My impression is that this code is used because some compilers won't accept a simpler and more direct approach.
I was wondering if the following would work "cross-platform" (It seems to work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant of other compilers):
#void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #}
Looks like a good idea. However one of the main reasons for code being as it is, are rather sub-optimal getline implementations that append one character at a time to the string, without reserving any space first: this results in an O(N^2) algorithm for many string implementations. Previously I used istream iterators and std::copy to copy the stream's contents to the string. It's lovely simple code, and good use of the std lib, but it takes an age to run :-( Basically there are really simple solutions that work well for small files, and then crap out big time as soon as the file size grows. Of course you could always replace std::string with std::vector and get O(N log N) performance, but then you're still open to "why did you do that" questions. John.
participants (4)
-
Eric Niebler
-
John Maddock
-
Lynn Allan
-
Rupert Swarbrick