regex_search? Which subexpression smatch'ed?

Lynn Allan

3 Apr 2006 3 Apr '06

2:39 p.m.

<alert comment="boost and regex newbie"> I'm using regex_search to detect which subexpression matched. I adapted the "Captures" example at: http://www.boost.org/libs/regex/doc/captures.html In the snippet below, it works ok to loop thru m[num].matched, but I was wondering if there was a more direct way of getting the information about which match was "hit". (The real application is going to have about 70 possible subexpressions to match against, instead of just four.) std::string contents("String has two, three, one, and four in it. "); boost::regex reg("(one)|(two)|(three)|(four)");

...

From the above, I want the output to be: two three one four

void test_search() { boost::smatch m; std::string contents("String has two, three, one, and four in it. "); boost::regex reg("(one)|(two)|(three)|(four)"); std::string::const_iterator it = contents.begin(); std::string::const_iterator end = contents.end(); while (boost::regex_search(it, end, m, reg)) { for (int num = 1; num <= 4; ++num) { if (m[num].matched == true) { std::cout << "" << m[0] << std::endl; break; } } it = m[0].second; } } </alert>

Attachments:

attachment.html (text/html — 2.0 KB)

Show replies by date

John Maddock

6 Apr 6 Apr

9:28 a.m.

...

In the snippet below, it works ok to loop thru m[num].matched, but I was wondering if there was a more direct way of getting the information about which match was "hit". (The real application is going to have about 70 possible subexpressions to match against, instead of just four.)

70!!! It wasn't really designed that kind of useage, and there isn't a simple way to get say the lowest index sub-expression that was matched: even with a full knowledge of the implementation it's not clear to me that it's possible to do better than an O(N) algorithm. You could I suppose group the sub-expressions together so you could use something more like a binary search to find the one that matched: ((a|b|c)|(d|e|f)) Then if(my_match[1].matched) { // must be a, b or c } else { // must be d e or f. } Sorry I can't be more helpful, but it's tempting to suggest that you use the spirit parser libary rather than regexes. John.

Lynn Allan

7 Apr 7 Apr

6:14 a.m.

"John Maddock" wrote in message news:010801c6595d$74466980$901c0952@fuji...

...

...
In the snippet below, it works ok to loop thru m[num].matched, but I was wondering if there was a more direct way of getting the information about which match was "hit". (The real application is going to have about 70 possible subexpressions to match against, instead of just four.)

70!!! It wasn't really designed that kind of useage, and there isn't a simple way to get say the lowest index sub-expression that was matched: even with a full knowledge of the implementation it's not clear to me that it's possible to do better than an O(N) algorithm.

Thanks for the reply. It's actually a lot more than 70 total sub-expressions, but 66 "groups" .... the application is recognizing Bible verse references and emitting a code. Genesis 1:1 becomes Genesis 1:1</xb> and Gen 1 becomes Gen 1</xb> The chapter number needs to be specified, but the verse number is optional (after the colon). Here is the actual MOAR (mother of all regexes): ((Gen|Ge|Genesis)| (Exo|Ex|Exodus|Exod)| (Le|Leviticus|Lev)| (Nu|Num|Numbers)| (De|Deuteronomy|Dt|Deut)| (Jos|Josh|Joshua)| (Jdg|Judges|Judg)| (Ru|Ruth)| (1 Sa|1Sa|1 Samuel|1 Sam|1Sam|1 Kingdoms|I Samuel|First Samuel|1st Samuel)| (2 Sa|2Sa|2 Samuel|2 Sam|2Sam|2 Kingdoms|II Samuel|Second Samuel|2nd Samuel)| (1 Ki|1Ki|1 Kings|1 Kgs|1 Kin|1 Ki|1Kings|3 Kingdoms|I Kings|First Kings|1st Kings)| (2 Ki|2Ki|2 Kings|2 Kgs|2 Kin|2 Kgs|II Kings|IIKings|2Kings|Second Kings|2nd Kings|4 Kingdoms)| (1 Ch|1Ch|1 Chronicles|1 Chr|1 Chron|1Chron|I Chronicles|IChronicles|First Chronicles|1st Chronicles)| (2 Ch|2Ch|2 Chronicles|2 Chr|2 Chron|2Chron|II Chronicles|IIChronicles|Second Chronicles|2nd Chronicles)| (Ezr|Ezra)| (Ne|Neh|Nehemiah)| (Es|Esther|Esth)| (Jb|Job)| (Ps|Psalm|Psa|Psalms)| (Pr|Proverbs|Pro|Prov)| (Ec|Ecc|Ecclesiastes|Eccl|Eccles|Ecclesiastics)| (Cant|Song|Song of Songs|Song of Solomon|Song of Sol)| (Is|Isaiah|Isa)| (Je|Jeremiah|Jer)| (La|Lamentations|Lam)| (Ez|Ezekiel|Ezk|Ezek)| (Da|Daniel|Dan)| (Ho|Hosea|Hs|Hos)| (Jl|Joel)| (Am|Amos)| (Ob|Obadiah|Obad)| (Jnh|Jonah)| (Mi|Micah|Mic)| (Na|Nahum|Nah)| (Hab|Habakkuk)| (Zep|Zephaniah|Zeph)| (Hag|Haggai)| (Zech|Zec|Zechariah)| (Mal|Malachi)| (Mat|Matthew|Matt|Mt)| (Mar|Mark|Mk)| (Lk|Luke|Luk)| (Jn|John|Joh)| (Ac|Acts|Act)| (Rm|Romans|Rom|Roman)| (1Cor|1 Co|1Co|1 Corinthians|1 Corin|1Corin|1 Cor|I Corinthians|I. Corinthians|First Corinthians|1st Corinthians)| (2Cor|2 Co|2Co|2 Corinthians|2 Corin|2Corin|2 Cor|II Corinthians|II. Corinthians|Second Corinthians|2nd Corinthians)| (Ga|Galatians|Gal)| (Ep|Ephesians|Eph)| (Phil|Philippians|Php|Philipians|Phillipians)| (Col|Colossians)| (1 Th|1Th|1 Thessalonians|1Thessalonians|1 Thess|1 Thes|1Thess|1Thes|I Thessalonians|First Thessalonians|1st Thessalonians)| (2 Th|2Th|2 Thessalonians|2Thessalonians|2 Thess|2 Thes|2Thess|2Thes|II Thessalonians|Second Thessalonians|2nd Thessalonians)| (1 Ti|1Ti|1 Timothy|1Timothy|1Tim|1 Tim|I Timothy|First Timothy|1st Timothy)| (2 Ti|2Ti|2 Timothy|2Timothy|2Tim|2 Tim|II Timothy|Second Timothy|2nd Timothy)| (Ti|Titus|Tit)| (PHILEMON|Phile|Philemon|Phlm|Philm)| (Hb|Hebrews|Heb|Hebrew)| (Ja|James|Jas|Jms)| (1 Pe|1Pe|1 Peter|1 Pet|I Peter|IPeter|First Peter|1st Peter)| (2 Pe|2Pe|2 Peter|2 Pet|II Peter|IIPeter|Second Peter|2nd Peter)| (1 Jn|1Jn|1 John|1 Joh|1Joh|1John|I John|IJohn|First John|1st John)| (2 Jn|2Jn|2 John|2 Joh|2Joh|2John|II John|IIJohn|Second John|2nd John)| (3 Jn|3Jn|3 John|3 Joh|3Joh|3John|III John|IIIJohn|Third John|3rd John)| (Jde|Jude)| (Apoc|Revelation|Rev|Revel|Apocrypha|Revelation of John|Rev of John|Rev. of John)) \.? ?([0-9]{1,3})(:([0-9]{1,3}))?

...

You could I suppose group the sub-expressions together so you could use something more like a binary search to find the one that matched:

((a|b|c)|(d|e|f))

Then

if(my_match[1].matched) { // must be a, b or c } else { // must be d e or f. }

My impression is that I'm already doing something like you suggest. The code doesn't need to know which of the three variants of Genesis was matched, just that it was from the Genesis "group". The adaption of the "Captures" code works well. Thanks for providing it. Obviously easier said than done, but if the library can turn the appropriate m[index].matched to "true", then perhaps there could be another variable in the class to reflect the lowest index of the first group that is true. I realize this gets more complicated because there can be multiple groups.

...

Sorry I can't be more helpful, but it's tempting to suggest that you use the spirit parser libary rather than regexes.

I'll take a look at it .... I'm really impressed with your boost::regex library, and am inclined to continue using it. Thanks for making it available.

John Maddock

9:48 a.m.

...

Obviously easier said than done, but if the library can turn the appropriate m[index].matched to "true", then perhaps there could be another variable in the class to reflect the lowest index of the first group that is true. I realize this gets more complicated because there can be multiple groups.

It's backtracking that gets you: each time the "lowest sub-expression set" gets updated you'd need to store a complete history of previous settings so you can backtrack out again. However, now I think about that some more, I actually think it might be possible after all :-) John.

Lynn Allan

12:29 p.m.

Whatever you could work out would be greatly appreciated. ----- Original Message ----- From: "John Maddock"

...

It's backtracking that gets you: each time the "lowest sub-expression set" gets updated you'd need to store a complete history of previous settings so you can backtrack out again. However, now I think about that some more, I actually think it might be possible after all :-)

Eric Niebler

5 p.m.

Lynn Allan wrote:

...

Whatever you could work out would be greatly appreciated.

----- Original Message ----- From: "John Maddock"

...
It's backtracking that gets you: each time the "lowest sub-expression set" gets updated you'd need to store a complete history of previous settings so you can backtrack out again. However, now I think about that some more, I actually think it might be possible after all :-)

Forgive me for jumping in late. This seems like a pretty silly optimization to me. Matching a regex, especially a complicated one like this, is going to take soooo much longer than a simple O(N) traversal of the resulting sub-matches that it just doesn't make sense to track this information during match time. In fact, it'll probably run *slower*. Don't do during match time what you can do before or after. You'll turn a simple O(N) operation into something far worse, and you'll make everybody pay for it. If you don't want to muddy up user code with a bunch of "if(results[n].matched)" checks, just write an iterator adaptor (using boost::filter_iterator<>) that iterates over only those sub-matches which matched, and also provides the index. My $.02, -- Eric Niebler Boost Consulting www.boost-consulting.com

John Maddock

5:56 p.m.

...

Forgive me for jumping in late. This seems like a pretty silly optimization to me. Matching a regex, especially a complicated one like this, is going to take soooo much longer than a simple O(N) traversal of the resulting sub-matches that it just doesn't make sense to track this information during match time. In fact, it'll probably run *slower*. Don't do during match time what you can do before or after. You'll turn a simple O(N) operation into something far worse, and you'll make everybody pay for it.

If you don't want to muddy up user code with a bunch of "if(results[n].matched)" checks, just write an iterator adaptor (using boost::filter_iterator<>) that iterates over only those sub-matches which matched, and also provides the index.

My $.02,

Good point: even enumerating 100 sub-expressions to find out which was matched is going to be pretty quick compared to the alternatives. The balance only really changes if you have thousands of marked sub-expressions, and I really hope we don't see regexes like that any time soon :-) John.

Lynn Allan

9:59 p.m.

Just seemed that at the point when the code sets mySmatch[curIndex].matched to true, then it could "tuck away" that index in an accessible variable .... but apparently not. Any suggestions about alternative on how to best accomplish the search for Bible verse references? I'd have to think the participants in this thread would have some deep insights on the best way to proceed. I've written different "flavors" of state-machines that recognize these kind of patterns, but a regex seems like the way to go unless performance has to be emphasized. Thanks. ----- Original Message ----- From: "John Maddock" Newsgroups: gmane.comp.lib.boost.user Sent: Friday, April 07, 2006 11:56 AM Subject: Re: regex_search? Which subexpression smatch'ed?

...

...
Forgive me for jumping in late. This seems like a pretty silly optimization to me. Matching a regex, especially a complicated one like this, is going to take soooo much longer than a simple O(N) traversal of the resulting sub-matches that it just doesn't make sense to track this information during match time. In fact, it'll probably run *slower*. Don't do during match time what you can do before or after. You'll turn a simple O(N) operation into something far worse, and you'll make everybody pay for it.

If you don't want to muddy up user code with a bunch of "if(results[n].matched)" checks, just write an iterator adaptor (using boost::filter_iterator<>) that iterates over only those sub-matches which matched, and also provides the index.

My $.02,

Good point: even enumerating 100 sub-expressions to find out which was matched is going to be pretty quick compared to the alternatives. The balance only really changes if you have thousands of marked sub-expressions, and I really hope we don't see regexes like that any time soon :-)

John.

John Maddock

8 Apr 8 Apr

9:35 a.m.

...

Just seemed that at the point when the code sets mySmatch[curIndex].matched to true, then it could "tuck away" that index in an accessible variable .... but apparently not.

No it has to maintain a stack of all past values, in order to backtrack out of a match.

...

Any suggestions about alternative on how to best accomplish the search for Bible verse references? I'd have to think the participants in this thread would have some deep insights on the best way to proceed.

This is a deep insight free zone :-) If you just want the first index to have matched then simply a linear search is as good as anything: int lowest_match = 1; for(; lowest_match < mymatch.size(); ++lowest_match) if(mymatch[lowest_match].matched) break; which assumes that your regex is along the lines of: (a)|(b)|(c)|... If that turns out too slow, then you might then have to collect together common prefixes for some of the expressions into single alternatives, but I'm not knowledgeable on bible references. John.

Lynn Allan

7 Apr 7 Apr

1:54 p.m.

New subject: [Regex] Simpler load_file code to use in snippet examples?

Thanks for providing and maintaining regex. The following code is used in a number of the example snippets: #void load_file(std::string& s, std::istream& is) #{ # s.erase(); # if(is.bad()) return; # s.reserve(is.rdbuf()->in_avail()); # char c; # while(is.get(c)) # { # if(s.capacity() == s.size()) # s.reserve(s.capacity() * 3); # s.append(1, c); # } #} My impression is that this code is used because some compilers won't accept a simpler and more direct approach. I was wondering if the following would work "cross-platform" (It seems to work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant of other compilers): #void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #}

Rupert Swarbrick

2:37 p.m.

New subject: [Regex] Simpler load_file code to use in snippet examples?

Lynn Allan wrote:

...

I was wondering if the following would work "cross-platform" (It seems to work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant of other compilers):

#void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #} Well, there's an immediate problem if the file contains any NULL bytes, which most non-text files do - the code will instantly stop on them. There may be a faster way using a getline() loop, based on the fact that many bytes are not \0, but I'm not sure that you'd actually lose much overhead, given that the stream is likely to be buffered?

I think that the code given is probably the simplest that is actually guaranteed to work. Please tell me if I'm being an idiot of course! Rupert

Lynn Allan

3:42 p.m.

New subject: [Regex] Simpler load_file code to use in snippetexamples?

...

Lynn Allan wrote:

...
#void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #}

...

Well, there's an immediate problem if the file contains any NULL bytes, which most non-text files do - the code will instantly stop on them. There may be a faster way using a getline() loop, based on the fact that many bytes are not \0, but I'm not sure that you'd actually lose much overhead, given that the stream is likely to be buffered?

Granted, the proposed code only works with non-binary data that doesn't have embedded NULL bytes. That seems a reasonable assumption for the snippets provided, as well as most situations where boost::regex would be used. And o/s buffering/caching may make it a moot point.

John Maddock

4:14 p.m.

New subject: [Regex] Simpler load_file code to use in snippetexamples?

...

My impression is that this code is used because some compilers won't accept a simpler and more direct approach.

I was wondering if the following would work "cross-platform" (It seems to work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant of other compilers):

#void load_file(std::string& s, std::istream& is) #{ # if(is.bad()) return; # // Use 0 as CrLf delimiter to cause entire file to be read # getline(is, s, '\0'); #}

Looks like a good idea. However one of the main reasons for code being as it is, are rather sub-optimal getline implementations that append one character at a time to the string, without reserving any space first: this results in an O(N^2) algorithm for many string implementations. Previously I used istream iterators and std::copy to copy the stream's contents to the string. It's lovely simple code, and good use of the std lib, but it takes an age to run :-( Basically there are really simple solutions that work well for small files, and then crap out big time as soon as the file size grows. Of course you could always replace std::string with std::vector and get O(N log N) performance, but then you're still open to "why did you do that" questions. John.

6968

Age (days ago)

6973

Last active (days ago)

List overview

Download

12 comments

4 participants

participants (4)

Eric Niebler
John Maddock
Lynn Allan
Rupert Swarbrick