Re: [Boost-users] regex_search? Which subexpression smatch'ed?

7 Apr 2006

      "John Maddock" <john@johnmaddock.co.uk> wrote in message 
news:010801c6595d$74466980$901c0952@fuji...
...
...
In the snippet below, it works ok to loop thru m[num].matched, but I
was wondering if there was a more direct way of getting the
information about which match was "hit". (The real application is
going to have about 70 possible subexpressions to match against,
instead of just four.)
70!!!  It wasn't really designed that kind of useage, and there isn't a
simple way to get say the lowest index sub-expression that was matched: 
even
with a full knowledge of the implementation it's not clear to me that it's
possible to do better than an O(N) algorithm.
Thanks for the reply.

It's actually a lot more than 70 total sub-expressions, but 66 "groups" .... 
the application is recognizing Bible verse references and emitting a code.

Genesis 1:1 becomes <xb=1.1.1>Genesis 1:1</xb> and
Gen 1 becomes <xb=1.1.1>Gen 1</xb>
The chapter number needs to be specified, but the verse number is optional 
(after the colon).

Here is the actual MOAR (mother of all regexes):

((Gen|Ge|Genesis)|
(Exo|Ex|Exodus|Exod)|
(Le|Leviticus|Lev)|
(Nu|Num|Numbers)|
(De|Deuteronomy|Dt|Deut)|
(Jos|Josh|Joshua)|
(Jdg|Judges|Judg)|
(Ru|Ruth)|
(1 Sa|1Sa|1 Samuel|1 Sam|1Sam|1 Kingdoms|I Samuel|First Samuel|1st Samuel)|
(2 Sa|2Sa|2 Samuel|2 Sam|2Sam|2 Kingdoms|II Samuel|Second Samuel|2nd 
Samuel)|
(1 Ki|1Ki|1 Kings|1 Kgs|1 Kin|1 Ki|1Kings|3 Kingdoms|I Kings|First Kings|1st 
Kings)|
(2 Ki|2Ki|2 Kings|2 Kgs|2 Kin|2 Kgs|II Kings|IIKings|2Kings|Second Kings|2nd 
Kings|4 Kingdoms)|
(1 Ch|1Ch|1 Chronicles|1 Chr|1 Chron|1Chron|I Chronicles|IChronicles|First 
Chronicles|1st Chronicles)|
(2 Ch|2Ch|2 Chronicles|2 Chr|2 Chron|2Chron|II 
Chronicles|IIChronicles|Second Chronicles|2nd Chronicles)|
(Ezr|Ezra)|
(Ne|Neh|Nehemiah)|
(Es|Esther|Esth)|
(Jb|Job)|
(Ps|Psalm|Psa|Psalms)|
(Pr|Proverbs|Pro|Prov)|
(Ec|Ecc|Ecclesiastes|Eccl|Eccles|Ecclesiastics)|
(Cant|Song|Song of Songs|Song of Solomon|Song of Sol)|
(Is|Isaiah|Isa)|
(Je|Jeremiah|Jer)|
(La|Lamentations|Lam)|
(Ez|Ezekiel|Ezk|Ezek)|
(Da|Daniel|Dan)|
(Ho|Hosea|Hs|Hos)|
(Jl|Joel)|
(Am|Amos)|
(Ob|Obadiah|Obad)|
(Jnh|Jonah)|
(Mi|Micah|Mic)|
(Na|Nahum|Nah)|
(Hab|Habakkuk)|
(Zep|Zephaniah|Zeph)|
(Hag|Haggai)|
(Zech|Zec|Zechariah)|
(Mal|Malachi)|
(Mat|Matthew|Matt|Mt)|
(Mar|Mark|Mk)|
(Lk|Luke|Luk)|
(Jn|John|Joh)|
(Ac|Acts|Act)|
(Rm|Romans|Rom|Roman)|
(1Cor|1 Co|1Co|1 Corinthians|1 Corin|1Corin|1 Cor|I Corinthians|I. 
Corinthians|First Corinthians|1st Corinthians)|
(2Cor|2 Co|2Co|2 Corinthians|2 Corin|2Corin|2 Cor|II Corinthians|II. 
Corinthians|Second Corinthians|2nd Corinthians)|
(Ga|Galatians|Gal)|
(Ep|Ephesians|Eph)|
(Phil|Philippians|Php|Philipians|Phillipians)|
(Col|Colossians)|
(1 Th|1Th|1 Thessalonians|1Thessalonians|1 Thess|1 Thes|1Thess|1Thes|I 
Thessalonians|First Thessalonians|1st Thessalonians)|
(2 Th|2Th|2 Thessalonians|2Thessalonians|2 Thess|2 Thes|2Thess|2Thes|II 
Thessalonians|Second Thessalonians|2nd Thessalonians)|
(1 Ti|1Ti|1 Timothy|1Timothy|1Tim|1 Tim|I Timothy|First Timothy|1st 
Timothy)|
(2 Ti|2Ti|2 Timothy|2Timothy|2Tim|2 Tim|II Timothy|Second Timothy|2nd 
Timothy)|
(Ti|Titus|Tit)|
(PHILEMON|Phile|Philemon|Phlm|Philm)|
(Hb|Hebrews|Heb|Hebrew)|
(Ja|James|Jas|Jms)|
(1 Pe|1Pe|1 Peter|1 Pet|I Peter|IPeter|First Peter|1st Peter)|
(2 Pe|2Pe|2 Peter|2 Pet|II Peter|IIPeter|Second Peter|2nd Peter)|
(1 Jn|1Jn|1 John|1 Joh|1Joh|1John|I John|IJohn|First John|1st John)|
(2 Jn|2Jn|2 John|2 Joh|2Joh|2John|II John|IIJohn|Second John|2nd John)|
(3 Jn|3Jn|3 John|3 Joh|3Joh|3John|III John|IIIJohn|Third John|3rd John)|
(Jde|Jude)|
(Apoc|Revelation|Rev|Revel|Apocrypha|Revelation of John|Rev of John|Rev. of 
John))
\.? ?([0-9]{1,3})(:([0-9]{1,3}))?
...
You could I suppose group the sub-expressions together so you could use
something more like a binary search to find the one that matched:
((a|b|c)|(d|e|f))
Then
if(my_match[1].matched)
{
// must be a, b or c
}
else
{
// must be d e or f.
}
My impression is that I'm already doing something like you suggest. The code 
doesn't need to know which of the three variants of Genesis was matched, 
just that it was from the Genesis "group". The adaption of the "Captures" 
code works well. Thanks for providing it.

Obviously easier said than done, but if the library can turn the appropriate 
m[index].matched to "true", then perhaps there could be another variable in 
the class to reflect the lowest index of the first group that is true.  I 
realize this gets more complicated because there can be multiple groups.
...
Sorry I can't be more helpful, but it's tempting to suggest that you use 
the
spirit parser libary rather than regexes.
I'll take a look at it .... I'm really impressed with your boost::regex 
library, and am inclined to continue using it. Thanks for making it 
available.

Re: [Boost-users] regex_search? Which subexpression smatch'ed?

Lynn Allan