[regex] wildcard matching byte not character

28 Feb 2010

      Hi all
I'm having trouble with the behaviour of the wildcard character when 
using boost regex and unicode strings. I would expect a . to match a 
character, not a byte, but that's not the behaviour I'm seeing. I would 
have thought one wildcard would match any previous character, but for 
multi-byte characters in UTF-8 I have to use multiple wildcards to match 
them.

I would appreciate it if someone could explain whether this is expected 
behaviour or not, or if there are flags that control this.

What I'm trying to accomplish is to match a pattern (in UTF-8 ) against 
a string (in UTF-8). I'm creating icu UnicodeStrings since I'm having 
other problems with straight UTF-8 char*s and my platform doesn't 
support w_chars. I can show examples of the non-UnicodeString problems 
if desired.

I'm using 1.42
Test program follows - output is:
$ g++ regex2.cc -l icui18n -l icuuc -l icudata  -lboost_regex -o example 
&& ./example
unicodeString tests
failed
Success!

----

#include <iostream>
#include <boost/regex.hpp>
#include <boost/regex/icu.hpp>

using namespace boost;
using namespace std;

int main(){

     static const char input[]={0xC2,0xA3, 0xC3,0x98, 0xC2,0xB2, 0 };

     UnicodeString uInput(input);
     const char match1[] = {0xC2,0xA3,0x2E,0xC2,0xB2, 0} ; // one .
     const char match2[] = {0xC2,0xA3,0x2E,0x2E,0xC2,0xB2, 0} ; // two .s
     UnicodeString uMatch1(match1);
     UnicodeString uMatch2(match2);
     u16match what;

     cout << "unicodeString tests" << endl;
     if(u32regex_search( uInput , what,  // one . fails
                   make_u32regex(uMatch1,regex::extended))) {
         cout << "Success!" << endl;
     } else {
         cout << "failed" << endl;
     }
     if(u32regex_search( uInput , what,  // two . succeeds
                   make_u32regex(uMatch2,regex::extended))) {
         cout << "Success!" << endl;
     } else {
         cout << "failed" << endl;
     }
}

Richard Clokie

John Maddock

Richard Clokie

tags

participants (2)