[regex] wildcard matching byte not character

Hi all
I'm having trouble with the behaviour of the wildcard character when
using boost regex and unicode strings. I would expect a . to match a
character, not a byte, but that's not the behaviour I'm seeing. I would
have thought one wildcard would match any previous character, but for
multi-byte characters in UTF-8 I have to use multiple wildcards to match
them.
I would appreciate it if someone could explain whether this is expected
behaviour or not, or if there are flags that control this.
What I'm trying to accomplish is to match a pattern (in UTF-8 ) against
a string (in UTF-8). I'm creating icu UnicodeStrings since I'm having
other problems with straight UTF-8 char*s and my platform doesn't
support w_chars. I can show examples of the non-UnicodeString problems
if desired.
I'm using 1.42
Test program follows - output is:
$ g++ regex2.cc -l icui18n -l icuuc -l icudata -lboost_regex -o example
&& ./example
unicodeString tests
failed
Success!
----
#include <iostream>
#include

I'm having trouble with the behaviour of the wildcard character when using boost regex and unicode strings. I would expect a . to match a character, not a byte, but that's not the behaviour I'm seeing. I would have thought one wildcard would match any previous character, but for multi-byte characters in UTF-8 I have to use multiple wildcards to match them.
I would appreciate it if someone could explain whether this is expected behaviour or not, or if there are flags that control this.
What I'm trying to accomplish is to match a pattern (in UTF-8 ) against a string (in UTF-8). I'm creating icu UnicodeStrings since I'm having other problems with straight UTF-8 char*s and my platform doesn't support w_chars. I can show examples of the non-UnicodeString problems if desired.
You're constructing invalid UnicodeString's: the const char* constructor does not convert from UTF-8, if the strings are constructed as: UnicodeString s(buf, "UTF8"); Then the output changes to Success! Failed Which is what you expected. HTH, John.

Yup, spotted it about 20 minutes after posting. Sorry about that, and thanks for the help. Richard John Maddock wrote:
I'm having trouble with the behaviour of the wildcard character when using boost regex and unicode strings. I would expect a . to match a character, not a byte, but that's not the behaviour I'm seeing. I would have thought one wildcard would match any previous character, but for multi-byte characters in UTF-8 I have to use multiple wildcards to match them.
I would appreciate it if someone could explain whether this is expected behaviour or not, or if there are flags that control this.
What I'm trying to accomplish is to match a pattern (in UTF-8 ) against a string (in UTF-8). I'm creating icu UnicodeStrings since I'm having other problems with straight UTF-8 char*s and my platform doesn't support w_chars. I can show examples of the non-UnicodeString problems if desired.
You're constructing invalid UnicodeString's: the const char* constructor does not convert from UTF-8, if the strings are constructed as:
UnicodeString s(buf, "UTF8");
Then the output changes to
Success! Failed
Which is what you expected.
HTH, John. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
John Maddock
-
Richard Clokie