[regex] wildcard matching byte not character
data:image/s3,"s3://crabby-images/9202d/9202d5b8ec5953598514a809586fe7fcd046b758" alt=""
Hi all
I'm having trouble with the behaviour of the wildcard character when
using boost regex and unicode strings. I would expect a . to match a
character, not a byte, but that's not the behaviour I'm seeing. I would
have thought one wildcard would match any previous character, but for
multi-byte characters in UTF-8 I have to use multiple wildcards to match
them.
I would appreciate it if someone could explain whether this is expected
behaviour or not, or if there are flags that control this.
What I'm trying to accomplish is to match a pattern (in UTF-8 ) against
a string (in UTF-8). I'm creating icu UnicodeStrings since I'm having
other problems with straight UTF-8 char*s and my platform doesn't
support w_chars. I can show examples of the non-UnicodeString problems
if desired.
I'm using 1.42
Test program follows - output is:
$ g++ regex2.cc -l icui18n -l icuuc -l icudata -lboost_regex -o example
&& ./example
unicodeString tests
failed
Success!
----
#include <iostream>
#include
data:image/s3,"s3://crabby-images/39fcf/39fcfc187412ebdb0bd6271af149c9a83d2cb117" alt=""
I'm having trouble with the behaviour of the wildcard character when using boost regex and unicode strings. I would expect a . to match a character, not a byte, but that's not the behaviour I'm seeing. I would have thought one wildcard would match any previous character, but for multi-byte characters in UTF-8 I have to use multiple wildcards to match them.
I would appreciate it if someone could explain whether this is expected behaviour or not, or if there are flags that control this.
What I'm trying to accomplish is to match a pattern (in UTF-8 ) against a string (in UTF-8). I'm creating icu UnicodeStrings since I'm having other problems with straight UTF-8 char*s and my platform doesn't support w_chars. I can show examples of the non-UnicodeString problems if desired.
You're constructing invalid UnicodeString's: the const char* constructor does not convert from UTF-8, if the strings are constructed as: UnicodeString s(buf, "UTF8"); Then the output changes to Success! Failed Which is what you expected. HTH, John.
data:image/s3,"s3://crabby-images/9202d/9202d5b8ec5953598514a809586fe7fcd046b758" alt=""
Yup, spotted it about 20 minutes after posting. Sorry about that, and thanks for the help. Richard John Maddock wrote:
I'm having trouble with the behaviour of the wildcard character when using boost regex and unicode strings. I would expect a . to match a character, not a byte, but that's not the behaviour I'm seeing. I would have thought one wildcard would match any previous character, but for multi-byte characters in UTF-8 I have to use multiple wildcards to match them.
I would appreciate it if someone could explain whether this is expected behaviour or not, or if there are flags that control this.
What I'm trying to accomplish is to match a pattern (in UTF-8 ) against a string (in UTF-8). I'm creating icu UnicodeStrings since I'm having other problems with straight UTF-8 char*s and my platform doesn't support w_chars. I can show examples of the non-UnicodeString problems if desired.
You're constructing invalid UnicodeString's: the const char* constructor does not convert from UTF-8, if the strings are constructed as:
UnicodeString s(buf, "UTF8");
Then the output changes to
Success! Failed
Which is what you expected.
HTH, John. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
John Maddock
-
Richard Clokie