[regex] character classes when using wide chars
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Documentation says that when using wide character strings with boost::wregex a character class like [[:alpha:]] depends on the system's implementation of iswalpha() function. My system seems to have a working implementation of iswalpha() function, but [[:alpha:]] still only seems to match ASCII alphabet characters. For example the following code: #define UNICODE #include <boost/regex.h> #include <stdio.h> #include <wctype.h> #include <locale.h> int main() { regex_t r; setlocale(LC_ALL, "en_US.utf8"); regcomp(&r, L"^[[:alpha:]]$", REG_EXTENDED); printf("%d\n", iswalpha(L'A')); printf("%d\n\n", regexec(&r, L"A", 0, NULL, 0)); printf("%d\n", iswalpha(L'\x160')); printf("%d\n\n", regexec(&r, L"\x160", 0, NULL, 0)); printf("%d\n", iswalpha(L'1')); printf("%d\n", regexec(&r, L"1", 0, NULL, 0)); regfree(&r); return 0; } Returns 1 0 1 1 0 1 In the second pair, iswalpha() correctly recognizes Unicode "S WITH CARON" character, however regular expression with [[:alpha:]] doesn't match it. I'm using Debian GNU/Linux with Boost 1.33.1. I also tried a similar program using boost::wregex and std::iswalpha() classes instead of the POSIX interface with the same results. Can anyone give me some advice on what I'm doing wrong here? Thanks Tomaž Šolc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPNZXsAlAlRhL9q8RAlymAKDktxV+FWCTvBEBKwMNfr9yus5rgQCfc3N1 WoCdr+9zgBSEXPORSLAJiUM= =18dB -----END PGP SIGNATURE-----
Tomaž Šolc wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi
Documentation says that when using wide character strings with boost::wregex a character class like [[:alpha:]] depends on the system's implementation of iswalpha() function.
My system seems to have a working implementation of iswalpha() function, but [[:alpha:]] still only seems to match ASCII alphabet characters.
I'm using Debian GNU/Linux with Boost 1.33.1. I also tried a similar program using boost::wregex and std::iswalpha() classes instead of the POSIX interface with the same results.
Can anyone give me some advice on what I'm doing wrong here?
Nothing: it looks like a bug, the current implementation was changed to use the C++ locale by default, but I forgot to change the POSIX API's to explicitly use the C locale. Try setting std::locale::global to the required locale and that should then work (provided your C++ std library supports all the locales that setlocale does). It's probably a bit late to fix this for 1.35, but will you please open a Track issue at svn.boost.org so I don't forget about this? Thanks, John Maddock.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi
Nothing: it looks like a bug, the current implementation was changed to use the C++ locale by default, but I forgot to change the POSIX API's to explicitly use the C locale. Try setting std::locale::global to the required locale and that should then work (provided your C++ std library supports all the locales that setlocale does).
Changing: setlocale(LC_ALL, "en_US.utf8"); to: std::locale en("en_US.utf8"); std::locale::global(en); fixed the problem. Thanks.
It's probably a bit late to fix this for 1.35, but will you please open a Track issue at svn.boost.org so I don't forget about this?
I opened ticket #1446 Best regards Tomaz Solc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPZVdsAlAlRhL9q8RAt7DAJ4jadXslEmeM3xe4MMENBSwPAcaMACgsw7t tMivJk8D0np/dkhqP23jUkk= =b/vR -----END PGP SIGNATURE-----
Tomaz Solc wrote:
Changing:
setlocale(LC_ALL, "en_US.utf8");
to:
std::locale en("en_US.utf8"); std::locale::global(en);
fixed the problem. Thanks.
OK good.
It's probably a bit late to fix this for 1.35, but will you please open a Track issue at svn.boost.org so I don't forget about this?
I opened ticket #1446
Thanks, John.
participants (2)
-
John Maddock
-
Tomaž Šolc