Hello, I use freebsd 4.11 & gcc3.3.6 and gcc 3.4.4
with both I don't have std::wstring defined. I only need it fo some
internal use. So, I defined:
typedef unsigned short uchar16
typedef basic_string<uchar16> string16;
and it was enough for me (I don't do iostreams with this string)
But, I need regex support for this. As I read in a few mail boost does
not create wregex on some platforms with limited support for wchar_t, so
I tried to create one the same as I did with string:
typedef reg_expression
Hello, I use freebsd 4.11 & gcc3.3.6 and gcc 3.4.4 with both I don't have std::wstring defined. I only need it fo some internal use. So, I defined: typedef unsigned short uchar16 typedef basic_string<uchar16> string16; and it was enough for me (I don't do iostreams with this string) But, I need regex support for this. As I read in a few mail boost does not create wregex on some platforms with limited support for wchar_t, so I tried to create one the same as I did with string: typedef reg_expression
regex16; When I try to use regex16 I get a big load of errors. Did I missed something?
Yes, the regex library needs a traits class that instructs it how to interact with the locale for that character type. In theory you could just use cpp_regex_traits<uchar16>, but that would require that all the std::locale facets are supported for uchar16 (I'm sure that they are not). If you're prepared to depend upon ICU, then the current cvs has (optional) support for 16 and 32-bit Unicode character types, the traits class design is also rather simplified and better documented, so that would be the best bet if you wanted to define your own minimalist traits class for uchar16 as well. John.
Yes, the regex library needs a traits class that instructs it how to interact with the locale for that character type. In theory you could just use cpp_regex_traits<uchar16>, but that would require that all the std::locale facets are supported for uchar16 (I'm sure that they are not).
If you're prepared to depend upon ICU,
WHat's ICU == I see you?? :) then the current cvs has
(optional) support for 16 and 32-bit Unicode character types, the traits
it's like utf-16, but I replace all the chars above 0xFFFF with '?', so it's utf-16 that doesn't have 4-byte chars.
class design is also rather simplified and better documented, so that would be the best bet if you wanted to define your own minimalist traits
I don't really understand well what's character_traits etc (and how to create them myself), I only wanted that my regex16 would do the same job for chars 0-0x00FF as boost::regex does for 0-0xff, and the rest of the chars (>=0x0100) would be considered non-words (\W) and so that I could only use \xXXXX-\xXXXX notation for their ranges& patterns...
class for uchar16 as well.
John.
If you're prepared to depend upon ICU,
WHat's ICU == I see you?? :)
IBM's Unicode libraries: http://www-306.ibm.com/software/globalization/icu/index.jsp
then the current cvs has
(optional) support for 16 and 32-bit Unicode character types, the traits
it's like utf-16, but I replace all the chars above 0xFFFF with '?', so it's utf-16 that doesn't have 4-byte chars.
class design is also rather simplified and better documented, so that would be the best bet if you wanted to define your own minimalist traits
I don't really understand well what's character_traits etc (and how to create them myself), I only wanted that my regex16 would do the same job for chars 0-0x00FF as boost::regex does for 0-0xff, and the rest of the chars (>=0x0100) would be considered non-words (\W) and so that I could only use \xXXXX-\xXXXX notation for their ranges& patterns...
Unfortunately you still have to write yourself a traits class to do that, a simple wrapper that forwards calls onto c_regex_traits<char> where appropriate would do it. Unfortunately the traits class design is going to change in the next release, which is why I'm nudging you towards the current cvs state, rather than the last release. John.
Unfortunately you still have to write yourself a traits class to do that, a simple wrapper that forwards calls onto c_regex_traits<char> where appropriate would do it. Unfortunately the traits class design is going to change in the next release, which is why I'm nudging you towards the current cvs state, rather than the last release.
I grabbed the cvs version, thanks. I intend to rewrite icu.hpp for my own needs. Is it the right file I should tweak? I didn't try to buld with icu support, but would like to ask how much bigger regex lib becomes? (is it megs or just a few kilobytes ? :) or it depends on these mega icu***.dll's?)
I grabbed the cvs version, thanks. I intend to rewrite icu.hpp for my own needs. Is it the right file I should tweak? I didn't try to buld with icu support, but would like to ask how much bigger regex lib becomes? (is it megs or just a few kilobytes ? :) or it depends on these mega icu***.dll's?)
The regex lib doesn't get much bigger, it's the dependency to ICU that gets you :-) I suggest that you read the traits class docs, and then use c_regex_traits as an example to work from. John.
The regex lib doesn't get much bigger, it's the dependency to ICU that gets you :-)
I suggest that you read the traits class docs, and then use c_regex_traits as an example to work from.
John.
Woo-hhoo, I managed to compile new regex with icu, but I think it's too complicated for such little functionality that I need. I don't really do something serious - as a lesson to study boost regex I wanted to write a simple app that takes regular expressions from javascript ( that are in form /regex/im only) and writes out cpp source code that using boost regex does string match and returns bool. I assume that the input string is utf16 (without possibility of extra 2 bytes, just like in javascript). Everything was done and tested, until I tried it on freebsd where wregex didn't exist and where sizeof wchar_t is different from vc_71. Easier way to get this functionality is to rip off the regex part from spidermonkey (embeddable js engine) that borrows regex part prom perl as far as I know, or even easier just to embed it and use for completely compatible regex match; But I don't need easy routes :) I tested with javascript - it does a good job with wide strings also. For example: /^\u03C6+$/i or /^\w+$/i will match "Φφ" correctly recognizing upper and lower case for greek PHi. I don't really need this *fancy* handling for chars over 0x7F. The entire javascript engine in a static lib is less than 2M, so ICU seems a bit heavy weight for simple functionality. The only extra thing I want to add over usual boost::regex is to be able to use \xHHHH or \uHHHH and that it would operate on 16-bit characters. I looked for c_regex_traits and couldn't find this class - I found a lot of specializations for this template in different places (like template<> c_regex_traits ...).
*fancy* handling for chars over 0x7F. The entire javascript engine in a static lib is less than 2M, so ICU seems a bit heavy weight for simple functionality. The only extra thing I want to add over usual boost::regex is to be able to use \xHHHH or \uHHHH and that it would operate on 16-bit characters.
Ok, I made it work the way I describe here - it understands now \u{HHH}
notaion and operates on 16-bit wide chars. I didn't touch anything from
c_regex_tratis. I only wrote myown char_traits and then
typedef reg_expression
Ok, I made it work the way I describe here - it understands now \u{HHH} notaion and operates on 16-bit wide chars. I didn't touch anything from c_regex_tratis. I only wrote myown char_traits and then typedef reg_expression
uregex16; and struct string16 : public std::basic_string { ... }; I suppose I need to overload c_regex_tratis if I need correct reinterpretation for \w, icase etc for chars that are outside latin-1?
Yes specialise regex_traits<uchar16> using c_regex_traits as a guide, actually I'm surprised it works at all: regex_traits<uchar16> shouldn't instantiate, but if you're happy then great. John.
John Maddock wrote:
Yes specialise regex_traits<uchar16> using c_regex_traits as a guide, actually I'm surprised it works at all: regex_traits<uchar16> shouldn't instantiate, but if you're happy then great.
John.
In all the routines for char_traits I added equivalents to accept char also (eg, I have assign(uchar& to, char& from) also ) I didn't try to run my code on unix, but on windows wchat_t is unsigned short, that's why I think it worked. I'll try to run it on freebsd later. Thanks
In all the routines for char_traits I added equivalents to accept char also (eg, I have assign(uchar& to, char& from) also ) I didn't try to run my code on unix, but on windows wchat_t is unsigned short, that's why I think it worked. I'll try to run it on freebsd later.
I'm not talking about char_traits, I'm talking about regex_traits, you shouldn't have needed the specialisation for char_traits. John.
participants (2)
-
John Maddock
-
pps