wregex undefined workaround

older
Mutex synchronized LockingPtr in...

pps

9 Mar 2005 9 Mar '05

12:19 p.m.

Hello, I use freebsd 4.11 & gcc3.3.6 and gcc 3.4.4 with both I don't have std::wstring defined. I only need it fo some internal use. So, I defined: typedef unsigned short uchar16 typedef basic_string<uchar16> string16; and it was enough for me (I don't do iostreams with this string) But, I need regex support for this. As I read in a few mail boost does not create wregex on some platforms with limited support for wchar_t, so I tried to create one the same as I did with string: typedef reg_expression<uchar16, regex_traits<uchar16>, BOOST_DEFAULT_ALLOCATOR(uchar16)> regex16; When I try to use regex16 I get a big load of errors. Did I missed something? Thanks

Show replies by date

John Maddock

9 Mar 9 Mar

12:52 p.m.

...

Hello, I use freebsd 4.11 & gcc3.3.6 and gcc 3.4.4 with both I don't have std::wstring defined. I only need it fo some internal use. So, I defined: typedef unsigned short uchar16 typedef basic_string<uchar16> string16; and it was enough for me (I don't do iostreams with this string) But, I need regex support for this. As I read in a few mail boost does not create wregex on some platforms with limited support for wchar_t, so I tried to create one the same as I did with string: typedef reg_expression<uchar16, regex_traits<uchar16>, BOOST_DEFAULT_ALLOCATOR(uchar16)> regex16; When I try to use regex16 I get a big load of errors. Did I missed something?

Yes, the regex library needs a traits class that instructs it how to interact with the locale for that character type. In theory you could just use cpp_regex_traits<uchar16>, but that would require that all the std::locale facets are supported for uchar16 (I'm sure that they are not). If you're prepared to depend upon ICU, then the current cvs has (optional) support for 16 and 32-bit Unicode character types, the traits class design is also rather simplified and better documented, so that would be the best bet if you wanted to define your own minimalist traits class for uchar16 as well. John.

pps

4:05 p.m.

New subject: [OBORONA-SPAM] Re: wregex undefined workaround

...

Yes, the regex library needs a traits class that instructs it how to interact with the locale for that character type. In theory you could just use cpp_regex_traits<uchar16>, but that would require that all the std::locale facets are supported for uchar16 (I'm sure that they are not).

If you're prepared to depend upon ICU,

WHat's ICU == I see you?? :) then the current cvs has

...

(optional) support for 16 and 32-bit Unicode character types, the traits

it's like utf-16, but I replace all the chars above 0xFFFF with '?', so it's utf-16 that doesn't have 4-byte chars.

...

class design is also rather simplified and better documented, so that would be the best bet if you wanted to define your own minimalist traits

I don't really understand well what's character_traits etc (and how to create them myself), I only wanted that my regex16 would do the same job for chars 0-0x00FF as boost::regex does for 0-0xff, and the rest of the chars (>=0x0100) would be considered non-words (\W) and so that I could only use \xXXXX-\xXXXX notation for their ranges& patterns...

...

class for uchar16 as well.

John.

John Maddock

10 Mar 10 Mar

4:34 p.m.

New subject: [OBORONA-SPAM] Re: wregex undefined workaround

...

...
If you're prepared to depend upon ICU,

WHat's ICU == I see you?? :)

IBM's Unicode libraries: http://www-306.ibm.com/software/globalization/icu/index.jsp

...

then the current cvs has

...
(optional) support for 16 and 32-bit Unicode character types, the traits

it's like utf-16, but I replace all the chars above 0xFFFF with '?', so it's utf-16 that doesn't have 4-byte chars.

...
class design is also rather simplified and better documented, so that would be the best bet if you wanted to define your own minimalist traits

I don't really understand well what's character_traits etc (and how to create them myself), I only wanted that my regex16 would do the same job for chars 0-0x00FF as boost::regex does for 0-0xff, and the rest of the chars (>=0x0100) would be considered non-words (\W) and so that I could only use \xXXXX-\xXXXX notation for their ranges& patterns...

Unfortunately you still have to write yourself a traits class to do that, a simple wrapper that forwards calls onto c_regex_traits<char> where appropriate would do it. Unfortunately the traits class design is going to change in the next release, which is why I'm nudging you towards the current cvs state, rather than the last release. John.

pps

13 Mar 13 Mar

9:53 a.m.

New subject: [OBORONA-SPAM] Re: [OBORONA-SPAM] Re: wregex undefined workaround

...

Unfortunately you still have to write yourself a traits class to do that, a simple wrapper that forwards calls onto c_regex_traits<char> where appropriate would do it. Unfortunately the traits class design is going to change in the next release, which is why I'm nudging you towards the current cvs state, rather than the last release.

I grabbed the cvs version, thanks. I intend to rewrite icu.hpp for my own needs. Is it the right file I should tweak? I didn't try to buld with icu support, but would like to ask how much bigger regex lib becomes? (is it megs or just a few kilobytes ? :) or it depends on these mega icu***.dll's?)

John Maddock

11:04 a.m.

New subject: [OBORONA-SPAM] Re: [OBORONA-SPAM] Re: [Boost-users] wregexundefined workaround

...

I grabbed the cvs version, thanks. I intend to rewrite icu.hpp for my own needs. Is it the right file I should tweak? I didn't try to buld with icu support, but would like to ask how much bigger regex lib becomes? (is it megs or just a few kilobytes ? :) or it depends on these mega icu***.dll's?)

The regex lib doesn't get much bigger, it's the dependency to ICU that gets you :-) I suggest that you read the traits class docs, and then use c_regex_traits as an example to work from. John.

pps

11:37 a.m.

New subject: [OBORONA-SPAM] Re: [OBORONA-SPAM] Re: [OBORONA-SPAM] Re: wregexundefined workaround

...

I suggest that you read the traits class docs, and then use c_regex_traits as an example to work from.

I already did that - from stl docs, Thanks

...

John.

pps

14 Mar 14 Mar

4:24 a.m.

New subject: [Boost-users] wregexundefined workaround

...

The regex lib doesn't get much bigger, it's the dependency to ICU that gets you :-)

I suggest that you read the traits class docs, and then use c_regex_traits as an example to work from.

John.

Woo-hhoo, I managed to compile new regex with icu, but I think it's too complicated for such little functionality that I need. I don't really do something serious - as a lesson to study boost regex I wanted to write a simple app that takes regular expressions from javascript ( that are in form /regex/im only) and writes out cpp source code that using boost regex does string match and returns bool. I assume that the input string is utf16 (without possibility of extra 2 bytes, just like in javascript). Everything was done and tested, until I tried it on freebsd where wregex didn't exist and where sizeof wchar_t is different from vc_71. Easier way to get this functionality is to rip off the regex part from spidermonkey (embeddable js engine) that borrows regex part prom perl as far as I know, or even easier just to embed it and use for completely compatible regex match; But I don't need easy routes :) I tested with javascript - it does a good job with wide strings also. For example: /^\u03C6+$/i or /^\w+$/i will match "Φφ" correctly recognizing upper and lower case for greek PHi. I don't really need this *fancy* handling for chars over 0x7F. The entire javascript engine in a static lib is less than 2M, so ICU seems a bit heavy weight for simple functionality. The only extra thing I want to add over usual boost::regex is to be able to use \xHHHH or \uHHHH and that it would operate on 16-bit characters. I looked for c_regex_traits and couldn't find this class - I found a lot of specializations for this template in different places (like template<> c_regex_traits ...).

pps

6:15 a.m.

New subject: [OBORONA-SPAM] Re: [Boost-users] wregexundefined workaround

...

*fancy* handling for chars over 0x7F. The entire javascript engine in a static lib is less than 2M, so ICU seems a bit heavy weight for simple functionality. The only extra thing I want to add over usual boost::regex is to be able to use \xHHHH or \uHHHH and that it would operate on 16-bit characters.

Ok, I made it work the way I describe here - it understands now \u{HHH} notaion and operates on 16-bit wide chars. I didn't touch anything from c_regex_tratis. I only wrote myown char_traits and then typedef reg_expression<uchar16, regex_traits<uchar16>, BOOST_DEFAULT_ALLOCATOR(uchar16)> uregex16; and struct string16 : public std::basic_string<uchar16, my_char_traits>{ ... }; I suppose I need to overload c_regex_tratis if I need correct reinterpretation for \w, icase etc for chars that are outside latin-1?

John Maddock

15 Mar 15 Mar

11:04 a.m.

New subject: wregexundefined workaround

...

Ok, I made it work the way I describe here - it understands now \u{HHH} notaion and operates on 16-bit wide chars. I didn't touch anything from c_regex_tratis. I only wrote myown char_traits and then typedef reg_expression<uchar16, regex_traits<uchar16>, BOOST_DEFAULT_ALLOCATOR(uchar16)> uregex16; and struct string16 : public std::basic_string<uchar16, my_char_traits>{ ... }; I suppose I need to overload c_regex_tratis if I need correct reinterpretation for \w, icase etc for chars that are outside latin-1?

Yes specialise regex_traits<uchar16> using c_regex_traits as a guide, actually I'm surprised it works at all: regex_traits<uchar16> shouldn't instantiate, but if you're happy then great. John.

pps

5:05 p.m.

New subject: [OBORONA-SPAM] wregexundefined workaround

John Maddock wrote:

...

Yes specialise regex_traits<uchar16> using c_regex_traits as a guide, actually I'm surprised it works at all: regex_traits<uchar16> shouldn't instantiate, but if you're happy then great.

John.

In all the routines for char_traits I added equivalents to accept char also (eg, I have assign(uchar& to, char& from) also ) I didn't try to run my code on unix, but on windows wchat_t is unsigned short, that's why I think it worked. I'll try to run it on freebsd later. Thanks

John Maddock

5:41 p.m.

New subject: wregexundefined workaround

...

In all the routines for char_traits I added equivalents to accept char also (eg, I have assign(uchar& to, char& from) also ) I didn't try to run my code on unix, but on windows wchat_t is unsigned short, that's why I think it worked. I'll try to run it on freebsd later.

I'm not talking about char_traits, I'm talking about regex_traits, you shouldn't have needed the specialisation for char_traits. John.

7451

Age (days ago)

7457

Last active (days ago)

List overview

Download

11 comments

2 participants

participants (2)

John Maddock
pps