regex with multi-byte characters

Hello all, I am using wide characters from the Xerces-C library. These are defined as always being a 2 bytes characters by: typedef short XMLCh; // can be also defined through the #define preprocessor directive I don't remember it I also saw a post http://lists.boost.org/boost-users/2003/09/5095.php where John answered that it is better to convert these character sequences on-the-fly to char. Somehow I don't like this approach, since I believe that with wrong encoding set on the system some information might get lost. Is it possible to use XMLCh as character traits in the regular expression if XMLCh* points to a null-terminated 2 bytes character sequence? -- With Kind Regards, Ovanes

I also saw a post http://lists.boost.org/boost-users/2003/09/5095.php where John answered that it is better to convert these character sequences on-the-fly to char. Somehow I don't like this approach, since I believe that with wrong encoding set on the system some information might get lost.
Is it possible to use XMLCh as character traits in the regular expression if XMLCh* points to a null-terminated 2 bytes character sequence?
There are several options: 1) Convert the characters on the fly to *wchar_t* and use boost::wregex, it's a trivial widening of your 16-bit characters, so nothing will get lost. You could probably use transform_iterator for such a task. 2) In Boost 1.33 there will be more [optional] support for Unicode, but it requires that you use the ICU library (http://www.ibm.com/software/globalization/icu/) to provide some of the basics. You can then correctly scan 16-bit Unicode code sequences, and have surrogate pairs correctly handled, as well as have access to the Unicode property names in regexes etc. However the character type for 16-bit code points is either unsigned short or wchar_t depending upon the platform (this is a requirement for interoperablity with ICU), so you may have to fiddle with your XMLCh setup to get everything working smoothly. See http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/... 3) You could define your own regex traits class for the character type that you're using: if you go down this road then make sure that you start with Boost-1.33 as it has better docs in this area, as well as redesigned traits class requirements compared to 1.32. Hope this helps, John.

John, many thanks for your answer. I would like to comment some of your points. On Wed, July 20, 2005 12:05, John Maddock said:
I also saw a post http://lists.boost.org/boost-users/2003/09/5095.php where John answered that it is better to convert these character sequences on-the-fly to char. Somehow I don't like this approach, since I believe that with wrong encoding set on the system some information might get lost.
Is it possible to use XMLCh as character traits in the regular expression if XMLCh* points to a null-terminated 2 bytes character sequence?
There are several options:
1) Convert the characters on the fly to *wchar_t* and use boost::wregex, it's a trivial widening of your 16-bit characters, so nothing will get lost. You could probably use transform_iterator for such a task. That's possible, the only problem is that *wchar_t* is not allways 2 bytes long. At least I read it at Xerces-C Build Instructions page at http://xml.apache.org/xerces-c/build-misc.html (What should I define XMLCh to be?). Here is an excerpt: ... Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is utf-16 (AIX, Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is not based on Unicode at all (HP/UX, AS/400, system 390). ...
In former releases it was defined as wchar_t, but there Apache developers decided to abonden it because of: ... - Portability problems with any code that assumes that the types of XMLCh and wchar_t are compatible - Excessive memory usage, especially in the DOM, on platforms with 32 bit wchar_t. - utf-16 encoded XMLCh is not always compatible with ucs-4 encoded wchar_t on Solaris and Linux. The problem occurs with Unicode characters with values greater than 64k; in ucs-4 the value is stored as a single 32 bit quantity. With utf-16, the value will be stored as a "surrogate pair" of two 16 bit values. Even with XMLCh equated to wchar_t, xerces will still create the utf-16 encoded surrogate pairs, which are illegal in ucs-4 encoded wchar_t strings. ...
2) In Boost 1.33 there will be more [optional] support for Unicode, but it requires that you use the ICU library (http://www.ibm.com/software/globalization/icu/) to provide some of the basics. You can then correctly scan 16-bit Unicode code sequences, and have surrogate pairs correctly handled, as well as have access to the Unicode property names in regexes etc. However the character type for 16-bit code points is either unsigned short or wchar_t depending upon the platform (this is a requirement for interoperablity with ICU), so you may have to fiddle with your XMLCh setup to get everything working smoothly. See http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/...
Ok, I understand. But then I possibly need to make conversions again (dependent on the platform). May be it would be better to offer an independent way of handling characters. As you have already mentioned the 3d possiblity.
3) You could define your own regex traits class for the character type that you're using: if you go down this road then make sure that you start with Boost-1.33 as it has better docs in this area, as well as redesigned traits class requirements compared to 1.32. Can I read more about it? Can you point me to a document which describes the traits class? What are the special key points of this class. I tried to take a look at the sources, but it was hardly to understand what is what, since there are not so many comments and a lot of typedefs which are hard to backtrace.
I am very thankful for such a nice library and your effort. Many thanks for your time.
Hope this helps,
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
With Kind Regards, Ovanes

There are several options:
1) Convert the characters on the fly to *wchar_t* and use boost::wregex, it's a trivial widening of your 16-bit characters, so nothing will get lost. You could probably use transform_iterator for such a task. That's possible, the only problem is that *wchar_t* is not allways 2 bytes long. At least I read it at Xerces-C Build Instructions page at http://xml.apache.org/xerces-c/build-misc.html (What should I define XMLCh to be?). Here is an excerpt:
Hey, stop right there! I said use an adapter, not a cast:
template <class Iterator>
struct my_adapter
{
my_adapter(Iterator p) : m_position(p){}
wchar_t operator*()const { return *m_position; }
my_adapter& operator++() { m_position++; return *this; }
// other members to make this a valid iterator go here...
private:
m_position;
};
Then pass my_adapter's as the iterator type to the regex algorithms, rather
than a XMLCh*, for example:
bool is_regex_present(XMLCh const* p, int len, boost::wregex const&e)
{
my_adapter
2) In Boost 1.33 there will be more [optional] support for Unicode, but it requires that you use the ICU library (http://www.ibm.com/software/globalization/icu/) to provide some of the basics. You can then correctly scan 16-bit Unicode code sequences, and have surrogate pairs correctly handled, as well as have access to the Unicode property names in regexes etc. However the character type for 16-bit code points is either unsigned short or wchar_t depending upon the platform (this is a requirement for interoperablity with ICU), so you may have to fiddle with your XMLCh setup to get everything working smoothly. See http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/...
Ok, I understand. But then I possibly need to make conversions again (dependent on the platform). May be it would be better to offer an independent way of handling characters. As you have already mentioned the 3d possiblity.
Actually probably not: Xerces can be built with ICU support see : http://xml.apache.org/xerces-c/build-misc.html#ICUPerl so if you define XMLCh to be the same type as ICU's UChar data type, then no conversions are required.
3) You could define your own regex traits class for the character type that you're using: if you go down this road then make sure that you start with Boost-1.33 as it has better docs in this area, as well as redesigned traits class requirements compared to 1.32. Can I read more about it? Can you point me to a document which describes the traits class? What are the special key points of this class. I tried to take a look at the sources, but it was hardly to understand what is what, since there are not so many comments and a lot of typedefs which are hard to backtrace.
The traits class requirement are here: http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/.... I should warn you it's still quite a bit of work to support a new character type. John.

Ok, thanks for the answer. What do you think? Could boost regex make usage of such traits_class or you would not like to include it into the distribution? On Wed, July 20, 2005 19:02, John Maddock said:
There are several options:
1) Convert the characters on the fly to *wchar_t* and use boost::wregex, it's a trivial widening of your 16-bit characters, so nothing will get lost. You could probably use transform_iterator for such a task. That's possible, the only problem is that *wchar_t* is not allways 2 bytes long. At least I read it at Xerces-C Build Instructions page at http://xml.apache.org/xerces-c/build-misc.html (What should I define XMLCh to be?). Here is an excerpt:
Hey, stop right there! I said use an adapter, not a cast:
template <class Iterator> struct my_adapter { my_adapter(Iterator p) : m_position(p){} wchar_t operator*()const { return *m_position; } my_adapter& operator++() { m_position++; return *this; }
// other members to make this a valid iterator go here...
private: m_position; };
Then pass my_adapter's as the iterator type to the regex algorithms, rather than a XMLCh*, for example:
bool is_regex_present(XMLCh const* p, int len, boost::wregex const&e) { my_adapter
i(p), j(p+len); return boost::regex_search(i, j, e); }
I was not going to use a cast, but I was talking about the following: even if I use *wchar_t* I still need a platform dependent conversion from one character type to another. Since *wchar_t* is platform dependent.
2) In Boost 1.33 there will be more [optional] support for Unicode, but it requires that you use the ICU library (http://www.ibm.com/software/globalization/icu/) to provide some of the basics. You can then correctly scan 16-bit Unicode code sequences, and have surrogate pairs correctly handled, as well as have access to the Unicode property names in regexes etc. However the character type for 16-bit code points is either unsigned short or wchar_t depending upon the platform (this is a requirement for interoperablity with ICU), so you may have to fiddle with your XMLCh setup to get everything working smoothly. See http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/...
Ok, I understand. But then I possibly need to make conversions again (dependent on the platform). May be it would be better to offer an independent way of handling characters. As you have already mentioned the 3d possiblity.
Actually probably not: Xerces can be built with ICU support see : http://xml.apache.org/xerces-c/build-misc.html#ICUPerl so if you define XMLCh to be the same type as ICU's UChar data type, then no conversions are required.
There are too many developers involved in the process, that we force all to recompile Xerces-C with specific settings. I don't think this would be an option for us. In our case it can also lead to unpredictable results, if one replaces xerces-c with freshly compiled xerces-c without icu support. I am a little bit sceptical about this.
3) You could define your own regex traits class for the character type that you're using: if you go down this road then make sure that you start with Boost-1.33 as it has better docs in this area, as well as redesigned traits class requirements compared to 1.32. Can I read more about it? Can you point me to a document which describes the traits class? What are the special key points of this class. I tried to take a look at the sources, but it was hardly to understand what is what, since there are not so many comments and a lot of typedefs which are hard to backtrace.
The traits class requirement are here: http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/....
I should warn you it's still quite a bit of work to support a new character type.
I think I should give it a try. ;)
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
With Kind Regards, Ovanes

What do you think? Could boost regex make usage of such traits_class or you would not like to include it into the distribution?
I don't know, it depends what it does: how do you plan to handle character classification in a portable manner for unsigned short?
There are too many developers involved in the process, that we force all to recompile Xerces-C with specific settings. I don't think this would be an option for us. In our case it can also lead to unpredictable results, if one replaces xerces-c with freshly compiled xerces-c without icu support. I am a little bit sceptical about this.
OK let me try one more time: if you compile regex *only* with ICU support, and use the iterator based u32regex_match/u32regex_search algorithms (or their equivalent regex iterators) then it doesn't matter what character type Xerces or anything else uses as long as: It's an 8-bit type: then it'll be treated as an [unsigned] UTF-8 encoded string. Or: It's a 16-bit type, then it'll be treated as an [unsigned] UTF-16 encoded string. Or: It's a 32-bit type, then it'll be treated as an [unsigned] UTF-32 encoded string. Is that generic enough for you? :-) John.

On Thu, July 21, 2005 14:54, John Maddock said:
What do you think? Could boost regex make usage of such traits_class or you would not like to include it into the distribution?
I don't know, it depends what it does: how do you plan to handle character classification in a portable manner for unsigned short? I plan to do it the same way Xerces-C does it. As I understand it they put 2 byte code into the short and do various operations with it. I have to investigate how exactly it is done.
There are too many developers involved in the process, that we force all to recompile Xerces-C with specific settings. I don't think this would be an option for us. In our case it can also lead to unpredictable results, if one replaces xerces-c with freshly compiled xerces-c without icu support. I am a little bit sceptical about this.
OK let me try one more time: if you compile regex *only* with ICU support, and use the iterator based u32regex_match/u32regex_search algorithms (or their equivalent regex iterators) then it doesn't matter what character type Xerces or anything else uses as long as:
It's an 8-bit type: then it'll be treated as an [unsigned] UTF-8 encoded string. Or: It's a 16-bit type, then it'll be treated as an [unsigned] UTF-16 encoded string. Or: It's a 32-bit type, then it'll be treated as an [unsigned] UTF-32 encoded string.
Is that generic enough for you? :-)
Yes, I will do some tests. If they will be ok, I will compile regex with ICU support. Otherwise I will write my own traits class for unsigned short characters. Thanks a lot for your help.
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
With Kind Regards, Ovanes

Ovanes Markarian wrote:
Hello all,
I am using wide characters from the Xerces-C library. These are defined as always being a 2 bytes characters by:
typedef short XMLCh; // can be also defined through the #define preprocessor directive I don't remember it
Actually, in 2.6.0 it's typedef unsigned short XMLCh; The source files in xecresc/util/Compilers contain comments saying that XMLCh is now unsigned short on all platforms. Jonathan
participants (3)
-
John Maddock
-
Jonathan Turkanis
-
Ovanes Markarian