
I also saw a post http://lists.boost.org/boost-users/2003/09/5095.php where John answered that it is better to convert these character sequences on-the-fly to char. Somehow I don't like this approach, since I believe that with wrong encoding set on the system some information might get lost.
Is it possible to use XMLCh as character traits in the regular expression if XMLCh* points to a null-terminated 2 bytes character sequence?
There are several options: 1) Convert the characters on the fly to *wchar_t* and use boost::wregex, it's a trivial widening of your 16-bit characters, so nothing will get lost. You could probably use transform_iterator for such a task. 2) In Boost 1.33 there will be more [optional] support for Unicode, but it requires that you use the ICU library (http://www.ibm.com/software/globalization/icu/) to provide some of the basics. You can then correctly scan 16-bit Unicode code sequences, and have surrogate pairs correctly handled, as well as have access to the Unicode property names in regexes etc. However the character type for 16-bit code points is either unsigned short or wchar_t depending upon the platform (this is a requirement for interoperablity with ICU), so you may have to fiddle with your XMLCh setup to get everything working smoothly. See http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/... 3) You could define your own regex traits class for the character type that you're using: if you go down this road then make sure that you start with Boost-1.33 as it has better docs in this area, as well as redesigned traits class requirements compared to 1.32. Hope this helps, John.