Re: [Boost-users] regex with multi-byte characters

20 Jul 2005

      ...
I also saw a post http://lists.boost.org/boost-users/2003/09/5095.php 
where John answered that it
is better to convert these character sequences on-the-fly to char. Somehow 
I don't like this
approach, since I believe that with wrong encoding set on the system some 
information might get
lost.
Is it possible to use XMLCh as character traits in the regular expression 
if XMLCh* points to a
null-terminated 2 bytes character sequence?
There are several options:

1) Convert the characters on the fly to *wchar_t* and use boost::wregex, 
it's a trivial widening of your 16-bit characters, so nothing will get lost. 
You could probably use transform_iterator for such a task.

2) In Boost 1.33 there will be more [optional] support for Unicode, but it 
requires that you use the ICU library 
(http://www.ibm.com/software/globalization/icu/) to provide some of the 
basics. You can then correctly scan 16-bit Unicode code sequences, and have 
surrogate pairs correctly handled, as well as have access to the Unicode 
property names in regexes etc.  However the character type for 16-bit code 
points is either unsigned short or wchar_t depending upon the platform (this 
is a requirement for interoperablity with ICU), so you may have to fiddle 
with your XMLCh setup to get everything working smoothly.  See 
http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/...

3) You could define your own regex traits class for the character type that 
you're using: if you go down this road then make sure that you start with 
Boost-1.33 as it has better docs in this area, as well as redesigned traits 
class requirements compared to 1.32.

Hope this helps,

John.

Re: [Boost-users] regex with multi-byte characters

John Maddock