
There are several options:
1) Convert the characters on the fly to *wchar_t* and use boost::wregex, it's a trivial widening of your 16-bit characters, so nothing will get lost. You could probably use transform_iterator for such a task. That's possible, the only problem is that *wchar_t* is not allways 2 bytes long. At least I read it at Xerces-C Build Instructions page at http://xml.apache.org/xerces-c/build-misc.html (What should I define XMLCh to be?). Here is an excerpt:
Hey, stop right there! I said use an adapter, not a cast:
template <class Iterator>
struct my_adapter
{
my_adapter(Iterator p) : m_position(p){}
wchar_t operator*()const { return *m_position; }
my_adapter& operator++() { m_position++; return *this; }
// other members to make this a valid iterator go here...
private:
m_position;
};
Then pass my_adapter's as the iterator type to the regex algorithms, rather
than a XMLCh*, for example:
bool is_regex_present(XMLCh const* p, int len, boost::wregex const&e)
{
my_adapter
2) In Boost 1.33 there will be more [optional] support for Unicode, but it requires that you use the ICU library (http://www.ibm.com/software/globalization/icu/) to provide some of the basics. You can then correctly scan 16-bit Unicode code sequences, and have surrogate pairs correctly handled, as well as have access to the Unicode property names in regexes etc. However the character type for 16-bit code points is either unsigned short or wchar_t depending upon the platform (this is a requirement for interoperablity with ICU), so you may have to fiddle with your XMLCh setup to get everything working smoothly. See http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/...
Ok, I understand. But then I possibly need to make conversions again (dependent on the platform). May be it would be better to offer an independent way of handling characters. As you have already mentioned the 3d possiblity.
Actually probably not: Xerces can be built with ICU support see : http://xml.apache.org/xerces-c/build-misc.html#ICUPerl so if you define XMLCh to be the same type as ICU's UChar data type, then no conversions are required.
3) You could define your own regex traits class for the character type that you're using: if you go down this road then make sure that you start with Boost-1.33 as it has better docs in this area, as well as redesigned traits class requirements compared to 1.32. Can I read more about it? Can you point me to a document which describes the traits class? What are the special key points of this class. I tried to take a look at the sources, but it was hardly to understand what is what, since there are not so many comments and a lot of typedefs which are hard to backtrace.
The traits class requirement are here: http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/.... I should warn you it's still quite a bit of work to support a new character type. John.