Re: [Boost-users] regex with multi-byte characters

20 Jul 2005

      ...
...
There are several options:
1) Convert the characters on the fly to *wchar_t* and use boost::wregex,
it's a trivial widening of your 16-bit characters, so nothing will get 
lost.
You could probably use transform_iterator for such a task.
That's possible, the only problem is that *wchar_t* is not allways 2 bytes 
long. At least I read
it at Xerces-C Build Instructions page at 
http://xml.apache.org/xerces-c/build-misc.html (What
should I define XMLCh to be?). Here is an excerpt:
Hey, stop right there!  I said use an adapter, not a cast:

template <class Iterator>
struct my_adapter
{
  my_adapter(Iterator p) : m_position(p){}
  wchar_t operator*()const { return *m_position; }
  my_adapter& operator++() { m_position++; return *this; }

  // other members to make this a valid iterator go here...

private:
  m_position;
};

Then pass my_adapter's as the iterator type to the regex algorithms, rather 
than a XMLCh*, for example:

bool is_regex_present(XMLCh const* p, int len, boost::wregex const&e)
{
   my_adapter<XMLCh*> i(p), j(p+len);
   return boost::regex_search(i, j, e);
}
...
...
2) In Boost 1.33 there will be more [optional] support for Unicode, but 
it
requires that you use the ICU library
(http://www.ibm.com/software/globalization/icu/) to provide some of the
basics. You can then correctly scan 16-bit Unicode code sequences, and 
have
surrogate pairs correctly handled, as well as have access to the Unicode
property names in regexes etc.  However the character type for 16-bit 
code
points is either unsigned short or wchar_t depending upon the platform 
(this
is a requirement for interoperablity with ICU), so you may have to fiddle
with your XMLCh setup to get everything working smoothly.  See
http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/...
Ok, I understand. But then I possibly need to make conversions again 
(dependent on the platform).
May be it would be better to offer an independent way of handling 
characters. As you have already
mentioned the 3d possiblity.
Actually probably not: Xerces can be built with ICU support see : 
http://xml.apache.org/xerces-c/build-misc.html#ICUPerl so if you define 
XMLCh to be the same type as ICU's UChar data type, then no conversions are 
required.
...
...
3) You could define your own regex traits class for the character type 
that
you're using: if you go down this road then make sure that you start with
Boost-1.33 as it has better docs in this area, as well as redesigned 
traits
class requirements compared to 1.32.
Can I read more about it? Can you point me to a document which describes 
the traits class? What
are the special key points of this class. I tried to take a look at the 
sources, but it was hardly
to understand what is what, since there are not so many comments and a lot 
of typedefs which are
hard to backtrace.
The traits class requirement are here: 
http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/....

I should warn you it's still quite a bit of work to support a new character 
type.

John.

Re: [Boost-users] regex with multi-byte characters

John Maddock