Re: [Boost-users] Xpressive: UTF-8 and diacritics

25 Aug 2008


      Allan Odgaard wrote:
...
It looks like the traits aspect of Xpressive is geared toward 
characters, so I assume that Xpressive is not directly usable with UTF-8 
encoded text, am I correct?
Correct.
...
It might work by having the character type be a 32 bit integer and then 
use iterator adapters which expose the sequence as ucs-4 code points 
(after all, the sequence is “encoded”),
Right, and such iterator adaptors already exist in 
boost/regex/pending/unicode_iterator.hpp. I've never tried to use them 
with xpressive, however.
...
but that leads me to the next 
question: diacritics.
For example something like é in decomposed unicode is two code points (e 
followed by a combining ´ mark), so even when the sequence is iterated 
as ucs-4 code points, a regexp of “.” will match just the e, not the 
actual (rendered) character.
I'm afraid your analysis is correct.
...
Since I was unable to find any discussion of this while searching for 
Xpressive, I am curious to hear if any thoughts have gone into these 
issues.
Xpressive is not Unicode-aware. It's been on my ToDo list forever, but 
it's a huge job and I don't foresee myself having the time to devote to 
this in the near future. If you could make a prioritized list of the 
features you'd like, it would help.

-- 
Eric Niebler
BoostPro Computing
http://www.boostpro.com