Xpressive: UTF-8 and diacritics
It looks like the traits aspect of Xpressive is geared toward characters, so I assume that Xpressive is not directly usable with UTF-8 encoded text, am I correct? It might work by having the character type be a 32 bit integer and then use iterator adapters which expose the sequence as ucs-4 code points (after all, the sequence is “encoded”), but that leads me to the next question: diacritics. For example something like é in decomposed unicode is two code points (e followed by a combining ´ mark), so even when the sequence is iterated as ucs-4 code points, a regexp of “.” will match just the e, not the actual (rendered) character. Since I was unable to find any discussion of this while searching for Xpressive, I am curious to hear if any thoughts have gone into these issues.
Allan Odgaard wrote:
It looks like the traits aspect of Xpressive is geared toward characters, so I assume that Xpressive is not directly usable with UTF-8 encoded text, am I correct?
Correct.
It might work by having the character type be a 32 bit integer and then use iterator adapters which expose the sequence as ucs-4 code points (after all, the sequence is “encoded”),
Right, and such iterator adaptors already exist in boost/regex/pending/unicode_iterator.hpp. I've never tried to use them with xpressive, however.
but that leads me to the next question: diacritics.
For example something like é in decomposed unicode is two code points (e followed by a combining ´ mark), so even when the sequence is iterated as ucs-4 code points, a regexp of “.” will match just the e, not the actual (rendered) character.
I'm afraid your analysis is correct.
Since I was unable to find any discussion of this while searching for Xpressive, I am curious to hear if any thoughts have gone into these issues.
Xpressive is not Unicode-aware. It's been on my ToDo list forever, but it's a huge job and I don't foresee myself having the time to devote to this in the near future. If you could make a prioritized list of the features you'd like, it would help. -- Eric Niebler BoostPro Computing http://www.boostpro.com
participants (2)
-
Allan Odgaard
-
Eric Niebler