[Boost-users] Xpressive: UTF-8 and diacritics

25 Aug 2008

      It looks like the traits aspect of Xpressive is geared toward  
characters, so I assume that Xpressive is not directly usable with  
UTF-8 encoded text, am I correct?

It might work by having the character type be a 32 bit integer and  
then use iterator adapters which expose the sequence as ucs-4 code  
points (after all, the sequence is “encoded”), but that leads me to  
the next question: diacritics.

For example something like é in decomposed unicode is two code points  
(e followed by a combining ´ mark), so even when the sequence is  
iterated as ucs-4 code points, a regexp of “.” will match just the e,  
not the actual (rendered) character.

Since I was unable to find any discussion of this while searching for  
Xpressive, I am curious to hear if any thoughts have gone into these  
issues.

[Boost-users] Xpressive: UTF-8 and diacritics

Allan Odgaard