Re: [boost] GSoC Unicode library: second preview

22 Jun 2009

      Scott McMurray wrote:
...
Suppose I have "difficult" with the "ffi" ligature codepoint, and I do
a perl-style split on /i/.
There is no way for "i" to match as being part of that string unless you 
replace the "ffi" ligature by the letters "f", "f", "i".
That operation is known as a compatibility decomposition (and will be 
provided by the library in due time, of course, along with compatibility 
composition, canonical decomposition, canonical composition and the 
normalization forms that are defined in terms of them)

You could choose to apply split with arguments normalized according to 
normalization form KC, which allows comparison independently of 
formatting considerations.
But that also means 5 will match ⁵. You could choose that 5 should match 
⁵, but ⁵ should not match 5, so the pattern should be in NFC but the 
string to search in in NFKC.
...
I should probably be getting "d", the "ff"
ligature codepoint, and "cult".  I know if I tried to code that by
hand in every application I'd miss all kinds of evil corner cases like
that.
Unfortunately Unicode is made of a lot of case corners, and there is no 
way around it without understanding it.

Re: [boost] GSoC Unicode library: second preview

Mathias Gaunard