
Scott McMurray wrote:
Suppose I have "difficult" with the "ffi" ligature codepoint, and I do a perl-style split on /i/.
There is no way for "i" to match as being part of that string unless you replace the "ffi" ligature by the letters "f", "f", "i". That operation is known as a compatibility decomposition (and will be provided by the library in due time, of course, along with compatibility composition, canonical decomposition, canonical composition and the normalization forms that are defined in terms of them) You could choose to apply split with arguments normalized according to normalization form KC, which allows comparison independently of formatting considerations. But that also means 5 will match ⁵. You could choose that 5 should match ⁵, but ⁵ should not match 5, so the pattern should be in NFC but the string to search in in NFKC.
I should probably be getting "d", the "ff" ligature codepoint, and "cult". I know if I tried to code that by hand in every application I'd miss all kinds of evil corner cases like that.
Unfortunately Unicode is made of a lot of case corners, and there is no way around it without understanding it.