
On 01/21/2011 09:50 AM, Beman Dawes wrote:
... elision by patrick ....
IMO, Any serious Unicode string proposal has to address UTF-8 strings, UTF-16 strings, UTF-32 strings, and probably UTF strings where the particular UTF encoding is established at runtime. Applications that deal with Asian languages, do a lot of random access, or would pay a performance or storage penalty will demand more than just UTF-8 strings. There might be other variants, too, such as a BMP-string. If a Unicode string library provides a strong design framework that is clearly articulated, then an initial implementation would only have to provide the most needed types; UTF-8 and UTF-16/BMP.
I really doubt any proposal will get taken very seriously is it only supports one of the UTF encodings.
+1 with the caveat that UTF-8 and UTF-32 is considered by many to be the most needed types with UTF-16 considered evil. (Seems to be a Windows/non-Windows split. I like them all;) So all three (four if you want to differentiate between fixed-width UTF-16/BMP (really UCS-2) and the full UTF-16) would be needed to avoid people saying that it doesn't fill their needs so why did we bother. The UTF string with run-time would carry a lot of extra code. Wouldn't a programmer know which he wanted to use internally at compile time? Patrick p.s. Nice quick description of the differences between and history of UCS-2 UCS-4 utf-8 utf-16 utf-32 at http://en.wikipedia.org/wiki/Universal_Character_Set