
Hi Erik, I thought I would jump in with some small observations: --- Erik Wien <wien@start.no> wrote:
Most users should not really care what kind of encoding and normalization form is used. They want to work with the string, not fiddle with it's internal representation.
You do care about the representation when communicating with system API's or writing data to networks or files. For example, say, UTF-32 was the chosen representation, some programmers would be constantly converting to UTF-16 to call the system, and vise-versa if UTF-16 is chosen where the system wants something else.
I would be surprised if any other encoding than UTF-16 would end up as the most efficient one. UTF-8 suffers from the big variation in code unit count for any given code point and UTF-32 is just a waste of space for little performance for most users. You never know though.
Here again, the performance measure could easily be dominated by conversions to the underlying system's encoding, depending on the application. Also, on some systems, particularly Mac, the system not only has an encoding preference, it doesn't particularly like "wchar_t *" either. On the Mac, most text is a CFString (a handle of sorts to the text). On Windows, you encounter BSTR's as well. In my not-so-nearly-thought-out work on this, I decided to have the default encoding be platform specific to eliminate the enormous number of conversions that might be otherwise needed. For example, on the Mac, I had an allocator-like strategy that allowed all unicode_strings to be backed by a CFString. There was a get_native() method that returned a platform-specific value (documented on a per platform basis) to allow platform-specific code to work more optimally. Just some thoughts... Best, Don __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com