Re: [Unicode strings] We're off

Hi Erik, I thought I would jump in with some small observations: --- Erik Wien <wien@start.no> wrote:
Most users should not really care what kind of encoding and normalization form is used. They want to work with the string, not fiddle with it's internal representation.
You do care about the representation when communicating with system API's or writing data to networks or files. For example, say, UTF-32 was the chosen representation, some programmers would be constantly converting to UTF-16 to call the system, and vise-versa if UTF-16 is chosen where the system wants something else.
I would be surprised if any other encoding than UTF-16 would end up as the most efficient one. UTF-8 suffers from the big variation in code unit count for any given code point and UTF-32 is just a waste of space for little performance for most users. You never know though.
Here again, the performance measure could easily be dominated by conversions to the underlying system's encoding, depending on the application. Also, on some systems, particularly Mac, the system not only has an encoding preference, it doesn't particularly like "wchar_t *" either. On the Mac, most text is a CFString (a handle of sorts to the text). On Windows, you encounter BSTR's as well. In my not-so-nearly-thought-out work on this, I decided to have the default encoding be platform specific to eliminate the enormous number of conversions that might be otherwise needed. For example, on the Mac, I had an allocator-like strategy that allowed all unicode_strings to be backed by a CFString. There was a get_native() method that returned a platform-specific value (documented on a per platform basis) to allow platform-specific code to work more optimally. Just some thoughts... Best, Don __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Don G wrote:
Hi Erik,
I thought I would jump in with some small observations:
That's what I'm here for! :)
You do care about the representation when communicating with system API's or writing data to networks or files. For example, say, UTF-32 was the chosen representation, some programmers would be constantly converting to UTF-16 to call the system, and vise-versa if UTF-16 is chosen where the system wants something else.
Yes. This is correct, but conversion to/from the native string type (usually UTF-16) should be abstracted by the library. (Through some get_native_string() function in the string class.) The casual user should not need to do this him-/herself. That's how I feel anyway.
Here again, the performance measure could easily be dominated by conversions to the underlying system's encoding, depending on the application.
Quite the truth. We will have to look into how big of a problem this actually is.
Also, on some systems, particularly Mac, the system not only has an encoding preference, it doesn't particularly like "wchar_t *" either. On the Mac, most text is a CFString (a handle of sorts to the text). On Windows, you encounter BSTR's as well.
Yep. The idea is that this would all be wrapped in the get_native_string() function mentioned above. This will of course require some work to make an implementation of that function for every platform in use today, but I think it will be worth it.
In my not-so-nearly-thought-out work on this, I decided to have the default encoding be platform specific to eliminate the enormous number of conversions that might be otherwise needed. For example, on the Mac, I had an allocator-like strategy that allowed all unicode_strings to be backed by a CFString. There was a get_native() method that returned a platform-specific value (documented on a per platform basis) to allow platform-specific code to work more optimally.
Yep. Basically what the library does already. Except it doesn't use the native type behind the scenes.
Just some thoughts...
Well appreciated.
Best, Don
- Erik
participants (2)
-
Don G
-
Erik Wien