[boost] Re: [Unicode strings] We're off

5 Apr 2005

      Hi Erik,

I thought I would jump in with some small observations:

--- Erik Wien <wien@start.no> wrote:
...
Most users should not really care what kind of encoding
and normalization form is used. They want to work with 
the string, not fiddle with it's internal representation.
You do care about the representation when communicating with system
API's or writing data to networks or files. For example, say, UTF-32
was the chosen representation, some programmers would be constantly
converting to UTF-16 to call the system, and vise-versa if UTF-16 is
chosen where the system wants something else.
...
I would be surprised if any other encoding than UTF-16 would
end up as the most efficient one. UTF-8 suffers from the big
variation in code unit count for any given code point and
UTF-32 is just a waste of space for little performance for
most users. You never know though.
Here again, the performance measure could easily be dominated by
conversions to the underlying system's encoding, depending on the
application.

Also, on some systems, particularly Mac, the system not only has an
encoding preference, it doesn't particularly like "wchar_t *" either.
On the Mac, most text is a CFString (a handle of sorts to the text).
On Windows, you encounter BSTR's as well.

In my not-so-nearly-thought-out work on this, I decided to have the
default encoding be platform specific to eliminate the enormous
number of conversions that might be otherwise needed. For example, on
the Mac, I had an allocator-like strategy that allowed all
unicode_strings to be backed by a CFString. There was a get_native()
method that returned a platform-specific value (documented on a per
platform basis) to allow platform-specific code to work more
optimally.

Just some thoughts...

Best,
Don

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

[boost] Re: [Unicode strings] We're off

Don G