Re: [boost] [General] Always treat std::strings as UTF-8

18 Jan 2011

      Dave Abrahams wrote:
...
At Tue, 18 Jan 2011 13:27:29 +0200,
Peter Dimov wrote:
...
Dave Abrahams wrote:
...
I think the reason to use separate types is to provide a type-safety
barrier between your functions that operate on utf-8 and system or
3rd-party interfaces that don't or may not.  In principle, that should
force you to think about encoding and decoding at all the places where
it may be needed, and should allow you to code naturally and with
confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is
operating in UTF-8 land though.
But they won't be.  That's not today's reality.
They should be, though. As a practical matter, the difference between 
taking/returning a string and taking/returning an utf8_t is to force people 
to write an explicit conversion. This penalizes people who are already in 
UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and 
s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true 
that for people whose strings are not UTF-8, forcing those explicit 
conversions may be considered a good thing. So it depends on what your goals 
are. Do you want to promote the use of UTF-8 for all strings, or do you want 
to enable people to remain in non-UTF-8-land?
...
...
It's a bit like defining a separate integer type for nonnegative
ints for type safety reasons - useful in theory, but nobody does it.
I refer you to Boost.Units
I'm sure that there are many libraries that use units in their interfaces, I 
just haven't heard of them. :-)

There's also the additional consideration of utf8_t's invariant. Does it 
require valid UTF-8? One possible specification of fopen might be:

FILE* fopen( char const* name, char const* mode );

The 'name' argument must be UTF-8 on Unicode-aware platforms and file 
systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte 
sequence on encoding-agnostic platforms and file systems such as Linux and 
Solaris, but UTF-8 is recommended.

On Windows, the UTF-8 sequence may be invalid due to the presence of UTF-16 
surrogates encoded as single code points, but such use is discouraged.