Re: [boost] [General] Always treat std::strings as UTF-8

18 Jan 2011

      ----- Original Message ----
...
From: Peter Dimov <pdimov@pdimov.com>
Dave Abrahams wrote:
...
At Tue, 18 Jan 2011 13:27:29 +0200,
Peter  Dimov wrote:
...
Dave Abrahams wrote:
...
I think the reason to use separate types is to provide a  type-safety
barrier between your functions that operate on  utf-8 and system or
3rd-party interfaces that don't or may  not.  In principle, that should
force you to think about  encoding and decoding at all the places where
it may be  needed, and should allow you to code naturally and with
 confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is
 operating in UTF-8 land though.
But they won't be.  That's  not today's reality.
They should be, though. As a practical matter, the  difference 
between taking/returning a string and taking/returning an
utf8_t is  to force people to write an explicit conversion. 
This penalizes people who are  already in UTF-8 land because 
it forces them to use utf8_t( s, encoding_utf8 )  and 
s.c_str( encoding_utf8 ) everywhere, without any gain or 
need. It's true  that for people whose strings are not UTF-8,
 forcing those explicit conversions  may be considered a good
thing. So it depends on what your goals are. Do you  want to 
promote the use of UTF-8 for all strings, or do you want to 
enable  people to remain in non-UTF-8-land?
+1
...
There's also the  additional consideration of utf8_t's invariant. Does it
require valid UTF-8? One  possible specification of fopen might be:
FILE* fopen( char const* name,  char const* mode );
The 'name' argument must be UTF-8 on Unicode-aware  platforms and
file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an
arbitrary byte sequence on encoding-agnostic platforms and file
systems such as  Linux and Solaris, but UTF-8 is recommended.
+1 As well.

Also I would like to add a small note of general C++ design as
a language: don't pay on what you don't need.

And 95% of all uses of strings is encoding agnostic!

Artyom