Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

29 Jan 2012

      On Sun, Jan 29, 2012 at 02:49, Mathias Gaunard <mathias.gaunard@ens-lyon.org
...
wrote:
...
On 01/28/2012 08:48 PM, Yakov Galka wrote:
...
The user can just write
cout<<  u8"您好世界";
Even better is:
cout<<  "您好世界";
which *just works* on most compilers (e.g. GCC:
http://ideone.com/lBpMJ)
   and needs some trickery on others (MSVC: save as UTF-8 without BOM).
No, that's just wrong.
That's not the model that C++ uses. By not storing it with the BOM, you're
essentially tricking MSVC into believing it is ANSI (windows-1252 on
western systems), and thus avoiding source character set to the execution
character set, since those happen to be the same.
The way a C++ compiler is supposed to work is that all of your source is
in the source character set, regardless of the type of string literal you
use.
Then the compiler will convert your source character set to the execution
character set for narrow string literals, to the wide execution character
set for wide string literals, to UTF-8 for u8 literals, etc.
Sorry for not being clear enough. I agree and I've not said otherwise. The
second 'cout' line *is* a hack. I admit it won't work if you mix such
string literals with wide literals or external identifiers containing
Unicode. The intent was to show how it could be done if the effort was
focused on making narrow string literals "Unicode compatible".

[...] What probably should be done is that compilers should be compelled to
...
support UTF-8 as the source character set in a unified way.
Yes, it could be nice. It would solve half the problem, which is a huge
step forward given the current mood of the committee. However, embedding
Unicode string literals in source code is still not something you routinely
do. Internationalization usually uses external string tables.

I once asked volodya if it were feasible to implement this in the build
...
system (add a BOM for MSVC), but he didn't seem to think it was worth it.
I don't understand. MSVC already understands BOM, and GCC has already been
fixed according to
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415(didn't test it).

On Sun, Jan 29, 2012 at 03:12, Mathias Gaunard <mathias.gaunard@ens-lyon.org
...
wrote:
...
I think you should consider the points being made in N3334.
While that proposal is in my opinion not good enough, it raises an
important issue that is often present with std::string-based or similar
designs.
A function that takes a std::string, or a boost::filesystem::path for that
matter, necessarily causes the callee to copy the data into a
heap-allocated buffer, even if there is no need to.
Use of the range concept would solve that issue, but then that requires
making the function a template. A type-erased range would be possible, but
that has significant performance overhead.
a string_ref or path_ref is maybe the lesser evil.
+1
This topic has been raised here in program-options context:
http://boost.2283326.n4.nabble.com/program-options-Some-methods-take-const-c...

-- 
Yakov