Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows

28 Oct 2011

      On 28.10.2011 14:41, Yakov Galka wrote:
...
On Fri, Oct 28, 2011 at 13:58, Peter Dimov<pdimov@pdimov.com>  wrote:
...
Alf P. Steinbach wrote:
How do I make the following program work with Visual C++ in Windows, using
...
narrow character string?
<code>
#include<stdio.h>
#include<fcntl.h>       // _O_U8TEXT
#include<io.h>          // _setmode, _fileno
#include<windows.h>
int main()
{
     //SetConsoleOutputCP( 65001 );
     //_setmode( _fileno( stdout ), _O_U8TEXT );
     printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
}
</code>
Output to a console wasn't our topic so far (and is not one of my strong
points), but the specific problem with this program is that the embedded
literal is not UTF-8, as the warning C4566 tells us, so there is no way for
you to get UTF-8 in the output. (You should be able to set VC++'s code page
to 65001, but I don't think you can.)
int main()
{
   printf( utf8_encode( L"кошка" ).c_str() );
}
You don't need to configure anything, in fact you cannot do it properly in
VS. What you can do is:
1) don't use wide-char literals with non ascii characters
2) use UTF-8 literals for narrow-char.
All you need is to save the source as UTF-8 WITHOUT BOM. Works as charm on
VS2005 and VS2010. Apparently it's portable. The IDE can detect UTF-8 even
without BOM  ("☑ Auto-detect UTF-8 encoding without signature").
This is interesting in a perverse sort of way.

In order to make Visual C++ produce UTF-8 encoded compiled narrow 
strings, one must /lie/ to the compiler. The source code is UTF-8. And 
one lies and tells the Visual C++ compiler that it's ANSI.

And in order to make g++ produce ANSI encoded compiled narrow strings, 
one must /lie/ the compiler. The source code is ANSI. And one lies and 
tells the g++ compiler that it's UTF-8.

As I see it, there's something wrong here.

Notwithstanding the limitation that codepage 65000 is impractical in the 
Windows command interpreter  --  e.g. 'more' command CRASHES.
...
...
This is not a practical problem for "proper" applications because Russian
text literals should always come from the equivalent of gettext and never be
embedded in code.
+1
I find that a very narrow minded view.

Would you like to be the one telling Norwegian student Åshild Bjørnson 
that you favor the notion that she should waste hours or days installing 
Boost and some other nix-oriented library and use 'gettext', in order to 
be able to display her name in her first C++ program?

That text representation and output in C++ has been designed (with your 
not just willing but enthusiastic vote) to be so inherently complex that 
it requires hours and days of efforts just to display your name?
...
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect
it to a file, and it will correctly save this as UTF-8. Preventing data-loss
is more important for me.
I find it thoroughly disgusting to have to lie to your tools, and to 
rely on an assumption that the tools will not wisen up in the future.

However, I concede the point that IF one is happy with output that's 
encoded so that most Windows command line tools fail (e.g. `more` 
crashes), and IF one is happy with lying to the compiler about the 
source encoding, and IF one is happy assuming that the compiler won't 
wisen up about encodings in a future version, then -- the UTF-8 scheme 
allows literals with national language characters, not just A through Z.

However, those are pretty constricting conditions.

Cheers & hth.,

- Alf