
On 29.10.2011 14:14, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.com> wrote:
On 28.10.2011 15:00, Peter Dimov wrote:
Yakov Galka wrote:
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8.
You can configure the console. Select Consolas or Lucida Console as the font, then issue chcp 65001. chcp 65001 apparently breaks .bat files though. :-)
it break a hell of a lot more than batch files. try `more`.
So I tried to make YOUR approach work (i.e. use wchar_t):
I am afraid that you are misrepresenting me a bit here. But I am sure it is not intentional. Let's walk through this.
Created a file with:
#include<cstdio> int main() { ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); }
saved as UTF-8 with BOM. Compiled with VS2005, windows XP.
Except that <cstdio> is not guaranteed to place wprintf in the global namespace (I commented on that before, better use <stdio.h>), that code works OK in the sense of doing what you have specified should happen. Which apparently is not what you think, heh. You have specified a conversion to narrow characters using the C++ executable narrow character set, i.e. a conversion to Windows ANSI. It surprises a lot of programmers that that's what 'wcout' does: a NARROWING CONVERSION. It did surprise me at one time in the 1990's. I was very disappointed. After that I have become more and more sure that there was no design of the C++ iostreams, but that's another story...
M:\bin> a.exe Blσbµrsyltet°y!
Yes -- that's what Windows ANSI Western, which you asked for, looks like when it is presented with the original IBM PC character set, codepage 437. Switch to codepage to 1252, the codepage number for Windows ANSI, to get the Windows ANSI result that you asked for to display properly. Of course it will lack the Unicode-only characters: <example> P:\test> type jam.cpp ∩╗┐#include <cstdio> int main() { ::wprintf( L"Bl├Ñb├ªrsyltet├╕y! µùѵ£¼σ¢╜ ╨║╨╛╤ê╨║╨░!\n" ); } P:\test> chcp 65001 Active code page: 65001 P:\test> type jam.cpp #include <cstdio> int main() { ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); } P:\test> cl jam.cpp jam.cpp P:\test> jam Bl�b�rsyltet�y! P:\test> chcp 437 Active code page: 437 P:\test> jam Blσbµrsyltet°y! P:\test> chcp 1252 Active code page: 1252 P:\test> jam Blåbærsyltetøy! P:\test> _ </example> [snip]
M:\bin> chcp 1252 M:\bin> a.exe Blåbærsyltetøy!
Somewhat better. But how do I get to see the whole string?
Not with any single-byte-per-character encoding. ;-) You can use UTF-8 or UTF-16 for the output. UTF-8 is a bit problematic because the Windows support is really flaky. [snip effort with wide text]
Ah! it's IMPOSSIBLE with wprintf!
No no, you're jumping to conclusions. The Microsoft runtime has special support for this at the C library level, but unfortunately, as far as I know, not at the C++ level. Still, since you're using 'wprintf', that's at the C level, so it's no problem: <example> P:\test> chcp 65001 Active code page: 65001 P:\test> type jam.cpp #include <stdio.h> #include <io.h> // _setmode #include <fcntl.h> // _O_U8TEXT int main() { _setmode( _fileno( stdout ), _O_U8TEXT ); ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); } P:\test> cl jam.cpp jam.cpp P:\test> jam Blåbærsyltetøy! 日本国 кошка! P:\test> g++ jam.cpp jam.cpp: In function 'int main()': jam.cpp:7: error: '_O_U8TEXT' was not declared in this scope P:\test> g++ jam.cpp -D __MSVCRT_VERSION__=0x0800 P:\test> a Blåbærsyltetøy! 日本国 кошка! P:\test> _ </example>
Let's try UTF-8 instead. [snip effort]
It works! MAGIC! More importantly: ***It's the only way to make it work!***
See above. Those statements are wrong in two important respects. First wrongness, the Windows console window support for UTF-8 is really really flaky, so that you get more or less arbitrary "errors". They seem to be connected with timing or something. So UTF-8 is not good: I showed above how to generate UTF-8 from wide char literals just to be exactly comparable to your example code, and the big difference is that I did not have to lie to the compiler and hope for the best. Instead, the code I presented above is well-defined. The result, for my program and for yours (since both output UTF-8) isn't well defined though -- it depends somewhat on the phase of the moon in Seattle, or something. Second wrongness, it's not the only way. I started very near the top of this thread by giving a concrete example that worked very neatly. It gets tiresome repeating myself. But as you could see in that example, the end programmer does not have to deal with the dirty platform-specific details any more than with all-UTF8. --- And you absolutely don't want to work with codepage 65001 in the console: it causes batch files and 'more' and pipes etc. to fail. But, you may ask, what about Alf's program, then, it's the same for heaven's sake? Well, let's check: <example> P:\test> chcp 1252 Active code page: 1252 P:\test> a Blåbærsyltetøy! 日本国 кошка! P:\test> jam Blåbærsyltetøy! 日本国 кошка! P:\test> </example> He he. :-) It works also with the more practical codepage 1252 in the console. The reason is probably that it uses WriteConsole internally, but it doesn't matter much how the runtime library accomplishes this. On the other hand, as with much else Microsoft there are probably hidden costs. It is possible that invoking this C level support may wreak havoc at the C++ iostreams level, so that a good solution may have to provide custom iostream buffers working around the Microsoft bugs. [snip about reporting one of the myriad console bugs, to Microsoft]
Since you cannot set UTF-16 codepage for the console, UTF-8 is your only options from the said above.
No that's incorrect. In my (limited) experience UTF-16 is more reliable for this. However, UTF-16 as an external encoding feels sort of wrong, even if it is very efficient for Japanese network traffic.
Furthermore, if people will pester microsoft we will get more benefit (no pun intended) than rewriting our code to use some unknown encoding that is different on each platform.
I believe that could greatly ease the porting of *nix tools to Windows. Cheers & hth., - Alf