
On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach < alf.p.steinbach+usenet@gmail.com> wrote:
On 28.10.2011 15:00, Peter Dimov wrote:
Yakov Galka wrote:
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8.
You can configure the console. Select Consolas or Lucida Console as the font, then issue chcp 65001. chcp 65001 apparently breaks .bat files though. :-)
it break a hell of a lot more than batch files. try `more`.
cheers & hth.,
- Alf
So I tried to make YOUR approach work (i.e. use wchar_t): Created a file with: #include <cstdio> int main() { ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); } saved as UTF-8 with BOM. Compiled with VS2005, windows XP. M:\bin> a.exe Blσbµrsyltet°y! M:\bin> a.exe > a.txt Contents of a.txt: 42 6C E5 62 E6 72 73 79 6C 74 65 74 F8 79 21 20 What happens to Japanese and Russian? What's the mojibake? Maybe the compiler corrupted the string? Let's see, change to: wchar_t s[] = L"Blåbærsyltetøy! 日本国 кошка!\n"; ::wprintf( s ); Recompile, step into the debugger. No. It's your favorite, correct UTF-16 that's passed to wprintf. Same result. Let's try a European codepage: M:\bin> chcp 1252 M:\bin> a.exe Blåbærsyltetøy! Somewhat better. But how do I get to see the whole string? M:\bin> chcp 65001 M:\bin> a.exe Blbrsyltety! M:\bin> chcp 1200 Invalid code page OK, let's drop the requirement that the user sees the string at all. Let's restrict to a simpler case: a.exe writes unicode to stdout, b.exe reads it from stdin and writes verbatim to a file. Here is program b.exe: int main() { wchar_t s[256]; _getws(s); std::ofstream fout("out.txt", std::ios::binary); fout.write((const char*)s, 2*wcslen(s)); // I want to see what I really get } Compile, run. M:\a> a.exe | b.exe Independent of chcp I get: 42 00 6C 00 E5 00 62 00 E6 00 72 00 73 00 79 00 6C 00 74 00 65 00 74 00 F8 00 79 00 21 00 20 00 Why the hell this is lossy‽ Where IS my lovely Japanese? What am I doing wrong⸘ Ah! it's IMPOSSIBLE with wprintf! Let's try UTF-8 instead. Write the program as we've written it for 40 years, even before UTF-8 and the whole wide-char crap was introduced†. Open VS2005: #include <stdio.h> int main() { printf("Blåbærsyltetøy! 日本国 кошка!\n"); } † I mean the C functions used. Of course we couldn't mix Japanese and Russian back then. Save in UTF-8 WITHOUT BOM. Compile to a-utf8.exe. int main() { char s[256]; gets(s); std::ofstream fout("out.txt", std::ios::binary); fout.write((const char*)s, strlen(s)); } Compile b-utf8.exe; M:\> a-utf8.exe BlÃ¥bærsyltetøy! 日本国 кошка! Something is bad. [The user goes to the documentation/support. Alright, I need UTF-8. This software is Unicode aware! Good, they care about their customers!]: M:\> chcp 65001 M:\> a-utf8.exe Blåbærsyltetøy! 日本国 кошка! Correct! (Ok, I see squares for the Japanese because I don't have a monospace font for it, but copy/paste works correctly.) M:\> a-utf8.exe > a.txt a.txt: 42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 0D 0A Correct! M:\> a-utf8.exe | b-utf8.exe M:\> type out.txt Blåbærsyltetøy! 日本国 кошка! out.txt: 42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 It works! MAGIC! More importantly: ***It's the only way to make it work!*** ⇒ What if it's automatic and the user cannot intervene to change the codepage? ‽ If it's automatic, then you don't care how it's displayed in the console. You will log it to a file anyway. The case of: M:\> a-utf8.exe | b-utf8.exe Works correctly independent of what the current codepage was set. ⟹ more doesn't work. ‽ Report the bug to microsoft. UTF-8 is a documented codepage. Microsoft itself encourages to use either UTF-8 or UTF-16. Other 'ANSI' codepages are unofficially deprecated. http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%2...: Note ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page. Since you cannot set UTF-16 codepage for the console, UTF-8 is your only options from the said above. Furthermore, if people will pester microsoft we will get more benefit (no pun intended) than rewriting our code to use some unknown encoding that is different on each platform. -- Yakov