"Best so far" for C level i/o of Unicode text with Windows console

This may be of interest to developers of Boost libraries that deal with text. I found that the Visual C++ implementation of the C library i/o generally does not support console input of international characters. It can deal with narrow character input from the current codepage, if that codepage is not UTF-8. For example: <code> #include <stdio.h> int main() { printf( "? " ); char buffer[80]; scanf( "%s", buffer ); printf( "%s\n", buffer ); } </code> <result> d:\dev\test> chcp 1252 Active code page: 1252 d:\dev\test> utf8test ? abcæøå abcæøå d:\dev\test> chcp 65001 Active code page: 65001 d:\dev\test> utf8test ? abcæøå ��, d:\dev\test> _ </result> I placed a comment about this at the blog of Microsoft's Unicode guru Michael Kaplan, http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx In particular, with active codepage 65001 (UTF-8), functions such as wscanf just fail outright on non-ASCII characters. The IMO best compromise that I have managed to come up with is a kind of hybrid, where the active input codepage is set to 1252, Windows ANSI Western, because that's a superset of Latin 1 which is a subset of Unicode. Output is converted to UTF-8. Input from e.g. a file accepts UTF-8. This means that e.g. Norwegian text can be input interactively (restricted to Latin 1 character set) or from a pipe or file (as UTF-8) without problems, and can be automatically translated (by the C i/o level) to UTF-16 encoding inside the program. However, I'm guessing that the C level input will garble e.g. Russian text no matter what one does, if one desires some Unicode based encoding inside the program. The Windows and Visual C++ support is just full of bugs -- e.g. as the crash of `more` showed in an earlier thread. Initialization of streams for the compromize scheme: static void msvcCompatibleInit() { struct Fix { static void mode( FILE* f, char const errorText[] ) { int const fileNo = _fileno( f ); bool const isConsoleInput = (f == stdin && _isatty( fileNo )); if( isConsoleInput ) { // Bytes are received as per the active codepage in the console. // Except if that active codepage is 65001, in which case non-ASCII // characters fail. Also, _setmode just causes non-ASCII fail. // // Setting the console codepage to 1252 might be practically helpful, // since cp 1252, Windows ANSI Western, is a superset of Latin 1 // which is a subset of Unicode. However, non-Latin 1 characters will // then be incorrect, just as with web pages in the old days. } else { int const newMode = _setmode( fileNo, _O_U8TEXT ); hopefully( newMode != -1 ) || throwX( errorText ); } } }; Fix::mode( stdin, "_setmode stdin failed" ); Fix::mode( stdout, "_setmode stdout failed" ); Fix::mode( stderr, "_setmode stderr failed" ); } Cheers & hth., - Alf

[Alf P. Steinbach]
I found that the Visual C++ implementation of the C library i/o generally does not support console input of international characters. It can deal with narrow character input from the current codepage, if that codepage is not UTF-8.
Changing the console's codepage isn't the right magic. See http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx With _O_U16TEXT, VC8+ can write Unicode to the console perfectly. However, I believe that input was broken up to and including VC10, and that it's been fixed in VC11. (I don't know about UTF-8. For reasons that are still mysterious to me, UTF-8 typically isn't handled as well as people expect it to be. Windows really really likes UTF-16 for Unicode. In practice, this is not a big deal, because UTF-8 and UTF-16 are losslessly convertible.) Stephan T. Lavavej Visual C++ Libraries Developer

On 04.11.2011 00:14, Stephan T. Lavavej wrote:
[Alf P. Steinbach]
I found that the Visual C++ implementation of the C library i/o generally does not support console input of international characters. It can deal with narrow character input from the current codepage, if that codepage is not UTF-8.
Changing the console's codepage isn't the right magic. See http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx
Thanks! :-) But did you notice that the article you replied to, had a link to that exact page, and contained code using the technique Kaplan describes?
With _O_U16TEXT, VC8+ can write Unicode to the console perfectly. However, I believe that input was broken up to and including VC10, and that it's been fixed in VC11.
Nope, sorry. <code> #include <stdio.h> int main() { printf( "? " ); char buffer[80]; scanf( "%s", buffer ); printf( "%s\n", buffer ); } </code> <result> D:\dev\test> (cl 2>&1) | find /i "c++" Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 17.00.40825.2 for 80x86 D:\dev\test> set CL=/nologo /EHsc /GR /W4 D:\dev\test> cl utf8test.cpp utf8test.cpp utf8test.cpp(7) : warning C4996: 'scanf': This function or variable may be unsafe. Consider using scanf_s instead. To di sable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\INCLUDE\stdio.h(290) : see declaration of 'scanf' D:\dev\test> chcp 1252 Active code page: 1252 D:\dev\test> utf8test ? abcæøå abcæøå D:\dev\test> chcp 65001 Active code page: 65001 D:\dev\test> utf8test ? abcæøå 1? D:\dev\test> _ </result>
(I don't know about UTF-8. For reasons that are still mysterious to me, UTF-8 typically isn't handled as well as people expect it to be. Windows really really likes UTF-16 for Unicode. In practice, this is not a big deal, because UTF-8 and UTF-16 are losslessly convertible.)
I think you're right that it's not a big deal for professional software development, because what professional developer depends on correct input of international characters from a Windows console window? Not me... (But then it's been some years since I was a prof. dev.) But I think it is important that students should be able to write the same kinds of program code, as they will later do as professionals. And for that reason it would be Really Nice if the Visual C++ runtime library is able to deal with interactive UTF-8 input. Note that if standard input is redirected to come from file, then it works OK. Cheers, & thanks for helping, - Alf

On Thu, Nov 3, 2011 at 4:14 PM, Stephan T. Lavavej <stl@exchange.microsoft.com> wrote:
[Alf P. Steinbach]
I found that the Visual C++ implementation of the C library i/o generally does not support console input of international characters. It can deal with narrow character input from the current codepage, if that codepage is not UTF-8.
Changing the console's codepage isn't the right magic. See http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx
With _O_U16TEXT, VC8+ can write Unicode to the console perfectly. However, I believe that input was broken up to and including VC10, and that it's been fixed in VC11.
(I don't know about UTF-8. For reasons that are still mysterious to me, UTF-8 typically isn't handled as well as people expect it to be. Windows really really likes UTF-16 for Unicode. In practice, this is not a big deal, because UTF-8 and UTF-16 are losslessly convertible.)
I've found that for a multi-platform library, the most straight-forward strategy for handling Unicode is to use UTF-8 which when running on Windows gets converted to UTF-16 just before calling a SomethingSomethingW function. Why not the other way around (use UTF-16 and convert to UTF-8 before calling Posix functions)? Because: - most portable Unicode-aware libraries use UTF-8, - many unaware libraries just work with UTF-8, - even on Windows, last time I checked MinGW still doesn't support std::wstring which makes it difficult to manage UTF-16 strings (assuming portability is important.) Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode
participants (3)
-
Alf P. Steinbach
-
Emil Dotchevski
-
Stephan T. Lavavej