
This may be of interest to developers of Boost libraries that deal with text. I found that the Visual C++ implementation of the C library i/o generally does not support console input of international characters. It can deal with narrow character input from the current codepage, if that codepage is not UTF-8. For example: <code> #include <stdio.h> int main() { printf( "? " ); char buffer[80]; scanf( "%s", buffer ); printf( "%s\n", buffer ); } </code> <result> d:\dev\test> chcp 1252 Active code page: 1252 d:\dev\test> utf8test ? abcæøå abcæøå d:\dev\test> chcp 65001 Active code page: 65001 d:\dev\test> utf8test ? abcæøå ��, d:\dev\test> _ </result> I placed a comment about this at the blog of Microsoft's Unicode guru Michael Kaplan, http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx In particular, with active codepage 65001 (UTF-8), functions such as wscanf just fail outright on non-ASCII characters. The IMO best compromise that I have managed to come up with is a kind of hybrid, where the active input codepage is set to 1252, Windows ANSI Western, because that's a superset of Latin 1 which is a subset of Unicode. Output is converted to UTF-8. Input from e.g. a file accepts UTF-8. This means that e.g. Norwegian text can be input interactively (restricted to Latin 1 character set) or from a pipe or file (as UTF-8) without problems, and can be automatically translated (by the C i/o level) to UTF-16 encoding inside the program. However, I'm guessing that the C level input will garble e.g. Russian text no matter what one does, if one desires some Unicode based encoding inside the program. The Windows and Visual C++ support is just full of bugs -- e.g. as the crash of `more` showed in an earlier thread. Initialization of streams for the compromize scheme: static void msvcCompatibleInit() { struct Fix { static void mode( FILE* f, char const errorText[] ) { int const fileNo = _fileno( f ); bool const isConsoleInput = (f == stdin && _isatty( fileNo )); if( isConsoleInput ) { // Bytes are received as per the active codepage in the console. // Except if that active codepage is 65001, in which case non-ASCII // characters fail. Also, _setmode just causes non-ASCII fail. // // Setting the console codepage to 1252 might be practically helpful, // since cp 1252, Windows ANSI Western, is a superset of Latin 1 // which is a subset of Unicode. However, non-Latin 1 characters will // then be incorrect, just as with web pages in the old days. } else { int const newMode = _setmode( fileNo, _O_U8TEXT ); hopefully( newMode != -1 ) || throwX( errorText ); } } }; Fix::mode( stdin, "_setmode stdin failed" ); Fix::mode( stdout, "_setmode stdout failed" ); Fix::mode( stderr, "_setmode stderr failed" ); } Cheers & hth., - Alf