[boost] "Best so far" for C level i/o of Unicode text with Windows console

3 Nov 2011

      This may be of interest to developers of Boost libraries that deal with 
text.

I found that the Visual C++ implementation of the C library i/o 
generally does not support console input of international characters. It 
can deal with narrow character input from the current codepage, if that 
codepage is not UTF-8. For example:

<code>
#include <stdio.h>

int main()
{
     printf( "? " );
     char buffer[80];
     scanf( "%s", buffer );
     printf( "%s\n", buffer );
}
</code>

<result>
d:\dev\test> chcp 1252
Active code page: 1252

d:\dev\test> utf8test
? abcæøå
abcæøå

d:\dev\test> chcp 65001
Active code page: 65001

d:\dev\test> utf8test
? abcæøå
��,

d:\dev\test> _
</result>

I placed a comment about this at the blog of Microsoft's Unicode guru 
Michael Kaplan,

http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

In particular, with active codepage 65001 (UTF-8), functions such as 
wscanf just fail outright on non-ASCII characters.

The IMO best compromise that I have managed to come up with is a kind of 
hybrid, where the active input codepage is set to 1252, Windows ANSI 
Western, because that's a superset of Latin 1 which is a subset of 
Unicode. Output is converted to UTF-8. Input from e.g. a file accepts UTF-8.

This means that e.g. Norwegian text can be input interactively 
(restricted to Latin 1 character set) or from a pipe or file (as UTF-8) 
without problems, and can be automatically translated (by the C i/o 
level) to UTF-16 encoding inside the program. However, I'm guessing that 
the C level input will garble e.g. Russian text no matter what one does, 
if one desires some Unicode based encoding inside the program. The 
Windows and Visual C++ support is just full of bugs  --  e.g. as the 
crash of `more` showed in an earlier thread.

Initialization of streams for the compromize scheme:

static void msvcCompatibleInit()
{
     struct Fix
     {
         static void mode( FILE* f, char const errorText[] )
         {
             int const   fileNo          = _fileno( f );
             bool const  isConsoleInput  = (f == stdin && _isatty( 
fileNo ));

             if( isConsoleInput )
             {
                 // Bytes are received as per the active codepage in the 
console.
                 // Except if that active codepage is 65001, in which 
case non-ASCII
                 // characters fail. Also, _setmode just causes 
non-ASCII fail.
                 //
                 // Setting the console codepage to 1252 might be 
practically helpful,
                 // since cp 1252, Windows ANSI Western, is a superset 
of Latin 1
                 // which is a subset of Unicode. However, non-Latin 1 
characters will
                 // then be incorrect, just as with web pages in the old 
days.
             }
             else
             {
                 int const newMode = _setmode( fileNo, _O_U8TEXT );
                 hopefully( newMode != -1 )
                     || throwX( errorText );
             }
         }
     };

     Fix::mode( stdin, "_setmode stdin failed" );
     Fix::mode( stdout, "_setmode stdout failed" );
     Fix::mode( stderr, "_setmode stderr failed" );
}

Cheers & hth.,

- Alf

[boost] "Best so far" for C level i/o of Unicode text with Windows console

Alf P. Steinbach