Re: [boost] Silly Boost.Locale default narrow string encodinginWindows

1 Nov 2011

      On 31.10.2011 18:18, bugpower wrote:
...
Alf, All,
What replies seem to be missing here is that what you call the "least
surprise" behavior of the code with argument of main(), is simply incorrect
from the software engineering point of view. Let me explain:
...
3. the most natural sufficiently general native encoding, 1 or 2
      depending on the platform that the source is being built for.
Now, when accepting filename from the user's command line on Windows, it is
simply not possible to use narrow-string version of main().
Well, there are three aspects of that claim:

   1 The limitations of `main` in Windows.

Regarding aspect (1), the C++ Standard does describe the `main` 
arguments as "MBCS" strings, meaning they can (should) be encoded with 
possibly two or more bytes per character, as in for example UTF-8, which 
to me is strongly implied. However, the Windows convention for C++ 
executable narrow character set predates the C++ standard by a long 
shot, and even predates the C standard, and is Windows ANSI. And that 
convention is /very/ deeply embedded, not only in the runtime library 
implementations but e.g. in how Visual C++ translates string literals.

   2 What you're trying to communicate.

Regarding aspect (2), more quoted concrete context could help make it 
more clear to readers what you're trying to say.

I'm not a telepath. But it does sound like you're arguing against a 
straw man of your own devising. As if someone had argued for using 
ANSI-encoded arguments in general, or as some solution of i18n.

So, I will put my initial remark above, more strongly:

Please always /quote/ what you're referring to.

Especially when you are offering something that sounds as an argument 
against something, then please /quote/ what you're referring to.

   3 The literal claim that "it is simply not possible to use narrow-
     string version of main()".

Regarding aspect (3), this claim is incorrect.

However, many people think that one has to use a non-standard startup 
function like WinMain, that one has to ditch some parts of the C++ 
standard as soon as one does Windows, so, some basic technical fact:

The GNU toolchain (the g++ compiler) happily accepts a standard `main` 
startup function without further ado, regardless of Windows subsystem.

The Microsoft toolchain (the Visual C++ compiler and linker), however, 
is less adept at recognizing your startup function as such. So with the 
MS toolchain you have to specify the startup function explicitly if 
you're building a GUI subsystem program and want a standard `main`. The 
relevant linker options: "/entry:mainCRTStartup /subsystem:windows".

   ---

Finally, note how I had to cover a lot of bases and use a lot of time on 
responding to your single little sentence.

That's because that sentence was very *unclear* and *misleading*, and, 
given the next sentence, quoted below, I hope it was not so by design.
...
Your code cannot
enforce your user to limit his input to characters representable in the
current ANSI codepage.
Ignoring the misleading "your", and responding to the technical content 
only:

the previous sentence talked about `main` arguments, and this following 
sentence talks about "input", so it seems that you are confusing two 
different aspects that have very different behaviors in Windows.

In Windows the program arguments are always passed to the process as a 
single UTF-16 encoded command line string, available via the API 
function GetCommandLine.

For the command line it is therefore meaningless to talk about 
restricting the user.

Standard /input/, OTOH., is always passed via some narrow character 
encoding, which does not include UTF-16, and which by convention is 
neither ANSI nor UTF-16 but the extraordinarily impractical OEM codepage 
(on English PC that codepage is the original IBM PC char set). Happily 
it is possible to change the narrow character encoding used for input. 
For example, it can be changed to UTF-8 as the external encoding. This 
is called the "active codepage" in a console window, and it can also be 
changed by the user, e.g. commands 'mode' and 'chcp'.

The dangers of selecting UTF-8 as active codepage in a command 
interpreter console window, have been discussed else-thread; in short, 
Microsoft has a large number of ridiculous bugs in their support.

But that discussion also showed that it's (very probably) OK under 
program control.
...
If the command line parameter is a filename as in the
example you suggested, you cannot tell them "never double click on some
files" (if a program used in a file association).
What example?

Please always quote what you refer to, and quote enough of that context: 
don't be ambiguous, don't leave it to readers to infer a context based 
on your possibly wrong understanding of it.

Anyway, text passed as `main` arguments can be e.g. the user's name, 
which is not necessarily a filename.
...
Supporting is always
better than white-listing, so the only acceptable way of using command line
parameter which is a filename on windows is with UTF-16 version - _tmain().
Oh dear.

Are you seriously suggesting using `_tmain` to keep compatible with 
Windows 9x? Note that for Windows 9x, `_tmain` maps to standard `main`. 
And note that in Windows, `main` has Windows ANSI-encoded arguments...

`_tmain` is a Microsoft macro that helped support compatibility with 
Windows 9x before the Layer for Unicode was introduced in 2001.

`_tmain` maps to narrow character standard `main` or wide character 
non-standard `wmain` depending on the `_UNICODE` macro symbol.

`_tmain` was an abomination even in its day, and today there are no 
reasons whatsoever to obfuscate the code that way.
...
Then, proceed as Artyom explained. The surprise is then justified - it
prevented a hard-to-spot bug. My preference on Windows though would be
different (and not due to religious reasons) - convert all API-returned
strings to UTF-8 as soon as possible and forget about encoding issues for
good.
No, it does not let you forget about encoding issues.

Rather it introduces extra bug attractors, since you have then 
overloaded the meaning of char-based text. By convention in Windows and 
in most existing Windows code, char is ANSI-encoded, so other code will 
expect ANSI encoding from the UTF-8 based code, which will tend to 
introduce bugs. And other code will produce ANSI encoded text to the 
UTF-8 based code, which will tend to introduce bugs. Thus adding another 
possible encoding is absolutely not a good idea wrt. bugs.

And you're adding inefficiency for all the myriad internal conversions.

And you're adding either an utterly silly restriction to English A-Z in 
literals, or, for the let's-lie-to-the-compiler UTF-8 source without BOM 
served to the Visual C++ compiler, requiring that the wide character 
literals language feature is not used, and hoping for the best with 
respect to how smart later versions of the compiler will be.

And most Windows libraries abide by Windows conventions, so it means 
extra work for supporting most library code. I.e. O(n) work for writing 
inefficient data converting wrappers for an unbounded set of functions 
instead of just O(1) work for writing efficient pointer type converting 
wrappers for a fixed set of functions. Think about it.

As far as I know there is not *one single technical aspect* that the all 
UTF-8 scheme solves. I.e., AFAIK from a purely technical POV it's dumb.
...
See
http://programmers.stackexchange.com/questions/102205/should-utf-16-be-consi...
Hm, was that an associative reference?

Let me quote from the question:

<quote>
For example, try to create file names in Windows that include these 
characters; try to delete these characters with a "backspace" to see how 
they behave in different applications that use UTF-16. I did some tests 
and the results are quite bad:

     Opera has problem with editing them (delete required 2 presses on 
backspace)
     Notepad can't deal with them correctly (delete required 2 presses 
on backspace)
     File names editing in Window dialogs in broken (delete required 2 
presses on backspace)
     All QT3 applications can't deal with them - show two empty squares 
instead of one symbol.
     Python encodes such characters incorrectly when used directly 
u'X'!=unicode('X','utf-16') on some platforms when X in character 
outside of BMP.
     Python 2.5 unicodedata fails to get properties on such characters 
when python compiled with UTF-16 Unicode strings.
     StackOverflow seems to remove these characters from the text if 
edited directly in as Unicode characters (these characters are shown 
using HTML Unicode escapes).
     WinForms TextBox may generate invalid string when limited with 
MaxLength.
</quote>

Here the poster lists concrete examples of how many common applications 
already have bugs in their Unicode handling.

Showing by example that Unicode is tricky to get right.

Is it then a good idea to needlessly, and at great cost, add further 
confusion about whether narrow characters are encoded as ANSI or UTF-8?

Cheers & hth.,

- Alf

Re: [boost] Silly Boost.Locale default narrow string encodinginWindows

Alf P. Steinbach