
On 31.10.2011 18:18, bugpower wrote:
Alf, All,
What replies seem to be missing here is that what you call the "least surprise" behavior of the code with argument of main(), is simply incorrect from the software engineering point of view. Let me explain:
3. the most natural sufficiently general native encoding, 1 or 2 depending on the platform that the source is being built for.
Now, when accepting filename from the user's command line on Windows, it is simply not possible to use narrow-string version of main().
Well, there are three aspects of that claim: 1 The limitations of `main` in Windows. Regarding aspect (1), the C++ Standard does describe the `main` arguments as "MBCS" strings, meaning they can (should) be encoded with possibly two or more bytes per character, as in for example UTF-8, which to me is strongly implied. However, the Windows convention for C++ executable narrow character set predates the C++ standard by a long shot, and even predates the C standard, and is Windows ANSI. And that convention is /very/ deeply embedded, not only in the runtime library implementations but e.g. in how Visual C++ translates string literals. 2 What you're trying to communicate. Regarding aspect (2), more quoted concrete context could help make it more clear to readers what you're trying to say. I'm not a telepath. But it does sound like you're arguing against a straw man of your own devising. As if someone had argued for using ANSI-encoded arguments in general, or as some solution of i18n. So, I will put my initial remark above, more strongly: Please always /quote/ what you're referring to. Especially when you are offering something that sounds as an argument against something, then please /quote/ what you're referring to. 3 The literal claim that "it is simply not possible to use narrow- string version of main()". Regarding aspect (3), this claim is incorrect. However, many people think that one has to use a non-standard startup function like WinMain, that one has to ditch some parts of the C++ standard as soon as one does Windows, so, some basic technical fact: The GNU toolchain (the g++ compiler) happily accepts a standard `main` startup function without further ado, regardless of Windows subsystem. The Microsoft toolchain (the Visual C++ compiler and linker), however, is less adept at recognizing your startup function as such. So with the MS toolchain you have to specify the startup function explicitly if you're building a GUI subsystem program and want a standard `main`. The relevant linker options: "/entry:mainCRTStartup /subsystem:windows". --- Finally, note how I had to cover a lot of bases and use a lot of time on responding to your single little sentence. That's because that sentence was very *unclear* and *misleading*, and, given the next sentence, quoted below, I hope it was not so by design.
Your code cannot enforce your user to limit his input to characters representable in the current ANSI codepage.
Ignoring the misleading "your", and responding to the technical content only: the previous sentence talked about `main` arguments, and this following sentence talks about "input", so it seems that you are confusing two different aspects that have very different behaviors in Windows. In Windows the program arguments are always passed to the process as a single UTF-16 encoded command line string, available via the API function GetCommandLine. For the command line it is therefore meaningless to talk about restricting the user. Standard /input/, OTOH., is always passed via some narrow character encoding, which does not include UTF-16, and which by convention is neither ANSI nor UTF-16 but the extraordinarily impractical OEM codepage (on English PC that codepage is the original IBM PC char set). Happily it is possible to change the narrow character encoding used for input. For example, it can be changed to UTF-8 as the external encoding. This is called the "active codepage" in a console window, and it can also be changed by the user, e.g. commands 'mode' and 'chcp'. The dangers of selecting UTF-8 as active codepage in a command interpreter console window, have been discussed else-thread; in short, Microsoft has a large number of ridiculous bugs in their support. But that discussion also showed that it's (very probably) OK under program control.
If the command line parameter is a filename as in the example you suggested, you cannot tell them "never double click on some files" (if a program used in a file association).
What example? Please always quote what you refer to, and quote enough of that context: don't be ambiguous, don't leave it to readers to infer a context based on your possibly wrong understanding of it. Anyway, text passed as `main` arguments can be e.g. the user's name, which is not necessarily a filename.
Supporting is always better than white-listing, so the only acceptable way of using command line parameter which is a filename on windows is with UTF-16 version - _tmain().
Oh dear. Are you seriously suggesting using `_tmain` to keep compatible with Windows 9x? Note that for Windows 9x, `_tmain` maps to standard `main`. And note that in Windows, `main` has Windows ANSI-encoded arguments... `_tmain` is a Microsoft macro that helped support compatibility with Windows 9x before the Layer for Unicode was introduced in 2001. `_tmain` maps to narrow character standard `main` or wide character non-standard `wmain` depending on the `_UNICODE` macro symbol. `_tmain` was an abomination even in its day, and today there are no reasons whatsoever to obfuscate the code that way.
Then, proceed as Artyom explained. The surprise is then justified - it prevented a hard-to-spot bug. My preference on Windows though would be different (and not due to religious reasons) - convert all API-returned strings to UTF-8 as soon as possible and forget about encoding issues for good.
No, it does not let you forget about encoding issues. Rather it introduces extra bug attractors, since you have then overloaded the meaning of char-based text. By convention in Windows and in most existing Windows code, char is ANSI-encoded, so other code will expect ANSI encoding from the UTF-8 based code, which will tend to introduce bugs. And other code will produce ANSI encoded text to the UTF-8 based code, which will tend to introduce bugs. Thus adding another possible encoding is absolutely not a good idea wrt. bugs. And you're adding inefficiency for all the myriad internal conversions. And you're adding either an utterly silly restriction to English A-Z in literals, or, for the let's-lie-to-the-compiler UTF-8 source without BOM served to the Visual C++ compiler, requiring that the wide character literals language feature is not used, and hoping for the best with respect to how smart later versions of the compiler will be. And most Windows libraries abide by Windows conventions, so it means extra work for supporting most library code. I.e. O(n) work for writing inefficient data converting wrappers for an unbounded set of functions instead of just O(1) work for writing efficient pointer type converting wrappers for a fixed set of functions. Think about it. As far as I know there is not *one single technical aspect* that the all UTF-8 scheme solves. I.e., AFAIK from a purely technical POV it's dumb.
See http://programmers.stackexchange.com/questions/102205/should-utf-16-be-consi...
Hm, was that an associative reference? Let me quote from the question: <quote> For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad: Opera has problem with editing them (delete required 2 presses on backspace) Notepad can't deal with them correctly (delete required 2 presses on backspace) File names editing in Window dialogs in broken (delete required 2 presses on backspace) All QT3 applications can't deal with them - show two empty squares instead of one symbol. Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP. Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings. StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes). WinForms TextBox may generate invalid string when limited with MaxLength. </quote> Here the poster lists concrete examples of how many common applications already have bugs in their Unicode handling. Showing by example that Unicode is tricky to get right. Is it then a good idea to needlessly, and at great cost, add further confusion about whether narrow characters are encoded as ANSI or UTF-8? Cheers & hth., - Alf