
On 27.10.2011 21:07, Peter Dimov wrote:
Alf P. Steinbach wrote:
On 27.10.2011 20:01, Peter Dimov wrote: ...
File names on NTFS are not necessarily representable in the ANSI code page. A program that uses narrow strings in the ANSI code page to represents paths will not necessarily be able to open all files on the system.
Right, that's one reason why modern Windows programs should best be wchar_t based.
This is one of the two options. The other is using UTF-8 for representing paths as narrow strings. The first option is more natural for Windows-only code, and the second is better, in practice, for portable code because it avoids the need to duplicate all path-related functions for char/wchar_t. The motivation for using UTF-8 is practical, not political or religious.
Thanks for that clarification of the current thinking at Boost. I suspected that people envisioned those two choices as an exhaustive set of alternatives, what to choose from, but I wasn't sure. Anyway, happily, the apparent forced choice between two inefficient ungoods, is not necessary -- i.e. it's a false dichotomy. For, there are at least THREE options for representing paths and other strings internally in the program, in portable single-source code: 1. wide character based (UTF-16 in Windows, possibly UTF-32 in *nix), as you described above, 2. narrow character based (UTF-8), as you described above, and 3. the most natural sufficiently general native encoding, 1 or 2 depending on the platform that the source is being built for. Option 3 means -- it requires, as far as I can see -- some abstraction that hides the narrow/wide representation so as to get source code level portability, which is all that matters for C++. It doesn't need to involve very much. Some typedefs, traits, references. Prior art in this direction, includes Microsoft's [tchar.h]. For example, write a portable string literal like this: PS( "This is a portable string literal" ) As compared to options 1 and 2, the benefits of option 3 include: * no inefficient conversions except at the external boundary of the program (and then in practice only in Windows, where it's already), * no problems with software and tools that don't understand a chosen "universal" (option 1 or 2) encoding, * no need to duplicate functions to adapt to underlying OS: one has at hand exactly what the OS API wants. The main drawback is IMO the need to use something like a PS macro for string and character literals, or a C++11 /user defined literal/. Windows programmers are used to that, writing _T("blah") all the time as if Windows 95 was still extant. So, considering that all that current labor is being done for no reward whatsoever, I think it should be no problem convincing programmers that writing a few characters more in order to get portable string literals, is worth it; it just needs exposure to examples from some authoritative source...
The example that I gave at top of the thread was passing a `main` argument further on, when using Boost.Locale. It causes trouble because in Windows `main` arguments are by convention encoded as ANSI, while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8 generally yields gobbledygook, except for the pure ASCII common subset.
Yes. If you (generic second person, not you specifically) want to take your paths from the narrow API, an UTF-8 default is not practical. But then again, you shouldn't take your paths from the narrow API, because it can't represent the names of all the files the user may have.
That's an unrelated issue, really, but I think Boost could use a "get undamaged program arguments in portable strings" thing, if it isn't there already? Cheers & hth., - Alf