Re: [boost] Silly Boost.Locale default narrow string encodinginWindows

27 Oct 2011

      On 27.10.2011 21:07, Peter Dimov wrote:
...
Alf P. Steinbach wrote:
...
On 27.10.2011 20:01, Peter Dimov wrote:
...
...
File names on NTFS are not necessarily representable in the ANSI code
page. A program that uses narrow strings in the ANSI code page to
represents paths will not necessarily be able to open all files on the
system.
Right, that's one reason why modern Windows programs should best be
wchar_t based.
This is one of the two options. The other is using UTF-8 for
representing paths as narrow strings. The first option is more natural
for Windows-only code, and the second is better, in practice, for
portable code because it avoids the need to duplicate all path-related
functions for char/wchar_t. The motivation for using UTF-8 is practical,
not political or religious.
Thanks for that clarification of the current thinking at Boost.

I suspected that people envisioned those two choices as an exhaustive 
set of alternatives, what to choose from, but I wasn't sure.

Anyway, happily, the apparent forced choice between two inefficient 
ungoods, is not necessary  --  i.e. it's a false dichotomy.

For, there are at least THREE options for representing paths and other 
strings internally in the program, in portable single-source code:

   1. wide character based (UTF-16 in Windows, possibly UTF-32 in *nix),
      as you described above,

   2. narrow character based (UTF-8), as you described above, and

   3. the most natural sufficiently general native encoding, 1 or 2
      depending on the platform that the source is being built for.

Option 3 means  --  it requires, as far as I can see  --  some 
abstraction that hides the narrow/wide representation so as to get 
source code level portability, which is all that matters for C++. It 
doesn't need to involve very much. Some typedefs, traits, references.

Prior art in this direction, includes Microsoft's [tchar.h].

For example, write a portable string literal like this:

     PS( "This is a portable string literal" )

As compared to options 1 and 2, the benefits of option 3 include:

   * no inefficient conversions except at the external boundary of the
     program (and then in practice only in Windows, where it's already),

   * no problems with software and tools that don't understand a chosen
     "universal" (option 1 or 2) encoding,

   * no need to duplicate functions to adapt to underlying OS: one has
     at hand exactly what the OS API wants.

The main drawback is IMO the need to use something like a PS macro for 
string and character literals, or a C++11 /user defined literal/. 
Windows programmers are used to that, writing _T("blah") all the time as 
if Windows 95 was still extant. So, considering that all that current 
labor is being done for no reward whatsoever, I think it should be no 
problem convincing programmers that writing a few characters more in 
order to get portable string literals, is worth it; it just needs 
exposure to examples from some authoritative source...
...
...
The example that I gave at top of the thread was passing a `main`
argument further on, when using Boost.Locale. It causes trouble
because in Windows `main` arguments are by convention encoded as ANSI,
while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8
generally yields gobbledygook, except for the pure ASCII common subset.
Yes. If you (generic second person, not you specifically) want to take
your paths from the narrow API, an UTF-8 default is not practical. But
then again, you shouldn't take your paths from the narrow API, because
it can't represent the names of all the files the user may have.
That's an unrelated issue, really, but I think Boost could use a "get 
undamaged program arguments in portable strings" thing, if it isn't 
there already?

Cheers & hth.,

- Alf