Re: [boost] Silly Boost.Locale default narrow stringencodinginWindows

28 Oct 2011

      On 28.10.2011 13:31, Yakov Galka wrote:
...
On Fri, Oct 28, 2011 at 13:17, Alf P. Steinbach<
alf.p.steinbach+usenet@gmail.com>  wrote:
...
On 28.10.2011 12:36, Yakov Galka wrote:
...
On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach<
alf.p.steinbach+usenet@gmail.**com<alf.p.steinbach%2Busenet@gmail.com>>
  wrote:
On 27.10.2011 23:56, Peter Dimov wrote:
...
...
The advantage of using UTF-8 is that, apart from the border layer that
calls the OS (and that needs to be ported either way), the rest of the
code is happily char[]-based.
Oh.
I would be happy to learn this.
How do I make the following program work with Visual C++ in Windows,
using
narrow character string?
<code>
#include<stdio.h>
#include<fcntl.h>        // _O_U8TEXT
#include<io.h>           // _setmode, _fileno
#include<windows.h>
int main()
{
    //SetConsoleOutputCP( 65001 );
    //_setmode( _fileno( stdout ), _O_U8TEXT );
    printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
}
</code>
How will you make this program portable?
Well, that was *my* question.
The claim that this minimal "Hello, world!" program puts to the point, is
that "the rest of the [UTF-8 based] code is happily char[]-based".
Apparently that is not so.
My point is that you cannot talk about things without comparison.
I think that means that I failed to communicate to you what I compared.

There was a claim that the UTF-8 based code should just work, but the 
minimal hello world like code in my example does /not/ work.

Thus, it is a comparison between (1) reality, and (2) the claim, OK?
...
...
The out-commented code is from my random efforts to Make It Work(TM).
...
...
It refused.
This is because windows narrow-chars can't be UTF-8. You could make it
portable by:
int main()
{
     boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
}
Thanks, TIL boost::printf.
The idea of UTF-8 as a universal encoding seems now to be to use some
workaround such as boost::printf for each and every case where it turns out
that it doesn't work portably.
You pull things out of context. We should COMPARE the UTF-8 approach to the
wide-char on windows narrow-char on non-windows approach. Your approach
involves using your own printf just as well:
#include "u/stdio_h.h"      // u::CodingValue, u::printf, U
printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL?
u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what
exactly U is.
The relevant difference is in my opinion between

* re-implementing
   e.g. the standard library to support UTF-8 (like boost::printf, and
   although I haven't tested the claim that it works for the program we
   discussed, it is enough for me that it /could/ work), or

* wrapping
   it with some constant time data conversions (e.g. u::printf).

The hello world program demonstrated that one or the other is necessary.

So, we can forget the earlier silly claim that UTF-8 just magically 
works, and now really compare, for a simplest relevant program.

And yes, with the functionality that I sketched and coded up a demo of, 
you get strong type checking and argument dependent lookup. It is 
however possible to design this in e.g. C level ways where it would be 
much less convenient. I think the opinions in community may have been 
influenced by one particularly bad such design, the [tchar.h]... ;-)

For an UTF-16 platform a printf wrapper can simply be like this:

     inline int printf( CodingValue const* format, ... )
     {
         va_list args;
         va_start( args, format );
         return ::vwprintf( format->rawPtr(), args );
     }

The sprintf wrapper that I used in my example is more interesting, though:

     inline int sprintf( CodingValue* buffer, size_t count, CodingValue 
const* format, ... )
     {
         va_list args;
         va_start( args, format );
         return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), 
args );
     }

     inline int sprintf( CodingValue* buffer, CodingValue const* format, 
... )
     {
         va_list args;
         va_start( args, format );
         return ::vswprintf( buffer->rawPtr(), size_t( -1 ), 
format->rawPtr(), args );
     }

The problem that the above solves is that standard vswprintf is not a 
simple wchar_t version of standard vsprintf. As I recall Microsoft's 
[tchar.h] relies on a compiler-specific overload, but that approach does 
not cut it for platform independent code. For wchar_t/char independent 
code, one solution (as above) is two offer both signatures.

Note that these wrappers do not (and do not have to) do data conversion.

Whereas re-implementations for the UTF-8 scheme have to convert data.
...
but anyway you have to do O(N) work to wrap the N library functions you use.
Not quite.

It is so for the UTF-8 scheme for platform independent things such as 
standard library i/o, and it is so also for the native string scheme for 
platform independent things such as standard library i/o.

But when you're talking about the OS API, then with the UTF-8 scheme you 
need inefficient string data conversions and N wrappers, while with the 
native string scheme no string data conversions and no wrappers are 
needed. Only simple "get raw pointer" calls are needed, as illustrated 
in my example. Those calls could even be made implicit, but I think it's 
best to have them explicit in order to avoid unexpected effects.

This difference in conversion & wrapping effort was the reason that I 
used both the standard library and the OS API in my original example.

The standard library call used a thin wrapper, as shown above, while the 
OS API function (MessageBoxW) could be and was called directly.
...
Your approach is no way better.
I hope to convince you that the native string approach is objectively 
better for portable code, for any reasonable criteria, e.g.:

* Native encoded strings avoid the inefficient string data conversions 
of the UTF-8 scheme for OS API calls and for calls to functions that 
follow OS conventions.

* Native encoded strings avoids many bug traps such as passing a UTF-8 
string to a function expecting ANSI, or vice versa.

* Native encoded strings work seamlessly with the largest amount of code 
(Windows code and nix code), while the UTF-8 approach only works 
seamlessly with nix-oriented code.

Conversely, points such as those above mean that the UTF-8 approach is 
objectively much worse for portable code.

In particular, the UTF-8 approach violates the principle of not paying 
for what you don't (need to or want to) use, by adding inefficient 
conversions in all directions; it violates the principle of least 
surprise (where did that gobbledygook come from); and it violates the 
KISS principle ("Keep It Simple, Stupid!", forcing Windows programmers 
to deal with 3 internal string encodings instead of just 2).
...
...
...
You judge from a non-portable coed point-of-view. How about:
#include<cstdio>
#include "gtkext/message_box.h" // for gtkext::message_box
int main()
{
     char buffer[80];
     sprintf(buffer, "The answer is %d!", 6*7);
     gtkext::message_box(buffer, "This is a title!",
gtkext::icon_blah_blah,
...);
}
And unlike your code, it's magically portable! (thanks to gtk using UTF-8
on
windows)
Aha. When you use a library L that translates in platform-specific ways
to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
However, try to pass a `main` argument over to gtkext::message_box.
See the argv explanation in
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036
I'm sorry, I don't see what's relevant there. You suggest there that 
boost::program_options can be used if it is fixed to support UTF-8; 
quote "she can use boost::program_options (assuming it's also changed to 
follow the UTF-8 convention)". I think that suggestion is probably 
misguided. For as far as I can see boost::program_options do not provide 
any way to obtain the undamaged command line in Windows (and anyway that 
command line is UTF-16 encoded). Without a portable way to obtain 
undamaged program arguments, portable support for parsing them with this 
encoding or that encoding seems to me to be irrelevant.

Anyway, where does this introduction of special cases end?

At every point where UTF-8 does not work, the suggested solution is to 
add an inefficient data conversion and support that on all platforms.

Cheers & hth.,

- Alf