
On 28.10.2011 13:31, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 13:17, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.com> wrote:
On 28.10.2011 12:36, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.**com<alf.p.steinbach%2Busenet@gmail.com>> wrote:
On 27.10.2011 23:56, Peter Dimov wrote:
The advantage of using UTF-8 is that, apart from the border layer that calls the OS (and that needs to be ported either way), the rest of the code is happily char[]-based.
Oh.
I would be happy to learn this.
How do I make the following program work with Visual C++ in Windows, using narrow character string?
<code> #include<stdio.h> #include<fcntl.h> // _O_U8TEXT #include<io.h> // _setmode, _fileno #include<windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
How will you make this program portable?
Well, that was *my* question.
The claim that this minimal "Hello, world!" program puts to the point, is that "the rest of the [UTF-8 based] code is happily char[]-based".
Apparently that is not so.
My point is that you cannot talk about things without comparison.
I think that means that I failed to communicate to you what I compared. There was a claim that the UTF-8 based code should just work, but the minimal hello world like code in my example does /not/ work. Thus, it is a comparison between (1) reality, and (2) the claim, OK?
The out-commented code is from my random efforts to Make It Work(TM).
It refused.
This is because windows narrow-chars can't be UTF-8. You could make it portable by:
int main() { boost::printf("Blåbærsyltetøy! 日本国 кошка!\n"); }
Thanks, TIL boost::printf.
The idea of UTF-8 as a universal encoding seems now to be to use some workaround such as boost::printf for each and every case where it turns out that it doesn't work portably.
You pull things out of context. We should COMPARE the UTF-8 approach to the wide-char on windows narrow-char on non-windows approach. Your approach involves using your own printf just as well:
#include "u/stdio_h.h" // u::CodingValue, u::printf, U printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL? u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what exactly U is.
The relevant difference is in my opinion between * re-implementing e.g. the standard library to support UTF-8 (like boost::printf, and although I haven't tested the claim that it works for the program we discussed, it is enough for me that it /could/ work), or * wrapping it with some constant time data conversions (e.g. u::printf). The hello world program demonstrated that one or the other is necessary. So, we can forget the earlier silly claim that UTF-8 just magically works, and now really compare, for a simplest relevant program. And yes, with the functionality that I sketched and coded up a demo of, you get strong type checking and argument dependent lookup. It is however possible to design this in e.g. C level ways where it would be much less convenient. I think the opinions in community may have been influenced by one particularly bad such design, the [tchar.h]... ;-) For an UTF-16 platform a printf wrapper can simply be like this: inline int printf( CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vwprintf( format->rawPtr(), args ); } The sprintf wrapper that I used in my example is more interesting, though: inline int sprintf( CodingValue* buffer, size_t count, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), args ); } inline int sprintf( CodingValue* buffer, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), size_t( -1 ), format->rawPtr(), args ); } The problem that the above solves is that standard vswprintf is not a simple wchar_t version of standard vsprintf. As I recall Microsoft's [tchar.h] relies on a compiler-specific overload, but that approach does not cut it for platform independent code. For wchar_t/char independent code, one solution (as above) is two offer both signatures. Note that these wrappers do not (and do not have to) do data conversion. Whereas re-implementations for the UTF-8 scheme have to convert data.
but anyway you have to do O(N) work to wrap the N library functions you use.
Not quite. It is so for the UTF-8 scheme for platform independent things such as standard library i/o, and it is so also for the native string scheme for platform independent things such as standard library i/o. But when you're talking about the OS API, then with the UTF-8 scheme you need inefficient string data conversions and N wrappers, while with the native string scheme no string data conversions and no wrappers are needed. Only simple "get raw pointer" calls are needed, as illustrated in my example. Those calls could even be made implicit, but I think it's best to have them explicit in order to avoid unexpected effects. This difference in conversion & wrapping effort was the reason that I used both the standard library and the OS API in my original example. The standard library call used a thin wrapper, as shown above, while the OS API function (MessageBoxW) could be and was called directly.
Your approach is no way better.
I hope to convince you that the native string approach is objectively better for portable code, for any reasonable criteria, e.g.: * Native encoded strings avoid the inefficient string data conversions of the UTF-8 scheme for OS API calls and for calls to functions that follow OS conventions. * Native encoded strings avoids many bug traps such as passing a UTF-8 string to a function expecting ANSI, or vice versa. * Native encoded strings work seamlessly with the largest amount of code (Windows code and nix code), while the UTF-8 approach only works seamlessly with nix-oriented code. Conversely, points such as those above mean that the UTF-8 approach is objectively much worse for portable code. In particular, the UTF-8 approach violates the principle of not paying for what you don't (need to or want to) use, by adding inefficient conversions in all directions; it violates the principle of least surprise (where did that gobbledygook come from); and it violates the KISS principle ("Keep It Simple, Stupid!", forcing Windows programmers to deal with 3 internal string encodings instead of just 2).
You judge from a non-portable coed point-of-view. How about:
#include<cstdio>
#include "gtkext/message_box.h" // for gtkext::message_box
int main() { char buffer[80]; sprintf(buffer, "The answer is %d!", 6*7); gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah, ...); }
And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on windows)
Aha. When you use a library L that translates in platform-specific ways to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
However, try to pass a `main` argument over to gtkext::message_box.
See the argv explanation in http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036
I'm sorry, I don't see what's relevant there. You suggest there that boost::program_options can be used if it is fixed to support UTF-8; quote "she can use boost::program_options (assuming it's also changed to follow the UTF-8 convention)". I think that suggestion is probably misguided. For as far as I can see boost::program_options do not provide any way to obtain the undamaged command line in Windows (and anyway that command line is UTF-16 encoded). Without a portable way to obtain undamaged program arguments, portable support for parsing them with this encoding or that encoding seems to me to be irrelevant. Anyway, where does this introduction of special cases end? At every point where UTF-8 does not work, the suggested solution is to add an inefficient data conversion and support that on all platforms. Cheers & hth., - Alf