
On Fri, Oct 28, 2011 at 15:34, Alf P. Steinbach < alf.p.steinbach+usenet@gmail.com> wrote:
[...] There was a claim that the UTF-8 based code should just work,
I can't recall anyone saying this. What people were saying is that it's the most sane way to write portable code. And if the vendors hadn't been resisting UTF-8 adoption, it would just work.
[...] * re-implementing e.g. the standard library to support UTF-8 (like boost::printf, and although I haven't tested the claim that it works for the program we discussed, it is enough for me that it /could/ work), or
* wrapping it with some constant time data conversions (e.g. u::printf).
The hello world program demonstrated that one or the other is necessary.
My last mail demonstrated that we don't need either when on windows. printf just works.
So, we can forget the earlier silly claim that UTF-8 just magically works, and now really compare, for a simplest relevant program.
Now we can recall this claim and continue to apply it to your silly claim that wrapping everything is easier. [...]
For an UTF-16 platform a printf wrapper can simply be like this:
inline int printf( CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vwprintf( format->rawPtr(), args ); }
Apparently we don't need it. In linux world requesting the user to use UTF-8 is legitimate. It's already almost everywhere the default. In some non-linux systems UTF-8 is the default too (Mac OS X?). In windows we can use narrow printf just fine.
The sprintf wrapper that I used in my example is more interesting, though:
inline int sprintf( CodingValue* buffer, size_t count, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), args ); }
inline int sprintf( CodingValue* buffer, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), size_t( -1 ), format->rawPtr(), args ); }
Oh! thank you! You suggest to wrap each function that comes in two kinds... You don't need to either wrap or re-implement sprintf for the UTF-8 approach. The whole point of UTF-8 is that it already works with most of the existing narrow library functions (strlen, strstr, str*, std::string, etc...) It's simpler, ah!? The problem that the above solves is that standard vswprintf is not a
simple wchar_t version of standard vsprintf. As I recall Microsoft's [tchar.h] relies on a compiler-specific overload, but that approach does not cut it for platform independent code. For wchar_t/char independent code, one solution (as above) is two offer both signatures.
No such problems in UTF-8 world.
but anyway you have to do O(N) work to wrap the N library functions you
use.
Not quite.
It is so for the UTF-8 scheme for platform independent things such as standard library i/o, and it is so also for the native string scheme for platform independent things such as standard library i/o.
As we see it's the other way around...
But when you're talking about the OS API, then with the UTF-8 scheme you need inefficient string data conversions
It's quite efficient. In fact it was never a bottleneck. Invoking the OS usually yields complex operations anyway. Moreover, even in non-English speaking world, most of the text internal to programs is still ASCII. UTF-8 saves space, saves cache usage. This compensates the conversion penalty. To make a definite statements you must measure. Otherwise it's premature optimization, if it's an optimization at all. Also note that in multi-threaded world with hierarchical memory, computation becomes faster than memory access. and N wrappers, while with the native string scheme no string data
conversions and no wrappers are needed.
The difference is what you wrap: the standard interface or the proprietary OS-interface. We benefit more from wrapping the later, as was done for hundreds of times in each portable library that tries to accomplish something beyond primitive file-io. This is because you get a portable library as a side-product.
Your approach is no way better.
I hope to convince you that the native string approach is objectively better for portable code, for any reasonable criteria, e.g.:
* Native encoded strings avoid the inefficient string data conversions of the UTF-8 scheme for OS API calls and for calls to functions that follow OS conventions.
Stop calling it inefficient. If you store portable data on some storage, or receive it through network—as any serious application does today—you can't avoid conversions. You just have to decide where you do them, closer to the OS or further. Anyway, see above. * Native encoded strings avoids many bug traps such as passing a UTF-8
string to a function expecting ANSI, or vice versa.
Yeah, and "multiple inheritance causes multiple abuse of multiple inheritance"[1] microsoft said? UTF-8 avoids many bug traps such as forgetting that UTF-16 is actually a variable length encoding. EVERYBODY knows that UTF-8 has vaaariiable-lng codepoints. * Native encoded strings work seamlessly with the largest amount of code
(Windows code and nix code), while the UTF-8 approach only works seamlessly with nix-oriented code.
Hmmm... I prefer the later, just to avoid all the boilerplate wrappers for what has been standard for years. And I'm a windows programmer. Besides, how will you return unicode from std::exception::what() if not by UTF-8? Conversely, points such as those above mean that the UTF-8 approach is
objectively much worse for portable code.
Since I'm tired of repeating the same again and again, see "Using the native encoding" in http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036 In particular, the UTF-8 approach violates the principle of not paying for
what you don't (need to or want to) use
UTF-16 violates the principle of you don't pay for what you don't use: If most of your text is ASCII (which is true for internal text even in non-English countries) you don't want to waste twice as much memory.
, by adding inefficient conversions in all directions;
Again? seekg(0) and read(). You'll have to do conversions anyway, e.g. when you read from a file. You don't store native encoding in portable file, do you?
[...] and it violates the KISS principle ("Keep It Simple, Stupid!", forcing Windows programmers to deal with 3 internal string encodings instead of just 2).
If you're working with 2 encodings, you're doing something terribly wrong. Seriously, it looks like you're still living in the 20th century. You shall not use ANSI encodings (other than UTF-8) on windows because they don't work with Unicode. They are mostly deprecated. Microsoft encourages you to use either UTF-8 or UTF-16 ( http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%2... ). Now, assuming you stopped using legacy 'ANSI' encodings, you left with only UTF-16 (internal) and UTF-8 (external). Replace internal UTF-16 with UTF-8, and you're left with only ONE encoding used for EVERYTHING, internal and external. UTF-16 at OS calls doesn't count as it's not stored anywhere (you're not 'dealing' with it). [1] from some C# book by microsoft I glanced a few years ago. -- Yakov