Silly Boost.Locale default narrow string encoding in Windows

When I engage the compiler-in-my-mind to the example given at http://cppcms.sourceforge.net/boost_locale/html/ namely <code> #include <boost/locale.hpp> #include <boost/filesystem/path.hpp> #include <boost/filesystem/fstream.hpp> int main() { // Create and install global locale std::locale::global(boost::locale::generator().generate("")); // Make boost.filesystem use it boost::filesystem::path::imbue(std::locale()); // Now Works perfectly fine with UTF-8! boost::filesystem::ofstream hello("שלום.txt"); } </code> then it fails to work when the literal string is replaced with a `main` argument. A conversion is then necessary and must be added. It breaks the principle of least surprise. It breaks the principle of not paying for what you don't (want to) use. I understand, from discussions elsewhere, that the author(s) have chosen a narrow string encoding that requires inefficient & awkward conversions in all directions, for political/religious reasons. Maybe my understanding of that is faulty, that it's no longer politics & religion but outright war (and maybe that war is even over, with even Luke Skywalker dead or deadly wounded). However, I still ask: why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do it right, using the platforms' native encodings. Cheers, - Alf

Alf P. Steinbach wrote:
However, I still ask:
why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do it right, using the platforms' native encodings.
Comment out the imbue line. (The platform's native encoding is UTF-16. The "ANSI" code page, which is not necessarily ANSI or ANSI-like at all, despite your assertion, is not "native"; the OS just converts from/to it as needed. Your program will work fine until it's given a file name that is not representable in the ANSI CP.)

On 27.10.2011 18:47, Peter Dimov wrote:
Alf P. Steinbach wrote:
However, I still ask:
why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do it right, using the platforms' native encodings.
Comment out the imbue line.
But that line is much of the point, isn't it?
(The platform's native encoding is UTF-16. The "ANSI" code page, which is not necessarily ANSI or ANSI-like at all, despite your assertion,
The article you responded to did not contain the word "ANSI". Thus, when you refer to an assertion about "ANSI", you have fantasized something. I hope you are not going to go on like that.
[ANSI] is not "native"; the OS just converts from/to it as needed.
OK, you need to learn a quite bit but (1) you appear to be very sure that you're already knowledgeable, and (2) you attribute things to me that you have just fantasized. That makes it very difficult to teach you. For narrow character strings in Windows, "native" and "ANSI" are interchangeable terms. They mean the same, namely the codepage identified by the GetACP() function. This is not a particular codepage, it is configurable. On my machine, and most probably on yours, it is codepage 1252, Windows ANSI Western. "Native" means the encoding used and expected by the OS' API functions. For narrow character strings in Windows, that's Windows ANSI.
Your program
No, again you're wrong: it's the Boost.Locale documentation's program.
will work fine until it's given a file name that is not representable in the ANSI CP.)
Nope, sorry, for any /reasonable interpretation/ of what you're writing. I can imagine that maybe you're thinking about setting ANSI CP to 65001, which however is not reasonable. Cheers & hth., - Alf

Alf P. Steinbach wrote:
On 27.10.2011 18:47, Peter Dimov wrote:
Alf P. Steinbach wrote:
However, I still ask:
why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do it right, using the platforms' native encodings.
Comment out the imbue line.
But that line is much of the point, isn't it?
There wouldn't be much point in calling imbue if you didn't want a change in the boost::filesystem default behavior, which is to convert using the ANSI CP (or the OEM CP if AreFIleApisAnsi() returns false, if I'm not mistaken).
(The platform's native encoding is UTF-16. The "ANSI" code page, which is not necessarily ANSI or ANSI-like at all, despite your assertion,
The article you responded to did not contain the word "ANSI".
Thus, when you refer to an assertion about "ANSI", you have fantasized something.
http://boost.2283326.n4.nabble.com/Making-Boost-Filesystem-work-with-GENERAL...
I hope you are not going to go on like that.
[ANSI] is not "native"; the OS just converts from/to it as needed.
OK, you need to learn a quite bit but
(1) you appear to be very sure that you're already knowledgeable, and
(2) you attribute things to me that you have just fantasized.
That makes it very difficult to teach you.
For narrow character strings in Windows, "native" and "ANSI" are interchangeable terms.
I will accept your definition for the time being and restate what I just said without using "native": Under Windows (NT+ and NTFS), the narrow character API is a wrapper over the wide character API. The system converts from/to the ANSI code page as needed. The narrowing conversion may lose data.
Your program
No, again you're wrong: it's the Boost.Locale documentation's program.
will work fine until it's given a file name that is not representable in the ANSI CP.)
Nope, sorry, for any /reasonable interpretation/ of what you're writing.
File names on NTFS are not necessarily representable in the ANSI code page. A program that uses narrow strings in the ANSI code page to represents paths will not necessarily be able to open all files on the system.

On 27.10.2011 20:01, Peter Dimov wrote:
Alf P. Steinbach wrote:
On 27.10.2011 18:47, Peter Dimov wrote:
Alf P. Steinbach wrote:
However, I still ask:
why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do it right, using the platforms' native encodings.
Comment out the imbue line.
But that line is much of the point, isn't it?
There wouldn't be much point in calling imbue if you didn't want a change in the boost::filesystem default behavior, which is to convert using the ANSI CP (or the OEM CP if AreFIleApisAnsi() returns false, if I'm not mistaken).
Oh there is. It is a level of indirection. You want Boost.Filesystem to assume /the same/ narrow character encoding as Boost.Locale, whatever it is. And to quote the docs where I found that program, "Boost Locale fully supports both narrow and wide API. The default character encoding is assumed to be UTF-8 on Windows."
(The platform's native encoding is UTF-16. The "ANSI" code page, which is not necessarily ANSI or ANSI-like at all, despite your assertion,
The article you responded to did not contain the word "ANSI".
Thus, when you refer to an assertion about "ANSI", you have fantasized something.
http://boost.2283326.n4.nabble.com/Making-Boost-Filesystem-work-with-GENERAL...
That's a different context and a different discussion, where it was neither necessary nor natural to dot the i's and cross the t's to perfection. Talk about dragging in things from out of the blue. If you wanted to point out the possibility of e.g. a Japanese codepage as ANSI, then you should have done that over there, in that thread. I mean in the context where it could make sense and where it could help prevent readers getting a wrong impression. If it was that important. [snippety]
Under Windows (NT+ and NTFS), the narrow character API is a wrapper over the wide character API. The system converts from/to the ANSI code page as needed. The narrowing conversion may lose data.
OK, we're just talking about two different meanings of "native", for two different contexts: windows internals, and windows apps. The relevant context for discussing Boost.Locale's treatment of narrow strings, is the application level.
[the program] will work fine until it's given a file name that is not representable in the ANSI CP.)
Nope, sorry, for any /reasonable interpretation/ of what you're writing.
File names on NTFS are not necessarily representable in the ANSI code page. A program that uses narrow strings in the ANSI code page to represents paths will not necessarily be able to open all files on the system.
Right, that's one reason why modern Windows programs should best be wchar_t based. Other reasons include efficiency (avoiding conversions) and simple convenience. Some API functions do not have narrow wrappers. However, a default assumption of UTF-8 encoding for narrow strings, as in Boost.Locale, seems to me to clash with most uses of narrow strings. For example, if you output UTF-8 on standard output, and then try to pipe that through `more` in Windows' [cmd.exe], you get this: <example> d:\dave> chcp 65001 Active code page: 65001 d:\dave> echo "imagine this is utf8" | more Not enough memory. d:\dave> _ </example> So utf-8 is, to put it less than strongly, not very practical as a general narrow-character encoding in Windows. The example that I gave at top of the thread was passing a `main` argument further on, when using Boost.Locale. It causes trouble because in Windows `main` arguments are by convention encoded as ANSI, while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8 generally yields gobbledygook, except for the pure ASCII common subset. But with ANSI as Boost.Locale default, with that more reasonable choice of default, the imbue call would not cause trouble, but would instead help to avoid trouble -- which is surely the original intention. Cheers & hth., - Alf

Alf P. Steinbach wrote:
On 27.10.2011 20:01, Peter Dimov wrote: ...
File names on NTFS are not necessarily representable in the ANSI code page. A program that uses narrow strings in the ANSI code page to represents paths will not necessarily be able to open all files on the system.
Right, that's one reason why modern Windows programs should best be wchar_t based.
This is one of the two options. The other is using UTF-8 for representing paths as narrow strings. The first option is more natural for Windows-only code, and the second is better, in practice, for portable code because it avoids the need to duplicate all path-related functions for char/wchar_t. The motivation for using UTF-8 is practical, not political or religious.
The example that I gave at top of the thread was passing a `main` argument further on, when using Boost.Locale. It causes trouble because in Windows `main` arguments are by convention encoded as ANSI, while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8 generally yields gobbledygook, except for the pure ASCII common subset.
Yes. If you (generic second person, not you specifically) want to take your paths from the narrow API, an UTF-8 default is not practical. But then again, you shouldn't take your paths from the narrow API, because it can't represent the names of all the files the user may have.

On 27.10.2011 21:07, Peter Dimov wrote:
Alf P. Steinbach wrote:
On 27.10.2011 20:01, Peter Dimov wrote: ...
File names on NTFS are not necessarily representable in the ANSI code page. A program that uses narrow strings in the ANSI code page to represents paths will not necessarily be able to open all files on the system.
Right, that's one reason why modern Windows programs should best be wchar_t based.
This is one of the two options. The other is using UTF-8 for representing paths as narrow strings. The first option is more natural for Windows-only code, and the second is better, in practice, for portable code because it avoids the need to duplicate all path-related functions for char/wchar_t. The motivation for using UTF-8 is practical, not political or religious.
Thanks for that clarification of the current thinking at Boost. I suspected that people envisioned those two choices as an exhaustive set of alternatives, what to choose from, but I wasn't sure. Anyway, happily, the apparent forced choice between two inefficient ungoods, is not necessary -- i.e. it's a false dichotomy. For, there are at least THREE options for representing paths and other strings internally in the program, in portable single-source code: 1. wide character based (UTF-16 in Windows, possibly UTF-32 in *nix), as you described above, 2. narrow character based (UTF-8), as you described above, and 3. the most natural sufficiently general native encoding, 1 or 2 depending on the platform that the source is being built for. Option 3 means -- it requires, as far as I can see -- some abstraction that hides the narrow/wide representation so as to get source code level portability, which is all that matters for C++. It doesn't need to involve very much. Some typedefs, traits, references. Prior art in this direction, includes Microsoft's [tchar.h]. For example, write a portable string literal like this: PS( "This is a portable string literal" ) As compared to options 1 and 2, the benefits of option 3 include: * no inefficient conversions except at the external boundary of the program (and then in practice only in Windows, where it's already), * no problems with software and tools that don't understand a chosen "universal" (option 1 or 2) encoding, * no need to duplicate functions to adapt to underlying OS: one has at hand exactly what the OS API wants. The main drawback is IMO the need to use something like a PS macro for string and character literals, or a C++11 /user defined literal/. Windows programmers are used to that, writing _T("blah") all the time as if Windows 95 was still extant. So, considering that all that current labor is being done for no reward whatsoever, I think it should be no problem convincing programmers that writing a few characters more in order to get portable string literals, is worth it; it just needs exposure to examples from some authoritative source...
The example that I gave at top of the thread was passing a `main` argument further on, when using Boost.Locale. It causes trouble because in Windows `main` arguments are by convention encoded as ANSI, while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8 generally yields gobbledygook, except for the pure ASCII common subset.
Yes. If you (generic second person, not you specifically) want to take your paths from the narrow API, an UTF-8 default is not practical. But then again, you shouldn't take your paths from the narrow API, because it can't represent the names of all the files the user may have.
That's an unrelated issue, really, but I think Boost could use a "get undamaged program arguments in portable strings" thing, if it isn't there already? Cheers & hth., - Alf

Alf P. Steinbach wrote:
On 27.10.2011 21:07, Peter Dimov wrote:
Alf P. Steinbach wrote: ...
Right, that's one reason why modern Windows programs should best be wchar_t based.
This is one of the two options. The other is using UTF-8 for representing paths as narrow strings. The first option is more natural for Windows-only code, and the second is better, in practice, for portable code because it avoids the need to duplicate all path-related functions for char/wchar_t. The motivation for using UTF-8 is practical, not political or religious.
Thanks for that clarification of the current thinking at Boost.
My opinion is not representative of all of Boost, although I've found that there is substantial agreement between people who write portable software that needs to deal with paths (#2, UTF-8, as the way to go).
3. the most natural sufficiently general native encoding, 1 or 2 depending on the platform that the source is being built for.
Yes, with its various suboptions. 3a, TCHAR, 3b, template on char_type, 3c, providing both char and wchar_t overloads. They all have their problems; people don't move to UTF-8 merely out of spite.
Prior art in this direction, includes Microsoft's [tchar.h].
This works, more or less, once you've accumulated the appropriate library of _T macros, _t functions and T/t typedefs. I've never heard of it actually being used for a portable code base, but I admit that it's possible to do things this way, even if it's somewhat alien to POSIX people. The advantage of using UTF-8 is that, apart from the border layer that calls the OS (and that needs to be ported either way), the rest of the code is happily char[]-based. There's no need to be aware of the fact that literals need to be quoted or that strlen should be spelled _tcslen. There's no need to convert paths to an external representation when writing them into a portable config/project file.
That's an unrelated issue, really, but I think Boost could use a "get undamaged program arguments in portable strings" thing, if it isn't there already?
We'll be back to the question of what constitutes a portable string. I'd prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer TCHAR[].

On 27.10.2011 23:56, Peter Dimov wrote:
Alf P. Steinbach wrote:
Alf P. Steinbach wrote: ...
Right, that's one reason why modern Windows programs should best be wchar_t based.
This is one of the two options. The other is using UTF-8 for representing paths as narrow strings. The first option is more natural for Windows-only code, and the second is better, in practice, for portable code because it avoids the need to duplicate all path-related functions for char/wchar_t. The motivation for using UTF-8 is
On 27.10.2011 21:07, Peter Dimov wrote: practical,
not political or religious.
Thanks for that clarification of the current thinking at Boost.
My opinion is not representative of all of Boost, although I've found that there is substantial agreement between people who write portable software that needs to deal with paths (#2, UTF-8, as the way to go).
3. the most natural sufficiently general native encoding, 1 or 2 depending on the platform that the source is being built for.
Yes, with its various suboptions. 3a, TCHAR, 3b, template on char_type, 3c, providing both char and wchar_t overloads. They all have their problems; people don't move to UTF-8 merely out of spite.
Prior art in this direction, includes Microsoft's [tchar.h].
This works, more or less, once you've accumulated the appropriate library of _T macros, _t functions and T/t typedefs. I've never heard of it actually being used for a portable code base,
[tchar.h], plus the similar support in <windows.h>, was heavily used for porting applications between Windows 9x ANSI and Windows NT Unicode, before Microsoft introduced the Layer for Unicode in 2001 or thereabouts (the layer allowed wchar_t-apps to run in Windows 9x). I'm not saying it's a good C++ approach for that porting -- it's not, since it was designed for the C language. I just gave it as an example of prior art, which includes a neat header where the names of the relevant functions to wrap (or whatever) can be extracted by a small Python script. ;-)
but I admit that it's possible to do things this way, even if it's somewhat alien to POSIX people.
The advantage of using UTF-8 is that, apart from the border layer that calls the OS (and that needs to be ported either way), the rest of the code is happily char[]-based.
Oh. I would be happy to learn this. How do I make the following program work with Visual C++ in Windows, using narrow character string? <code> #include <stdio.h> #include <fcntl.h> // _O_U8TEXT #include <io.h> // _setmode, _fileno #include <windows.h> int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code> The out-commented code is from my random efforts to Make It Work(TM). It refused. By the way, I'm hoping Boost isn't supporting old versions of g++. Because old versions of g++ chocked on a BOM at start of UTF-8 encoded source code, while Visual C++ requires that BOM... So, UTF-8 source code ungood with old versions of g++, if Visual C++ is also used.
There's no need to be aware of the fact that literals need to be quoted or that strlen should be spelled _tcslen. There's no need to convert paths to an external representation when writing them into a portable config/project file.
Hm, I'm not so sure. I'd like to see this magic in action before believing in it, e.g., the program above working with narrow chars and printf, with Visual C++.
That's an unrelated issue, really, but I think Boost could use a "get undamaged program arguments in portable strings" thing, if it isn't there already?
We'll be back to the question of what constitutes a portable string. I'd prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer TCHAR[].
No, not TCHAR, which was designed for the C language (and is an ugly uppercase name to boot). Instead, like this: <code> #include "u/stdio_h.h" // u::CodingValue, u::sprintf, U #undef UNICODE #define UNICODE #include <windows.h> // MessageBox int main() { u::CodingValue buffer[80]; sprintf( buffer, U( "The answer is %d!" ), 6*7 ); // Koenig lookup. MessageBox( 0, buffer->rawPtr(), U( "This is a title!" )->rawPtr(), MB_ICONINFORMATION | MB_SETFOREGROUND ); } </code> I coded up that support after reading the article I'm responding to now, because I felt that without coding it up I would be just spewing gut feelings and hunches. Well-informed such, but still. So I coded. :-) Cheers & hth., - Alf

On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach < alf.p.steinbach+usenet@gmail.com> wrote:
On 27.10.2011 23:56, Peter Dimov wrote:
Alf P. Steinbach wrote:
On 27.10.2011 21:07, Peter Dimov wrote:
Alf P. Steinbach wrote:
...
Right, that's one reason why modern Windows programs should best be wchar_t based.
This is one of the two options. The other is using UTF-8 for representing paths as narrow strings. The first option is more natural for Windows-only code, and the second is better, in practice, for portable code because it avoids the need to duplicate all path-related functions for char/wchar_t. The motivation for using UTF-8 is practical, not political or religious.
Thanks for that clarification of the current thinking at Boost.
My opinion is not representative of all of Boost, although I've found that there is substantial agreement between people who write portable software that needs to deal with paths (#2, UTF-8, as the way to go).
3. the most natural sufficiently general native encoding, 1 or 2
depending on the platform that the source is being built for.
Yes, with its various suboptions. 3a, TCHAR, 3b, template on char_type, 3c, providing both char and wchar_t overloads. They all have their problems; people don't move to UTF-8 merely out of spite.
Prior art in this direction, includes Microsoft's [tchar.h].
This works, more or less, once you've accumulated the appropriate library of _T macros, _t functions and T/t typedefs. I've never heard of it actually being used for a portable code base,
[tchar.h], plus the similar support in <windows.h>, was heavily used for porting applications between Windows 9x ANSI and Windows NT Unicode, before Microsoft introduced the Layer for Unicode in 2001 or thereabouts (the layer allowed wchar_t-apps to run in Windows 9x).
I'm not saying it's a good C++ approach for that porting -- it's not, since it was designed for the C language.
I just gave it as an example of prior art, which includes a neat header where the names of the relevant functions to wrap (or whatever) can be extracted by a small Python script. ;-)
but I admit that it's
possible to do things this way, even if it's somewhat alien to POSIX people.
The advantage of using UTF-8 is that, apart from the border layer that calls the OS (and that needs to be ported either way), the rest of the code is happily char[]-based.
Oh.
I would be happy to learn this.
How do I make the following program work with Visual C++ in Windows, using narrow character string?
<code> #include <stdio.h> #include <fcntl.h> // _O_U8TEXT #include <io.h> // _setmode, _fileno #include <windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
How will you make this program portable? The out-commented code is from my random efforts to Make It Work(TM).
It refused.
This is because windows narrow-chars can't be UTF-8. You could make it portable by: int main() { boost::printf("Blåbærsyltetøy! 日本国 кошка!\n"); }
By the way, I'm hoping Boost isn't supporting old versions of g++.
Because old versions of g++ chocked on a BOM at start of UTF-8 encoded source code, while Visual C++ requires that BOM... So, UTF-8 source code ungood with old versions of g++, if Visual C++ is also used.
If you don't use widechars, you can cheat VC++ to use UTF-8 string-literals. Just save the file as UTF-8 *without* BOM. It will just embed them verbatim into the executable. There's no need to be aware of the fact
that literals need to be quoted or that strlen should be spelled _tcslen. There's no need to convert paths to an external representation when writing them into a portable config/project file.
Hm, I'm not so sure.
I'd like to see this magic in action before believing in it, e.g., the program above working with narrow chars and printf, with Visual C++.
See above and see http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036
That's an unrelated issue, really, but I think Boost could use a "get
undamaged program arguments in portable strings" thing, if it isn't there already?
We'll be back to the question of what constitutes a portable string. I'd prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer TCHAR[].
No, not TCHAR, which was designed for the C language (and is an ugly uppercase name to boot).
Instead, like this:
<code> #include "u/stdio_h.h" // u::CodingValue, u::sprintf, U
#undef UNICODE #define UNICODE #include <windows.h> // MessageBox
int main() { u::CodingValue buffer[80];
sprintf( buffer, U( "The answer is %d!" ), 6*7 ); // Koenig lookup. MessageBox( 0, buffer->rawPtr(), U( "This is a title!" )->rawPtr(), MB_ICONINFORMATION | MB_SETFOREGROUND ); } </code>
You judge from a non-portable coed point-of-view. How about: #inclued <cstdio> #include "gtkext/message_box.h" // for gtkext::message_box int main() { char buffer[80]; sprintf(buffer, "The answer is %d!", 6*7); gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah, ...); } And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on windows) Sincerely, -- Yakov

On 28.10.2011 12:36, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.com> wrote:
On 27.10.2011 23:56, Peter Dimov wrote:
The advantage of using UTF-8 is that, apart from the border layer that calls the OS (and that needs to be ported either way), the rest of the code is happily char[]-based.
Oh.
I would be happy to learn this.
How do I make the following program work with Visual C++ in Windows, using narrow character string?
<code> #include<stdio.h> #include<fcntl.h> // _O_U8TEXT #include<io.h> // _setmode, _fileno #include<windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
How will you make this program portable?
Well, that was *my* question. The claim that this minimal "Hello, world!" program puts to the point, is that "the rest of the [UTF-8 based] code is happily char[]-based". Apparently that is not so.
The out-commented code is from my random efforts to Make It Work(TM).
It refused.
This is because windows narrow-chars can't be UTF-8. You could make it portable by:
int main() { boost::printf("Blåbærsyltetøy! 日本国 кошка!\n"); }
Thanks, TIL boost::printf. The idea of UTF-8 as a universal encoding seems now to be to use some workaround such as boost::printf for each and every case where it turns out that it doesn't work portably. When every portability problem has been diagnosed and special cased to use functions that translate to/from UTF-8 translation, and ignoring the efficiency aspect of that, then UTF-8 just magically works, hurray. E.g., if 'fopen( "rød.txt", "r" )' fails in the universal UTF-8 code, then just replace with 'boost::fopen', or 'my_special_casing::fopen'. However, with these workaround details made manifest, it is /much less/ convincing than the original general vague claim that UTF-8 just works. [snip]
You judge from a non-portable coed point-of-view. How about:
#include <cstdio> #include "gtkext/message_box.h" // for gtkext::message_box
int main() { char buffer[80]; sprintf(buffer, "The answer is %d!", 6*7); gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah, ...); }
And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on windows)
Aha. When you use a library L that translates in platform-specific ways to/from UTF-8 for you, then UTF-8 is magically portable. For use of L. However, try to pass a `main` argument over to gtkext::message_box. Then you have involved some /ohter code/ (namely the runtime library code that calls 'main') that may not necessarily translate for you, and in fact in Windows is extremely unlikely to translate for you. Such code is prevalent. Most code does not translate to/from UTF-8. Cheers & hth., & thanks for mention of boost::printf, - Alf PS: With C++11 there is no longer any reason to use <cstdio> instead of <stdio.h>, because <cstdio> no longer formally guarantees to not pollute the global namespace (and in practice it has never honored its C++98 guarantee). The code above is a good example why <stdio.h> is preferable -- it is too easy to write non-portable code with <cstdio>, such as using unqualified sprintf (not to mention size_t!).

On Fri, Oct 28, 2011 at 13:17, Alf P. Steinbach < alf.p.steinbach+usenet@gmail.com> wrote:
On 28.10.2011 12:36, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.**com <alf.p.steinbach%2Busenet@gmail.com>> wrote:
On 27.10.2011 23:56, Peter Dimov wrote:
The advantage of using UTF-8 is that, apart from the border layer that calls the OS (and that needs to be ported either way), the rest of the code is happily char[]-based.
Oh.
I would be happy to learn this.
How do I make the following program work with Visual C++ in Windows, using narrow character string?
<code> #include<stdio.h> #include<fcntl.h> // _O_U8TEXT #include<io.h> // _setmode, _fileno #include<windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
How will you make this program portable?
Well, that was *my* question.
The claim that this minimal "Hello, world!" program puts to the point, is that "the rest of the [UTF-8 based] code is happily char[]-based".
Apparently that is not so.
My point is that you cannot talk about things without comparison.
The out-commented code is from my random efforts to Make It Work(TM).
It refused.
This is because windows narrow-chars can't be UTF-8. You could make it portable by:
int main() { boost::printf("Blåbærsyltetøy! 日本国 кошка!\n"); }
Thanks, TIL boost::printf.
The idea of UTF-8 as a universal encoding seems now to be to use some workaround such as boost::printf for each and every case where it turns out that it doesn't work portably.
You pull things out of context. We should COMPARE the UTF-8 approach to the wide-char on windows narrow-char on non-windows approach. Your approach involves using your own printf just as well: #include "u/stdio_h.h" // u::CodingValue, u::printf, U printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL? u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what exactly U is. but anyway you have to do O(N) work to wrap the N library functions you use. Your approach is no way better.
[...]
[snip]
You judge from a non-portable coed point-of-view. How about:
#include <cstdio>
#include "gtkext/message_box.h" // for gtkext::message_box
int main() { char buffer[80]; sprintf(buffer, "The answer is %d!", 6*7); gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah, ...); }
And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on windows)
Aha. When you use a library L that translates in platform-specific ways to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
However, try to pass a `main` argument over to gtkext::message_box.
See the argv explanation in http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036 -- Yakov

On 28.10.2011 13:31, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 13:17, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.com> wrote:
On 28.10.2011 12:36, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.**com<alf.p.steinbach%2Busenet@gmail.com>> wrote:
On 27.10.2011 23:56, Peter Dimov wrote:
The advantage of using UTF-8 is that, apart from the border layer that calls the OS (and that needs to be ported either way), the rest of the code is happily char[]-based.
Oh.
I would be happy to learn this.
How do I make the following program work with Visual C++ in Windows, using narrow character string?
<code> #include<stdio.h> #include<fcntl.h> // _O_U8TEXT #include<io.h> // _setmode, _fileno #include<windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
How will you make this program portable?
Well, that was *my* question.
The claim that this minimal "Hello, world!" program puts to the point, is that "the rest of the [UTF-8 based] code is happily char[]-based".
Apparently that is not so.
My point is that you cannot talk about things without comparison.
I think that means that I failed to communicate to you what I compared. There was a claim that the UTF-8 based code should just work, but the minimal hello world like code in my example does /not/ work. Thus, it is a comparison between (1) reality, and (2) the claim, OK?
The out-commented code is from my random efforts to Make It Work(TM).
It refused.
This is because windows narrow-chars can't be UTF-8. You could make it portable by:
int main() { boost::printf("Blåbærsyltetøy! 日本国 кошка!\n"); }
Thanks, TIL boost::printf.
The idea of UTF-8 as a universal encoding seems now to be to use some workaround such as boost::printf for each and every case where it turns out that it doesn't work portably.
You pull things out of context. We should COMPARE the UTF-8 approach to the wide-char on windows narrow-char on non-windows approach. Your approach involves using your own printf just as well:
#include "u/stdio_h.h" // u::CodingValue, u::printf, U printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL? u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what exactly U is.
The relevant difference is in my opinion between * re-implementing e.g. the standard library to support UTF-8 (like boost::printf, and although I haven't tested the claim that it works for the program we discussed, it is enough for me that it /could/ work), or * wrapping it with some constant time data conversions (e.g. u::printf). The hello world program demonstrated that one or the other is necessary. So, we can forget the earlier silly claim that UTF-8 just magically works, and now really compare, for a simplest relevant program. And yes, with the functionality that I sketched and coded up a demo of, you get strong type checking and argument dependent lookup. It is however possible to design this in e.g. C level ways where it would be much less convenient. I think the opinions in community may have been influenced by one particularly bad such design, the [tchar.h]... ;-) For an UTF-16 platform a printf wrapper can simply be like this: inline int printf( CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vwprintf( format->rawPtr(), args ); } The sprintf wrapper that I used in my example is more interesting, though: inline int sprintf( CodingValue* buffer, size_t count, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), args ); } inline int sprintf( CodingValue* buffer, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), size_t( -1 ), format->rawPtr(), args ); } The problem that the above solves is that standard vswprintf is not a simple wchar_t version of standard vsprintf. As I recall Microsoft's [tchar.h] relies on a compiler-specific overload, but that approach does not cut it for platform independent code. For wchar_t/char independent code, one solution (as above) is two offer both signatures. Note that these wrappers do not (and do not have to) do data conversion. Whereas re-implementations for the UTF-8 scheme have to convert data.
but anyway you have to do O(N) work to wrap the N library functions you use.
Not quite. It is so for the UTF-8 scheme for platform independent things such as standard library i/o, and it is so also for the native string scheme for platform independent things such as standard library i/o. But when you're talking about the OS API, then with the UTF-8 scheme you need inefficient string data conversions and N wrappers, while with the native string scheme no string data conversions and no wrappers are needed. Only simple "get raw pointer" calls are needed, as illustrated in my example. Those calls could even be made implicit, but I think it's best to have them explicit in order to avoid unexpected effects. This difference in conversion & wrapping effort was the reason that I used both the standard library and the OS API in my original example. The standard library call used a thin wrapper, as shown above, while the OS API function (MessageBoxW) could be and was called directly.
Your approach is no way better.
I hope to convince you that the native string approach is objectively better for portable code, for any reasonable criteria, e.g.: * Native encoded strings avoid the inefficient string data conversions of the UTF-8 scheme for OS API calls and for calls to functions that follow OS conventions. * Native encoded strings avoids many bug traps such as passing a UTF-8 string to a function expecting ANSI, or vice versa. * Native encoded strings work seamlessly with the largest amount of code (Windows code and nix code), while the UTF-8 approach only works seamlessly with nix-oriented code. Conversely, points such as those above mean that the UTF-8 approach is objectively much worse for portable code. In particular, the UTF-8 approach violates the principle of not paying for what you don't (need to or want to) use, by adding inefficient conversions in all directions; it violates the principle of least surprise (where did that gobbledygook come from); and it violates the KISS principle ("Keep It Simple, Stupid!", forcing Windows programmers to deal with 3 internal string encodings instead of just 2).
You judge from a non-portable coed point-of-view. How about:
#include<cstdio>
#include "gtkext/message_box.h" // for gtkext::message_box
int main() { char buffer[80]; sprintf(buffer, "The answer is %d!", 6*7); gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah, ...); }
And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on windows)
Aha. When you use a library L that translates in platform-specific ways to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
However, try to pass a `main` argument over to gtkext::message_box.
See the argv explanation in http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036
I'm sorry, I don't see what's relevant there. You suggest there that boost::program_options can be used if it is fixed to support UTF-8; quote "she can use boost::program_options (assuming it's also changed to follow the UTF-8 convention)". I think that suggestion is probably misguided. For as far as I can see boost::program_options do not provide any way to obtain the undamaged command line in Windows (and anyway that command line is UTF-16 encoded). Without a portable way to obtain undamaged program arguments, portable support for parsing them with this encoding or that encoding seems to me to be irrelevant. Anyway, where does this introduction of special cases end? At every point where UTF-8 does not work, the suggested solution is to add an inefficient data conversion and support that on all platforms. Cheers & hth., - Alf

Alf P. Steinbach wrote:
* wrapping it with some constant time data conversions (e.g. u::printf).
In this particular example, wrapping doesn't work, because wprintf is broken. (At least I haven't been able to make it work.) You'll still need the hypothetical boost::wprintf here.

Alf P. Steinbach wrote:
This difference in conversion & wrapping effort was the reason that I used both the standard library and the OS API in my original example.
Using the OS API makes your program non-portable, so it's not clear what the example is supposed to demonstrate. You may as well stick to wchar_t; Windows 95 is ancient history and the whole wrapping effort is completely unnecessary. The portable version of your example would be something along the lines of: #include "message_box.hpp" #include <stdio.h> int main() { char buffer[ 80 ]; sprintf( buffer, "The answer is %d!", 6*7 ); message_box( buffer, "Title", mb_icon_information ); } where message_box has implementations for the various OSes the program supports. On Windows, it will utf8_decode its arguments and call MessageBoxW. A localizable version would not embed readable texts: #include "message_box.hpp" #include "get_text.hpp" #include <stdio.h> int main() { char buffer[ 80 ]; // ignore buffer overflow for now sprintf( buffer, get_text( "the_answer_is" ).c_str(), 6*7 ); message_box( buffer, get_text( "title" ), mb_icon_information ); } Now get_text may return something in Chinese (UTF-8 encoded) and it will all work. It's also possible to use wchar_t for human-readable text throughout the code base - this provides a layer of type safety. You'll have to replace sprintf with swprintf then. Paths, however, are better kept as char[].

On 28.10.2011 16:10, Peter Dimov wrote:
Alf P. Steinbach wrote:
This difference in conversion & wrapping effort was the reason that I used both the standard library and the OS API in my original example.
Using the OS API makes your program non-portable, so it's not clear what the example is supposed to demonstrate.
It demonstrates what I said. That no data conversion is needed for calling API functions. Your argument would to some extent make sense if you are assuming that the two worlds of OS-specific and portable code will always be completely separate, never touching -- but I find that unrealistic.
You may as well stick to wchar_t; Windows 95 is ancient history and the whole wrapping effort is completely unnecessary.
Sorry, that's meaningless to me. It sounds like free association.
The portable version of your example would be something along the lines of:
#include "message_box.hpp" #include <stdio.h>
int main() { char buffer[ 80 ]; sprintf( buffer, "The answer is %d!", 6*7 );
message_box( buffer, "Title", mb_icon_information ); }
where message_box has implementations for the various OSes the program supports. On Windows, it will utf8_decode its arguments and call MessageBoxW.
There is no need to add inefficient translation to the mix.
A localizable version would not embed readable texts:
#include "message_box.hpp" #include "get_text.hpp" #include <stdio.h>
int main() { char buffer[ 80 ]; // ignore buffer overflow for now sprintf( buffer, get_text( "the_answer_is" ).c_str(), 6*7 );
message_box( buffer, get_text( "title" ), mb_icon_information ); }
Now get_text may return something in Chinese (UTF-8 encoded) and it will all work.
This may conceivably make sense at some enterprise level.
It's also possible to use wchar_t for human-readable text throughout the code base - this provides a layer of type safety. You'll have to replace sprintf with swprintf then. Paths, however, are better kept as char[].
Sorry, again I fail to discern the underlying thoughts. It sounds like free association. Cheers & sorry, no comprende, - Alf

On Fri, Oct 28, 2011 at 15:34, Alf P. Steinbach < alf.p.steinbach+usenet@gmail.com> wrote:
[...] There was a claim that the UTF-8 based code should just work,
I can't recall anyone saying this. What people were saying is that it's the most sane way to write portable code. And if the vendors hadn't been resisting UTF-8 adoption, it would just work.
[...] * re-implementing e.g. the standard library to support UTF-8 (like boost::printf, and although I haven't tested the claim that it works for the program we discussed, it is enough for me that it /could/ work), or
* wrapping it with some constant time data conversions (e.g. u::printf).
The hello world program demonstrated that one or the other is necessary.
My last mail demonstrated that we don't need either when on windows. printf just works.
So, we can forget the earlier silly claim that UTF-8 just magically works, and now really compare, for a simplest relevant program.
Now we can recall this claim and continue to apply it to your silly claim that wrapping everything is easier. [...]
For an UTF-16 platform a printf wrapper can simply be like this:
inline int printf( CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vwprintf( format->rawPtr(), args ); }
Apparently we don't need it. In linux world requesting the user to use UTF-8 is legitimate. It's already almost everywhere the default. In some non-linux systems UTF-8 is the default too (Mac OS X?). In windows we can use narrow printf just fine.
The sprintf wrapper that I used in my example is more interesting, though:
inline int sprintf( CodingValue* buffer, size_t count, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), args ); }
inline int sprintf( CodingValue* buffer, CodingValue const* format, ... ) { va_list args; va_start( args, format ); return ::vswprintf( buffer->rawPtr(), size_t( -1 ), format->rawPtr(), args ); }
Oh! thank you! You suggest to wrap each function that comes in two kinds... You don't need to either wrap or re-implement sprintf for the UTF-8 approach. The whole point of UTF-8 is that it already works with most of the existing narrow library functions (strlen, strstr, str*, std::string, etc...) It's simpler, ah!? The problem that the above solves is that standard vswprintf is not a
simple wchar_t version of standard vsprintf. As I recall Microsoft's [tchar.h] relies on a compiler-specific overload, but that approach does not cut it for platform independent code. For wchar_t/char independent code, one solution (as above) is two offer both signatures.
No such problems in UTF-8 world.
but anyway you have to do O(N) work to wrap the N library functions you
use.
Not quite.
It is so for the UTF-8 scheme for platform independent things such as standard library i/o, and it is so also for the native string scheme for platform independent things such as standard library i/o.
As we see it's the other way around...
But when you're talking about the OS API, then with the UTF-8 scheme you need inefficient string data conversions
It's quite efficient. In fact it was never a bottleneck. Invoking the OS usually yields complex operations anyway. Moreover, even in non-English speaking world, most of the text internal to programs is still ASCII. UTF-8 saves space, saves cache usage. This compensates the conversion penalty. To make a definite statements you must measure. Otherwise it's premature optimization, if it's an optimization at all. Also note that in multi-threaded world with hierarchical memory, computation becomes faster than memory access. and N wrappers, while with the native string scheme no string data
conversions and no wrappers are needed.
The difference is what you wrap: the standard interface or the proprietary OS-interface. We benefit more from wrapping the later, as was done for hundreds of times in each portable library that tries to accomplish something beyond primitive file-io. This is because you get a portable library as a side-product.
Your approach is no way better.
I hope to convince you that the native string approach is objectively better for portable code, for any reasonable criteria, e.g.:
* Native encoded strings avoid the inefficient string data conversions of the UTF-8 scheme for OS API calls and for calls to functions that follow OS conventions.
Stop calling it inefficient. If you store portable data on some storage, or receive it through network—as any serious application does today—you can't avoid conversions. You just have to decide where you do them, closer to the OS or further. Anyway, see above. * Native encoded strings avoids many bug traps such as passing a UTF-8
string to a function expecting ANSI, or vice versa.
Yeah, and "multiple inheritance causes multiple abuse of multiple inheritance"[1] microsoft said? UTF-8 avoids many bug traps such as forgetting that UTF-16 is actually a variable length encoding. EVERYBODY knows that UTF-8 has vaaariiable-lng codepoints. * Native encoded strings work seamlessly with the largest amount of code
(Windows code and nix code), while the UTF-8 approach only works seamlessly with nix-oriented code.
Hmmm... I prefer the later, just to avoid all the boilerplate wrappers for what has been standard for years. And I'm a windows programmer. Besides, how will you return unicode from std::exception::what() if not by UTF-8? Conversely, points such as those above mean that the UTF-8 approach is
objectively much worse for portable code.
Since I'm tired of repeating the same again and again, see "Using the native encoding" in http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036 In particular, the UTF-8 approach violates the principle of not paying for
what you don't (need to or want to) use
UTF-16 violates the principle of you don't pay for what you don't use: If most of your text is ASCII (which is true for internal text even in non-English countries) you don't want to waste twice as much memory.
, by adding inefficient conversions in all directions;
Again? seekg(0) and read(). You'll have to do conversions anyway, e.g. when you read from a file. You don't store native encoding in portable file, do you?
[...] and it violates the KISS principle ("Keep It Simple, Stupid!", forcing Windows programmers to deal with 3 internal string encodings instead of just 2).
If you're working with 2 encodings, you're doing something terribly wrong. Seriously, it looks like you're still living in the 20th century. You shall not use ANSI encodings (other than UTF-8) on windows because they don't work with Unicode. They are mostly deprecated. Microsoft encourages you to use either UTF-8 or UTF-16 ( http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%2... ). Now, assuming you stopped using legacy 'ANSI' encodings, you left with only UTF-16 (internal) and UTF-8 (external). Replace internal UTF-16 with UTF-8, and you're left with only ONE encoding used for EVERYTHING, internal and external. UTF-16 at OS calls doesn't count as it's not stored anywhere (you're not 'dealing' with it). [1] from some C# book by microsoft I glanced a few years ago. -- Yakov

Alf P. Steinbach wrote: On 28.10.2011 12:36, Yakov Galka wrote:
This is because windows narrow-chars can't be UTF-8. You could make it portable by:
int main() { boost::printf("Blåbærsyltetøy! 日本国 кошка!\n"); }
Thanks, TIL boost::printf.
No, I don't think that this works. The problem here is not the printf call, it's the literal. When a char[] that does contain the proper UTF-8 text is passed, printf works under chcp 65001. In principle, you should still need to use the hypothetical boost::printf, though, if you want the program to properly support arbitrary code pages (not that the text above can be output in any code page other than 65001).
When every portability problem has been diagnosed and special cased to use functions that translate to/from UTF-8 translation, and ignoring the efficiency aspect of that, then UTF-8 just magically works, hurray.
E.g., if 'fopen( "rød.txt", "r" )' fails in the universal UTF-8 code, then just replace with 'boost::fopen', or 'my_special_casing::fopen'.
Yes, exactly. It's not a silver bullet, but... try coming up with a better alternative.

Alf P. Steinbach wrote:
How do I make the following program work with Visual C++ in Windows, using narrow character string?
<code> #include <stdio.h> #include <fcntl.h> // _O_U8TEXT #include <io.h> // _setmode, _fileno #include <windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
Output to a console wasn't our topic so far (and is not one of my strong points), but the specific problem with this program is that the embedded literal is not UTF-8, as the warning C4566 tells us, so there is no way for you to get UTF-8 in the output. (You should be able to set VC++'s code page to 65001, but I don't think you can.) int main() { printf( utf8_encode( L"кошка" ).c_str() ); } This is not a practical problem for "proper" applications because Russian text literals should always come from the equivalent of gettext and never be embedded in code. int main() { printf( gettext( "cat" ).c_str() ); } So, yes, I admit that you can't easily write a portable application (or a command-line utility) that has its Russian texts hardcoded, if that's your point. But you can write a command-line utility that can take кошка.txt as input and work properly, which is what I've been saying, and what sparked the original debate (argv[1]).

On Fri, Oct 28, 2011 at 13:58, Peter Dimov <pdimov@pdimov.com> wrote:
Alf P. Steinbach wrote:
How do I make the following program work with Visual C++ in Windows, using
narrow character string?
<code> #include <stdio.h> #include <fcntl.h> // _O_U8TEXT #include <io.h> // _setmode, _fileno #include <windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
Output to a console wasn't our topic so far (and is not one of my strong points), but the specific problem with this program is that the embedded literal is not UTF-8, as the warning C4566 tells us, so there is no way for you to get UTF-8 in the output. (You should be able to set VC++'s code page to 65001, but I don't think you can.)
int main() { printf( utf8_encode( L"кошка" ).c_str() ); }
You don't need to configure anything, in fact you cannot do it properly in VS. What you can do is: 1) don't use wide-char literals with non ascii characters 2) use UTF-8 literals for narrow-char. All you need is to save the source as UTF-8 WITHOUT BOM. Works as charm on VS2005 and VS2010. Apparently it's portable. The IDE can detect UTF-8 even without BOM ("☑ Auto-detect UTF-8 encoding without signature").
This is not a practical problem for "proper" applications because Russian text literals should always come from the equivalent of gettext and never be embedded in code.
+1 Personally I'm happy with printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8. Preventing data-loss is more important for me. -- Yakov

Yakov Galka wrote:
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8.
You can configure the console. Select Consolas or Lucida Console as the font, then issue chcp 65001. chcp 65001 apparently breaks .bat files though. :-)

On 28.10.2011 15:00, Peter Dimov wrote:
Yakov Galka wrote:
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8.
You can configure the console. Select Consolas or Lucida Console as the font, then issue chcp 65001. chcp 65001 apparently breaks .bat files though. :-)
it break a hell of a lot more than batch files. try `more`. cheers & hth., - Alf

On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach < alf.p.steinbach+usenet@gmail.com> wrote:
On 28.10.2011 15:00, Peter Dimov wrote:
Yakov Galka wrote:
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8.
You can configure the console. Select Consolas or Lucida Console as the font, then issue chcp 65001. chcp 65001 apparently breaks .bat files though. :-)
it break a hell of a lot more than batch files. try `more`.
cheers & hth.,
- Alf
So I tried to make YOUR approach work (i.e. use wchar_t): Created a file with: #include <cstdio> int main() { ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); } saved as UTF-8 with BOM. Compiled with VS2005, windows XP. M:\bin> a.exe Blσbµrsyltet°y! M:\bin> a.exe > a.txt Contents of a.txt: 42 6C E5 62 E6 72 73 79 6C 74 65 74 F8 79 21 20 What happens to Japanese and Russian? What's the mojibake? Maybe the compiler corrupted the string? Let's see, change to: wchar_t s[] = L"Blåbærsyltetøy! 日本国 кошка!\n"; ::wprintf( s ); Recompile, step into the debugger. No. It's your favorite, correct UTF-16 that's passed to wprintf. Same result. Let's try a European codepage: M:\bin> chcp 1252 M:\bin> a.exe Blåbærsyltetøy! Somewhat better. But how do I get to see the whole string? M:\bin> chcp 65001 M:\bin> a.exe Blbrsyltety! M:\bin> chcp 1200 Invalid code page OK, let's drop the requirement that the user sees the string at all. Let's restrict to a simpler case: a.exe writes unicode to stdout, b.exe reads it from stdin and writes verbatim to a file. Here is program b.exe: int main() { wchar_t s[256]; _getws(s); std::ofstream fout("out.txt", std::ios::binary); fout.write((const char*)s, 2*wcslen(s)); // I want to see what I really get } Compile, run. M:\a> a.exe | b.exe Independent of chcp I get: 42 00 6C 00 E5 00 62 00 E6 00 72 00 73 00 79 00 6C 00 74 00 65 00 74 00 F8 00 79 00 21 00 20 00 Why the hell this is lossy‽ Where IS my lovely Japanese? What am I doing wrong⸘ Ah! it's IMPOSSIBLE with wprintf! Let's try UTF-8 instead. Write the program as we've written it for 40 years, even before UTF-8 and the whole wide-char crap was introduced†. Open VS2005: #include <stdio.h> int main() { printf("Blåbærsyltetøy! 日本国 кошка!\n"); } † I mean the C functions used. Of course we couldn't mix Japanese and Russian back then. Save in UTF-8 WITHOUT BOM. Compile to a-utf8.exe. int main() { char s[256]; gets(s); std::ofstream fout("out.txt", std::ios::binary); fout.write((const char*)s, strlen(s)); } Compile b-utf8.exe; M:\> a-utf8.exe BlÃ¥bærsyltetøy! 日本国 кошка! Something is bad. [The user goes to the documentation/support. Alright, I need UTF-8. This software is Unicode aware! Good, they care about their customers!]: M:\> chcp 65001 M:\> a-utf8.exe Blåbærsyltetøy! 日本国 кошка! Correct! (Ok, I see squares for the Japanese because I don't have a monospace font for it, but copy/paste works correctly.) M:\> a-utf8.exe > a.txt a.txt: 42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 0D 0A Correct! M:\> a-utf8.exe | b-utf8.exe M:\> type out.txt Blåbærsyltetøy! 日本国 кошка! out.txt: 42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 It works! MAGIC! More importantly: ***It's the only way to make it work!*** ⇒ What if it's automatic and the user cannot intervene to change the codepage? ‽ If it's automatic, then you don't care how it's displayed in the console. You will log it to a file anyway. The case of: M:\> a-utf8.exe | b-utf8.exe Works correctly independent of what the current codepage was set. ⟹ more doesn't work. ‽ Report the bug to microsoft. UTF-8 is a documented codepage. Microsoft itself encourages to use either UTF-8 or UTF-16. Other 'ANSI' codepages are unofficially deprecated. http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%2...: Note ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page. Since you cannot set UTF-16 codepage for the console, UTF-8 is your only options from the said above. Furthermore, if people will pester microsoft we will get more benefit (no pun intended) than rewriting our code to use some unknown encoding that is different on each platform. -- Yakov

On 29.10.2011 14:14, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach< alf.p.steinbach+usenet@gmail.com> wrote:
On 28.10.2011 15:00, Peter Dimov wrote:
Yakov Galka wrote:
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8.
You can configure the console. Select Consolas or Lucida Console as the font, then issue chcp 65001. chcp 65001 apparently breaks .bat files though. :-)
it break a hell of a lot more than batch files. try `more`.
So I tried to make YOUR approach work (i.e. use wchar_t):
I am afraid that you are misrepresenting me a bit here. But I am sure it is not intentional. Let's walk through this.
Created a file with:
#include<cstdio> int main() { ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); }
saved as UTF-8 with BOM. Compiled with VS2005, windows XP.
Except that <cstdio> is not guaranteed to place wprintf in the global namespace (I commented on that before, better use <stdio.h>), that code works OK in the sense of doing what you have specified should happen. Which apparently is not what you think, heh. You have specified a conversion to narrow characters using the C++ executable narrow character set, i.e. a conversion to Windows ANSI. It surprises a lot of programmers that that's what 'wcout' does: a NARROWING CONVERSION. It did surprise me at one time in the 1990's. I was very disappointed. After that I have become more and more sure that there was no design of the C++ iostreams, but that's another story...
M:\bin> a.exe Blσbµrsyltet°y!
Yes -- that's what Windows ANSI Western, which you asked for, looks like when it is presented with the original IBM PC character set, codepage 437. Switch to codepage to 1252, the codepage number for Windows ANSI, to get the Windows ANSI result that you asked for to display properly. Of course it will lack the Unicode-only characters: <example> P:\test> type jam.cpp ∩╗┐#include <cstdio> int main() { ::wprintf( L"Bl├Ñb├ªrsyltet├╕y! µùѵ£¼σ¢╜ ╨║╨╛╤ê╨║╨░!\n" ); } P:\test> chcp 65001 Active code page: 65001 P:\test> type jam.cpp #include <cstdio> int main() { ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); } P:\test> cl jam.cpp jam.cpp P:\test> jam Bl�b�rsyltet�y! P:\test> chcp 437 Active code page: 437 P:\test> jam Blσbµrsyltet°y! P:\test> chcp 1252 Active code page: 1252 P:\test> jam Blåbærsyltetøy! P:\test> _ </example> [snip]
M:\bin> chcp 1252 M:\bin> a.exe Blåbærsyltetøy!
Somewhat better. But how do I get to see the whole string?
Not with any single-byte-per-character encoding. ;-) You can use UTF-8 or UTF-16 for the output. UTF-8 is a bit problematic because the Windows support is really flaky. [snip effort with wide text]
Ah! it's IMPOSSIBLE with wprintf!
No no, you're jumping to conclusions. The Microsoft runtime has special support for this at the C library level, but unfortunately, as far as I know, not at the C++ level. Still, since you're using 'wprintf', that's at the C level, so it's no problem: <example> P:\test> chcp 65001 Active code page: 65001 P:\test> type jam.cpp #include <stdio.h> #include <io.h> // _setmode #include <fcntl.h> // _O_U8TEXT int main() { _setmode( _fileno( stdout ), _O_U8TEXT ); ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); } P:\test> cl jam.cpp jam.cpp P:\test> jam Blåbærsyltetøy! 日本国 кошка! P:\test> g++ jam.cpp jam.cpp: In function 'int main()': jam.cpp:7: error: '_O_U8TEXT' was not declared in this scope P:\test> g++ jam.cpp -D __MSVCRT_VERSION__=0x0800 P:\test> a Blåbærsyltetøy! 日本国 кошка! P:\test> _ </example>
Let's try UTF-8 instead. [snip effort]
It works! MAGIC! More importantly: ***It's the only way to make it work!***
See above. Those statements are wrong in two important respects. First wrongness, the Windows console window support for UTF-8 is really really flaky, so that you get more or less arbitrary "errors". They seem to be connected with timing or something. So UTF-8 is not good: I showed above how to generate UTF-8 from wide char literals just to be exactly comparable to your example code, and the big difference is that I did not have to lie to the compiler and hope for the best. Instead, the code I presented above is well-defined. The result, for my program and for yours (since both output UTF-8) isn't well defined though -- it depends somewhat on the phase of the moon in Seattle, or something. Second wrongness, it's not the only way. I started very near the top of this thread by giving a concrete example that worked very neatly. It gets tiresome repeating myself. But as you could see in that example, the end programmer does not have to deal with the dirty platform-specific details any more than with all-UTF8. --- And you absolutely don't want to work with codepage 65001 in the console: it causes batch files and 'more' and pipes etc. to fail. But, you may ask, what about Alf's program, then, it's the same for heaven's sake? Well, let's check: <example> P:\test> chcp 1252 Active code page: 1252 P:\test> a Blåbærsyltetøy! 日本国 кошка! P:\test> jam Blåbærsyltetøy! 日本国 кошка! P:\test> </example> He he. :-) It works also with the more practical codepage 1252 in the console. The reason is probably that it uses WriteConsole internally, but it doesn't matter much how the runtime library accomplishes this. On the other hand, as with much else Microsoft there are probably hidden costs. It is possible that invoking this C level support may wreak havoc at the C++ iostreams level, so that a good solution may have to provide custom iostream buffers working around the Microsoft bugs. [snip about reporting one of the myriad console bugs, to Microsoft]
Since you cannot set UTF-16 codepage for the console, UTF-8 is your only options from the said above.
No that's incorrect. In my (limited) experience UTF-16 is more reliable for this. However, UTF-16 as an external encoding feels sort of wrong, even if it is very efficient for Japanese network traffic.
Furthermore, if people will pester microsoft we will get more benefit (no pun intended) than rewriting our code to use some unknown encoding that is different on each platform.
I believe that could greatly ease the porting of *nix tools to Windows. Cheers & hth., - Alf

Alf P. Steinbach wrote:
But, you may ask, what about Alf's program, then, it's the same for heaven's sake?
Well, let's check:
<example> P:\test> chcp 1252 Active code page: 1252
P:\test> a Blåbærsyltetøy! 日本国 кошка!
Neat trick. Apparently, _O_U8TEXT switches to Unicode mode when stdout is a console. Let me try... | C:\Projects\testbed>chcp | Active code page: 437 | | C:\Projects\testbed>release\testbed.exe | Blåbærsyltetøy! 日本国 кошка! Yeah. | C:\Projects\testbed>release\testbed.exe | more | Bl├Ñb├ªrsyltet├╕y! µùѵ£¼σ¢╜ ╨║╨╛╤ê╨║╨░! Well. You can't have everything. :-) | C:\Projects\testbed>release\testbed.exe > testbed.txt | | C:\Projects\testbed>type testbed.txt | Bl├Ñb├ªrsyltet├╕y! µùѵ£¼σ¢╜ ╨║╨╛╤ê╨║╨░! | | C:\Projects\testbed>chcp 65001 | Active code page: 65001 | | C:\Projects\testbed>type testbed.txt | Blåbærsyltetøy! 日本国 кошка! Of course, chcp 65001 breaks everything and more. Not that more worked in the first place. :-)

On Sat, Oct 29, 2011 at 18:21, Alf P. Steinbach < alf.p.steinbach+usenet@gmail.com> wrote:
[...]
M:\bin> chcp 1252 M:\bin> a.exe Blåbærsyltetøy!
Somewhat better. But how do I get to see the whole string?
Not with any single-byte-per-character encoding. ;-)
That's why ANSI codepages other than UTF-8 are crap, they're not suitable for internationalization.
UTF-8 is a bit problematic because the Windows support is really flaky.
It's problem of windows, not UTF-8. Report to microsoft, demand UTF-8 support, meanwhile develop workarounds that let people use UTF-8 portably. In 20 years we may get a working UTF-8 support. I understand that you give a damn about what will be in 20 years, but I do care.
Still, since you're using 'wprintf', that's at the C level, so it's no problem:
Congratulations! You found a WORKAROUND to properly support WIDE-CHAR, when UTF-8 support is ALREADY THERE. But you know what? There's a similar workaround to output UTF-8 when UTF-8 is not set for the console. Now explain, how is this: int main() { _setmode( _fileno( stdout ), _O_U8TEXT ); wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); } M:\>chcp 1252 Active code page: 1252 M:\>a.exe Blåbærsyltetøy! 日本国 кошка!\n better than this: int main() { SetConsoleOutputCP(CP_UTF8); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } M:\>chcp 1252 Active code page: 1252 M:\>a.exe Blåbærsyltetøy! 日本国 кошка!\n ‽ How will you explain Åshild Bjørnson why he can use plain-old printf on the workstations in the university but he needs to use all the w and L (or your proprietary Unicode-wrappers) on his private computer at home? 'w' stands for windows? Or perhaps you want to infect the non-windows world with wchar_t too? They seem to be connected with timing or something. So UTF-8 is not good: I
showed above how to generate UTF-8 from wide char literals just to be exactly comparable to your example code,
I showed you how you can continue to use UTF-8, resulting in portable code (modulo a call to SetConsoleOutputCP) which behaves the same as yours.
and the big difference is that I did not have to lie to the compiler and hope for the best.
It's not lying. It's just not telling the truth. And in C++11 you won't need it either: int main() { SetConsoleOutputCP(CP_UTF8); printf( u8"Blåbærsyltetøy! 日本国 кошка!\n" ); } Instead, the code I presented above is well-defined. The result, for my
program and for yours (since both output UTF-8) isn't well defined though -- it depends somewhat on the phase of the moon in Seattle, or something.
What? It's well defined: both will write UTF-8 bytes to stdout. If you redirect to a file, it's well defined. If you redirect to another program, it's well defined. What's may not be well defined is how the reciever interprets this. It will break only when the receiver tries to convert the data to UTF-16 if it doesn't know that it's UTF-8. But then it's again not restricted to UTF-8. The problem is the same for any 'ANSI' encoding. This is why standardizing on UTF-8 is important. Second wrongness, it's not the only way.
You don't have stdin and wstdin. stdin has a byte oriented encoding an thus the only way to transfer unicode data through it is with UTF-8. If you want to use wprintf—good, the library will do the conversion for you. But it still has to be translated to UTF-8. If you don't use UTF-8 you won't be Unicode-compatible. If you're not Unicode compatible, that means you're stuck in the 20th century. ⚠ The importance of Unicode is not only in multilingual support, it's important even within one language such as English—“fiflffffiffl”… No 'ANSI' non-UTF-8 codepage can encode these. I started very near the top of this thread by giving a concrete example
that worked very neatly. It gets tiresome repeating myself. But as you could see in that example, the end programmer does not have to deal with the dirty platform-specific details any more than with all-UTF8.
She does. She need to use your redundant u::sprintf when the narrow-character STANDARD sprintf works just fine. It works also with the more practical codepage 1252 in the console.
My default is not 1252. Stop being Euro-centric. UTF-8 works with 1252 too as shown above. [...]
In my (limited) experience UTF-16 is more reliable for this.
How it's more reliable? -- Yakov

On Sun, Oct 30, 2011 at 19:28, Yakov Galka <ybungalobill@gmail.com> wrote:
M:\>a.exe Blåbærsyltetøy! 日本国 кошка!\n
And don't come upon me that it wasn't copied from the console. Here is a real copy-pasta for your calm: M:\censored>chcp 1252 Active code page: 1252 M:\censored>a.exe Blåbærsyltetøy! 日本国 кошка! -- Yakov

On 30.10.2011 18:38, Yakov Galka wrote:
On Sun, Oct 30, 2011 at 19:28, Yakov Galka<ybungalobill@gmail.com> wrote:
M:\>a.exe Blåbærsyltetøy! 日本国 кошка!\n
And don't come upon me that it wasn't copied from the console. Here is a real copy-pasta for your calm:
M:\censored>chcp 1252 Active code page: 1252
M:\censored>a.exe Blåbærsyltetøy! 日本国 кошка!
Hey, have I ever indicated that I don't believe you? He he. I think what that indicates is that console windows and some other tools apparently use statistical methods to infer the encoding. There was, once, an infamous bug in that functionality. Which meant that when you wrote a particular sentence about George Bush in Notepad, and saved as Unicode, and reloaded the file, then Notepad told you he's a liar... :-) Cheers & hth., - Alf

On 30.10.2011 18:28, Yakov Galka wrote:
On Sat, Oct 29, 2011 at 18:21, Alf P. Steinbach<
Upthread, Yakov Galka wrote:
Somewhat better. But how do I get to see the whole string?
Not with any single-byte-per-character encoding. ;-)
That's why ANSI codepages other than UTF-8 are crap, they're not suitable for internationalization.
Nobody have suggested using Windows ANSI for internationalization. So your use of the four-letter word "crap" is, so to speak, wasted.
UTF-8 is a bit problematic because the Windows support is really flaky.
It's problem of windows, not UTF-8. Report to microsoft, demand UTF-8 support, meanwhile develop workarounds that let people use UTF-8 portably. In 20 years we may get a working UTF-8 support. I understand that you give a damn about what will be in 20 years, but I do care.
Uh, four letter word again. I suggest reserving them for where they suitably describe reality. E.g., I used a four letter word once in this discussion, namely "hell of a lot more" about the Windows console bugs. By the way, I can assure you that telepathy does not work: the claimed insight into my motivations etc. is incorrect (making such a claim is also an invalid form of rhetoric, but that's less important).
Still, since you're using 'wprintf', that's at the C level, so it's no problem:
Congratulations! You found a WORKAROUND to properly support WIDE-CHAR, when UTF-8 support is ALREADY THERE.
Please reserve all uppercase for macro names. And no, I have so far not had the pleasure of learning anything technical from this thread, unless you count the lie-to-g++ trick applied to the visual c++ compiler, but that's more psychological. I think that this one-sided learning, that various aspects of the reality have apparently not been well known to Boosters, means that the review process at Boost for this case has probably not involved the right kind of critical people knowledgeable in the domain.
But you know what? There's a similar workaround to output UTF-8 when UTF-8 is not set for the console. Now explain, how is this:
int main() { _setmode( _fileno( stdout ), _O_U8TEXT ); wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" ); }
M:\>chcp 1252 Active code page: 1252
M:\>a.exe Blåbærsyltetøy! 日本国 кошка!\n
better than this:
int main() { SetConsoleOutputCP(CP_UTF8); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); }
The first program, with wide string literal, does not require you to lie to the Visual C++ compiler about the source code encoding. The second program does require you to lie to the compiler. Hence, (1) wide string literals with non-ASCII characters will be mangled, cutting off use of an otherwise well defined language feature, (2) a later version of the compiler may be able to infer the UTF-8 encoding in spite of lacking BOM, then mangling the text, (3) you have to invoke inefficient data conversions for any use of functions that adhere to Windows conventions, which includes most Windows libraries and of course the Windows API -- e.g., try MessageBox, (4) you force Windows programmers to deal with 3 text encodings (ANSI, UTF-8 and UTF-16) instead of just 2 (ANSI and UTF-16), and (5) by "overloading" the narrow character strings with two main character encodings, you make it easy to introduce encoding-related bugs which can only be found by laborious run-time testing. So, it is an inefficient hack that can stop working, that cuts off a well defined language feature, forces complexity and attracts bugs. In my opinion it is not a good idea to base a Boost library on an inefficient hack that can stop working and that cuts off a language feature that's much used in Windows, and that on top of that forces complexity and attracts bugs that can only be found by testing. [snip]
How will you explain Åshild Bjørnson why he can use plain-old printf on the workstations in the university but he needs to use all the w and L (or your proprietary Unicode-wrappers) on his private computer at home?
I would not, since that's not the case. Also, I do not have any proprietary wrappers, that's also incorrect.
'w' stands for windows?
AFAIK "w" has not appeared in this thread, unless you're thinking of the standard library's wprintf etc. I do not know what it otherwise stands for or is. Note that using wprintf or "L" literals directly is not portable, so if that's what you thinking of then it's a non-issue.
Or perhaps you want to infect the non-windows world with wchar_t too?
I am baffled by your assumption that wchar_t is not used at all in the *nix world. And I am also baffled by your lack of understanding of the scheme I have described many times. So for the record, I have not been talking about using an unnatural representation for the platform at hand. Instead I have argued for the opposite, namely to use the natural encoding for the platform -- which to my mind is much of the good things that C++ is all about, namely diversity, adaption & raw efficiency, and instead of the Java idea of binary level portability, C++-like efficient (but less convenient) source code level portability. For that matter I'm also baffled by the attack with four letter word on Windows ANSI for internationalization, which is impossible and so is not done; it is a non-existing scheme you attacked there. [snip]
It's not lying. It's just not telling the truth.
To lie is to intentionally make someone believe something that one thinks one knows is not true. One can lie by stating the truth. And in this case, one lies by omitting a crucial fact (namely the BOM).
And in C++11 you won't need it either:
int main() { SetConsoleOutputCP(CP_UTF8); printf( u8"Blåbærsyltetøy! 日本国 кошка!\n" ); }
Yes, this is indeed a point in favor of the UTF-8 scheme: that C++11 partially supports it. Knowledge of the encoding is however discarded: the end result is just an array of 'char', which unfortunately, on the Windows platform, by convention is expected to be encoded as ANSI... -> bugs.
Instead, the code I presented above is well-defined. The result, for my
program and for yours (since both output UTF-8) isn't well defined though -- it depends somewhat on the phase of the moon in Seattle, or something.
What? It's well defined: both will write UTF-8 bytes to stdout. If you redirect to a file, it's well defined. If you redirect to another program, it's well defined. What's may not be well defined is how the reciever interprets this. It will break only when the receiver tries to convert the data to UTF-16 if it doesn't know that it's UTF-8. But then it's again not restricted to UTF-8. The problem is the same for any 'ANSI' encoding. This is why standardizing on UTF-8 is important.
No, I was talking about the console window support. A console window will itself often partially mangle UTF-8 output, in particular the first letter. At least it has done that when I have tested out the examples for this thread. However, supporting UTF-8 more directly with e.g. a SetConsoleOutputCP call, appears to work for direct presentation.
Second wrongness, it's not the only way.
You don't have stdin and wstdin. stdin has a byte oriented encoding an thus the only way to transfer unicode data through it is with UTF-8. If you want to use wprintf—good, the library will do the conversion for you. But it still has to be translated to UTF-8. If you don't use UTF-8 you won't be Unicode-compatible. If you're not Unicode compatible, that means you're stuck in the 20th century.
I am not sure what you're arguing here. The bit about "the only way" is technically wrong. However, I /think/ what you're trying to communicate is that UTF-8 is good as a kind of universal external encoding. And if so, then I wholeheartedly agree. However, we have been discussing internal text representation.
⚠ The importance of Unicode is not only in multilingual support, it's important even within one language such as English—“fiflffffiffl”… No 'ANSI' non-UTF-8 codepage can encode these.
Yes, you have words like "maneuver", which properly is spelled with an oe contraction that I once mistakenly thought was a Norwegian "æ"!
I started very near the top of this thread by giving a concrete example
that worked very neatly. It gets tiresome repeating myself. But as you could see in that example, the end programmer does not have to deal with the dirty platform-specific details any more than with all-UTF8.
She does. She need to use your redundant u::sprintf when the narrow-character STANDARD sprintf works just fine.
Oh, the standard sprintf starts yielding incorrect results as soon as some ANSI text has sneaked into the mix, or when Visual C++ 12 (say) has discovered that your BOM-less source code is UTF-8 encoded. With something like u::sprintf one is to some extent protected by having the encoding statically type-checked. You can say that with C++ compared to C, more and stronger static type checking is a large part of what C++ is all about. ;-)
It works also with the more practical codepage 1252 in the console.
My default is not 1252. Stop being Euro-centric. UTF-8 works with 1252 too as shown above.
[...]
In my (limited) experience UTF-16 is more reliable for this.
How it's more reliable?
A console window will in some cases mangle the first character of UTF-8 output. I don't know why. And the [cmd.exe] "/u" option for supporting Unicode in pipes, is reportedly UTF-16 (disclaimer: I haven't used it). Cheers & hth., - Alf PS: Sorry that I don't have time to answer all responses.

Alf P. Steinbach wrote, about chcp 65001:
it break a hell of a lot more than batch files. try `more`.
Yes. Life isn't perfect. Incidentally, 'more' demonstrates once again the superiority of UTF-8 (if it worked): C:\Projects\testbed\tmp>dir Volume in drive C has no label. Volume Serial Number is 34C7-A38D Directory of C:\Projects\testbed\tmp 29.10.2011 17:28 <DIR> . 29.10.2011 17:28 <DIR> .. 29.10.2011 17:25 0 Blåbærsyltetøy! 日本国 кошка!.txt 1 File(s) 0 bytes 2 Dir(s) 856,726,167,552 bytes free C:\Projects\testbed\tmp>dir | more Volume in drive C has no label. Volume Serial Number is 34C7-A38D Directory of C:\Projects\testbed\tmp 29.10.2011 17:28 <DIR> . 29.10.2011 17:28 <DIR> .. 29.10.2011 17:25 0 Blåbærsyltetoy! ??? ?????!.txt 1 File(s) 0 bytes 2 Dir(s) 856,726,167,552 bytes free The "dir" command has no problem displaying arbitrary file names directly to the console (presumably via WriteConsoleW), but once it has to write to a file, it needs to convert to narrow and no code page other than 65001 can express the above file name. (My default console code page is 437, which doesn't even have ø. The Consolas font doesn't have glyphs for 日本国, but the characters are present, just not displayable, which is why I could copy and paste them here.) It would've been nice for Microsoft to set all the narrow code pages to UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but they didn't, so here we are.

On Sat, Oct 29, 2011 at 16:41, Peter Dimov <pdimov@pdimov.com> wrote:
It would've been nice for Microsoft to set all the narrow code pages to UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but they didn't, so here we are.
They can do it anytime. It won't break anything. You already cannot rely on a specific narrow code page, and it even can be variable-length (e.g. Shift-JIS). They don't do it intentionally (http://bit.ly/2Pdaa). -- Yakov

Yakov Galka wrote:
On Sat, Oct 29, 2011 at 16:41, Peter Dimov <pdimov@pdimov.com> wrote:
It would've been nice for Microsoft to set all the narrow code pages to UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but they didn't, so here we are.
They can do it anytime. It won't break anything. You already cannot rely on a specific narrow code page, and it even can be variable-length (e.g. Shift-JIS). They don't do it intentionally (http://bit.ly/2Pdaa).
They can, but it will be a lot of pain in the short term. It will break all programs that require a specific code page (such as Latin-1 or Shift-JIS) and can afford to do so because all Windows installations in the country are on the same page and hardly anyone uses file names outside this code page.

On Sat, Oct 29, 2011 at 17:21, Peter Dimov <pdimov@pdimov.com> wrote:
Yakov Galka wrote:
On Sat, Oct 29, 2011 at 16:41, Peter Dimov <pdimov@pdimov.com> wrote:
It would've been nice for Microsoft to set all the narrow code pages to UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but they didn't, so here we are.
They can do it anytime. It won't break anything. You already cannot rely on a specific narrow code page, and it even can be variable-length (e.g. Shift-JIS). They don't do it intentionally (http://bit.ly/2Pdaa).
They can, but it will be a lot of pain in the short term. It will break all programs that require a specific code page (such as Latin-1 or Shift-JIS) and can afford to do so because all Windows installations in the country are on the same page and hardly anyone uses file names outside this code page.
OK, the problem here is not that it's not the default (we have a long way to go for this) but that they don't even implement it as an option. I can't even imagine how hard they had to fail in-order to implement 'more' so it doesn't work with UTF-8. Maybe it's intentional? Who volunteers to RE 'more'? -- Yakov

On 28.10.2011 14:41, Yakov Galka wrote:
On Fri, Oct 28, 2011 at 13:58, Peter Dimov<pdimov@pdimov.com> wrote:
Alf P. Steinbach wrote:
How do I make the following program work with Visual C++ in Windows, using
narrow character string?
<code> #include<stdio.h> #include<fcntl.h> // _O_U8TEXT #include<io.h> // _setmode, _fileno #include<windows.h>
int main() { //SetConsoleOutputCP( 65001 ); //_setmode( _fileno( stdout ), _O_U8TEXT ); printf( "Blåbærsyltetøy! 日本国 кошка!\n" ); } </code>
Output to a console wasn't our topic so far (and is not one of my strong points), but the specific problem with this program is that the embedded literal is not UTF-8, as the warning C4566 tells us, so there is no way for you to get UTF-8 in the output. (You should be able to set VC++'s code page to 65001, but I don't think you can.)
int main() { printf( utf8_encode( L"кошка" ).c_str() ); }
You don't need to configure anything, in fact you cannot do it properly in VS. What you can do is:
1) don't use wide-char literals with non ascii characters 2) use UTF-8 literals for narrow-char.
All you need is to save the source as UTF-8 WITHOUT BOM. Works as charm on VS2005 and VS2010. Apparently it's portable. The IDE can detect UTF-8 even without BOM ("☑ Auto-detect UTF-8 encoding without signature").
This is interesting in a perverse sort of way. In order to make Visual C++ produce UTF-8 encoded compiled narrow strings, one must /lie/ to the compiler. The source code is UTF-8. And one lies and tells the Visual C++ compiler that it's ANSI. And in order to make g++ produce ANSI encoded compiled narrow strings, one must /lie/ the compiler. The source code is ANSI. And one lies and tells the g++ compiler that it's UTF-8. As I see it, there's something wrong here. Notwithstanding the limitation that codepage 65000 is impractical in the Windows command interpreter -- e.g. 'more' command CRASHES.
This is not a practical problem for "proper" applications because Russian text literals should always come from the equivalent of gettext and never be embedded in code.
+1
I find that a very narrow minded view. Would you like to be the one telling Norwegian student Åshild Bjørnson that you favor the notion that she should waste hours or days installing Boost and some other nix-oriented library and use 'gettext', in order to be able to display her name in her first C++ program? That text representation and output in C++ has been designed (with your not just willing but enthusiastic vote) to be so inherently complex that it requires hours and days of efforts just to display your name?
Personally I'm happy with
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
writing UTF-8. Even if I cannot configure the console, I still can redirect it to a file, and it will correctly save this as UTF-8. Preventing data-loss is more important for me.
I find it thoroughly disgusting to have to lie to your tools, and to rely on an assumption that the tools will not wisen up in the future. However, I concede the point that IF one is happy with output that's encoded so that most Windows command line tools fail (e.g. `more` crashes), and IF one is happy with lying to the compiler about the source encoding, and IF one is happy assuming that the compiler won't wisen up about encodings in a future version, then -- the UTF-8 scheme allows literals with national language characters, not just A through Z. However, those are pretty constricting conditions. Cheers & hth., - Alf

Alf P. Steinbach wrote:
Would you like to be the one telling Norwegian student Åshild Bjørnson that you favor the notion that she should waste hours or days installing Boost and some other nix-oriented library and use 'gettext', in order to be able to display her name in her first C++ program?
No, of course not. Our original topic was Boost.Locale, not the first program of a Norwegian student. But, consider the topic changed, and please do let me know what you suggest said student should do.

On 20:59, Alf P. Steinbach wrote:
Would you like to be the one telling Norwegian student Åshild Bjørnson that you favor the notion that she should waste hours or days installing Boost and some other nix-oriented library and use 'gettext', in order to be able to display her name in her first C++ program?
Or you could only use ASCII in source code, encode the strings in UTF-8 manually using octal escape sequences. #include <iostream> int main() { std::cout << "\303\205shild Bj\303\270rnson\n"; } Not that nice to the eye, but anyway... Regards, Anders Dalvander -- WWFSMD?

Alf P. Steinbach wrote:
Option 3 means -- it requires, as far as I can see -- some abstraction that hides the narrow/wide representation so as to get source code level portability, which is all that matters for C++. It doesn't need to involve very much. Some typedefs, traits, references.
For example, write a portable string literal like this:
PS( "This is a portable string literal" )
[snip]
The main drawback is IMO the need to use something like a PS macro for string and character literals, or a C++11 /user defined literal/. Windows programmers are used to that, writing _T("blah") all the time as if Windows 95 was still extant. So, considering that all that current labor is being done for no reward whatsoever, I think it should be no problem convincing programmers that writing a few characters more in order to get portable string literals, is worth it; it just needs exposure to examples from some authoritative source...
The problem with that approach is that existing, non-Windows, code must be painstakingly altered to introduce such manual portability constructs. If code was already written using the Microsoft facilities for portability, it's a relatively easy transition to make (s/_T/PS/, for example). Regardless of authoritative examples, inertia is against your idea. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Alf, On Thu, Oct 27, 2011 at 5:12 PM, Alf P. Steinbach <alf.p.steinbach+usenet@gmail.com> wrote:
... Thanks for that clarification of the current thinking at Boost. ...
Please understand that Boost isn't a single library, but rather a collection of 100 or so individual libraries. So there isn't any single "current thinking at Boost" on any topic that has library or application dependent aspects. That said, Peter Dimov's replies do represent the thinking of many Boost developers and library maintainers, include me:-) --Beman

Alf, All, What replies seem to be missing here is that what you call the "least surprise" behavior of the code with argument of main(), is simply incorrect from the software engineering point of view. Let me explain:
3. the most natural sufficiently general native encoding, 1 or 2 depending on the platform that the source is being built for.
Now, when accepting filename from the user's command line on Windows, it is simply not possible to use narrow-string version of main(). Your code cannot enforce your user to limit his input to characters representable in the current ANSI codepage. If the command line parameter is a filename as in the example you suggested, you cannot tell them "never double click on some files" (if a program used in a file association). Supporting is always better than white-listing, so the only acceptable way of using command line parameter which is a filename on windows is with UTF-16 version - _tmain(). Then, proceed as Artyom explained. The surprise is then justified - it prevented a hard-to-spot bug. My preference on Windows though would be different (and not due to religious reasons) - convert all API-returned strings to UTF-8 as soon as possible and forget about encoding issues for good. See http://programmers.stackexchange.com/questions/102205/should-utf-16-be-consi... -- View this message in context: http://boost.2283326.n4.nabble.com/Silly-Boost-Locale-default-narrow-string-... Sent from the Boost - Dev mailing list archive at Nabble.com.

On 31.10.2011 18:18, bugpower wrote:
Alf, All,
What replies seem to be missing here is that what you call the "least surprise" behavior of the code with argument of main(), is simply incorrect from the software engineering point of view. Let me explain:
3. the most natural sufficiently general native encoding, 1 or 2 depending on the platform that the source is being built for.
Now, when accepting filename from the user's command line on Windows, it is simply not possible to use narrow-string version of main().
Well, there are three aspects of that claim: 1 The limitations of `main` in Windows. Regarding aspect (1), the C++ Standard does describe the `main` arguments as "MBCS" strings, meaning they can (should) be encoded with possibly two or more bytes per character, as in for example UTF-8, which to me is strongly implied. However, the Windows convention for C++ executable narrow character set predates the C++ standard by a long shot, and even predates the C standard, and is Windows ANSI. And that convention is /very/ deeply embedded, not only in the runtime library implementations but e.g. in how Visual C++ translates string literals. 2 What you're trying to communicate. Regarding aspect (2), more quoted concrete context could help make it more clear to readers what you're trying to say. I'm not a telepath. But it does sound like you're arguing against a straw man of your own devising. As if someone had argued for using ANSI-encoded arguments in general, or as some solution of i18n. So, I will put my initial remark above, more strongly: Please always /quote/ what you're referring to. Especially when you are offering something that sounds as an argument against something, then please /quote/ what you're referring to. 3 The literal claim that "it is simply not possible to use narrow- string version of main()". Regarding aspect (3), this claim is incorrect. However, many people think that one has to use a non-standard startup function like WinMain, that one has to ditch some parts of the C++ standard as soon as one does Windows, so, some basic technical fact: The GNU toolchain (the g++ compiler) happily accepts a standard `main` startup function without further ado, regardless of Windows subsystem. The Microsoft toolchain (the Visual C++ compiler and linker), however, is less adept at recognizing your startup function as such. So with the MS toolchain you have to specify the startup function explicitly if you're building a GUI subsystem program and want a standard `main`. The relevant linker options: "/entry:mainCRTStartup /subsystem:windows". --- Finally, note how I had to cover a lot of bases and use a lot of time on responding to your single little sentence. That's because that sentence was very *unclear* and *misleading*, and, given the next sentence, quoted below, I hope it was not so by design.
Your code cannot enforce your user to limit his input to characters representable in the current ANSI codepage.
Ignoring the misleading "your", and responding to the technical content only: the previous sentence talked about `main` arguments, and this following sentence talks about "input", so it seems that you are confusing two different aspects that have very different behaviors in Windows. In Windows the program arguments are always passed to the process as a single UTF-16 encoded command line string, available via the API function GetCommandLine. For the command line it is therefore meaningless to talk about restricting the user. Standard /input/, OTOH., is always passed via some narrow character encoding, which does not include UTF-16, and which by convention is neither ANSI nor UTF-16 but the extraordinarily impractical OEM codepage (on English PC that codepage is the original IBM PC char set). Happily it is possible to change the narrow character encoding used for input. For example, it can be changed to UTF-8 as the external encoding. This is called the "active codepage" in a console window, and it can also be changed by the user, e.g. commands 'mode' and 'chcp'. The dangers of selecting UTF-8 as active codepage in a command interpreter console window, have been discussed else-thread; in short, Microsoft has a large number of ridiculous bugs in their support. But that discussion also showed that it's (very probably) OK under program control.
If the command line parameter is a filename as in the example you suggested, you cannot tell them "never double click on some files" (if a program used in a file association).
What example? Please always quote what you refer to, and quote enough of that context: don't be ambiguous, don't leave it to readers to infer a context based on your possibly wrong understanding of it. Anyway, text passed as `main` arguments can be e.g. the user's name, which is not necessarily a filename.
Supporting is always better than white-listing, so the only acceptable way of using command line parameter which is a filename on windows is with UTF-16 version - _tmain().
Oh dear. Are you seriously suggesting using `_tmain` to keep compatible with Windows 9x? Note that for Windows 9x, `_tmain` maps to standard `main`. And note that in Windows, `main` has Windows ANSI-encoded arguments... `_tmain` is a Microsoft macro that helped support compatibility with Windows 9x before the Layer for Unicode was introduced in 2001. `_tmain` maps to narrow character standard `main` or wide character non-standard `wmain` depending on the `_UNICODE` macro symbol. `_tmain` was an abomination even in its day, and today there are no reasons whatsoever to obfuscate the code that way.
Then, proceed as Artyom explained. The surprise is then justified - it prevented a hard-to-spot bug. My preference on Windows though would be different (and not due to religious reasons) - convert all API-returned strings to UTF-8 as soon as possible and forget about encoding issues for good.
No, it does not let you forget about encoding issues. Rather it introduces extra bug attractors, since you have then overloaded the meaning of char-based text. By convention in Windows and in most existing Windows code, char is ANSI-encoded, so other code will expect ANSI encoding from the UTF-8 based code, which will tend to introduce bugs. And other code will produce ANSI encoded text to the UTF-8 based code, which will tend to introduce bugs. Thus adding another possible encoding is absolutely not a good idea wrt. bugs. And you're adding inefficiency for all the myriad internal conversions. And you're adding either an utterly silly restriction to English A-Z in literals, or, for the let's-lie-to-the-compiler UTF-8 source without BOM served to the Visual C++ compiler, requiring that the wide character literals language feature is not used, and hoping for the best with respect to how smart later versions of the compiler will be. And most Windows libraries abide by Windows conventions, so it means extra work for supporting most library code. I.e. O(n) work for writing inefficient data converting wrappers for an unbounded set of functions instead of just O(1) work for writing efficient pointer type converting wrappers for a fixed set of functions. Think about it. As far as I know there is not *one single technical aspect* that the all UTF-8 scheme solves. I.e., AFAIK from a purely technical POV it's dumb.
See http://programmers.stackexchange.com/questions/102205/should-utf-16-be-consi...
Hm, was that an associative reference? Let me quote from the question: <quote> For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad: Opera has problem with editing them (delete required 2 presses on backspace) Notepad can't deal with them correctly (delete required 2 presses on backspace) File names editing in Window dialogs in broken (delete required 2 presses on backspace) All QT3 applications can't deal with them - show two empty squares instead of one symbol. Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP. Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings. StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes). WinForms TextBox may generate invalid string when limited with MaxLength. </quote> Here the poster lists concrete examples of how many common applications already have bugs in their Unicode handling. Showing by example that Unicode is tricky to get right. Is it then a good idea to needlessly, and at great cost, add further confusion about whether narrow characters are encoded as ANSI or UTF-8? Cheers & hth., - Alf

From: Alf P. Steinbach <alf.p.steinbach+usenet@gmail.com>
[...]
It is a level of indirection.
You want Boost.Filesystem to assume /the same/ narrow character encoding as Boost.Locale, whatever it is.
And to quote the docs where I found that program,
"Boost Locale fully supports both narrow and wide API. The default character encoding is assumed to be UTF-8 on Windows."
I would probably say it once again and the last time. 1. Boost.Locale is **localization** library and localization today is done using **Unicode** not cp1252, cp936 or cp1255 And UTF-8 is **Unicode** encoding for narrow strings. So _any_ localization library **must** use Unicode encoding otherwise it will be useless crap. 2. If you write software for Windows and what to use ANSI encoding by default all you need is to add a _single_ line into your code. I give you a choice to use whatever you want. But the default should be suitable for **Localization** - the reason this library is written for, Now, you may not like the design of Boost.Locale library or you don't like its defaults. Legitimate. But using UTF-8 by default was one of few points that had total agreement between all Boost.Locale reviewers. Using UTF-8 by default is indeed strategical decision. You may call it political, I may call it practical. You may do not like it but this is what will remain because it is the way the library designed and it is one of its central parts. You don't like it? Ok... I had given you an option to change it. I think you and other users will survive this one extra line that changes the default encoding to ANSI instead of cross platform and UTF-8. Best Regards, Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

then it fails to work when the literal string is replaced with a `main` argument.
A conversion is then necessary and must be added.
It breaks the principle of least surprise.
It breaks the principle of not paying for what you don't (want to) use.
Did you read this? http://beta.boost.org/doc/libs/1_48_0_beta1/libs/locale/doc/html/default_enc... You can **easily** switch to ANSI as default... But you don't want to (rather switch to UTF-16 or UTF-8) especially when you actually use localization... :-)
I understand, from discussions elsewhere, that the author(s) have chosen a narrow string encoding that requires inefficient & awkward conversions in all directions, for political/religious reasons.
No you hadn't read rationale correctly and didn't read what is written in the link I had given. If you write "Windows only" software you should either set Ansi option to use native encoding - UTF-16. If not stick to cross platform UTF-8.
Maybe my understanding of that is faulty, that it's no longer politics & religion but outright war (and maybe that war is even over, with even Luke Skywalker dead or deadly wounded). However, I still ask:
why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do it right, using the platforms' native encodings.
Windows native encoding is not ANSI. It is Wide/UTF-16 encoding. ----------------------------------------------------- If you still not convinced, using UTF-8 by default was one of important pluses this library brings and it was noticed by many reviewers. Artyom

On 27 October 2011 18:19, Alf P. Steinbach <alf.p.steinbach+usenet@gmail.com> wrote:
On 27.10.2011 19:06, Artyom Beilis wrote:
Windows native encoding is not ANSI. It is Wide/UTF-16 encoding.
Try using UTF-16 with narrow strings.
You simply don't do that, do you, without conversion to wide string type. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net Charter Member of OSGeo, http://osgeo.org Member of ACCU, http://accu.org
participants (9)
-
Alf P. Steinbach
-
Anders Dalvander
-
Artyom Beilis
-
Beman Dawes
-
bugpower
-
Mateusz Łoskot
-
Peter Dimov
-
Stewart, Robert
-
Yakov Galka