
On Wed, Apr 20, 2011 at 3:54 AM, Edward Diener <eldiener@tropicsoft.com> wrote:
On 4/19/2011 9:05 AM, Artyom wrote:
From: Edward Diener<eldiener@tropicsoft.com> On 4/19/2011 3:17 AM, Matus Chochlik wrote:
[snip/]
Take your pick :-)
My pick is to use what the language currently provides, which is wchar_t, which can represent UTF-16,
No it can not represent UTF-16, it represents either UTF-32 (on most platforms around) or UTF-16 (on one specific platform Microsoft Windows).
Then clearly it can represent UTF-16.
a popular Unicode variant which also happens to be the standard for wide characters on Windows, which just happens to be the dominant operating system in the world ( by alot ) in terms of end-users.
*Only* on a single platform (which you claim to be the most dominant, which is only partially true). On other platforms wchar_t does not represent UTF-16. Actually the situation with wchar_t is only slightly better than with char because just as the standard does not specify what encoding the char-based strings use is also does not specify the encoding for wchar_t. And wchar_t using UTF-16 on Windows is no standard is a custom. I still remember times when wchar_t used to be UCS2. [snip/]
I bake your pardon?
Apologies. I should not have said that you have a closed mind about this issue.
UTF-8 is standard far beyond what Linux is uses.
A standard for what ? There are three Unicode character sets which are generally used, UTF-8, UTF-16, and UTF-32. I can not understand why you think one of them is some sort of standard for something.
Then look which of those encodings is "dominant" on the Web (HTML pages, PHP scripts, template files for various CMS', CSS files, WSDL files, ...), in various database systems (most of those adopting wchar_t and UTF-16 (USC2 really) had quite a lot of problems because of this, just as Windows had at some point) or in XML files in general which are used basically everywhere. Just have a look what encoding use the XML files that are zipped inside the *Microsoft* Office's documents (docx, xlsx, pptx, etc.), Hint, no it's not UTF-16. But the most important reason why I think that UTF-8 is superior to UTF-16/32 is, that it is the only truly portable format. Yes, you can use UTF-16 or UTF-32 as your internal representation of characters on a certain platform, but if you expect to publish the data and move it to other computers then the only rational thing to do is to use UTF-8, where you don't have to deal with *stupid* byte ordering marks nor any other similar nonsense. I think that it about time that we pick a single character set and a single encoding because if we don't we are back at the point where we started 30-40 year ago. We just won't have ISO-8859-X, CP-XYZ, ... but UTF-XY, UCS-N, ... instead. And actually I think that the usually highly overrated "invisible hand of the market" had done the right thing this time and already picked the "best" encoding for us (see above). Even Microsoft will wake up one of those days and accept this. The fact that they already use UTF-8 in their own document formats is IMO a proof of that.
All systems use English as the base as it is the best practice.
That's a pretty bald statement buy I do not think it is true. But even if it were, why should everybody doing something one way be proof that that way is best ? I would much rather pursue a technical solution that I felt was best even if no one else thought so.
I do not think that it is bald (nor bold ;-)). Using the basic character set and English has one big advantage: You won't have problems with Unicode normalization. But, for people willing to take the risks of their code being unportable, etc. I don't see why we could not add another overload for translate which would accept wchar-based strings, somehow "detect" the encoding and convert it to UTF-8 for the backend (gettext in this case) if necessary and return a wide string with the translation. I don't mind if other people want to risk shooting themselves in the foot if that is their own free decision :-) This will be temporary anyway because the UTF-8 literals are already coming. [snip/] Matus