[locale] Formal review of Boost.Locale library EXTENDED

newer
GSoC 2011

Chad Nelson

17 Apr 2011 17 Apr '11

1:05 p.m.

(This message made it to the announcements list, but the copy intended for the developers list didn't make it and I hadn't noticed. I apologize if this inconvenienced anyone.) There have been only a handful of reviews of the proposed Boost.Locale library to date, and several people have requested more time to provide theirs, so the formal review of the Boost.Locale library has been extended. The deadline for reviews is now 11:59pm UTC on Friday, April 22nd. For those in the Western hemisphere, remember that that's earlier than it might seem -- 7:59pm in New York, 4:59 in Los Angeles. I'll accept late reviews until I complete the report on Saturday morning, but please try to get yours in earlier. You can find the "boost_locale_for_review" version here: https://sourceforge.net/projects/cppcms/files/boost_locale/ The HTML documentation can also be seen at: http://cppcms.sourceforge.net/boost_locale/html/index.html Writing a review ================ If you feel this is an interesting library, then please submit your review to the developer list (preferably), or to the review manager (me). Here are some questions you might want to answer in your review: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? And finally, every review should answer this question: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. -- Chad Nelson Oak Circle Software, Inc. * * *

Attachments:

signature.asc (application/pgp-signature — 198 bytes)

Show replies by date

Paul A. Bristow

18 Apr 18 Apr

1:53 p.m.

...

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Chad Nelson Sent: Sunday, April 17, 2011 2:06 PM To: boost@lists.boost.org Subject: [boost] [locale] Formal review of Boost.Locale library EXTENDED

...

I'll accept late reviews until I complete the report on Saturday morning, but please try to get yours in earlier.

...

If you feel this is an interesting library, then please submit your review to the developer list (preferably), or to the review manager (me).

OK, this is cheating, but I'm going to cut to the chase (though I have read the helpful docs) ;-)

...

- Do you think the library should be accepted as a Boost library?

If Steven has spent 20 hours studying it, and reckons it is OK, that's good enough for me! (And give that man a medal!) I doubt if Boost.Locale is perfect, but if it makes some paths across the morass that is 'Internationalization', then that's progress. Paul PS Without wishing to sound too British - "If the natives don't understand, just shout louder.", I have little sympathy with the view expressed that no knowledge of English should be required. (Is this the cost of Japanese failure in the 80's to succeed in their ambition to dominate hardware and software with a world-beating language in Japanese?) C++ just is a language in English, so you must surely expect to have some rudimentary knowledge of this useful language.

Edward Diener

19 Apr 19 Apr

12:10 a.m.

On 4/18/2011 9:53 AM, Paul A. Bristow wrote:

...

PS Without wishing to sound too British - "If the natives don't understand, just shout louder.",

Now I know why Gandhi fought so hard for an independent country. Yes, I understand, then above is British humour.

...

I have little sympathy with the view expressed that no knowledge of English should be required. (Is this the cost of Japanese failure in the 80's to succeed in their ambition to dominate hardware and software with a world-beating language in Japanese?)

What language was that ?

...

C++ just is a language in English, so you must surely expect to have some rudimentary knowledge of this useful language.

The amount of rudimentary knowledge of English necessay to use C++ keywords and C++ standard library classes is so incredibly little compared to an actual knowledge of English grammar, that I doubt whether a knowledge of English is really necessary in any way in order to program in C++. My personal objection to Gnu gettext and its English bias has nothing to do with any desire myself to use a language other than English in order to communicate, since English ( or perhaps Americanese ) is the language of the country in which I was born, but nearly everything to do with my sense of the problems of translating even computer program phraseology from one language to another without complicating things by having to put some other language, even a very popular one, in the middle. Was that a single sentence ? I wonder if it can be translated to Japanese ?

Ryou Ezoe

2:14 a.m.

Insisting English knowledge is not practical. Most Japanese C++ programmers don't know English at all. Worse, it will cause serious problem. I bet Non-english people will use this library with non-english language. translate("日本語") They don't care it only expect ASCII. They write it just because it can be compiled. I'm worried about auther's lack of MBCS knowledge. No matter how many times I explained, he believes that if we use source file that use UTF-8 with BOM, MSVC use UTF-8 encoding for string literal with no encoding prefix. That is wrong. MSVC use shift-jis for string literal "日本語". MSVC accept all well-known Japanese encoding(shift-jis, JIS, EUC-JP, UTF-8, UTF-16(both)) It's just because UTF-8 is compatible with ASCII. As a Japanese, I think this library better be rejected. -- Ryou Ezoe

Artyom

9:08 a.m.

Hello, You know what, I'm not going to tell you what are best practices and what are the good ways to do things. (See appendix below) You'll probably disagree with them, But lets face the facts You telling

...

Ryou Ezoe Wrote:

Insisting English knowledge is not practical. Most Japanese C++ programmers don't know English at all. Worse, it will cause serious problem.

...

I bet Non-english people will use this library with non-english language.

translate("日本語")

[snip]

As a Japanese, I think this library better be rejected.

Lets appeal to facts. I have a simple Debian machine with various software installed. I've checked how many applications (gettext-domains) has translation files for different languages (calculated number of files under /usr/share/locale/LANG/LC_MESSAGES/*.mo) This is typical Debian system with some more tools installed for Hebrew. As I don't know Japanese I have never installed any Japanese specific software. Now lets see, I've taken few samples of languages: he - 140 ; Hebrew ar - 176 ; Arabic zh_TW - 200 ; Chinese Traditional zh_CN - 228 ; Chinese Simplified ja - 255 ; Japanese ru - 263 ; Russian es - 287 ; Spanish fr - 296 ; French de - 297 ; German So I can't say that gettext is not good for Japanese. In fact it is not bad at all Now then I sorted all locales by number of translated domains (applications) And this is the top 50 297 de 296 fr 287 es 285 sv 272 it 263 ru 263 pl 256 cs 255 ja 253 pt_BR 251 nl 237 da 233 tr 233 ca 228 zh_CN 223 hu 213 nb 212 sk 210 uk 208 fi 206 el 200 zh_TW 199 pt 194 gl 188 ko 187 vi 178 ro 176 ar 175 sr 173 et 169 lt 166 eu 166 bg 163 en_GB 158 sl 158 pa 150 th 140 rw 140 he 140 ga 133 sr@Latn 132 ms 131 mk 131 id 129 ta 127 nn 125 hr 120 dz As you can see, Japanese is in top 10 at 9th place. Given that Japanese speaking population is relatively small I can't buy it that gettext is not good for Japanese... Any more comments? -------------------------------------------------- Appendix: ========= Copied rationale from other message for English Strings: --------------------------------------------------------- I'll explain it the way I explained it before, there are many reasons to have English as core/source language rather then other native language of the developer. Living aside technical notes (I'll talk about them later) I'll explain why English source strings is the best pratice. a) Most of the software around is being only partially translated, it is frequent that the translations are out of sync with major development line, Beta versions come usually with only limited translation support. Now consider yourself beta test of a program developed in Japan by programmers who does not know English well and you try to open a file and you see a message: "File this is bad format" // Bad English Or see: "これは不正なファイル形式です。" // Good Japanese (actually // translated with google :-) Opps?! I hope now it is more clear. Even for most of us who do not speak English well are already familiar to see English as international language and would be able to handle partially translated software. But with this "natural method" it would be "all greek to us" b) In many cases, especially when it comes to GNU Gettext your customer or just a volonteer can take a dictionary template, sit for about an hour or two with "Poedit" and give acceptable quality translation of medium size program. Load it test it and send it back. This is actually happens 95% of time on open source world and it happens in closed source world as well. Reason - it is easy, accesable to everyone and you do not have to be a programmer to do this. You even do not have to be a professional tanslator with a degree in Lingustics to translate messages from English to your own language. That is why it is the best practice, that is why all translation systems around use same technique for this. It is not rediculas, it is not strange it is reality and actually is not so bad reality. Now technical reasons: 1. There is a total mess with encodings between different compilers. It would be quite hard to relate on charset of the localized strings in source. For windows it would likely be one of 12XX or 9xx codepages For Unix it would likely to be UTF-8 And it is actually impossible to make both MSVC and GCC to see same UTF-8 encoded string in same source L"שלום-سلام-pease" Because MSVC would want a stupid BOM from you and all other "sane" compilers would not accept BOM in sources at all. 2. You want to be able to conver the string to the target string very fast if it is missed in dictionary, so if you use wide strings and you can't use wide Unicode strings in sources because of the problem above you will have to do charset conversion. When for English and ASCII it would be just byte by byte casting. You don't want to do charset conversion for every string around in runtime. 3. All translation systems (not Gettext only) assume ASCII keys as input. And you do not want to create yet another new translation file format as you will need to: a) Port it to other languages as projects may want to use unified system b) Develop nice GUI tools like Lokalize or Poedit (that had been under development for years) c) Try to convinse all users around that your reinvented wheel is better then gettext (and it wouldn't be better) I hope not it is clear enough.

Ryou Ezoe

9:54 a.m.

There are many Japanese translated softwares, because there are many Japanese translators who contributed the free software. The gettext works for translating English software to other languages such as Japanese. But you can't use gettext in Japanese software. Using English as a primary language means it's not a Japanese software anymore. Most Japanese cannot even write "File this is bad format". We don't even know what "file" means. Just like you don't know what ファイル means without using the machine translation. You can look it up the meaning from dictionary, but it still a completely unfamiliar symbols. Just like you feel from ファイル. ファイル means file. Now you know what ファイル means. Does that makes you write 「不正なファイル形式です」 ? I don't think so. How could we, the Japanese write a code? I think we don't recognize the meaning of function name like fopen by English. It's just a unique identifier for a function that does ファイルを開く(open the file) Even if all programmer use English(that is impossible), It's not only the programmer who write texts in the software. Many texts in the software are written by non-programmers. Do you force English to them too? By some miracle, if you could somehow achieved that, why do we need translation in that ideal world? Everybody use English. No translation is needed. Isn't it? This library can be used only by people who's language can be expressed by using basic source character set. Isn't it flawed considering this is a localization library? You're saying that, in order to localize the software, you need to abandon your language for the start. That will never works. On Tue, Apr 19, 2011 at 6:08 PM, Artyom <artyomtnk@yahoo.com> wrote:

...

Hello,

You know what, I'm not going to tell you what are best practices and what are the good ways to do things.

(See appendix below)

You'll probably disagree with them,

But lets face the facts

You telling

...
Ryou Ezoe Wrote:

Insisting English knowledge is not practical. Most Japanese C++ programmers don't know English at all. Worse, it will cause serious problem.

...
I bet Non-english people will use this library with non-english language.

translate("日本語")

[snip]

As a Japanese, I think this library better be rejected.

Lets appeal to facts.

I have a simple Debian machine with various software installed.

I've checked how many applications (gettext-domains) has translation files for different languages

(calculated number of files under /usr/share/locale/LANG/LC_MESSAGES/*.mo)

This is typical Debian system with some more tools installed for Hebrew. As I don't know Japanese I have never installed any Japanese specific software.

Now lets see, I've taken few samples of languages:

he - 140 ; Hebrew ar - 176 ; Arabic zh_TW - 200 ; Chinese Traditional zh_CN - 228 ; Chinese Simplified ja - 255 ; Japanese ru - 263 ; Russian es - 287 ; Spanish fr - 296 ; French de - 297 ; German

So I can't say that gettext is not good for Japanese.

In fact it is not bad at all

Now then I sorted all locales by number of translated domains (applications)

And this is the top 50

297 de 296 fr 287 es 285 sv 272 it 263 ru 263 pl 256 cs 255 ja 253 pt_BR 251 nl 237 da 233 tr 233 ca 228 zh_CN 223 hu 213 nb 212 sk 210 uk 208 fi 206 el 200 zh_TW 199 pt 194 gl 188 ko 187 vi 178 ro 176 ar 175 sr 173 et 169 lt 166 eu 166 bg 163 en_GB 158 sl 158 pa 150 th 140 rw 140 he 140 ga 133 sr@Latn 132 ms 131 mk 131 id 129 ta 127 nn 125 hr 120 dz

As you can see, Japanese is in top 10 at 9th place.

Given that Japanese speaking population is relatively small I can't buy it that gettext is not good for Japanese...

Any more comments?

--------------------------------------------------

Appendix: =========

Copied rationale from other message for English Strings: ---------------------------------------------------------

I'll explain it the way I explained it before,

there are many reasons to have English as core/source language rather then other native language of the developer.

Living aside technical notes (I'll talk about them later) I'll explain why English source strings is the best pratice.

a) Most of the software around is being only partially translated, it is frequent that the translations are out of sync with major development line,

Beta versions come usually with only limited translation support.

Now consider yourself beta test of a program developed in Japan by programmers who does not know English well and you try to open a file and you see a message:

"File this is bad format" // Bad English

Or see:

"これは不正なファイル形式です。" // Good Japanese (actually // translated with google :-)

Opps?!

I hope now it is more clear. Even for most of us who do not speak English well are already familiar to see English as international language and would be able to handle partially translated software.

But with this "natural method" it would be "all greek to us"

b) In many cases, especially when it comes to GNU Gettext your customer or just a volonteer can take a dictionary template, sit for about an hour or two with "Poedit" and give acceptable quality translation of medium size program.

Load it test it and send it back.

This is actually happens 95% of time on open source world and it happens in closed source world as well.

Reason - it is easy, accesable to everyone and you do not have to be a programmer to do this. You even do not have to be a professional tanslator with a degree in Lingustics to translate messages from English to your own language.

That is why it is the best practice, that is why all translation systems around use same technique for this.

It is not rediculas, it is not strange it is reality and actually is not so bad reality.

Now technical reasons:

1. There is a total mess with encodings between different compilers.

It would be quite hard to relate on charset of the localized strings in source.

For windows it would likely be one of 12XX or 9xx codepages For Unix it would likely to be UTF-8

And it is actually impossible to make both MSVC and GCC to see same UTF-8 encoded string in same source

L"שלום-سلام-pease"

Because MSVC would want a stupid BOM from you and all other "sane" compilers would not accept BOM in sources at all.

2. You want to be able to conver the string to the target string very fast if it is missed in dictionary, so if you use wide strings and you can't use wide Unicode strings in sources because of the problem above you will have to do charset conversion.

When for English and ASCII it would be just byte by byte casting.

You don't want to do charset conversion for every string around in runtime.

3. All translation systems (not Gettext only) assume ASCII keys as input.

And you do not want to create yet another new translation file format as you will need to:

a) Port it to other languages as projects may want to use unified system

b) Develop nice GUI tools like Lokalize or Poedit (that had been under development for years)

c) Try to convinse all users around that your reinvented wheel is better then gettext

(and it wouldn't be better)

I hope not it is clear enough. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Ryou Ezoe

Artyom

12:25 p.m.

...

From: Ryou Ezoe <boostcpp@gmail.com>

There are many Japanese translated softwares, because there are many Japanese translators who contributed the free software. The gettext works for translating English software to other languages such as Japanese. But you can't use gettext in Japanese software.

Using English as a primary language means it's not a Japanese software anymore.

Most Japanese cannot even write "File this is bad format". We don't even know what "file" means. Just like you don't know what ファイル means without using the machine translation. You can look it up the meaning from dictionary, but it still a completely unfamiliar symbols. Just like you feel from ファイル. ファイル means file. Now you know what ファイル means. Does that makes you write 「不正なファイル形式です」 ? I don't think so.

How could we, the Japanese write a code? I think we don't recognize the meaning of function name like fopen by English. It's just a unique identifier for a function that does ファイルを開く(open the file)

Even if all programmer use English(that is impossible), It's not only the programmer who write texts in the software. Many texts in the software are written by non-programmers. Do you force English to them too? By some miracle, if you could somehow achieved that, why do we need translation in that ideal world? Everybody use English. No translation is needed. Isn't it?

This library can be used only by people who's language can be expressed by using basic source character set. Isn't it flawed considering this is a localization library?

You're saying that, in order to localize the software, you need to abandon your language for the start. That will never works.

I understand that we are going to phylosofical discussions, but yet there few things that I don't get:

...

We don't even know what "file" means. ... How could we, the Japanese write a code?

The problem is not only simple messages like 「不正なファイル形式です」 "illegal file format" but rather much more basic things: 1. How do you read man-pages? 2. How do you read MSDN docs? 3. How do you read the documentation of Boost Libraries? 4. How do you solve problems you hand't seen? Do you really can find all answers in Google in Japanese? 5. How do you talk to customers outside Japan? I'm not trying to tell you don't speak Japanese. Ohh not! I love being able to use software adopted to my culture I believe that culture diversity is very important and should be preserved and not "Globalized" by putting English anywhere... I hate that some "globalized" bad habits sometime kill-off some correct behavior and good language. But this has nothing to do with writing software. You know (read) that my English is far from being perfect - you read my manuals in my "Hebrew English", but still you mostly understand what you write and I still understand what Japanese people write in "their English dialect" And I still thing it is correct way to write the original messages in English even it may be hard sometimes. I can suggest a little experiment: Take these several messages: - "Bad file format! The program wasn't able to open it correctly" - "You have no more space on disk, please specify other location to save the file" - "The compilation had failed, no '}' found at the end of the file" Translate them to Japanese and ask several co-workers that are average C++ programmers capable of using Boost libraries to write them in English without using any dictionaries or google-translate. Record how much time had that took for them. I don't know the area you are working in, so I can't create specific messages for your domain, but consider if you were working in image I'd give a message like "Change the brightness of the picture" Of course take something that is valid for the domain of your work. If I can see that average employed C++ programmers can't write understandable messages in English I promise that I'll think about the solution for next version of the Boost.Locale regardless if it is going to be accepted to the Boost or not. But please, try to do it in objective way. Artyom ---------------------------------------------

...

...
You know what, I'm not going to tell you what are best practices and what are the good ways to do things.

(See appendix below)

You'll probably disagree with them,

But lets face the facts

You telling

...
Ryou Ezoe Wrote:

Insisting English knowledge is not practical. Most Japanese C++ programmers don't know English at all. Worse, it will cause serious problem.

...
I bet Non-english people will use this library with non-english language.

translate("日本語")

[snip]

As a Japanese, I think this library better be rejected.

Lets appeal to facts.

I have a simple Debian machine with various software installed.

I've checked how many applications (gettext-domains) has translation files for different languages

[snip]

As you can see, Japanese is in top 10 at 9th place.

Given that Japanese speaking population is relatively small I can't buy it that gettext is not good for Japanese...

Any more comments?

Ryou Ezoe

12:58 p.m.

On Tue, Apr 19, 2011 at 9:25 PM, Artyom <artyomtnk@yahoo.com> wrote:

...

1. How do you read man-pages? We read translation. 2. How do you read MSDN docs? We read translation. 3. How do you read the documentation of Boost Libraries? We read translation. 4. How do you solve problems you hand't seen? Do you really can find all answers in Google in Japanese? There is high chance some Japanese programmer solved it or read English paper and explained it in Japanese. 5. How do you talk to customers outside Japan? Japanese programmer don't talk to foreign customers. That is a job for the translator.

Isn't it obvious from the fact that your debian has so many Japanese translation? We have enough Japanese translations to learn programming without knowing English. Why there are so many translation? Because we need it. If average Japanese programmer can read English, we don't need such amount of translations. -- Ryou Ezoe

Artyom

20 Apr 20 Apr

4:54 a.m.

...

From: Ryou Ezoe <boostcpp@gmail.com> To: boost@lists.boost.org Sent: Tue, April 19, 2011 3:58:00 PM Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED

On Tue, Apr 19, 2011 at 9:25 PM, Artyom <artyomtnk@yahoo.com> wrote:

...
1. How do you read man-pages? We read translation. 2. How do you read MSDN docs? We read translation. 3. How do you read the documentation of Boost Libraries? We read translation. 4. How do you solve problems you hand't seen? Do you really can find all answers in Google in Japanese? There is high chance some Japanese programmer solved it or read English paper and explained it in Japanese. 5. How do you talk to customers outside Japan? Japanese programmer don't talk to foreign customers. That is a job for the translator.

Isn't it obvious from the fact that your debian has so many Japanese translation? We have enough Japanese translations to learn programming without knowing English. Why there are so many translation? Because we need it. If average Japanese programmer can read English, we don't need such amount of translations.

You know, I have a solution for you. This solution works because gettext accepts arbitrary char * as key, even when new versions warn about non-ASCII strings. So basically wgettext("日本語") would work when the string in the dictionary... So there is a "solution" that you can adopt: Solutuon A: -------------------------------------------------------------------------- template<typename CharType> std::basic_string<CharType> basic_xtranslate(char const *msg,std::locale const &l=std::locale()) { typedef boost::locale::message_format<CharType> facet_type; CharType const *translated = 0; if(std::has_facet<facet_type>(l) && (translated=std::use_facet<facet_type>(l).(0,0,msg))!=0) { return translated; } // Will be replaced in utf_to_utf return boost::locale::conv::to_utf<CharType>(msg,"UTF-8"); } inline std::wstring wxtranslate(char const *msg,std::locale const &l=std::locale()) { return basic_xtranslate<wchar_t>(msg,l); } Solution B ....................................... typedef std::pair<char const *,wchar_t const *> dual_message_type inline dual_message_type make_dual_message(char const *n,wchar_t const *w) { return dual_message_type(n,w); } #define WTR(m) (make_dual_mesage(m,L##m)) std::wstring wxtranslate(dual_message_type const &msg,std::locale const &l=std::locale()) { typedef boost::locale::message_format<wchar_t> facet_type; wchar_t const *translated = 0; if(std::has_facet<facet_type>(l) && (translated=std::use_facet<facet_type>(l).(0,0,msg.first))!=0) { return translated; } return msg.second } ---------------------------------------------------------------- And now you can freely write: a) wxtranslate("日本語") or b) wxtranslate(WTR("日本語")) However a) In first case you will have to make sure that the sources are UTF-8 (and as you had said they may be not) or other encoding but it should be constant at compilation time. And it has run time penalty on case were the string is not in the dictionary b) In second case you will have to make sure that MSVC handles L"日本語" correctly. This can be extended for plural forms, context support and domains. But this isn't going to be part of Boost.Locale as such code would bite you at some point very hard. Artyom

Ryou Ezoe

7:35 a.m.

On Wed, Apr 20, 2011 at 1:54 PM, Artyom <artyomtnk@yahoo.com> wrote:

...

...
From: Ryou Ezoe <boostcpp@gmail.com> To: boost@lists.boost.org Sent: Tue, April 19, 2011 3:58:00 PM Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED

On Tue, Apr 19, 2011 at 9:25 PM, Artyom <artyomtnk@yahoo.com> wrote:

...
1. How do you read man-pages? We read translation. 2. How do you read MSDN docs? We read translation. 3. How do you read the documentation of Boost Libraries? We read translation. 4. How do you solve problems you hand't seen? Do you really can find all answers in Google in Japanese? There is high chance some Japanese programmer solved it or read English paper and explained it in Japanese. 5. How do you talk to customers outside Japan? Japanese programmer don't talk to foreign customers. That is a job for the translator.

Isn't it obvious from the fact that your debian has so many Japanese translation? We have enough Japanese translations to learn programming without knowing English. Why there are so many translation? Because we need it. If average Japanese programmer can read English, we don't need such amount of translations.

You know, I have a solution for you.

This solution works because gettext accepts arbitrary char * as key, even when new versions warn about non-ASCII strings.

warn about non-ASCII strings? For what? As being a library that stick with non-portable ASCII when next C++ standard officially support UTF-8, UTF-16 and UTF-32? I lost all my faith on this library. Don't claim it support UTF-8. UTF-8 that reject non-ASCII characters is ASCII.

...

So basically wgettext("日本語") would work when the string in the dictionary...

So there is a "solution" that you can adopt:

Solutuon A: --------------------------------------------------------------------------

template<typename CharType> std::basic_string<CharType> basic_xtranslate(char const *msg,std::locale const &l=std::locale()) { typedef boost::locale::message_format<CharType> facet_type; CharType const *translated = 0; if(std::has_facet<facet_type>(l) && (translated=std::use_facet<facet_type>(l).(0,0,msg))!=0) { return translated; } // Will be replaced in utf_to_utf return boost::locale::conv::to_utf<CharType>(msg,"UTF-8"); }

inline std::wstring wxtranslate(char const *msg,std::locale const &l=std::locale()) { return basic_xtranslate<wchar_t>(msg,l); }

Solution B .......................................

typedef std::pair<char const *,wchar_t const *> dual_message_type

inline dual_message_type make_dual_message(char const *n,wchar_t const *w) { return dual_message_type(n,w); }

#define WTR(m) (make_dual_mesage(m,L##m))

std::wstring wxtranslate(dual_message_type const &msg,std::locale const &l=std::locale()) { typedef boost::locale::message_format<wchar_t> facet_type; wchar_t const *translated = 0; if(std::has_facet<facet_type>(l) && (translated=std::use_facet<facet_type>(l).(0,0,msg.first))!=0) { return translated; } return msg.second }

----------------------------------------------------------------

And now you can freely write:

a) wxtranslate("日本語")

or

b) wxtranslate(WTR("日本語"))

However

a) In first case you will have to make sure that the sources are UTF-8 (and as you had said they may be not) or other encoding but it should be constant at compilation time.

And it has run time penalty on case were the string is not in the dictionary

b) In second case you will have to make sure that MSVC handles L"日本語" correctly.

This can be extended for plural forms, context support and domains.

But this isn't going to be part of Boost.Locale as such code would bite you at some point very hard.

I don't understand what are you trying to solve by that so called solutions. Solution A does not work at all. There is no guarantee ordinary string literal is UTF-8 encoded.(and it always isn't in MSVC). Solution B... What are you doing? Isn't wxtranslate(WTR("日本語")) ended up pointer to const wchar_t that refers to L"日本語" ? It does nothing except it works as a macro which add L encoding prefix. If so, I'd rather write L"日本語" directly. Since translate() expect nothing but ASCII. I suggest you should clearly stated that in the document. You should also throw an exception when you detect a non-ASCII character in the argument of translate. That way, you are safe from the real world problem. -- Ryou Ezoe

Artyom

2:06 p.m.

...

I don't understand what are you trying to solve by that so called solutions.

Solution A does not work at all. There is no guarantee ordinary string literal is UTF-8 encoded.(and it always isn't in MSVC).

You are right, I missed it. After digging a little I've now surprised how broken MSVC's behavior is $%&#%$^&#%^#%^#$%^#$%^#$%^ Seriously... - Without BOM I can't get L"日本語" as UTF-16 to work when the sources are UTF-8 (however char * is valid utf-8), It works only when the sources are in the current locale's encoding that actually may vary from PC to PC so it is totally unreliable. - With BOM while I get L"日本語" as UTF-16... I can't get "日本語" as UTF-8 at all it still formats them in current locale, which is unreliable as well. But you know what? It just convinces me even more that ASCII strings should be used as keys with all this nonsense with MSVC compiler.

...

Solution B... What are you doing? Isn't wxtranslate(WTR("日本語")) ended up pointer to const wchar_t that refers to L"日本語" ? It does nothing except it works as a macro which add L encoding prefix. If so, I'd rather write L"日本語" directly.

I see why the second case does not work unless your "ids" are in Shift-JIS. Any way, the things you say even increases the total mess exists in current charset encoding. From all compilers I used (MSVC, GCC, Intel, SunCC, OpenVMS's HP) all but MSVC handle UTF-8/wide/narrow characters properly. I'm sorry I was thinking about it too good things. This way or other it convinces me that you can relay only on ASCII. Artyom

Ryou Ezoe

11:55 p.m.

On Wed, Apr 20, 2011 at 11:06 PM, Artyom <artyomtnk@yahoo.com> wrote:

...

...
I don't understand what are you trying to solve by that so called solutions.

Solution A does not work at all. There is no guarantee ordinary string literal is UTF-8 encoded.(and it always isn't in MSVC).

You are right, I missed it.

After digging a little I've now surprised how broken MSVC's behavior is $%&#%$^&#%^#%^#$%^#$%^#$%^

Seriously...

- Without BOM I can't get L"日本語" as UTF-16 to work when the sources are UTF-8 (however char * is valid utf-8),

It works only when the sources are in the current locale's encoding that actually may vary from PC to PC so it is totally unreliable.

It doesn't work because there is no way to detect what character encoding the source file use. It can be anything. See the list. http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx Assuming the character set according to system's default locale is not a good idea. But there is no good way to handle it anyway. Even the std::locale does the same thing, isn't it? Set a default global locale and assume it's the correct locale.

...

- With BOM while I get L"日本語" as UTF-16... I can't get "日本語" as UTF-8 at all it still formats them in current locale, which is unreliable as well.

Windows does not truly support UTF-8 locale.(Native locale is UTF-16) Because MBCS can be anything. You can't detect it no matter what clever hack you use. Maybe UTF-8, maybe one of this encoding in the list excluding UTF-16LE and BE. http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx So it needs a hint, a BOM. When it says MBCS and ANSI and Japanese, It's always CP932(Microsoft variant of Shift-JIS). So MSVC, the compiler for Windows, use that encoding for Japanese.

...

But you know what? It just convinces me even more that ASCII strings should be used as keys with all this nonsense with MSVC compiler.

No. You must use one of UCS encodings.

...

...
Solution B... What are you doing? Isn't wxtranslate(WTR("日本語")) ended up pointer to const wchar_t that refers to L"日本語" ? It does nothing except it works as a macro which add L encoding prefix. If so, I'd rather write L"日本語" directly.

I see why the second case does not work unless your "ids" are in Shift-JIS.

Again, Using CP932(Microsoft Variant) is not a solution. It's workaround. I think I need to explain a bit of history about Japanese character encoding. Shift-JIS was designed based on two JIS standard encodings when Unicode is not widely used and not a practical solution. They needed a encoding for Japanese and they needed it fast. There were two JIS standard encodings(kana as extended ASCII encoding and kanji code points) already. But it was hard to use. JIS encoding was a statefull encoding. It use escape sequences to change the behavior of how to interpret rest of the characters. blah blah blah [escape sequence to change the mode] blah blah blah [escape sequence to change it back] blah blah blah The meaning of the code point changes base from an escape sequence that appears prior to that code point. This is really hard to use. Designing a new encoding needs time so they had to slightly modify this encoding. In order to remove the escape sequence, they shifted some code points to squeeze characters to unused range. Since it just use binary shift, it's easy to get a original code point from shift-jis(if they know it is indeed a shift-jis). Thus, the name *shift* JIS. EUC-JP has another story. But I don't know EUC-JP well. Every character encoding has its own history. A long history. You can't ignore it. These encodings are still widely used. If you stick with the ASCII, expect any extended ASCII variants. Because that's what these extended ASCII were meant to be. Give it to the code which expect ASCII and, because of its compatibility with ASCII, it works most of the time. It's not perfect. But you can use it from today. That's what happened in the last century.

...

Any way, the things you say even increases the total mess exists in current charset encoding.

From all compilers I used (MSVC, GCC, Intel, SunCC, OpenVMS's HP) all but MSVC handle UTF-8/wide/narrow characters properly.

What do you mean properly? Windows does not support UTF-8 locale. Everything is handled as UTF-16. Wide and narrow characters can be anything. So any encodings are proper implementation of the standard.

...

I'm sorry I was thinking about it too good things.

This way or other it convinces me that you can relay only on ASCII.

I you think it that way. You probably shouldn't design a localization library. Don't you think it's odd? That is, in order to use UCS, we have to use ASCII. UCS is not perfect. If we were to design it from scratch(including drop compatibility with ASCII), we can design it better. Nobody use such standard though. UCS is the only standard an encoding(either UTF-8, UTF-16, UTF-32) can represent all well-known glyphs in the world today You prefer UTF-8 because, fundamentally, you're thinking in ASCII. UTF-8 is ASCII compatible. ASCII is NOT UTF-8 compatible.

...

Artyom

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Ryou Ezoe

Artyom

21 Apr 21 Apr

4:58 a.m.

...

From: Ryou Ezoe <boostcpp@gmail.com>

On Wed, Apr 20, 2011 at 11:06 PM, Artyom <artyomtnk@yahoo.com> wrote:

...
...
I don't understand what are you trying to solve by that so called solutions.

Solution A does not work at all. There is no guarantee ordinary string literal is UTF-8 encoded.(and it always isn't in MSVC).

You are right, I missed it.

After digging a little I've now surprised how broken MSVC's behavior is $%&#%$^&#%^#%^#$%^#$%^#$%^

Seriously...

- Without BOM I can't get L"日本語" as UTF-16 to work when the sources are UTF-8 (however char * is valid utf-8),

It works only when the sources are in the current locale's encoding that actually may vary from PC to PC so it is totally unreliable.

It doesn't work because there is no way to detect what character encoding the source file use. It can be anything. See the list. http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx

Exactly, that means that same source code compiled on two different machines can lead to two different results.

...

Assuming the character set according to system's default locale is not a good idea. But there is no good way to handle it anyway.

Actually there is #pragma setlocale http://msdn.microsoft.com/en-us/library/3e22ty2t(v=VS.100).aspx But too few know it and it is MSVC specific.

...

Even the std::locale does the same thing, isn't it? Set a default global locale and assume it's the correct locale.

This is different, for example when you start text editor on my PC where the system/user locale is he_IL.UTF-8 that it is more then reasonable to start the user interface in Hebrew while on your PC (I assume Japanese_Japan.932) it should have Japanese interface. However unlike your (Windows PC) the narrow encoding is different, while on my (Linux PC) it is UTF-8 so no encoding change is required. It was one of the goals of this library as it by default choses UTF-8 as narrow encoding (unless it was told to select so called ANSI) So it allows you with help of std::locale aware tools like boost::filesystem to develop cross platform Unicode aware software. (And don't suggest using Wide strings as they quite useless for cross platform development, they useful on Windows as it has Wide API as major API but not more then that)

...

...
- With BOM while I get L"日本語" as UTF-16... I can't get "日本語" as UTF-8 at all it still formats them in current locale, which is unreliable as well.

Windows does not truly support UTF-8 locale.(Native locale is UTF-16) Because MBCS can be anything. You can't detect it no matter what clever hack you use.

See pragma setlocale above.

...

Maybe UTF-8, maybe one of this encoding in the list excluding UTF-16LE and BE. http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx

So it needs a hint, a BOM.

Yes, unfortunately you should put BOM that all compilers in the world complain about. I'm aware of the way Windows thinks of encodings. I just mistakenly assumed that with BOM "שלום" and L"שלום" would get me UTF-8 and UTF-16 string under MSVC, but I was wrong.

...

...
But you know what? It just convinces me even more that ASCII strings should be used as keys with all this nonsense with MSVC compiler.

No. You must use one of UCS encodings.

But, MSVC does not know to handle UCS encodings :-) I mean you can't get both UTF-8 string and UTF-16 string under MSVC in same sources.

...

...
...
Solution B... What are you doing? Isn't wxtranslate(WTR("日本語")) ended up pointer to const wchar_t that refers to L"日本語" ? It does nothing except it works as a macro which add L encoding prefix. If so, I'd rather write L"日本語" directly.

I see why the second case does not work unless your "ids" are in Shift-JIS.

Again, Using CP932(Microsoft Variant) is not a solution. It's workaround.

Agree but using Japanese as source string encoding is workaround of programmes lack of English knowledge, don't you really expect that French, German, Hebrew, Arabic or Greek developers would all use their own language in the sources?

...

I think I need to explain a bit of history about Japanese character encoding. Shift-JIS was designed based on two JIS standard encodings when Unicode is not widely used and not a practical solution. [snip]

I know this story, and I most telling Shift-JIS at it is more clear then to say cp932...

...

Every character encoding has its own history. A long history. You can't ignore it. These encodings are still widely used.

If you stick with the ASCII, expect any extended ASCII variants.

When I tell ASCII I should probably say US-ASCII not extended ones. Extended ASCII variants should be vanished, including ISO-8859-8, Windows-1255 (hebrew encodings) and other encodings like Latin1, JIS and others in favor of UTF-8.

...

...
Any way, the things you say even increases the total mess exists in current charset encoding.

From all compilers I used (MSVC, GCC, Intel, SunCC, OpenVMS's HP) all but MSVC handle UTF-8/wide/narrow characters properly.

What do you mean properly? Windows does not support UTF-8 locale. Everything is handled as UTF-16.

Wide and narrow characters can be anything. So any encodings are proper implementation of the standard.

I mean "שלום" and L"שלום" would be UTF-8 and UTF-16 or 32 as in all C and C++ compilers on the Earth.

...

...
I'm sorry I was thinking about it too good things.

This way or other it convinces me that you can relay only on ASCII.

I you think it that way. You probably shouldn't design a localization

library.

...

Don't you think it's odd?

I don't want to open philosofical discussions about what design desigions had Microsoft did, but some of them really bad onces, and that is why Boost.Locale uses UTF-8 encoding by default under MS Windows (unless you explicitly say it to select local ANSI encoding) This library tries to do the best supporting both Wide and Narrow characters but at some points desisions should be made to one direction or others, because otherwise we would stay behind. I understand that you are Windows developer that is familiar with Wide character API. However Boost.Locale is cross platform system, and it can't be based on Wide API because it is useless outside Microsoft Windows. So yes, it choses to stick to portable desigision like using char * as string id or selecting UTF-8 and default encoding, because otherwise it would not bring us forward. Also given a fact that "char *" is total mess in terms of encoding as well as "wchar_t *" is even more messy in terms of source files encodings then I think the design to stick to "char *" and to US-ASCII as ids is right one.

...

That is, in order to use UCS, we have to use ASCII. UCS is not perfect. If we were to design it from scratch(including drop compatibility with ASCII), we can design it better. Nobody use such standard though. UCS is the only standard an encoding(either UTF-8, UTF-16, UTF-32) can represent all well-known glyphs in the world today

You prefer UTF-8 because, fundamentally, you're thinking in ASCII. UTF-8 is ASCII compatible. ASCII is NOT UTF-8 compatible.

You probably confused between two things: a) Every US-ASCII string is UTF-8 string b) Only subset of UTF-8 strings is US-ASCII Also I don't think ASCII, I think portability, and UTF-8 is much more portable then UTF-16 or UTF-32. Best, Artyom

Mathieu Champlon

19 Apr 19 Apr

8:50 p.m.

On 19/04/2011 13:25, Artyom wrote:

...

If I can see that average employed C++ programmers can't write understandable messages in English I promise that I'll think about the solution for next version of the Boost.Locale regardless if it is going to be accepted to the Boost or not.

Not only most of them cannot write messages in English, they actually even cannot read them. The view that the whole computer science industry uses English is really a western centric thing. I know it's hard to understand how it's possible to work in IT without using English when that's what we've been doing our whole life, but it's the norm in Japan and probably in other eastern Asian countries as well. They are some download statistics for boost available here : https://sourceforge.net/projects/boost/files/boost/stats/map China and Japan are in the top 5. How often do Chinese or Japanese ask questions on the boost users mailing list ? MAT.

Peter Dimov

12:04 p.m.

Ryou Ezoe wrote:

...

Insisting English knowledge is not practical. Most Japanese C++ programmers don't know English at all. Worse, it will cause serious problem.

I bet Non-english people will use this library with non-english language.

translate("日本語")

They don't care it only expect ASCII. They write it just because it can be compiled.

Am I correct in understanding that you just want the ability to write translate( L"日本語" ) instead, with the wide literal automatically converted from UTF-16 to UTF-8 by the library? Or do you demand something more, such as the ability to write translate("日本語") and have the library correctly handle shift-JIS?

Ryou Ezoe

1:05 p.m.

On Tue, Apr 19, 2011 at 9:04 PM, Peter Dimov <pdimov@pdimov.com> wrote:

...

Ryou Ezoe wrote:

...
Insisting English knowledge is not practical. Most Japanese C++ programmers don't know English at all. Worse, it will cause serious problem.

I bet Non-english people will use this library with non-english language.

translate("日本語")

They don't care it only expect ASCII. They write it just because it can be compiled.

Am I correct in understanding that you just want the ability to write

translate( L"日本語" )

instead, with the wide literal automatically converted from UTF-16 to UTF-8 by the library? Or do you demand something more, such as the ability to write translate("日本語") and have the library correctly handle shift-JIS?

I don't care library internally convert between UTFs. At least, conversion between UTFs are trivial and more importantly consistent so I don't care. The user want to develop their software using their native language. Localization should be done by mapping their language to other languages. So the library should offer mapping to any unicode string to other unicode string. Ideally, it should be handled by C++0x's unicode support(new types and encoding prefix) But I think C++0x implementation is still too early to use. -- Ryou Ezoe

Peter Dimov

1:38 p.m.

Ryou Ezoe wrote:

...

I don't care library internally convert between UTFs. At least, conversion between UTFs are trivial and more importantly consistent so I don't care.

The user want to develop their software using their native language. Localization should be done by mapping their language to other languages. So the library should offer mapping to any unicode string to other unicode string.

But you have to propose something specific. How should the library be changed? I asked a direct question: if the library accepts translate( L"日本語" ) encodes the wide literal as UTF-8 and uses it to look up a msgid in the .po file, will this be enough for you? Or do you want something else, and if so, what?

Ryou Ezoe

1:53 p.m.

What I want is translate() accept wchar_t const * and std::wstring as a parameter. just like it accept char const * and std::string. Then, it return the corresponding translated text. Although the encoding of wchar_t is unspecified in the Standard. In the current MS-Windows environment, it should be treated as UTF-16. Converting it to UTF-8 is a implementation details. I don't care which UTF it internally use. As long as it support real UCS(all code points defined in UCS) But treating it as UCS rather than binary string is better. Assuming we have C++0x compiler and encoding of wchar_t is UTF-16, translate(u8"text"), translate(u"text"), translate(U"text") and translate(L"text") all returns the same mapped translated text according to the locale. This is a good. On Tue, Apr 19, 2011 at 10:38 PM, Peter Dimov <pdimov@pdimov.com> wrote:

...

Ryou Ezoe wrote:

...
I don't care library internally convert between UTFs. At least, conversion between UTFs are trivial and more importantly consistent so I don't care.

The user want to develop their software using their native language. Localization should be done by mapping their language to other languages. So the library should offer mapping to any unicode string to other unicode string.

But you have to propose something specific. How should the library be changed? I asked a direct question: if the library accepts

translate( L"日本語" )

encodes the wide literal as UTF-8 and uses it to look up a msgid in the .po file, will this be enough for you? Or do you want something else, and if so, what? _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Ryou Ezoe

Soares Chen Ruo Fei

2:31 p.m.

Ryou Ezoe wrote:

...

What I want is translate() accept wchar_t const * and std::wstring as a parameter. just like it accept char const * and std::string. Then, it return the corresponding translated text. Although the encoding of wchar_t is unspecified in the Standard. In the current MS-Windows environment, it should be treated as UTF-16.

Converting it to UTF-8 is a implementation details. I don't care which UTF it internally use. As long as it support real UCS(all code points defined in UCS)

But treating it as UCS rather than binary string is better.

Assuming we have C++0x compiler and encoding of wchar_t is UTF-16, translate(u8"text"), translate(u"text"), translate(U"text") and translate(L"text") all returns the same mapped translated text according to the locale. This is a good.

I suppose that you are probably fine with the requirement that the supplied text must be in one of the Unicode encodings, because otherwise translating from text in shift-JIS or arbitrary encodings is probably be a mess from a technical perspective. I think that what we really need is to enforce the character set used in Boost.Locale, not the language. It just happen that Artyom chose the ASCII character set which don't support most other languages. I don't see any technical reasons to enforce the language used for translating, but there are many technical reasons to enforce a particular encoding. We can just change the encoding used from ASCII to UCS, and that wouldn't technically make much difference. The only problem for using Unicode as the translation key is the normalization issues. Since normalization is too heavyweight, the translation system should probably operate at code point level, though translations of identical original text with different code points will then fail. I have one suggestion to overcome GNU Gettext's limitation. Perhaps we can automatically convert the text into Unicode escaped sequences before passing to GNU Gettext, so "日本語" in UTF-8 will become "\\u65E5\\u672C\\u8A9E" in ASCII.

Ryou Ezoe

8:54 p.m.

On Tue, Apr 19, 2011 at 11:31 PM, Soares Chen Ruo Fei <crf@hypershell.org> wrote:

...

Ryou Ezoe wrote:

...
What I want is translate() accept wchar_t const * and std::wstring as a parameter. just like it accept char const * and std::string. Then, it return the corresponding translated text. Although the encoding of wchar_t is unspecified in the Standard. In the current MS-Windows environment, it should be treated as UTF-16.

Converting it to UTF-8 is a implementation details. I don't care which UTF it internally use. As long as it support real UCS(all code points defined in UCS)

But treating it as UCS rather than binary string is better.

Assuming we have C++0x compiler and encoding of wchar_t is UTF-16, translate(u8"text"), translate(u"text"), translate(U"text") and translate(L"text") all returns the same mapped translated text according to the locale. This is a good.

I suppose that you are probably fine with the requirement that the supplied text must be in one of the Unicode encodings, because otherwise translating from text in shift-JIS or arbitrary encodings is probably be a mess from a technical perspective.

I think that what we really need is to enforce the character set used in Boost.Locale, not the language. It just happen that Artyom chose the ASCII character set which don't support most other languages. I don't see any technical reasons to enforce the language used for translating, but there are many technical reasons to enforce a particular encoding. We can just change the encoding used from ASCII to UCS, and that wouldn't technically make much difference. The only problem for using Unicode as the translation key is the normalization issues. Since normalization is too heavyweight, the translation system should probably operate at code point level, though translations of identical original text with different code points will then fail.

I don't expect perfect normalization. I think it's not possible. I just want libraries to be UCS aware.

...

I have one suggestion to overcome GNU Gettext's limitation. Perhaps we can automatically convert the text into Unicode escaped sequences before passing to GNU Gettext, so "日本語" in UTF-8 will become "\\u65E5\\u672C\\u8A9E" in ASCII.

Why do you need to escaped it? Why do you want to stick with ASCII? UCS and its encoding UTF-8, UTF-16, UTF-32 will be specified in upcoming C++0x standard. On the other hand, standard still does not say ASCII. The basic source character set does not cover all ASCII characters. So using ASCII is not portable.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Ryou Ezoe

Sergey Cheban

20 Apr 20 Apr

10:25 a.m.

20.04.2011 0:54, Ryou Ezoe пишет:

...

UCS and its encoding UTF-8, UTF-16, UTF-32 will be specified in upcoming C++0x standard. On the other hand, standard still does not say ASCII. The basic source character set does not cover all ASCII characters. So using ASCII is not portable. You are right. translate("Peace") will not work if the source file is EBCDIC-encoded and if the translations were made from the ascii-encoded english. But with EBCDIC-encoded translations all probably will be OK.

You may even use CP932 for the translations (if the target language strings can be encoded into CP932, of course. I mean, you can use CP932 directly if you want to "translate" your software to some japanese slang/argot). Or, you may mix the different encodings in the translation file (it is not easy, of course, to work with such files). -- Sergey Cheban

Sergey Cheban

19 Apr 19 Apr

3:51 p.m.

19.04.2011 6:14, Ryou Ezoe пишет:

...

I bet Non-english people will use this library with non-english language.

translate("日本語") Are there any problems with translate(ShiftJisToUtf8("日本語"))?

-- Sergey Cheban

Ryou Ezoe

9:30 p.m.

On Wed, Apr 20, 2011 at 12:51 AM, Sergey Cheban <s.cheban@drweb.com> wrote:

...

19.04.2011 6:14, Ryou Ezoe пишет:

...
I bet Non-english people will use this library with non-english language.

translate("日本語")

Are there any problems with translate(ShiftJisToUtf8("日本語"))?

I'd rather wait for compilers support u8 encoding prefix. The problem of that code is we don't have the rule how to map shift-jis characters to UCS code points. Mapping rule slightly differs in every libraries. That is, some shift-jis characters are mapped to different UCS code point in different libraries. Actually, simply saying "shift-jis" is not right. There is no such encoding like "shift-jis". There are many shift-jis variants. Windows use CP932. Mac use MacJapanese. JIS(Japanese Industrial Standards) defined ISO-2022-JP standard. These are slightly different so mapping problem happens. And each libraries handles it in their own way. So it's like there is no THE consistent rule. This is worse than UCS normalize problem. We shouldn't use shift-jis anymore. Converting from shift-jis is not recommended. We should use one of UCS encoding directly. That's why I don't say Boost.locale should handle all shift-jis variants, JIS(this is yet another standard. not one of shift-jis variant including ISO-2022-JP), EUC-JP and other encodings that have been ever used at some point in the history.

...

-- Sergey Cheban

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Ryou Ezoe

Sergey Cheban

20 Apr 20 Apr

9:44 a.m.

...

...
Are there any problems with translate(ShiftJisToUtf8("日本語"))?

I'd rather wait for compilers support u8 encoding prefix.

The problem of that code is we don't have the rule how to map shift-jis characters to UCS code points. It's not a problem providing the ShiftJisToUtf8 implementation, the

...

Mapping rule slightly differs in every libraries. That is, some shift-jis characters are mapped to different UCS code point in different libraries.

Actually, simply saying "shift-jis" is not right. There is no such encoding like "shift-jis". There are many shift-jis variants. Windows use CP932. Mac use MacJapanese. JIS(Japanese Industrial Standards) defined ISO-2022-JP standard. These are slightly different so mapping problem happens. And each libraries handles it in their own way. So it's like there is no THE consistent rule. All you need is to choose the library that is consistent with your

20.04.2011 1:30, Ryou Ezoe пишет: source file encoding and the translation file data are consistent. source file encoding. For MSVC/Windows, it is probably OK to convert from CP_932 to CP_UTF8 using MultiByteToWideChar and WideCharToMultiByte. -- Sergey Cheban

Matus Chochlik

19 Apr 19 Apr

7:17 a.m.

On Tue, Apr 19, 2011 at 2:10 AM, Edward Diener <eldiener@tropicsoft.com> wrote:

...

On 4/18/2011 9:53 AM, Paul A. Bristow wrote:

...
PS Without wishing to sound too British - "If the natives don't understand, just shout louder.",

Now I know why Gandhi fought so hard for an independent country. Yes, I understand, then above is British humour.

+1 Funny but a little arrogant.

...

...
I have little sympathy with the view expressed that no knowledge of English should be required. (Is this the cost of Japanese failure in the 80's to succeed in their ambition to dominate hardware and software with a world-beating language in Japanese?)

What language was that ?

...
C++ just is a language in English, so you must surely expect to have some rudimentary knowledge of this useful language.

The amount of rudimentary knowledge of English necessay to use C++ keywords and C++ standard library classes is so incredibly little compared to an actual knowledge of English grammar, that I doubt whether a knowledge of English is really necessary in any way in order to program in C++.

I'm not a native English speaker and it took me nearly a decade to learn English in such a way that I was able to communicate more elaborate "thoughts" than "Hello", "My name is X Y", I'm N year old" :) and it still could be much better. But, I accept English as the necessary and very useful tool for my work in IT/CS. (just as most medical doctors accept the use of Latin, I guess) Most of the documentation and basically every mainstream programming language uses English so knowing the language at least at some level is a "must" here.

...

My personal objection to Gnu gettext and its English bias has nothing to do with any desire myself to use a language other than English in order to communicate, since English ( or perhaps Americanese ) is the language of the country in which I was born, but nearly everything to do with my sense of the problems of translating even computer program phraseology from one language to another without complicating things by having to put some other language, even a very popular one, in the middle.

Was that a single sentence ? I wonder if it can be translated to Japanese ?

These are all valid points Speaking in a particular language means to be thinking in a certain way and many things can be lost in the translation. But I don't see above any solutions to the actual problem.

...

From how I see it there are several ways to handle this:

1) Stick to English phrases (-) Requires good knowledge of the English language (+) Easy to find someone to translate to language Y (+) Portable (+) Lots of mature l10n libraries work this way (+) Works (for English speakers) even if the translation fails 2) Use English identifier strings (as Peter Dimov suggested) (-) Still requires some English (-) Hard to keep unique in large applications (-) Doesn't look good if the translation fails for some reason (+) Requires "less" English 3) Use the u"" U"" literals (-) Current support by the compilers (-) Requires the u/U/... prefix (+) Will be portable in the future (+) Does not require English 4) Use wchar_t and the L"" prefix literals (-) Non-portable and platform dependent (-) Requires the L prefix (+) Works if you are limited to a single platform (+) Does not require English 5) use char with some GUID literals/hashes (-) Completely unusable if the translation fails (-) Takes a lot of using to (easier for GIT users :)) (-) Requires a GUID/hash generator (+) Portable (+) Does not require English 6) keep in original language but transliterate to Latin characters [a-z0-9] (-) Requires picking a good transliteration scheme (-) Hard to read in the code (-) Pretty unusable if the translation fails (+) Does not require the use of English (+) Portable Take your pick :-) Matus

Ryou Ezoe

7:41 a.m.

On Tue, Apr 19, 2011 at 4:17 PM, Matus Chochlik <chochlik@gmail.com> wrote:

...

On Tue, Apr 19, 2011 at 2:10 AM, Edward Diener <eldiener@tropicsoft.com> wrote:

...
On 4/18/2011 9:53 AM, Paul A. Bristow wrote:

...
PS Without wishing to sound too British - "If the natives don't understand, just shout louder.",

Now I know why Gandhi fought so hard for an independent country. Yes, I understand, then above is British humour.

+1 Funny but a little arrogant.

...
...
I have little sympathy with the view expressed that no knowledge of English should be required. (Is this the cost of Japanese failure in the 80's to succeed in their ambition to dominate hardware and software with a world-beating language in Japanese?)

What language was that ?

...
C++ just is a language in English, so you must surely expect to have some rudimentary knowledge of this useful language.

The amount of rudimentary knowledge of English necessay to use C++ keywords and C++ standard library classes is so incredibly little compared to an actual knowledge of English grammar, that I doubt whether a knowledge of English is really necessary in any way in order to program in C++.

I'm not a native English speaker and it took me nearly a decade to learn English in such a way that I was able to communicate more elaborate "thoughts" than "Hello", "My name is X Y", I'm N year old" :) and it still could be much better.

But, I accept English as the necessary and very useful tool for my work in IT/CS. (just as most medical doctors accept the use of Latin, I guess) Most of the documentation and basically every mainstream programming language uses English so knowing the language at least at some level is a "must" here.

...
My personal objection to Gnu gettext and its English bias has nothing to do with any desire myself to use a language other than English in order to communicate, since English ( or perhaps Americanese ) is the language of the country in which I was born, but nearly everything to do with my sense of the problems of translating even computer program phraseology from one language to another without complicating things by having to put some other language, even a very popular one, in the middle.

Was that a single sentence ? I wonder if it can be translated to Japanese ?

These are all valid points Speaking in a particular language means to be thinking in a certain way and many things can be lost in the translation. But I don't see above any solutions to the actual problem.

...
From how I see it there are several ways to handle this:

1) Stick to English phrases (-) Requires good knowledge of the English language (+) Easy to find someone to translate to language Y (+) Portable (+) Lots of mature l10n libraries work this way (+) Works (for English speakers) even if the translation fails

2) Use English identifier strings (as Peter Dimov suggested) (-) Still requires some English (-) Hard to keep unique in large applications (-) Doesn't look good if the translation fails for some reason (+) Requires "less" English

3) Use the u"" U"" literals (-) Current support by the compilers (-) Requires the u/U/... prefix (+) Will be portable in the future (+) Does not require English

4) Use wchar_t and the L"" prefix literals (-) Non-portable and platform dependent (-) Requires the L prefix (+) Works if you are limited to a single platform (+) Does not require English

5) use char with some GUID literals/hashes (-) Completely unusable if the translation fails (-) Takes a lot of using to (easier for GIT users :)) (-) Requires a GUID/hash generator (+) Portable (+) Does not require English

6) keep in original language but transliterate to Latin characters [a-z0-9] (-) Requires picking a good transliteration scheme (-) Hard to read in the code (-) Pretty unusable if the translation fails (+) Does not require the use of English (+) Portable

Take your pick :-)

Matus _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

char is not portable too. It can be any encodings. -- Ryou Ezoe

Matus Chochlik

7:47 a.m.

On Tue, Apr 19, 2011 at 9:41 AM, Ryou Ezoe <boostcpp@gmail.com> wrote:

...

On Tue, Apr 19, 2011 at 4:17 PM, Matus Chochlik <chochlik@gmail.com> wrote:

...
On Tue, Apr 19, 2011 at 2:10 AM, Edward Diener <eldiener@tropicsoft.com> wrote:

[snip/]

...

...
...
From how I see it there are several ways to handle this:

1) Stick to English phrases (-) Requires good knowledge of the English language (+) Easy to find someone to translate to language Y (+) Portable (+) Lots of mature l10n libraries work this way (+) Works (for English speakers) even if the translation fails

2) Use English identifier strings (as Peter Dimov suggested) (-) Still requires some English (-) Hard to keep unique in large applications (-) Doesn't look good if the translation fails for some reason (+) Requires "less" English

3) Use the u"" U"" literals (-) Current support by the compilers (-) Requires the u/U/... prefix (+) Will be portable in the future (+) Does not require English

4) Use wchar_t and the L"" prefix literals (-) Non-portable and platform dependent (-) Requires the L prefix (+) Works if you are limited to a single platform (+) Does not require English

5) use char with some GUID literals/hashes (-) Completely unusable if the translation fails (-) Takes a lot of using to (easier for GIT users :)) (-) Requires a GUID/hash generator (+) Portable (+) Does not require English

6) keep in original language but transliterate to Latin characters [a-z0-9] (-) Requires picking a good transliteration scheme (-) Hard to read in the code (-) Pretty unusable if the translation fails (+) Does not require the use of English (+) Portable

Take your pick :-)

Matus

...

...
char is not portable too. It can be any encodings.

Yes, but the basic character set containing [a-z] and [0-9] (which is all you need for writing the English phrases, GUIDs, ...) is, unless I'm terribly mistaken, in every encoding, otherwise you would not be able to write any C++. Matus

Edward Diener

12:37 p.m.

On 4/19/2011 3:17 AM, Matus Chochlik wrote:

...

On Tue, Apr 19, 2011 at 2:10 AM, Edward Diener<eldiener@tropicsoft.com> wrote:

...
On 4/18/2011 9:53 AM, Paul A. Bristow wrote:

My personal objection to Gnu gettext and its English bias has nothing to do with any desire myself to use a language other than English in order to communicate, since English ( or perhaps Americanese ) is the language of the country in which I was born, but nearly everything to do with my sense of the problems of translating even computer program phraseology from one language to another without complicating things by having to put some other language, even a very popular one, in the middle.

Was that a single sentence ? I wonder if it can be translated to Japanese ?

These are all valid points Speaking in a particular language means to be thinking in a certain way and many things can be lost in the translation. But I don't see above any solutions to the actual problem.

...
From how I see it there are several ways to handle this:

1) Stick to English phrases (-) Requires good knowledge of the English language (+) Easy to find someone to translate to language Y (+) Portable (+) Lots of mature l10n libraries work this way (+) Works (for English speakers) even if the translation fails

2) Use English identifier strings (as Peter Dimov suggested) (-) Still requires some English (-) Hard to keep unique in large applications (-) Doesn't look good if the translation fails for some reason (+) Requires "less" English

3) Use the u"" U"" literals (-) Current support by the compilers (-) Requires the u/U/... prefix (+) Will be portable in the future (+) Does not require English

4) Use wchar_t and the L"" prefix literals (-) Non-portable and platform dependent (-) Requires the L prefix (+) Works if you are limited to a single platform (+) Does not require English

5) use char with some GUID literals/hashes (-) Completely unusable if the translation fails (-) Takes a lot of using to (easier for GIT users :)) (-) Requires a GUID/hash generator (+) Portable (+) Does not require English

6) keep in original language but transliterate to Latin characters [a-z0-9] (-) Requires picking a good transliteration scheme (-) Hard to read in the code (-) Pretty unusable if the translation fails (+) Does not require the use of English (+) Portable

Take your pick :-)

My pick is to use what the language currently provides, which is wchar_t, which can represent UTF-16, a popular Unicode variant which also happens to be the standard for wide characters on Windows, which just happens to be the dominant operating system in the world ( by alot ) in terms of end-users. I am not saying this has to be done by Locale immediately because I realize that there is no Unicode in C++03 and creating a Unicode library is hardly easy ( Boost will soon have a Unicode library for submission). I also realize that it is much easier to use what already exists, such as gnu gettext, than to create one's own system from scratch or modify another system of computer language translation. So I am much in sympathy with the current choice of the Locale author. What I object to is not the way that Locale currently works but that the author of the library seems to have a closed mind about this issue. He thinks that UTF-8 must be the standard because it is what Linux uses, and he thinks that everyone must follow the way that gnu gettext does things because that also comes from the Linux world about which he is knowledgable. Even when it is pointed out to him the flaw in gnu gettext which forces other languages to go through English to be translated, he feels that this is correct on the basis that English is the dominant language in computer programming, so every programmer must know it to write computer programmers in C++. I am an English speaker only and quite realize that English is as much as a common language in computer programming as one can have. But I also realize that translations that must go through English ( or any third language ) are not only a PITA but represent linguistically a fallacy, which is that it is relatively easy translating from one language to another. The assumption that you can create a "correct" translation system in computer programming which dictates not only a common language to be used by everyone but also claims that going through that common language is "easier" or "better" or uses less "resources" I find absurd. Claiming that all programmers must know English to do programming, or that translating through English is a rote job I also find absurd. I really do not see how hard conceptually it would be that would allow a translation system to use message catalogues which can also be Unicode ( wchar_t in C++ ) or some multi-byte encoding. This would obviously allow people whose language encoding is not a narrow character ( like Japanese ) to translate to another language without having to go through some intermediate 3rd language ( English in the present case ). I know many decisions would have to be made about how to do this, and it would no doubt mean abandoning a popular model of doing translations ( gnu gettext ), and it would mean much programming work, but in the face of dictating to programmers of other countries that English must be used I think it would be worthwhile. Clearly a locale translation system which forces all programmers using it to not only go through another language, even as common as that language appears in the programming world, but to deal with the linguistic translation issues involved, is a good way to alienate many, many programmers from using your software. Whatever one thinks of Unicode, and I myself am critical of it for my own reasons, it is an attempt to not only bring in end-users of computer programs who do not know English but also computer programmers themselves who do not know English, and have them participate in the computer world. Creating a programming system, even if it is a translation system for a particular library, which insists that English must be used as the intermediate "glue" is clearly going against this idea IMO.

Artyom

1:05 p.m.

...

...
On Tue, Apr 19, 2011 at 2:10 AM, Edward Diener<eldiener@tropicsoft.com> wrote:

...
On 4/18/2011 9:53 AM, Paul A. Bristow wrote:

My personal objection to Gnu gettext and its English bias has nothing to do with any desire myself to use a language other than English in order to communicate, since English ( or perhaps Americanese ) is the language of

...
...
country in which I was born, but nearly everything to do with my sense of the problems of translating even computer program phraseology from one language to another without complicating things by having to put some other language, even a very popular one, in the middle.

Was that a single sentence ? I wonder if it can be translated to Japanese ?

These are all valid points Speaking in a particular language means to be thinking in a certain way and many things can be lost in the

From: Edward Diener <eldiener@tropicsoft.com> On 4/19/2011 3:17 AM, Matus Chochlik wrote: the translation.

...
But I don't see above any solutions to the actual problem.

...
From how I see it there are several ways to handle this:

1) Stick to English phrases [snip]

Take your pick :-)

My pick is to use what the language currently provides, which is wchar_t, which can represent UTF-16,

No it can not represent UTF-16, it represents either UTF-32 (on most platforms around) or UTF-16 (on one specific platform Microsoft Windows).

...

a popular Unicode variant which also happens to be the standard for wide characters on Windows, which just happens to be the dominant operating system in the world ( by alot ) in terms of end-users.

That is questionable statement especially that there are so much iPhones, Andriods, Servers and Services not using Windows but directly communicating with end user. But your statement has nothing to do with support of Unicode in C++ world

...

What I object to is not the way that Locale currently works but that the author of the library seems to have a closed mind about this issue. He thinks that UTF-8 must be the standard because it is what Linux uses,

I bake your pardon? UTF-8 is standard far beyond what Linux is uses. I suggest to study a facts about it a little. Buzzwords: xml, web, Unicode and more and more and more. About UTF-16, I assume you are not familiar with this: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... Despite this Boost.Locale fully supports UTF-16/wchar_t on windows.

...

and he thinks that everyone must follow the way that gnu gettext does things because that also comes from the Linux world about which he is knowledgable.

Even when it is pointed out to him the flaw in gnu gettext which forces other languages to go through English to be translated, he feels that this is correct on the basis that English is the dominant language in computer programming, so every programmer must know it to write computer programmers in C++.

- How many translation system do you know? - How many have you tested? - How many have you really used? - How many programs have you ever localized, translated? All systems use English as the base as it is the best practice.

...

[snip] Claiming that all programmers must know English to do programming, or that translating through English is a rote job I also find absurd. [snip]

Despite the fact that it is what is done all over the world very successfully. Artyom

Ryou Ezoe

1:09 p.m.

UTF-8 is just one of encoding schemes for UCS. UTF-8 is not THE standard. It's one of standards. -- Ryou Ezoe

Edward Diener

20 Apr 20 Apr

1:54 a.m.

On 4/19/2011 9:05 AM, Artyom wrote:

...

...
...
On Tue, Apr 19, 2011 at 2:10 AM, Edward Diener<eldiener@tropicsoft.com> wrote:

...
On 4/18/2011 9:53 AM, Paul A. Bristow wrote:

My personal objection to Gnu gettext and its English bias has nothing to do with any desire myself to use a language other than English in order to communicate, since English ( or perhaps Americanese ) is the language of

...
...
country in which I was born, but nearly everything to do with my sense of the problems of translating even computer program phraseology from one language to another without complicating things by having to put some other language, even a very popular one, in the middle.

Was that a single sentence ? I wonder if it can be translated to Japanese ?

These are all valid points Speaking in a particular language means to be thinking in a certain way and many things can be lost in the

From: Edward Diener<eldiener@tropicsoft.com> On 4/19/2011 3:17 AM, Matus Chochlik wrote: the translation.

...
But I don't see above any solutions to the actual problem.

...
From how I see it there are several ways to handle this:

1) Stick to English phrases [snip]

Take your pick :-)

My pick is to use what the language currently provides, which is wchar_t, which can represent UTF-16,

No it can not represent UTF-16, it represents either UTF-32 (on most platforms around) or UTF-16 (on one specific platform Microsoft Windows).

Then clearly it can represent UTF-16.

...

...
a popular Unicode variant which also happens to be the standard for wide characters on Windows, which just happens to be the dominant operating system in the world ( by alot ) in terms of end-users.

That is questionable statement especially that there are so much iPhones, Andriods, Servers and Services not using Windows but directly communicating with end user.

But your statement has nothing to do with support of Unicode in C++ world

...
What I object to is not the way that Locale currently works but that the author of the library seems to have a closed mind about this issue. He thinks that UTF-8 must be the standard because it is what Linux uses,

I bake your pardon?

Apologies. I should not have said that you have a closed mind about this issue.

...

UTF-8 is standard far beyond what Linux is uses.

A standard for what ? There are three Unicode character sets which are generally used, UTF-8, UTF-16, and UTF-32. I can not understand why you think one of them is some sort of standard for something.

...

I suggest to study a facts about it a little.

Buzzwords: xml, web, Unicode and more and more and more.

Buzzwords mean nothing. All three Unicode character sets can be used in xml, web pages, and Unicode. Each one is just another encoding.

...

About UTF-16, I assume you are not familiar with this:

http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf...

The criticism of UTF-16 in the above applies equally to UTF-8 and UTF-32 as far as I can see.

...

Despite this Boost.Locale fully supports UTF-16/wchar_t on windows.

I am truly glad it does.

...

...
and he thinks that everyone must follow the way that gnu gettext does things because that also comes from the Linux world about which he is knowledgable.

Even when it is pointed out to him the flaw in gnu gettext which forces other languages to go through English to be translated, he feels that this is correct on the basis that English is the dominant language in computer programming, so every programmer must know it to write computer programmers in C++.

- How many translation system do you know? - How many have you tested? - How many have you really used? - How many programs have you ever localized, translated?

All systems use English as the base as it is the best practice.

That's a pretty bald statement buy I do not think it is true. But even if it were, why should everybody doing something one way be proof that that way is best ? I would much rather pursue a technical solution that I felt was best even if no one else thought so.

...

...
[snip] Claiming that all programmers must know English to do programming, or that translating through English is a rote job I also find absurd. [snip]

Despite the fact that it is what is done all over the world very successfully.

I am glad you feel it is always done very successfully. But that is hardly a relevant statement when looking at a technical issue. Do not misunderstand me. I can understand your making the choice you have regarding translation in Locale. It is certainly easier using what already exists to some extent than having to create your own solution completely from scratch. But I do think you should at least realize that the dependence on English is going to keep programmers from countries where English is not well known from using that aspect of your library.

Matus Chochlik

7:47 a.m.

On Wed, Apr 20, 2011 at 3:54 AM, Edward Diener <eldiener@tropicsoft.com> wrote:

...

On 4/19/2011 9:05 AM, Artyom wrote:

...
...
From: Edward Diener<eldiener@tropicsoft.com> On 4/19/2011 3:17 AM, Matus Chochlik wrote:

...
[snip/]

...
...
Take your pick :-)

My pick is to use what the language currently provides, which is wchar_t, which can represent UTF-16,

No it can not represent UTF-16, it represents either UTF-32 (on most platforms around) or UTF-16 (on one specific platform Microsoft Windows).

Then clearly it can represent UTF-16.

...
...
a popular Unicode variant which also happens to be the standard for wide characters on Windows, which just happens to be the dominant operating system in the world ( by alot ) in terms of end-users.

*Only* on a single platform (which you claim to be the most dominant, which is only partially true). On other platforms wchar_t does not represent UTF-16. Actually the situation with wchar_t is only slightly better than with char because just as the standard does not specify what encoding the char-based strings use is also does not specify the encoding for wchar_t. And wchar_t using UTF-16 on Windows is no standard is a custom. I still remember times when wchar_t used to be UCS2. [snip/]

...

...
I bake your pardon?

Apologies. I should not have said that you have a closed mind about this issue.

...
UTF-8 is standard far beyond what Linux is uses.

A standard for what ? There are three Unicode character sets which are generally used, UTF-8, UTF-16, and UTF-32. I can not understand why you think one of them is some sort of standard for something.

Then look which of those encodings is "dominant" on the Web (HTML pages, PHP scripts, template files for various CMS', CSS files, WSDL files, ...), in various database systems (most of those adopting wchar_t and UTF-16 (USC2 really) had quite a lot of problems because of this, just as Windows had at some point) or in XML files in general which are used basically everywhere. Just have a look what encoding use the XML files that are zipped inside the *Microsoft* Office's documents (docx, xlsx, pptx, etc.), Hint, no it's not UTF-16. But the most important reason why I think that UTF-8 is superior to UTF-16/32 is, that it is the only truly portable format. Yes, you can use UTF-16 or UTF-32 as your internal representation of characters on a certain platform, but if you expect to publish the data and move it to other computers then the only rational thing to do is to use UTF-8, where you don't have to deal with *stupid* byte ordering marks nor any other similar nonsense. I think that it about time that we pick a single character set and a single encoding because if we don't we are back at the point where we started 30-40 year ago. We just won't have ISO-8859-X, CP-XYZ, ... but UTF-XY, UCS-N, ... instead. And actually I think that the usually highly overrated "invisible hand of the market" had done the right thing this time and already picked the "best" encoding for us (see above). Even Microsoft will wake up one of those days and accept this. The fact that they already use UTF-8 in their own document formats is IMO a proof of that.

...

...
All systems use English as the base as it is the best practice.

That's a pretty bald statement buy I do not think it is true. But even if it were, why should everybody doing something one way be proof that that way is best ? I would much rather pursue a technical solution that I felt was best even if no one else thought so.

I do not think that it is bald (nor bold ;-)). Using the basic character set and English has one big advantage: You won't have problems with Unicode normalization. But, for people willing to take the risks of their code being unportable, etc. I don't see why we could not add another overload for translate which would accept wchar-based strings, somehow "detect" the encoding and convert it to UTF-8 for the backend (gettext in this case) if necessary and return a wide string with the translation. I don't mind if other people want to risk shooting themselves in the foot if that is their own free decision :-) This will be temporary anyway because the UTF-8 literals are already coming. [snip/] Matus

Ryou Ezoe

7:58 a.m.

Why some people thinks one encoding of UCS is better than others. On Wed, Apr 20, 2011 at 4:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:

...

On Wed, Apr 20, 2011 at 3:54 AM, Edward Diener <eldiener@tropicsoft.com> wrote:

...
On 4/19/2011 9:05 AM, Artyom wrote:

...
...
From: Edward Diener<eldiener@tropicsoft.com> On 4/19/2011 3:17 AM, Matus Chochlik wrote:

...
[snip/]

...
...
Take your pick :-)

My pick is to use what the language currently provides, which is wchar_t, which can represent UTF-16,

No it can not represent UTF-16, it represents either UTF-32 (on most platforms around) or UTF-16 (on one specific platform Microsoft Windows).

Then clearly it can represent UTF-16.

...
...
a popular Unicode variant which also happens to be the standard for wide characters on Windows, which just happens to be the dominant operating system in the world ( by alot ) in terms of end-users.

*Only* on a single platform (which you claim to be the most dominant, which is only partially true). On other platforms wchar_t does not represent UTF-16. Actually the situation with wchar_t is only slightly better than with char because just as the standard does not specify what encoding the char-based strings use is also does not specify the encoding for wchar_t.

And wchar_t using UTF-16 on Windows is no standard is a custom. I still remember times when wchar_t used to be UCS2.

[snip/]

...
...
I bake your pardon?

Apologies. I should not have said that you have a closed mind about this issue.

...
UTF-8 is standard far beyond what Linux is uses.

A standard for what ? There are three Unicode character sets which are generally used, UTF-8, UTF-16, and UTF-32. I can not understand why you think one of them is some sort of standard for something.

Then look which of those encodings is "dominant" on the Web (HTML pages, PHP scripts, template files for various CMS', CSS files, WSDL files, ...), in various database systems (most of those adopting wchar_t and UTF-16 (USC2 really) had quite a lot of problems because of this, just as Windows had at some point) or in XML files in general which are used basically everywhere.

Just have a look what encoding use the XML files that are zipped inside the *Microsoft* Office's documents (docx, xlsx, pptx, etc.), Hint, no it's not UTF-16.

But the most important reason why I think that UTF-8 is superior to UTF-16/32 is, that it is the only truly portable format. Yes, you can use UTF-16 or UTF-32 as your internal representation of characters on a certain platform, but if you expect to publish the data and move it to other computers then the only rational thing to do is to use UTF-8, where you don't have to deal with *stupid* byte ordering marks nor any other similar nonsense.

I think that it about time that we pick a single character set and a single encoding because if we don't we are back at the point where we started 30-40 year ago. We just won't have ISO-8859-X, CP-XYZ, ... but UTF-XY, UCS-N, ... instead.

And actually I think that the usually highly overrated "invisible hand of the market" had done the right thing this time and already picked the "best" encoding for us (see above). Even Microsoft will wake up one of those days and accept this. The fact that they already use UTF-8 in their own document formats is IMO a proof of that.

...
...
All systems use English as the base as it is the best practice.

That's a pretty bald statement buy I do not think it is true. But even if it were, why should everybody doing something one way be proof that that way is best ? I would much rather pursue a technical solution that I felt was best even if no one else thought so.

I do not think that it is bald (nor bold ;-)). Using the basic character set and English has one big advantage: You won't have problems with Unicode normalization. But, for people willing to take the risks of their code being unportable, etc. I don't see why we could not add another overload for translate which would accept wchar-based strings, somehow "detect" the encoding and convert it to UTF-8 for the backend (gettext in this case) if necessary and return a wide string with the translation. I don't mind if other people want to risk shooting themselves in the foot if that is their own free decision :-) This will be temporary anyway because the UTF-8 literals are already coming.

[snip/]

Matus _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Ryou Ezoe

Matus Chochlik

8:21 a.m.

On Wed, Apr 20, 2011 at 9:58 AM, Ryou Ezoe <boostcpp@gmail.com> wrote:

...

On Wed, Apr 20, 2011 at 4:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:

...
On Wed, Apr 20, 2011 at 3:54 AM, Edward Diener <eldiener@tropicsoft.com> wrote:

...
On 4/19/2011 9:05 AM, Artyom wrote:

...

Why some people thinks one encoding of UCS is better than others.

I could ask why some people think that one character set (UCS) is better than others, but I think the answer is obvious and to me it is also obvious why to use one character encoding. 1) Because some of those people don't do just Windows programming or just Mac programming or Linux programming, but they do multi-platform programming and sometimes they do it for machines with different byte orderings. 2) If you pick one encoding and use it consistently everywhere your application does not need to include transcoding-related code nor data. You for example save a data file on MAC and open it on Windows just by reading it byte by byte. You don't have to solve Little Endian/Big Endian related problems etc. i.e. you don't need byte ordering marks. The only time you need to transcode is on platforms where the OS/API uses another encoding and such APIs already provide means for transcoding from UTF-8. 3) You use the same algorithms everywhere, you don't need to write the same algorithm n-times (for UTF-8, UTF-16, UTF-32) and I'm not even going to start talking about maintaining and debugging that, you use just one of them. 4) If someone on the other end of the Globe uses the same approach and wishes to use the output of your applications or wants to feed it some input data he/she does not need to do the transcoding. 5) You don't need dumb macro-based character type switching. 6) Your library plays well with other libraries using char. Try to count 3-rd party libraries that use char-based strings and count those using wchar-based strings only in their APIs. Compare. [snip/] Matus

Frank Mori Hess

19 Apr 19 Apr

1:05 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tuesday, April 19, 2011, Edward Diener wrote:

...

My pick is to use what the language currently provides, which is wchar_t, which can represent UTF-16, a popular Unicode variant which also happens to be the standard for wide characters on Windows, which just happens to be the dominant operating system in the world ( by alot ) in terms of end-users.

I thought the standard was pretty loose about wchar_t, and that it could even legally be a single byte in size? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAk2tiKgACgkQ5vihyNWuA4U+CACgiGdC1ifFlWJ79gMg6+F6NbBB DVIAn0OPMucdhZI9W10qQiyer1AmQp7s =BF9O -----END PGP SIGNATURE-----

5213

Age (days ago)

5217

Last active (days ago)

List overview

Download

34 comments

11 participants

participants (11)

Artyom
Chad Nelson
Edward Diener
Frank Mori Hess
Mathieu Champlon
Matus Chochlik
Paul A. Bristow
Peter Dimov
Ryou Ezoe
Sergey Cheban
Soares Chen Ruo Fei