Re: [boost] [locale] [filesystem] Windows local 8 bit encoding

Hi Artyom,
Currently win32 API supports only UTF-8 encodings, so you need to one of:
- compile with ICU - select std backend (because the default without ICU on windows is win32) - disable win32 backend (in build options) so only std backend would be used. But you should note that under Windows only MSVC has std backend support, for gcc and ANSI encodings you need ICU,
Is there reason why win32 and std backend behave differently here? Using std locale get_system_locale() method works for code pages of the form 'windows-XXX'. Others e.g. 'Shift_JIS' should fail. And if you want the CURRENT locale on Microsoft Windows, I simply can't see how to get that. But why not add generator.generate( void ) giving exactly that. Best regards Bjoern.

________________________________ From: "Thiel, Bjoern" <bjoern.thiel@mpibpc.mpg.de> To: "boost@lists.boost.org" <boost@lists.boost.org> Sent: Tuesday, November 27, 2012 1:52 PM Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding
Hi Artyom,
Currently win32 API supports only UTF-8 encodings, so you need to one of:
- compile with ICU - select std backend (because the default without ICU on windows is win32) - disable win32 backend (in build options) so only std backend would be used. But you should note that under Windows only MSVC has std backend support, for gcc and ANSI encodings you need ICU,
Is there reason why win32 and std backend behave differently here?
Yes because they are entirely different backends. Implementing ANSI encoding support under win32 backend required significant effort, and because ANSI encodings were deprecated by Windows in favor of Unicode it was decided not to support ANSI encodings.
Using std locale get_system_locale() method works for code pages of the form 'windows-XXX'. Others e.g. 'Shift_JIS' should fail.
In generally for locakes with Shift_JIS it shoudl return stuff like ja_JP.windows-932 (which is shift_jis) Is there problem with that?
And if you want the CURRENT locale on Microsoft Windows, I simply can't see how to get that.
What do you mean, what kind of CURRENT locale do you need?
But why not add generator.generate( void ) giving exactly that.
Once again, what do you mean.
Best regards
Bjoern.
Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

Hi Artyom, On Wed, Nov 28, 2012 at 1:33 AM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
Using std locale get_system_locale() method works for code pages of the form 'windows-XXX'. Others e.g. 'Shift_JIS' should fail.
In generally for locakes with Shift_JIS it shoudl return stuff like ja_JP.windows-932 (which is shift_jis)
Is there problem with that?
Code page 932 is different from shift_jis. See the explanation in Wikipedia: https://en.wikipedia.org/wiki/Code_page_932 Code page 932 contains extension characters to Shift-JIS, so we cannot mix the two. -- Ryo IGARASHI, Ph.D. rigarash@gmail.com

If so there is no such a locale under windows that works with Shift_JIS... Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/
________________________________ From: Ryo IGARASHI <rigarash@gmail.com> To: boost <boost@lists.boost.org> Sent: Thursday, November 29, 2012 7:16 AM Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding
Hi Artyom,
On Wed, Nov 28, 2012 at 1:33 AM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
Using std locale get_system_locale() method works for code pages of the form 'windows-XXX'. Others e.g. 'Shift_JIS' should fail.
In generally for locakes with Shift_JIS it shoudl return stuff like ja_JP.windows-932 (which is shift_jis)
Is there problem with that?
Code page 932 is different from shift_jis. See the explanation in Wikipedia: https://en.wikipedia.org/wiki/Code_page_932 Code page 932 contains extension characters to Shift-JIS, so we cannot mix the two.
-- Ryo IGARASHI, Ph.D. rigarash@gmail.com
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi Artyom, On Thu, Nov 29, 2012 at 10:45 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
If so there is no such a locale under windows that works with Shift_JIS...
Strictly speaking, you are right. You cannot use Japanese locale with strict Shift_JIS character set on Windows. However, all characters in Shift_JIS can be described in CP932 since the CP932 character set of is wider than Shift_JIS. The text below may be off-topic for Boost.Locale, but it might explain why (I believe) Japanese windows programmers are reluctant to convert text to UTF-8 (on Windows). If you have a CP932-encoded string, convert to UTF-8, and then convert back to CP932 string, the first and the third string may be *different*. This means that the original information is (somewhat) lost. See the reference information from Microsoft: http://support.microsoft.com/default.aspx?scid=kb;en-us;Q170559 (Note that 'Shift JIS' in the above link means CP932) This means that in order to handle the Japanese string properly under Windows, the programmers are encouraged not to convert at all. Moreover, at least 2 major (and slightly different) Shift_JIS <-> UTF-8 mapping table exists. i.e. the same Shift_JIS text will map to different UTF-8 string. (2 which are provided by Unicode Consortium and Microsoft) Best regards, -- Ryo IGARASHI, Ph.D. rigarash@gmail.com

On 30/11/12 21:29, Ryo IGARASHI wrote:
If you have a CP932-encoded string, convert to UTF-8, and then convert back to CP932 string, the first and the third string may be *different*. This means that the original information is (somewhat) lost.
This means that in order to handle the Japanese string properly under Windows, the programmers are encouraged not to convert at all.
Moreover, at least 2 major (and slightly different) Shift_JIS <-> UTF-8 mapping table exists. i.e. the same Shift_JIS text will map to different UTF-8 string. (2 which are provided by Unicode Consortium and Microsoft)
Sorry if this sounds silly, but what's the problem if we just stick to one mapping table consistently?

Hi, Jookia, On Fri, Nov 30, 2012 at 7:35 PM, Jookia <166291@gmail.com> wrote:
Sorry if this sounds silly, but what's the problem if we just stick to one mapping table consistently?
The round trip conversion is the problem. Suppose you write a program which communicates to the legacy software which can only handle "NEC selection of IBM extension" characters (See [1] for what this means; See [2] for complete table). If my new program convert the input to UTF-8 (1st) and convert back as an input to the legacy software (2nd), those characters are now "IBM extension" character, which the legacy software fail to handle. This problem is inevitable even if we stick to one mapping table and avoidable when I do not convert at all. [1] https://en.wikipedia.org/wiki/Code_page_932 [2] http://www2d.biglobe.ne.jp/~msyk/charcode/cp932/uni2sjis.html (Japanese) Best regards, -- Ryo IGARASHI, Ph.D. rigarash@gmail.com

On Fri, Nov 30, 2012 at 12:29 PM, Ryo IGARASHI <rigarash@gmail.com> wrote:
Hi Artyom,
On Thu, Nov 29, 2012 at 10:45 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
If so there is no such a locale under windows that works with Shift_JIS...
[...] See the reference information from Microsoft: http://support.microsoft.com/default.aspx?scid=kb;en-us;Q170559 (Note that 'Shift JIS' in the above link means CP932)
This means that in order to handle the Japanese string properly under Windows, the programmers are encouraged not to convert at all. [...]
As I understand from the page the problem of CP932 is that it has duplicate code points, so a CP932 → UTF-8 → CP932 will result in, although binary different, but semantically identical text. I do not see a problem with this. So Unicode itself has *many more* ways to encode the same thing, including, but not limited to, duplicate code points and combining characters. And we are living with this fine for years. The solution to this is using normalization if this *really* matters. And where it matters (comparison, likely. What else?) you will be forced to normalize your CP932 too... -- Yakov
participants (5)
-
Artyom Beilis
-
Jookia
-
Ryo IGARASHI
-
Thiel, Bjoern
-
Yakov Galka