[locale] [filesystem] Windows local 8 bit encoding

Hi, developing platform independent code I really like the convenience functions conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale. Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf following the rational from filesystem (path encoding conversions): template < typename CharType > std::basic_string< CharType > local8bit_to_utf ( std::string const & text, method_type how = default_method ) { char const * encoding = impl::local8bit_encoding() ; return to_utf< CharType >( text, encoding, how ) ; } template< typename CharType > std::string local8bit_from_utf ( std::basic_string< CharType > const & text, method_type how = default_method ) { char const * encoding = impl::local8bit_encoding() ; return from_utf< CharType >( text, encoding, how ) ; } with char const * local8bit_encoding() { #ifdef WIN32 UINT codepage = AreFileApisANSI() ? GetACP() : GetOEMCP() ; return windows_codepage_to_encoding( codepage ) ; #else return "UTF-8" ; #endif } and with (better using a map) char const * windows_codepage_to_encoding( int const codepage ) { switch (codepage) { case 874: return "windows-874" ; case 932: return "Shift_JIS" ; // but should be "Windows-31J" ; case 936: return "GB2312" ; case 949: return "KS_C_5601-1987" ; case 950: return "Big5" ; case 1250: return "windows-1250" ; case 1251: return "windows-1251" ; case 1252: return "windows-1252" ; case 1253: return "windows-1253" ; case 1254: return "windows-1254" ; case 1255: return "windows-1255" ; case 1256: return "windows-1256" ; case 1257: return "windows-1257" ; case 1258: return "windows-1258" ; case 20127: return "US-ASCII" ; case 20866: return "KOI8-R" ; case 20932: return "EUC-JP" ; case 21866: return "KOI8-U" ; case 28591: return "ISO-8859-1" ; case 28592: return "ISO-8859-2" ; case 28593: return "ISO-8859-3" ; case 28594: return "ISO-8859-4" ; case 28595: return "ISO-8859-5" ; case 28596: return "ISO-8859-6" ; case 28597: return "ISO-8859-7" ; case 28598: return "ISO-8859-8" ; case 28599: return "ISO-8859-9" ; case 28603: return "ISO-8859-13" ; case 28605: return "ISO-8859-15" ; case 50220: return "ISO-2022-JP" ; case 50225: return "ISO-2022-KR" ; case 51949: return "EUC-KR" ; case 54936: return "GB18030" ; case 65001: return "UTF-8" ; default: { std::ostringstream message ; message << "Unknown codepage " << codepage ; throw std::invalid_argument( message.str() ) ; } } } Best regards Bjoern.

On Wed, Oct 31, 2012 at 4:07 PM, Thiel, Bjoern <bjoern.thiel@mpibpc.mpg.de>wrote:
Hi,
Hi,
developing platform independent code I really like the convenience functions conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale. Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf following the rational from filesystem (path encoding conversions):
Cannot talk for Artyom, but IMO there is little use to such functions. On Windows, 'ANSI' encodings exist solely for legacy reasons, and their use is limited to legacy code and code that gives up Unicode support in the first place. Boost.Filesystem uses 'ANSI' for narrow strings because Beman decided that compatibility with the dinkumware CRT implementation is more important than portability of Unicode correct code. The same is true for all other parts of boost (except Locale). If you are really into platform independent code, take a look at Boost.Nowide (http://cppcms.com/files/nowide/html/) waiting for review. In principle, on Windows, you need only two conversions: UTF-8 into UTF-16 and vice versa. Cheers, -- Yakov

On 01/11/12 03:41, Yakov Galka wrote:
On Wed, Oct 31, 2012 at 4:07 PM, Thiel, Bjoern <bjoern.thiel@mpibpc.mpg.de>wrote:
Hi,
Hi,
developing platform independent code I really like the convenience functions conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale. Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf following the rational from filesystem (path encoding conversions):
Cannot talk for Artyom, but IMO there is little use to such functions. On Windows, 'ANSI' encodings exist solely for legacy reasons, and their use is limited to legacy code and code that gives up Unicode support in the first place. Boost.Filesystem uses 'ANSI' for narrow strings because Beman decided that compatibility with the dinkumware CRT implementation is more important than portability of Unicode correct code. The same is true for all other parts of boost (except Locale).
If you are really into platform independent code, take a look at Boost.Nowide (http://cppcms.com/files/nowide/html/) waiting for review. In principle, on Windows, you need only two conversions: UTF-8 into UTF-16 and vice versa.
Cheers,
Hello all. Although this is right, I do think locales themselves should specify legacy encodings if they're using them. To my understanding wouldn't you be able to to from/to_utf and achieve the same behaviour as wanted without using a separate set of functions? Jookia.

[Yakov Galka]
Cannot talk for Artyom, but IMO there is little use to such functions. On Windows, 'ANSI' encodings exist solely for legacy reasons, and their use is limited to legacy code and code that gives up Unicode support in the first place. Boost.Filesystem uses 'ANSI' for narrow strings because Beman decided that compatibility with the dinkumware CRT implementation is more important than portability of Unicode correct code.
FYI, MSVC's C++ Standard Library implementation is licensed from Dinkumware, but MSVC's CRT is not. Stephan T. Lavavej Visual C++ Libraries Developer

________________________________ From: "Thiel, Bjoern" <bjoern.thiel@mpibpc.mpg.de> To: "boost@lists.boost.org" <boost@lists.boost.org> Sent: Wednesday, October 31, 2012 4:07 PM Subject: [boost] [locale] [filesystem] Windows local 8 bit encoding
Hi,
developing platform independent code I really like the convenience functions conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale. Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf
First of all locale encoding is not constant, for example there are numerous way to change locale by calling C functions setlocale(LC_ALL,"en_US.ISO-8859-1") or setlocale(LC_ALL,"English_USA.1251") you can change it in C++ as std::locale::global(std::locale("en_US.ISO-8859-1")); or std::locale::global(std::locale("English_USA.1251")); Of course under POSIX platform even stuff like setenv("LANG","en_US.ISO-8859-1",1) Right after main() would effectively change the process locale. Some functions will be effected by such changes some other don't, it depends on implementation and other things. Thus the "concept" of the OS locale is quite uncertain and not well defined especially under Microsoft Windows. Using Boost.Locale you can convert to locale encoding of a given std::locale() object generated with Boost.Locale. boost::locale::generator allows to select legacy "ANSI" encoding instead of UTF-8 to be default upon creation of the locale object that corresponds to the system locale. This object you can use with to_utf and from_utf functions.
following the rational from filesystem (path encoding conversions):
[snip]
I can tell that I think that boost.filesystem's approach is too simplistic and tries to use default behavior as default windows encoding under windows making cross platform development harder. So if you want to write cross platform software stick to UTF-8 and on the boundary of Win32 API convert it to Wide API which is the native Windows API and the correct one to use.
Best regards
Bjoern.
Best, Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

________________________________________ From: boost-bounces@lists.boost.org [boost-bounces@lists.boost.org] on behalf of Artyom Beilis [artyomtnk@yahoo.com] Sent: Thursday, November 01, 2012 09:57 To: boost@lists.boost.org Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding
________________________________ From: "Thiel, Bjoern" <bjoern.thiel@mpibpc.mpg.de> To: "boost@lists.boost.org" <boost@lists.boost.org> Sent: Wednesday, October 31, 2012 4:07 PM Subject: [boost] [locale] [filesystem] Windows local 8 bit encoding
Hi,
Hi,
developing platform independent code I really like the convenience functions conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale. Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf
First of all locale encoding is not constant, for example there are numerous way to change locale
[...]
Thus the "concept" of the OS locale is quite uncertain and not well defined especially under Microsoft Windows.
Right
Using Boost.Locale you can convert to locale encoding of a given std::locale() object generated with Boost.Locale.
boost::locale::generator allows to select legacy "ANSI" encoding instead of UTF-8 to be default upon creation of the locale object that corresponds to the system locale.
This object you can use with to_utf and from_utf functions.
Unfortunately that does not work under Microsoft Windows as generator locale_generator ; locale_generator.use_ansi_encoding( true ) ; std::locale const current_locale = locale_generator.generate( name ) ; needs a name. If I use the application locale name std::string const name = std::locale().name() ; I get "C" which gives me "US-ASCII" encoding and not the "windows-1252" encoding I have. Even if I use the system locale name std::string const name = std::locale( "" ).name() ; I get "English_United States.1252" which gives me the codepage "1252" as encoding and not "windows-1252" either (conv::to_utf and conv::from_utf just throw "Invalid or unsupported charset:1252" in this case).
[...]
So if you want to write cross platform software stick to UTF-8 and on the boundary of Win32 API convert it to Wide API which is the native Windows API and the correct one to use.
Actually I'm trying to make a shared object (a dll) platform independent that has to do some character conversions according to the current application locale. Best regards Bjoern.

Using Boost.Locale you can convert to locale encoding of a given
std::locale() object generated with Boost.Locale.
boost::locale::generator allows to select legacy "ANSI" encoding instead of UTF-8 to be default upon creation of the locale object that corresponds to the system locale.
This object you can use with to_utf and from_utf functions.
Unfortunately that does not work under Microsoft Windows as generator locale_generator ; locale_generator.use_ansi_encoding( true ) ; std::locale const current_locale = locale_generator.generate( name ) ; needs a name.
Similar to creating std::locale("") the generation locale_generator.generate("") gives the expected result, i.e. system default locale. See: http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/locale_gen.html
Best regards
Bjoern.
Regards Artyom Beilis

________________________________________ From: boost-bounces@lists.boost.org [boost-bounces@lists.boost.org] on behalf of Artyom Beilis [artyomtnk@yahoo.com] Sent: Thursday, November 01, 2012 15:59 To: boost@lists.boost.org Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding
Using Boost.Locale you can convert to locale encoding of a given std::locale() object generated with Boost.Locale.
boost::locale::generator allows to select legacy "ANSI" encoding instead of UTF-8 to be default upon creation of the locale object that corresponds to the system locale.
This object you can use with to_utf and from_utf functions.
Unfortunately that does not work under Microsoft Windows as generator locale_generator ; locale_generator.use_ansi_encoding( true ) ; std::locale const current_locale = locale_generator.generate( name ) ; needs a name.
Similar to creating std::locale("") the generation locale_generator.generate("") gives the expected result, i.e. system default locale.
Unfortunately under Microsoft Windows this only gives the 'hardwired' "UTF-8" encoding: void prepare_data() { ... if(locale_id_.empty()) { real_id_ = util::get_system_locale(true); // always UTF-8 ... } ... } where util::get_system_locale(false) would at least do part of the mapping from codepage to encoding (see my initial posting). Best regards Bjoern.

On 02/11/12 01:43, Thiel, Bjoern wrote:
Unfortunately that does not work under Microsoft Windows as generator locale_generator ; locale_generator.use_ansi_encoding( true ) ; std::locale const current_locale = locale_generator.generate( name ) ; needs a name.
If I use the application locale name std::string const name = std::locale().name() ; I get "C" which gives me "US-ASCII" encoding and not the "windows-1252" encoding I have.
Even if I use the system locale name std::string const name = std::locale( "" ).name() ; I get "English_United States.1252" which gives me the codepage "1252" as encoding and not "windows-1252" either (conv::to_utf and conv::from_utf just throw "Invalid or unsupported charset:1252" in this case).
Best regards
Bjoern.
Hey! Sorry if this sounds silly, but have you tried util::get_system_locale? Jookia.

________________________________________ From: boost-bounces@lists.boost.org [boost-bounces@lists.boost.org] on behalf of Artyom Beilis [artyomtnk@yahoo.com] Sent: Thursday, November 01, 2012 09:57 To: boost@lists.boost.org Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding
________________________________ From: "Thiel, Bjoern" <bjoern.thiel <at> mpibpc.mpg.de> To: "boost <at> lists.boost.org" <boost <at> lists.boost.org> Sent: Wednesday, October 31, 2012 4:07 PM Subject: [boost] [locale] [filesystem] Windows local 8 bit encoding
Hi Artyom,
Using Boost.Locale you can convert to locale encoding of a given std::locale() object generated with Boost.Locale.
boost::locale::generator allows to select legacy "ANSI" encoding instead of UTF-8 to be default upon creation of the locale object that corresponds to the system locale.
This object you can use with to_utf and from_utf functions.
Right - you can use them. But they are not very helpful. If you want the SYSTEM locale on Microsoft Windows: generator locale_generator ; locale_generator.use_ansi_encoding( true ) ; wstring = conv::to_utf< wchar_t >( string, locale_generator( "" ) ) ; unfortunately gives UTF-8 encoding as well. And if you want the CURRENT locale on Microsoft Windows, I simply can't see how to get that. But why not add generator.generate( void ) giving exactly that. Together with 'really' use_ansi_encoding( true ) it would be perfect. Best regards Bjoern.

What backend do you use, there are several applicable for Windows: - icu - based on ICU library - win32 - based on win32 API - std - based on standard C++ library They are selected in this order of compiled in. Currently win32 API supports only UTF-8 encodings, so you need to one of: - compile with ICU - select std backend (because the default without ICU on windows is win32) - disable win32 backend (in build options) so only std backend would be used. But you should note that under Windows only MSVC has std backend support, for gcc and ANSI encodings you need ICU, Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/
________________________________ From: "Thiel, Bjoern" <bjoern.thiel@mpibpc.mpg.de> To: "boost@lists.boost.org" <boost@lists.boost.org> Sent: Tuesday, November 6, 2012 12:47 PM Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding
________________________________________ From: boost-bounces@lists.boost.org [boost-bounces@lists.boost.org] on behalf of Artyom Beilis [artyomtnk@yahoo.com] Sent: Thursday, November 01, 2012 09:57 To: boost@lists.boost.org Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding
________________________________ From: "Thiel, Bjoern" <bjoern.thiel <at> mpibpc.mpg.de> To: "boost <at> lists.boost.org" <boost <at> lists.boost.org> Sent: Wednesday, October 31, 2012 4:07 PM Subject: [boost] [locale] [filesystem] Windows local 8 bit encoding
Hi Artyom,
Using Boost.Locale you can convert to locale encoding of a given std::locale() object generated with Boost.Locale.
boost::locale::generator allows to select legacy "ANSI" encoding instead of UTF-8 to be default upon creation of the locale object that corresponds to the system locale.
This object you can use with to_utf and from_utf functions.
Right - you can use them. But they are not very helpful. If you want the SYSTEM locale on Microsoft Windows: generator locale_generator ; locale_generator.use_ansi_encoding( true ) ; wstring = conv::to_utf< wchar_t >( string, locale_generator( "" ) ) ; unfortunately gives UTF-8 encoding as well.
And if you want the CURRENT locale on Microsoft Windows, I simply can't see how to get that.
But why not add generator.generate( void ) giving exactly that. Together with 'really' use_ansi_encoding( true ) it would be perfect.
Best regards
Bjoern.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (5)
-
Artyom Beilis
-
Jookia
-
Stephan T. Lavavej
-
Thiel, Bjoern
-
Yakov Galka