[General] Treat narrow strings as UTF-8 (compilation flag)

Hello All, About half a year ago there was a long discussion titled "Always treat std::strings as UTF-8". The only objection to the proposal was that making an instant switch by assuming UTF-8 by default will give surprising results to those who're unaware of the convention (or prefer using legacy encodings instead of UTF-8). This applies almost only to Windows developers. However, there are already many projects and organizations that switched to UTF-8 even for Windows programming. The company I work in is one of them. Nowadays: ========== All the libraries that accept narrow strings assume the system encoding by default. * filesystem::path — Can be configured through static imbue() function. * system_error_category (windows error description), interprocess (object names)... more? — Don't support Unicode at all. They use the narrow API on Windows. * program_options — Assumes UTF-8 for internal data (Good!), but uses system encoding for paths (parse_config_file) and for environment variables (Bad...) . Note that, e.g. path::imbue(), is a painful solution for two reasons: Any global state initialization is problematic in dynamically-linked, multi-threaded systems (like the one I'm maintaining now). In such cases a compile time configuration is more attractive. I really don't want to have such a function in each boost library (can be solved by having a global boost::imbue though). Proposal: ======== Add a compile-time configuration flag that causes boost to treat all narrow strings as UTF-8. The flag will be off by default. For example, in filesystem it's a matter of setting `codepage` to CP_UTF8 in just two places. Rationale: ========== Those who are ready to move to the UTF-8 future, they can do it by simply setting a compilation flag.. Those who don't care about Unicode correctness are not affected by the addition. There won't be any complaints to boost, like: "Hey! I use boost with these libraries and it doesn't work. Your encoding is wrong!". -- Yakov Galka

Hello again, My previous mail was ignored by the community, and I would like to know why. If it wasn't clear, I want to hear your opinion on the topic. If there is a disagreement, I would like to know what is the reason for the disagreement. If there are problems in the proposal, perhaps we can fix them and come to a solution accepted by all. If you agree in principle but just don't have the resources for this work, I'm going to do this work (or part of it). I just don't want to waste my time on something that is certainly going to be rejected. Thank you in advance, -- Yakov Galka On Tue, Jul 5, 2011 at 19:25, Yakov Galka <ybungalobill@gmail.com> wrote:
Hello All,
About half a year ago there was a long discussion titled "Always treat std::strings as UTF-8". The only objection to the proposal was that making an instant switch by assuming UTF-8 by default will give surprising results to those who're unaware of the convention (or prefer using legacy encodings instead of UTF-8). This applies almost only to Windows developers. However, there are already many projects and organizations that switched to UTF-8 even for Windows programming. The company I work in is one of them.
Nowadays: ==========
All the libraries that accept narrow strings assume the system encoding by default. * filesystem::path — Can be configured through static imbue() function. * system_error_category (windows error description), interprocess (object names)... more? — Don't support Unicode at all. They use the narrow API on Windows. * program_options — Assumes UTF-8 for internal data (Good!), but uses system encoding for paths (parse_config_file) and for environment variables (Bad...) .
Note that, e.g. path::imbue(), is a painful solution for two reasons: Any global state initialization is problematic in dynamically-linked, multi-threaded systems (like the one I'm maintaining now). In such cases a compile time configuration is more attractive. I really don't want to have such a function in each boost library (can be solved by having a global boost::imbue though).
Proposal: ========
Add a compile-time configuration flag that causes boost to treat all narrow strings as UTF-8. The flag will be off by default. For example, in filesystem it's a matter of setting `codepage` to CP_UTF8 in just two places.
Rationale: ==========
Those who are ready to move to the UTF-8 future, they can do it by simply setting a compilation flag.. Those who don't care about Unicode correctness are not affected by the addition. There won't be any complaints to boost, like: "Hey! I use boost with these libraries and it doesn't work. Your encoding is wrong!".
-- Yakov Galka

Hello All, I can suggest following policy. - Boost must deprecate use of ANSI API on Windows anywhere - Boost must use only Wide API explicitly - Boost must treat all narrow strings as UTF-8 regardless the fact it is not compatible with _some_ other software that uses ANSI encoding and convert them to Wide onces. To make things simpler the conversion should be done only on the last stage - close to OS system calls/C library calls like CreateFileW or _wfopen, _wremove - I think where it is possible to have an optional backward compatibility build/compilation flag like BOOST_WINDOWS_USE_ANSI_ENCODING For thous who want to stick with old API with compatibility And I want to explain why keeping using ANSI API is still not compatible and will remain not-compatible even withing existing software. ---------------------------------------------------- ---------------------------------------------------- ANSI/Narrow API is not compatible with itself, there are several places where encoding is defined and it is used differently in different places even withing the native Windows software like Visual Studio itself. ---------------------------------------------------- ---------------------------------------------------- For example, this program does not do what is expected when is compiled with Microsoft Visual Studio 2008/2010 1 setlocale(LC_ALL,"Russian_Russia.1251") // Set Russian Locale 2 std::ofstream text("Мир.txt"); // encoded as 1251 text << "Hello" << std::endl; text.close(); 3 std::remove("Мир.txt"); // 1251 1. Set the global C locale and encoding to Russian and sets the code page to 1251 - Cyrillic encoding 2. text stream is being opened. "Мир.txt" is converted from CP1251 to UTF-16 and file is created 3. std::remove converts "Мир.txt" to UTF-16 according to OS ANSI code page - it may not be the same code page as was set in (1) So the file remains on the system and not got removed Because two different parts of same program use different narrow encodings. And this happens withing the same runtime and same compiler! --------------------------------------------------------------- 1. ANSI API Must be deprecated 2. UTF-8 should be used by default. Many libraries around had adopted this policy on windows as ASNI encoding keeps us behind and makes cross platform programming nightmare. Example of some libraries that adopted UTF-8 on Windows 1. GTK/GTKmm 2. Sqlite3 3. Boost.Locale - UTF-8 policy was very welcoming by many reviewers I'd put more libraries into this list but it not comes to my mind right now. I'd suggest to make this policy as official Boost policy and bring it to the formal review. ----------------------------------------------------------- I'm personally would write patches for Boost libraries that still use ANSI API and fix them if required. Yakov - I would be with your on this because current windows/unicode situation is very bad in Boost. ------------------------------------------------------------ Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/ ----- Original Message -----
From: Yakov Galka <ybungalobill@gmail.com> To: boost@lists.boost.org Cc: Sent: Friday, July 22, 2011 9:49 AM Subject: Re: [boost] [General] Treat narrow strings as UTF-8 (compilation flag)
Hello again,
My previous mail was ignored by the community, and I would like to know why. If it wasn't clear, I want to hear your opinion on the topic.
If there is a disagreement, I would like to know what is the reason for the disagreement. If there are problems in the proposal, perhaps we can fix them and come to a solution accepted by all.
If you agree in principle but just don't have the resources for this work, I'm going to do this work (or part of it). I just don't want to waste my time on something that is certainly going to be rejected.
Thank you in advance, -- Yakov Galka
On Tue, Jul 5, 2011 at 19:25, Yakov Galka <ybungalobill@gmail.com> wrote:
Hello All,
About half a year ago there was a long discussion titled "Always treat std::strings as UTF-8". The only objection to the proposal was that making an instant switch by assuming UTF-8 by default will give surprising results to those who're unaware of the convention (or prefer using legacy encodings instead of UTF-8). This applies almost only to Windows developers. However, there are already many projects and organizations that switched to UTF-8 even for Windows programming. The company I work in is one of them.
Nowadays: ==========
All the libraries that accept narrow strings assume the system encoding by default. * filesystem::path — Can be configured through static imbue() function. * system_error_category (windows error description), interprocess (object names)... more? — Don't support Unicode at all. They use the narrow API on Windows. * program_options — Assumes UTF-8 for internal data (Good!), but uses system encoding for paths (parse_config_file) and for environment variables (Bad...) .
Note that, e.g. path::imbue(), is a painful solution for two reasons: Any global state initialization is problematic in dynamically-linked, multi-threaded systems (like the one I'm maintaining now). In such cases a compile time configuration is more attractive. I really don't want to have such a function in each boost library (can be solved by having a global boost::imbue though).
Proposal: ========
Add a compile-time configuration flag that causes boost to treat all narrow strings as UTF-8. The flag will be off by default. For example, in filesystem it's a matter of setting `codepage` to CP_UTF8 in just two places.
Rationale: ==========
Those who are ready to move to the UTF-8 future, they can do it by simply setting a compilation flag.. Those who don't care about Unicode correctness are not affected by the addition. There won't be any complaints to boost, like: "Hey! I use boost with these libraries and it doesn't work. Your encoding is wrong!".
-- Yakov Galka
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Fri, Jul 22, 2011 at 2:49 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
Hello again,
My previous mail was ignored by the community, and I would like to know why. If it wasn't clear, I want to hear your opinion on the topic.
If there is a disagreement, I would like to know what is the reason for the disagreement. If there are problems in the proposal, perhaps we can fix them and come to a solution accepted by all.
If you agree in principle but just don't have the resources for this work, I'm going to do this work (or part of it). I just don't want to waste my time on something that is certainly going to be rejected.
The default encoding for narrow and wide characters char and wchar_t in Boost libraries mirrors the default encoding for these characters in C++ and the standard library, which is in turn mirroring the default encoding supplied by various operating systems. It is as simple as that. So if you want a change, convince these folks to change their default encoding, and give users a chance to adjust to that change. Until that happens, you might get a more positive response from the maintainers of Boost libraries if you figured out a way for users to globally change the assumed default narrow encoding from the system encoding to encoding X, where X may well be UTF-8 but also might be something else, such as some of the encodings widely used in Asia. Just my personal opinion, --Beman

Beman Dawes wrote:
On Fri, Jul 22, 2011 at 2:49 AM, Yakov Galka <ybungalobill@gmail.com>
The default encoding for narrow and wide characters char and wchar_t in Boost libraries mirrors the default encoding for these characters in C++ and the standard library, which is in turn mirroring the default encoding supplied by various operating systems. It is as simple as that.
So if you want a change, convince these folks to change their default encoding, and give users a chance to adjust to that change.
+1 The idea that one change should be propogated all boost libraries over the authority of boost library authors has created havoc before. I'm refering to the the redefinition/repurposing of BOOST_THROW_EXCEPTION Robert Ramey

----- Original Message -----
From: Beman Dawes <bdawes@acm.org> To: boost@lists.boost.org
The default encoding for narrow and wide characters char and wchar_t in Boost libraries mirrors the default encoding for these characters in C++ and the standard library, which is in turn mirroring the default encoding supplied by various operating systems. It is as simple as that.
Unfortunately under Windows there is no such thing consistent narrow encoding as I had shown above in a simple example. The standard library is not consistent with itself! Keeping ANSI encoding keeps us backward and makes software develop a total nightmare. The current standard of narrow encoding under Windows is broken and it is deprecated by the Microsoft itself. This situation should just be fixed. Boost is too valuable software to ignore the problem and refer to some broken-and-deprecated-standard. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/
participants (4)
-
Artyom Beilis
-
Beman Dawes
-
Robert Ramey
-
Yakov Galka