
On Wed, Oct 26, 2011 at 22:13, Beman Dawes <bdawes@acm.org> wrote:
(2) V3 may work OK with the Microsoft 65001 UTF-8 codepage, although I've never used it myself and you would have to pass in a UTF-8 encoded narrow character name.
If it had been possible, it would simplify everything. Unfortunately, you cannot set UTF-8 codepage for windows API functions (you can for the console though). Microsoft isn't interested in making portability easier, so I don't see them adding the support in the nearest future. But they could. UTF-8 should be considered as the only default narrow encoding on windows, because all other ANSI encodings are not unicode-aware. On Wed, Oct 26, 2011 at 22:47, Beman Dawes <bdawes@acm.org> wrote:
On Wed, Oct 26, 2011 at 6:24 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
[...]
Even if you fix the Unicode problems,
What Unicode problems are you running into? Although there are some locale related tickets outstanding, I'm not aware of any Unicode issues.
1) The one that was brought up in the previous thread. 2) The complexity of writing portable unicode-aware code: currently you're forcing me to a) use wstring on windows, or if I prefer to use my favorite portable UTF-8 encoded strings b) write all the boilerplate code that passes codecvt everywhere as a parameter (see below why ¬imbue()). In both cases you're shifting the complexity to the higher-level code. It's not a kind thing for you as a low-level library developer to do, The library is expected to ℍ𝕚𝕕𝕖 the platform differences by providing a uniform interface. ⇒ Myth: Using the native encoding on each platform results in portable code. ‽ In some definition of 'portable' definitely yes. But not when things are shared among different platforms. It starts with files transferred between different systems and ends with the source code itself (there is a different between "" and L""). Uniformity == simplicity. Consider a simple case of loading a path from some project file and loading the referenced file: The project file is encoded in UTF-8 making it portable among all systems with CHAR_BIT == 8. // Option a) #include "codecvt_implementation.h" std::basic_ifstream<native_char> fin("project.file"); std::basic_string<native_char> str; fin.imbue(locale(fin.getloc(), new utf8_to_native_codecvt())); getline(fin, str); fs::ifstream fin2(project_path/str); // Option b) #include "codecvt_implementation.h" std::ifstream fin("project.file"); std::string str; getline(fin, str); fs::ifstream fin2(fs::path(project_path).append(str, utf8_to_native_codecvt())); // c) How it could be done std::ifstream fin("project.file"); std::string str; getline(fin, str); fs::ifstream fin2(project_path/str); ⇒ Use boost⸬filesystem⸬imbue to convert b to c. ‽ Who is responsible for calling imbue()? I'm writing library code. I'm not allowed to change the global-state. ⇒ This code will break: int main(int argc, char* argv[]) { fs::ifstream fin(argv[1]); } ‽ It works fine for ASCII characters on all sane platform. For non-ASCII, I don't care. It's already not unicode-aware if the native encoding is not UTF-8 (which can't be so on Windows). If the writer of this code really cares about internationalization, she can use boost⸬program_options (assuming it's also changed to follow the UTF-8 convention). Otherwise she's a hypocrite. ⇒ UTF-8 is slow. ‽ Compared to what? You haven't measured this. On windows: we must do complicated operations on paths before we pass them to the OS anyway (system_complete, prepend \\?\). Most of the strings contain ASCII characters so std::strings take less memory, thus decreasing cache thrashing in other parts of the program. On other OSes: the encoding is already almost always UTF-8. Experience shows that the small overhead (if it's an overhead at all) is not the bottleneck. Many cross-platform libraries already switched to UTF-8 for narrow-chars (see one of the previous discussion for a list), and I don't see a reason why boost can't be the next. -- Yakov