[filesystem and beyond] Narrow strings be UTF-8

On Wed, Oct 26, 2011 at 22:13, Beman Dawes <bdawes@acm.org> wrote:
(2) V3 may work OK with the Microsoft 65001 UTF-8 codepage, although I've never used it myself and you would have to pass in a UTF-8 encoded narrow character name.
If it had been possible, it would simplify everything. Unfortunately, you cannot set UTF-8 codepage for windows API functions (you can for the console though). Microsoft isn't interested in making portability easier, so I don't see them adding the support in the nearest future. But they could. UTF-8 should be considered as the only default narrow encoding on windows, because all other ANSI encodings are not unicode-aware. On Wed, Oct 26, 2011 at 22:47, Beman Dawes <bdawes@acm.org> wrote:
On Wed, Oct 26, 2011 at 6:24 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
[...]
Even if you fix the Unicode problems,
What Unicode problems are you running into? Although there are some locale related tickets outstanding, I'm not aware of any Unicode issues.
1) The one that was brought up in the previous thread. 2) The complexity of writing portable unicode-aware code: currently you're forcing me to a) use wstring on windows, or if I prefer to use my favorite portable UTF-8 encoded strings b) write all the boilerplate code that passes codecvt everywhere as a parameter (see below why ¬imbue()). In both cases you're shifting the complexity to the higher-level code. It's not a kind thing for you as a low-level library developer to do, The library is expected to ℍ𝕚𝕕𝕖 the platform differences by providing a uniform interface. ⇒ Myth: Using the native encoding on each platform results in portable code. ‽ In some definition of 'portable' definitely yes. But not when things are shared among different platforms. It starts with files transferred between different systems and ends with the source code itself (there is a different between "" and L""). Uniformity == simplicity. Consider a simple case of loading a path from some project file and loading the referenced file: The project file is encoded in UTF-8 making it portable among all systems with CHAR_BIT == 8. // Option a) #include "codecvt_implementation.h" std::basic_ifstream<native_char> fin("project.file"); std::basic_string<native_char> str; fin.imbue(locale(fin.getloc(), new utf8_to_native_codecvt())); getline(fin, str); fs::ifstream fin2(project_path/str); // Option b) #include "codecvt_implementation.h" std::ifstream fin("project.file"); std::string str; getline(fin, str); fs::ifstream fin2(fs::path(project_path).append(str, utf8_to_native_codecvt())); // c) How it could be done std::ifstream fin("project.file"); std::string str; getline(fin, str); fs::ifstream fin2(project_path/str); ⇒ Use boost⸬filesystem⸬imbue to convert b to c. ‽ Who is responsible for calling imbue()? I'm writing library code. I'm not allowed to change the global-state. ⇒ This code will break: int main(int argc, char* argv[]) { fs::ifstream fin(argv[1]); } ‽ It works fine for ASCII characters on all sane platform. For non-ASCII, I don't care. It's already not unicode-aware if the native encoding is not UTF-8 (which can't be so on Windows). If the writer of this code really cares about internationalization, she can use boost⸬program_options (assuming it's also changed to follow the UTF-8 convention). Otherwise she's a hypocrite. ⇒ UTF-8 is slow. ‽ Compared to what? You haven't measured this. On windows: we must do complicated operations on paths before we pass them to the OS anyway (system_complete, prepend \\?\). Most of the strings contain ASCII characters so std::strings take less memory, thus decreasing cache thrashing in other parts of the program. On other OSes: the encoding is already almost always UTF-8. Experience shows that the small overhead (if it's an overhead at all) is not the bottleneck. Many cross-platform libraries already switched to UTF-8 for narrow-chars (see one of the previous discussion for a list), and I don't see a reason why boost can't be the next. -- Yakov

On Thu, Oct 27, 2011 at 3:31 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Wed, Oct 26, 2011 at 22:13, Beman Dawes <bdawes@acm.org> wrote:
On Wed, Oct 26, 2011 at 6:24 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
[...]
Even if you fix the Unicode problems,
What Unicode problems are you running into? Although there are some locale related tickets outstanding, I'm not aware of any Unicode issues.
1) The one that was brought up in the previous thread.
That resulted in a ticket being opened, to fix a problem specific to MinGW. It will get fixed as time permits.
2) The complexity of writing portable unicode-aware code: currently you're forcing me to a) use wstring on windows, or if I prefer to use my favorite portable UTF-8 encoded strings b) write all the boilerplate code that passes codecvt everywhere as a parameter (see below why ¬imbue()).
In both cases you're shifting the complexity to the higher-level code. It's not a kind thing for you as a low-level library developer to do,
I don't know of any other viable approaches. I'm sorry you find the boilerplate objectionable, but I'm not about to change to a default that would enforce UTF-8 or any other particular narrow string encoding. That's a much wider problem than Boost.Filesystem.
The library is expected to ℍ𝕚𝕕𝕖 the platform differences by providing a uniform interface.
Initially the plan was to provide both a uniform interface in terms of syntax and semantics. User reaction to uniform syntax was very positive, and I've tried to provide that to the maximum extent possible as far as the API goes. Uniform semantics turned out to be much more complex. Paths are one of the areas where acknowledging the difference between generic paths and native paths is something that users want and need.
⇒ Myth: Using the native encoding on each platform results in portable code.
Hum... I don't recall anyone every claiming that "native encoding on each platform results in portable code". It is way more complex than that.
⇒ Use boost⸬filesystem⸬imbue to convert b to c. ‽ Who is responsible for calling imbue()? I'm writing library code. I'm not allowed to change the global-state.
Right. Library code writers have to avoid changing global state if they want to keep users happy. Nothing unusual about Boost.Filesystem in that respect.
⇒ This code will break: int main(int argc, char* argv[]) { fs::ifstream fin(argv[1]); } ‽ It works fine for ASCII characters on all sane platform. For non-ASCII, I don't care. It's already not unicode-aware if the native encoding is not UTF-8 (which can't be so on Windows). If the writer of this code really cares about internationalization, she can use boost⸬program_options (assuming it's also changed to follow the UTF-8 convention). Otherwise she's a hypocrite.
I disagree with your assertion that the above code will break. It is in all essential aspects the same as: int main(int argc, char* argv[]) { std::ifstream fin(argv[1]); } While the results may not be what the coder expected, that's an issue well beyond the scope of the standard library or Boost.Filesystem.
⇒ UTF-8 is slow. ‽ Compared to what? You haven't measured this.
Actually, I have measured it many times, and never found UTF-8 to be a bottleneck on European, North American, or South American data sets. I haven't measured UTF-8 with Asian data sets; they tended to use other encodings.
Experience shows that the small overhead (if it's an overhead at all) is not the bottleneck. Many cross-platform libraries already switched to UTF-8 for narrow-chars (see one of the previous discussion for a list), and I don't see a reason why boost can't be the next.
Boost libraries could use UTF-8 for narrow characters as a default, but only where they aren't interfacing with existing code and/or operating systems. --Beman
participants (2)
-
Beman Dawes
-
Yakov Galka