
Hi, I am sorry for the length of the e-mail, I hope I make myself clear better than last time.
[mailto:boost-bounces@lists.boost.org] On Behalf Of Bennett, Patrick
On Behalf Of Ferdinand Prantl
I have nothing against usage of UTF-8 if it suits the scenario well. I just say that it is not an encoding for all purposes. It is a multibyte one and so extremely inefficient for getting size, searching, etc.
[Bennett, Patrick] I fail to see how this is the case. Right now, today, filesystem supports *only* ascii. If you
IMHO it is not true. boost::filesystem does neither recoding nor modification to the strings passed into its methods. It calls underlying OS functions, the XxxA methods on Windows and the POSIX functions on UN*Xes. These work with the current locale on UN*Xes and in the current ANSI codepage on Windows. They accept all characters valied in the current codepage. If you change it, you may see the filenames crippled, because the characters can be mapped differently, but that's it, it's not problem of boost::filesystem. This the current boost::filesystem does not solve and lets it on the programmer, to convert the filenames, if needed. You can use UTF-8 even today on non-Windows platforms. Unfortunately, there is no setting on Windows to enforce a UTF-8 codepage, like on UN*Xes. You set the regional settings and the codepage is assigned automatically. So the XxxA functions continue use the ANSI encoding. Well, the speed of searching in filenames can be discutable... But I like to use only one string type for everything in my application model and convert only on rare places, where a contact with OS is necessary. I like to use wstring only.
If you continued to use ASCII, nothing would change.
Yes, if I used ASCII but I use all characters from the current locale. I write localized application, including names of files. Then simply throw UTF-8 in would break it.
There is zero speed penalty for calculating the # of *bytes* in an utf-8 string. If you want to determine the # of characters, then there is, but then only if you're actually working on a Unicode string. There is no getting around this for an international application, no matter what encoding is used.
It is not predictable, if there are some non-ASCII characters in a string or not. Use a different, more suitable Unicode encoding for it. If you have to declare the complexity of your algorithms and the application is time critical, the multibyte (UTF) encodings are not used, but the fixed-byte (UCS) ones. UTF are recommended for transport because they do not waste space, UCS for runtime because they guarantee constant complexity of operations. But it is not a dogma, UTF encodings are not "forbidden" in runtime, it depends on requirements on an application. Windows (XxxW) and Java use UCS-2 for the Unicode character representation in their APIs, for example.
Why to prescribe it for all boost::filesystem users and force them to put recoding into their sources,
[Bennett, Patrick] Absolutely no recoding would be necessary for current users of boost::filesystem. boost::filesystem has no support for unicode today, so why would they have to recode anything?
As I said, absolutely no recoding is necessary now, if you use boost::filesystem with the strings in a native codepage. For example East Europe or Cyrillic. Everything will work well because you deliver your application localized in these countries. If you start accepting only UTF-8, such application would be broken. The application, which does not use Unicode (on Windows it does not have to and it can run on Japanese Windows good though), would have to perform recoding from the local codepage, in which it works, into UTF-8, which boost::filesystem would accept. And I did not mention, that Windows9x do not implement all XxxW methods (aditional library must be used to compile an application) and the recoding methods on Windows95 do not have support for UTF-8 as a codepage as Windows 98/NT have. It would mean a different solution for Windows9x withinn boost::filesystem.
UTF-8 is not identical with the complete iso-8859-1 (latin1) codepage. Some code could be broken by accepting UTF-8 in the new version.
[Bennett, Patrick] Hmmm, good point, but... would it break for any of the characters that are valid characters for a path or filename on an 8859-1 system? No, not that I can think of.
Again, valid are all characters in the current codepage for XxxA methods and in the current locale for POSIX methods. Except for wildcards and path delimiters, of course. Then it is change of interface breaking all usage scenarios.
[Bennett, Patrick] If you can think of a good way of handling this that doesn't involve a mess of codepages, locales, and facets, then I'm all for it. Frankly I think C++'s 'built-in' internationalization support is a nightmare, but that's probably just me. My (intentional) limited exposure to them probably hasn't helped. It's hard to beat having a 'single' encoding like UTF-8 that can handle all defined characters. Unicode is definitely the way to go IMO.
Unicode is the ideal way, however, not all systems support it or are configured to use it. If you force its support in boost::filesystem, you would force Unicode to all applications using it, which do not do it now. Currently boost::filesystem does not force it, but allows it with the following setup: A local encoding on UNIX: no problem OS: configure for example cs_CZ environment boost::filesystem: use cs_CZ strings UTF-8 on UNIX: no problem OS: configure UTF-8 environment boost::filesystem: use UTF-8 strings A local encoding on Windows: no problem System: configure for example Czech environment boost::filesystem: use czech strings Windows has no predefined regional settings with UTF-8, but uses always an ANSI codepage. Unicode support is additional and preferred for NT. If you wanted UTF-8 on API boost::filesystem, you would have to: 1) either use XxxW methods in boost::filesystem and convert from UTF-8 to UCS-2 2) or recode the input UTF-8 strings into the local ANSI codepage within boost::filesystem to be able to use XxxA methods ad 1) it needs a recoding engine (built in Windows, for UTF-8 on 98/NT) ad 2) it will work for names from the current codepage only (UTF-8 recoding is from 98/NT)
My real issue with boost::filesystem is that as currently defined, it's unusable in an application that will be used around the world. My initial response to this whole thread was just to point out to David that there *are* issues preventing people from using the library. He didn't think there were any, so I was compelled to point out what one of the issues was for me at least.
Could you specify exactly what prevents you to use it? I can use boost::filesystem in czech and russian environments happily and for japanese it is the same as long as you stay within the local codepage. boost::filesystem is perfectly usable in any localized application, which instance runs in a single language environment. The application itself can be installed in any language environment, however. The only thing you have to keep in mind is, that the strings being accepted by boost::filesystem are in the current locale/ANSI codepage. The problem is somewhere else and does not concern only boost::filesystem but the whole locale support by OSes. Just to clarify: localized application is an application, which can run on a localized environment, has a localized interface and accepts all arguments (here: filenames) from the localized environment. This boost::filesystem fulfills this for any existing locale, including japanese, without recompiling (single binary, localized resources). Nowadays comes another requirement - sharing applications and its data throughout more locales (locale switching, data distribution) without loosing information (here: crippled characters in filenames). To make myself clear - to use more locales together in a single application, for example czech, russian and japanese together. Unfortunately, this is not possible to achieve using POSIX methods (libc, libstdc++/STL), because they use the current locale, which does not need to be defined to Unicode (UTF-8) but to any ANSI. AFAIK only Windows implement C/C++ runtime with native Unicode suport, using UCS-2. It is a question, if boost::filesystem should help it. Currently it behaves the same way, how POSIX and libc and libstdc++/STL: works in the current locale. I think, that ANSI-C comitee should have updated the library specification to contain wchar_t methods since a long time, btw... It could have a positive influence on std::streams (wchar_t* methods) and boost::filesystem as well. In such ideal worlds you would have no problems :-) You are right in these points: 1) you cannot write an application, which has characters from more codepages (locales) in a filename 2) if characters in two codepages are mapped differently, filenames would be incompatible, although the characters are valid in both encodings Currently you are not able to write applications for Windows, which require usage of Unicode API (XxxW functions) - the two poins above. It is not because boost::filesystem does not use UTF-8. It is because: ANSI-C does not use Unicode -> ANSI-C++ (STL) does not use Unicode -> boost::filesystem does not use Unicode. My opinion: the way to solve it is not to change encoding of std::string and char* parameters of the current boost::filesystem to UTF-8 but to follow the STL way - to introduce methods with std::wstring and wchar_t or better, templatize it with basic_string<CharType>. It is difficult because wchar_t is not in STL and libc, I know.
At the company where I work we're currently just pursuing our own wrappers for what filesystem provides. I originally tried using filesystem, but once I saw that it's handling of internationalization was absent, I had no choice but to dump it. I certainly have an interest in it being improved, and I could see looking at it again, but someone will have to spearhead that initiative. Considering that this hasn't really been brought up before tells me that people either aren't using the library, or simply don't care about internationalization (probably the latter). I, unfortunately, don't have that luxury.
I am from an Eastern European country and you bet we are concerned with internationalization. However, no change comes without a discussion and you are not alone using the library and your suggested change would make the others (at least me ;-) dificulties. If nobody has got the problem could also mean, that there is a usage style, which works. This I pointed out above. Besides, I don't have the luxury to support platforms with full support of XxxW methods and UTF-8 support in WideCharToMultibyte. That's why would like to extend boost::filesystem with the support of the current behavior (work in the current locale) but for Unicode applications perform recoding, which does not need to happen, if OS supports Unicode natively. That's why I pointed out the imbue principle used in std::streams for the similar case - application can work in Unicode (UCS-2 wchar_t) and content of the stream can be in any encoding (including UCS2 wchar_t for no recoding). Ferda
Cheers... Patrick Bennett