[filesystem]Extracting path as string from wpath

How do convert a boost::filesystem::wpath to a known encoding in the native filesystem's path representation? Specifically I want to convert a boost::filesystem::wpath to a third party string (CFString from Carbon). In order to do that I must know the encoding of wpath.string() or wpath.external_file_string() on Mac, which is "implementation defined". Interfacing boost::filesystem::wpath with 3rd party APIs seems like common problem. Best Regards, Johan Torp www.johantorp.com -- View this message in context: http://www.nabble.com/-filesystem-Extracting-path-as-string-from-wpath-tp200... Sent from the Boost - Dev mailing list archive at Nabble.com.

Johan Torp wrote:
How do convert a boost::filesystem::wpath to a known encoding in the native filesystem's path representation?
Specifically I want to convert a boost::filesystem::wpath to a third party string (CFString from Carbon). In order to do that I must know the encoding of wpath.string() or wpath.external_file_string() on Mac, which is "implementation defined".
Interfacing boost::filesystem::wpath with 3rd party APIs seems like common problem.
After reading CFString's reference documentation the 11th time, I found the CFStringCreateWithFileSystemRepresentation. It was hidden away at the bottom of the page, far away from all other CFStringCreateWithXXX functions. *mumble*, *mumble*... I assume that wpath::external_file_string() returns a string containing POSIX file system representation? It would be nice to get some guarantees on what wpath::external_file_system() returns for each supported platform. Thoughts? Best Regards, Johan Torp -- View this message in context: http://www.nabble.com/-filesystem-Extracting-path-as-string-from-wpath-tp200... Sent from the Boost - Dev mailing list archive at Nabble.com.

On Fri, Oct 17, 2008 at 7:17 AM, Johan Torp <johan.torp@gmail.com> wrote:
How do convert a boost::filesystem::wpath to a known encoding in the native filesystem's path representation?
Specifically I want to convert a boost::filesystem::wpath to a third party string (CFString from Carbon). In order to do that I must know the encoding of wpath.string() or wpath.external_file_string() on Mac, which is "implementation defined".
wpath.external_file_string() returns the type and encoding required by the platform. For Mac OS X, the type is std::string, and the Apple web site says the encoding is Unicode. I assume that means UTF-8, although I couldn't find that stated explicitly.
Interfacing boost::filesystem::wpath with 3rd party APIs seems like common problem.
Sure. That's why external_file_string is exposed. Internationalization should be a bit easier with Boost.Filesystem Version 3 that I'm working on, because there is a single path type. --Beman

On Fri, Oct 17, 2008 at 7:10 AM, Beman Dawes <bdawes@acm.org> wrote:
wpath.external_file_string() returns the type and encoding required by the platform. For Mac OS X, the type is std::string, and the Apple web site says the encoding is Unicode. I assume that means UTF-8, although I couldn't find that stated explicitly.
"In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8." This means that precomposed characters are forbidden and combining diacritics must be used to replace them. See http://developer.apple.com/qa/qa2001/qa1173.html. Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

On Friday 17 October 2008 17:45:28 Emil Dotchevski wrote:
"In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8."
This means that precomposed characters are forbidden and combining diacritics must be used to replace them.
Danger: read the whole document! The point is, that nothing guarantees this encoding, it is by no means enforced by the OS. So, in order to be able to use non-compliant media (like e.g. ones with codepage encodings, possibly even unknown codepage encodings) you have to treat the strings received from the filesystem as byte strings. The only things you can rely on are: - Termination with a null byte. - Segments separated with a path separator (i.e. '/'). Otherwise, converting it to a text string is a lossy conversion because of the unreliable encoding (though assuming UTF-8 as a default works). Similarly, encoding to a byte string isn't reliable, because the encoding of the filesystem isn't guaranteed. BTW: - A similar discussion took place on the Python developers' mailinglist. Current state seems to be to implement both a Unicode API and one using byte strings in parallel, though I'm not advocating that approach. - The same problem is present on all POSIX systems (BSDs, Linux..) though there you don't have the UTF-8 default but rather the encoding of the CTYPE locale. - On modern MS Windows platforms, the system actually claims to guarantee UTF-16. Non-decodeable media are supposedly simply rejected, but I can't say this works for sure. Uli

On Sun, Oct 19, 2008 at 5:17 AM, Ulrich Eckhardt <doomster@knuut.de> wrote:
On Friday 17 October 2008 17:45:28 Emil Dotchevski wrote:
"In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8."
This means that precomposed characters are forbidden and combining diacritics must be used to replace them.
Danger: read the whole document! The point is, that nothing guarantees this encoding, it is by no means enforced by the OS. So, in order to be able to use non-compliant media (like e.g. ones with codepage encodings, possibly even unknown codepage encodings) you have to treat the strings received from the filesystem as byte strings. The only things you can rely on are: - Termination with a null byte. - Segments separated with a path separator (i.e. '/').
Otherwise, converting it to a text string is a lossy conversion because of the unreliable encoding (though assuming UTF-8 as a default works). Similarly, encoding to a byte string isn't reliable, because the encoding of the filesystem isn't guaranteed.
BTW: - A similar discussion took place on the Python developers' mailinglist. Current state seems to be to implement both a Unicode API and one using byte strings in parallel, though I'm not advocating that approach. - The same problem is present on all POSIX systems (BSDs, Linux..) though there you don't have the UTF-8 default but rather the encoding of the CTYPE locale.
Yes. The situation on POSIX systems is quite messy. I've been discussing it with the POSIX folks, and get conflicting answers depending on the example presented. Part of the problem is that documented behavior of the POSIX command line utilities is different from the program API behavior. Also, real-world behavior sometimes seems different from POSIX specifications. Sigh. I'd really like to be put in contact with someone who has access to and is familiar with POSIX variants used in Asia. --Beman

Beman Dawes:
Yes. The situation on POSIX systems is quite messy.
I'm not sure that it is as messy as usually cited. There are basically two cases: 1. The filesystem is "8 bit neutral", that is, it stores the NTBS that is passed exactly as-is (and returns it unmodified); 2. The filesystem uses UTF-16 (NTFS and HPFS+). In this case, the OS translates the NTBS to UTF-16 for storage (using the system codepage in Windows, UTF-8 in Mac OS X, and the codepage specified at mount time on Linux) and translates the UTF-16 name from the FS back when returning it to the application. Note that the roundtrip on HPFS+ may not produce the original NTBS even for valid UTF-8 inputs because of the Unicode normalization that occurs (but it does produce the original string, as read by the user). Most of the perceived complexity comes from the fact that people living in the (1) world can't comprehend that non-neutral filesystems exist and expect to be able to (1) pass arbitrary byte strings to the OS and (2) get them back. This leads to other mistaken beliefs that it's possible for the user to choose the encoding of the input.

Beman Dawes wrote:
How do convert a boost::filesystem::wpath to a known encoding in the native filesystem's path representation?
Specifically I want to convert a boost::filesystem::wpath to a third party string (CFString from Carbon). In order to do that I must know the encoding of wpath.string() or wpath.external_file_string() on Mac, which is "implementation defined".
wpath.external_file_string() returns the type and encoding required by the platform. For Mac OS X, the type is std::string, and the Apple web site says the encoding is Unicode. I assume that means UTF-8, although I couldn't find that stated explicitly.
Interfacing boost::filesystem::wpath with 3rd party APIs seems like common problem.
Sure. That's why external_file_string is exposed.
Internationalization should be a bit easier with Boost.Filesystem Version 3 that I'm working on, because there is a single path type.
Sounds interesting, looking forward to it! For V3, I would appreciate comprehensive documentation on the format of external_file_string() for each supported platform. Thanks for an excellent library. Johan -- View this message in context: http://www.nabble.com/-filesystem-Extracting-path-as-string-from-wpath-tp200... Sent from the Boost - Dev mailing list archive at Nabble.com.

On Sat, Oct 18, 2008 at 6:51 AM, Johan Torp <johan.torp@gmail.com> wrote:
Beman Dawes wrote:
How do convert a boost::filesystem::wpath to a known encoding in the native filesystem's path representation?
Specifically I want to convert a boost::filesystem::wpath to a third party string (CFString from Carbon). In order to do that I must know the encoding of wpath.string() or wpath.external_file_string() on Mac, which is "implementation defined".
wpath.external_file_string() returns the type and encoding required by
the
platform. For Mac OS X, the type is std::string, and the Apple web site says the encoding is Unicode. I assume that means UTF-8, although I couldn't find that stated explicitly.
Interfacing boost::filesystem::wpath with 3rd party APIs seems like common problem.
Sure. That's why external_file_string is exposed.
Internationalization should be a bit easier with Boost.Filesystem Version 3 that I'm working on, because there is a single path type.
Sounds interesting, looking forward to it! For V3, I would appreciate comprehensive documentation on the format of external_file_string() for each supported platform.
The problem is that the exact format is determined not by the filesystem library, but by a conversion function provided by the operating system. These conversion functions (or the underlying locale facet) do not always provide detailed documentation.
Thanks for an excellent library.
Thanks for the feedback! --Beman

Beman Dawes wrote:
For V3, I would appreciate comprehensive documentation on the format of external_file_string() for each supported platform.
The problem is that the exact format is determined not by the filesystem library, but by a conversion function provided by the operating system. These conversion functions (or the underlying locale facet) do not always provide detailed documentation.
A documented promise that boost.filesystem will use a specific conversion function and pointers to the existing documentation of those functions - however poor they might be - would be valuable. Looking at the implementation is doable but you don't get any promises of stability when upgrading to newer boost versions. Also, many people download pre-built binaries from boost-pro and those people have to find the source somehow. Cheers, Johan www.johantorp.com -- View this message in context: http://www.nabble.com/-filesystem-Extracting-path-as-string-from-wpath-tp200... Sent from the Boost - Dev mailing list archive at Nabble.com.
participants (5)
-
Beman Dawes
-
Emil Dotchevski
-
Johan Torp
-
Peter Dimov
-
Ulrich Eckhardt