[filesystem] Mac OS default codecvt facet

On Mac OS, Boost.Filesystem currently uses std::locale() as the default, since Apple doesn't see fit to support std::locale(""). There has been a request (ticket #3928) to change the codecvt facet to UTF-8 for the Boost.Filesystem default on Mac OS. This would at least potentially be a breaking change, so I wanted to ask other's opinions before making it. Questions: * Is UTF-8 OK with Mac OS users as the Boost.Filesystem default? * Would Linux users also prefer UTF-8 as the Boost.Filesystem default? Thanks, --Beman

Beman Dawes wrote:
* Is UTF-8 OK with Mac OS users as the Boost.Filesystem default?
UTF-8 is not merely a default on Mac OS X. It's _the_ encoding used by the OS. You can't use any other encoding and expect things to work (if you leave ASCII land). At least as far as I know since I've never used a Mac. :-)

On Sun, Feb 14, 2010 at 3:53 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Beman Dawes wrote:
* Is UTF-8 OK with Mac OS users as the Boost.Filesystem default?
UTF-8 is not merely a default on Mac OS X. It's _the_ encoding used by the OS.
Do you have a link for that? I spent awhile searching the Apple site, but didn't come up a direct description of how a narrow character file name is translated into a file system path. The actual on disk format is apparently UTF-16.
You can't use any other encoding and expect things to work (if you leave ASCII land).
I'm less than 100% sure of that. I see discussions indicating locale can have an impact.
At least as far as I know since I've never used a Mac. :-)
I've got one of those cute little mini Mac's, but it's in Virginia and I'm in Florida at the moment. Thanks, --Beman

Beman Dawes wrote:
On Sun, Feb 14, 2010 at 3:53 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Beman Dawes wrote:
* Is UTF-8 OK with Mac OS users as the Boost.Filesystem default?
UTF-8 is not merely a default on Mac OS X. It's _the_ encoding used by the OS.
Do you have a link for that?
The most authoritative one is probably http://developer.apple.com/mac/library/documentation/MacOSX/Conceptual/BPInt... "All BSD system functions expect their string parameters to be in UTF-8 encoding and nothing else. Code that calls BSD system routines should ensure that the contents of all const *char parameters are in canonical UTF-8 encoding. In a canonical UTF-8 string, all decomposable characters are decomposed; for example, é (0x00E9) is represented as e (0x0065) + ´ (0x0301). To put things into a canonical UTF-8 encoding, use the "file-system representation" interfaces defined in Cocoa and Carbon (including Core Foundation)." I think that in practice the OS will take any valid UTF-8 and normalize it internally, so it's not necessary to decompose it. http://lists.apple.com/archives/unix-porting/2007/Sep/msg00023.html "The kernel will reject any filename that is not a valid UTF-8 string, and it will even be normalized (to Unicode NFD) before stored on disk, at least when using HFS. The right way to deal with it would be to always convert the filename to UTF-8 before trying to open/create a file." http://lists.apple.com/archives/applescript-users/2002/Sep/msg00319.html "How a file name looks at the API level depends on the API. Current Carbon APIs handle file names as an array of UTF-16 characters; POSIX ones handle them as an array of UTF-8, which is why UTF-8 works well in Terminal. How it's stored on disk depends on the disk format; HFS+ uses UTF-16, but that's not important in most cases." http://developer.apple.com/mac/library/qa/qa2001/qa1173.html "In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8. This raises a number of interesting issues."

On Sun, Feb 14, 2010 at 8:22 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Beman Dawes wrote:
On Sun, Feb 14, 2010 at 3:53 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Beman Dawes wrote:
* Is UTF-8 OK with Mac OS users as the Boost.Filesystem default?
UTF-8 is not merely a default on Mac OS X. It's _the_ encoding used by the OS.
Do you have a link for that?
The most authoritative one is probably
http://developer.apple.com/mac/library/documentation/MacOSX/Conceptual/BPInt...
"All BSD system functions expect their string parameters to be in UTF-8 encoding and nothing else. Code that calls BSD system routines should ensure that the contents of all const *char parameters are in canonical UTF-8 encoding. In a canonical UTF-8 string, all decomposable characters are decomposed; for example, é (0x00E9) is represented as e (0x0065) + ´ (0x0301). To put things into a canonical UTF-8 encoding, use the "file-system representation" interfaces defined in Cocoa and Carbon (including Core Foundation)."
I think that in practice the OS will take any valid UTF-8 and normalize it internally, so it's not necessary to decompose it.
http://lists.apple.com/archives/unix-porting/2007/Sep/msg00023.html
"The kernel will reject any filename that is not a valid UTF-8 string, and it will even be normalized (to Unicode NFD) before stored on disk, at least when using HFS. The right way to deal with it would be to always convert the filename to UTF-8 before trying to open/create a file."
http://lists.apple.com/archives/applescript-users/2002/Sep/msg00319.html
"How a file name looks at the API level depends on the API. Current Carbon APIs handle file names as an array of UTF-16 characters; POSIX ones handle them as an array of UTF-8, which is why UTF-8 works well in Terminal. How it's stored on disk depends on the disk format; HFS+ uses UTF-16, but that's not important in most cases."
http://developer.apple.com/mac/library/qa/qa2001/qa1173.html
"In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8. This raises a number of interesting issues."
Those are great references. Many Thanks! I've updated trunk accordingly, and closed #3928. --Beman

Beman Dawes wrote:
I'm less than 100% sure of that. I see discussions indicating locale can have an impact.
Locale can have an impact in that a non-UTF-8 locale breaks programs that use the locale for filename conversions (a behavior common for programs originating on old-school, encoding-agnostic Unix.) Linux is one example of an encoding-agnostic OS, and there is no practical way for FS to do the right thing there, because the encoding can vary depending on the path (one can mount NTFS and HFS+ filesystems and the encoding is given at mount time.) In principle, a Linux-specific implementation of the FS library might be able to query the kernel about the correct encoding for a specific path element, but I doubt that anyone will bother. Either way, once the world switches to UTF-8 everywhere things will be much simpler.

Beman Dawes wrote:
On Mac OS, Boost.Filesystem currently uses std::locale() as the default, since Apple doesn't see fit to support std::locale("").
BTW, I'm not sure whether std::locale("") is an Apple or a libstdc++ problem. As explained here: http://stackoverflow.com/questions/1745045/stdlocale-breakage-on-macos-10-6-... the default LANG is en_US.UTF-8 (good) but libstdc++ doesn't support this locale. locale("") is reported to work when LANG is not set or set to C.
participants (2)
-
Beman Dawes
-
Peter Dimov