
On Friday 17 October 2008 17:45:28 Emil Dotchevski wrote:
"In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8."
This means that precomposed characters are forbidden and combining diacritics must be used to replace them.
Danger: read the whole document! The point is, that nothing guarantees this encoding, it is by no means enforced by the OS. So, in order to be able to use non-compliant media (like e.g. ones with codepage encodings, possibly even unknown codepage encodings) you have to treat the strings received from the filesystem as byte strings. The only things you can rely on are: - Termination with a null byte. - Segments separated with a path separator (i.e. '/'). Otherwise, converting it to a text string is a lossy conversion because of the unreliable encoding (though assuming UTF-8 as a default works). Similarly, encoding to a byte string isn't reliable, because the encoding of the filesystem isn't guaranteed. BTW: - A similar discussion took place on the Python developers' mailinglist. Current state seems to be to implement both a Unicode API and one using byte strings in parallel, though I'm not advocating that approach. - The same problem is present on all POSIX systems (BSDs, Linux..) though there you don't have the UTF-8 default but rather the encoding of the CTYPE locale. - On modern MS Windows platforms, the system actually claims to guarantee UTF-16. Non-decodeable media are supposedly simply rejected, but I can't say this works for sure. Uli