
On Sun, Oct 19, 2008 at 5:17 AM, Ulrich Eckhardt <doomster@knuut.de> wrote:
On Friday 17 October 2008 17:45:28 Emil Dotchevski wrote:
"In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8."
This means that precomposed characters are forbidden and combining diacritics must be used to replace them.
Danger: read the whole document! The point is, that nothing guarantees this encoding, it is by no means enforced by the OS. So, in order to be able to use non-compliant media (like e.g. ones with codepage encodings, possibly even unknown codepage encodings) you have to treat the strings received from the filesystem as byte strings. The only things you can rely on are: - Termination with a null byte. - Segments separated with a path separator (i.e. '/').
Otherwise, converting it to a text string is a lossy conversion because of the unreliable encoding (though assuming UTF-8 as a default works). Similarly, encoding to a byte string isn't reliable, because the encoding of the filesystem isn't guaranteed.
BTW: - A similar discussion took place on the Python developers' mailinglist. Current state seems to be to implement both a Unicode API and one using byte strings in parallel, though I'm not advocating that approach. - The same problem is present on all POSIX systems (BSDs, Linux..) though there you don't have the UTF-8 default but rather the encoding of the CTYPE locale.
Yes. The situation on POSIX systems is quite messy. I've been discussing it with the POSIX folks, and get conflicting answers depending on the example presented. Part of the problem is that documented behavior of the POSIX command line utilities is different from the program API behavior. Also, real-world behavior sometimes seems different from POSIX specifications. Sigh. I'd really like to be put in contact with someone who has access to and is familiar with POSIX variants used in Asia. --Beman