
On Mon, Jul 18, 2005 at 09:48:11AM -0700, Robert Ramey wrote:
Jonathan Wakely wrote:
On Mon, Jul 18, 2005 at 08:11:31AM -0700, Robert Ramey wrote:
Hmm, I've twiddled with the set of allowable characters from time to time on sort of an ad hoc basis. For some reason it never occured to me to actually try and find the difinitive source for this. So I suppose there are couple
Assuming you're referring to XML, it's here: http://www.w3.org/TR/REC-xml
of pending fine points here:
a) the exact rules for what characters are legal in which part of tag names. This might not be all that obvious given that the html can be coded in wide characters then to utf-8. Also the narrow character version is coded with the current locale so that's another story.
A character is a character, how it is encoded is irrelevent.
Thanks for the link.
That's not obvious to me - especially when one is using a locale specific character set. Maybe XML requires that that all characters be ucs-16 (or 32) or some such thing but as a practical matter lots of people are still using locale-specific types for strings. So its not obvious what the implications are of including a '\0' as part of text string in and xml archive. This is one of those things that seemed simple when I started but ran into a lot of small "gotchas' as time when on.
I agree that's a harder problem than just "can character X be used in an element name" :-) The '\0' character is not valid anywhere in XML, in any encoding. I don't know the reasoning but it means you have to use some kind of alternative representation for data that could contain NULs. If you're talking text strings with embedded NULs then you might need to define an entity that can stand in for the NUL, so you can expand it back to NUL when you recreate the string from the XML archive, or put all strings that might contain NULs in an element like <hex> and hex-encode the bytes. There might be other solutions too, but I've not used them. jon