
On Wed, Jul 20, 2005 at 12:28:40PM -0700, Robert Ramey wrote:
FWIW I personally would like options 1 - use anyway - basically because it would preserve the idea that an xml_archive can do anything any other archive can do and doesn't ripple XML - ness back into the library or user programs. But even this is not so trivial. Its not clear to me whether it should apply to all non-printable character. This then raises the issue of what is non-printable in a UTF context. Then it makes me wonder what the "encoding" attribute in XML is for in a UTF file. This is a perfect example how something that seems simple at first glance turns in to a really time consuming issue.
You're confusing "characters that are allowed in XML" with "encoding used to represent characters". © is a character entity which has NOTHING to do with encoding. It represents the same character whatever encoding your document is stored in. Similarly, and and other numerical entities are not allowed in an XML document irrespective of the encoding. Character encoding is to do with how an XML file is stored on disk. Whether you can have '\0' in an XML file is to do with the semantic content of the XML document. These issues are unrelated.
I've never warmed up to XML myself. I learned enough of the details to implement xml_?archive but I still never learned to like it. The only thing I've found it useful for is checking that load/save functions match. The xml_archive classes check that the end tag is found in the right place and in fact matches the start tag so any difference in the save / load functions throws an exception. So if I have an obscure problem I test using xml_archive.
Other than the above, the only utility I can see for the xml_?archive is as some sort of bridge to the "outside world". That's why I set aside the original string representation - as a sequence of numbers - in favor of the current one - a text string. The mismatch between what std::string does and xml text data does is the source of the problem.
So stop using XML. If you're not going to write well-formed XML (which means no or or etc.) then why bother writing XML? XML is verbose, inefficient and has a number of complicated details. Its main advantage is interoperability and the availablity of compatible tools. If you produce non-well-formed XML then you can't use any existing tools, so you've invented your own markup langguage with most of the drawbacks of XML and none of the advantages! IMHO you should do is produce well-formed XML.
I would hope that some smart person can find the sentence, in the paragraph, on the page, in the chapter of the relevant document which can deal with this is some sort of comforming way.
Either: 1) Store all strings in a hexadecimal or base64 representation. This allows any arbitrary sequence of bytes to be mapped to a portable subset of ASCII characters. 2) Store strings normally, unless they contain invalid characters, in which case put the string in a <hex> or <base64> element and use hex/base64 to store the string. The advantage of 1) is consistency. The advantage of 2) is human readibility for most strings - only unrepresentable ones are not human readable. I am completely unfamiliar with the serialization library and its XML format. Do you turn all strings to UTF-8 ? That seems wrong to me, if I give you a std::string with the bytes that map to a ISO-8859-1 string do you re-encode that as UTF-8 using e.g. iconv ? What if I give you a std::string containing bytes that map to a UTF-8 string? Do you re-encode that? I think there is a strong argument for not doing anything encoding-related to strings, just store the bytes exactly as they are, unless that would produce an invalid XML doc, in which case use hex or base64. Otherwise you impose a semantic meaning on the bytes in a std::string that may not be present, namely "this string contains text data that can be stored in an XML text node". C++ allows ANY bytes in a std::string and does not require those bytes to form a valid UTF-8 string, or a valid ASCII string, or any other restriction. jon -- "What I tell you three times is true" - The Hunting of the Snark