[serialization] Why can't hyphens appear in NVP's?

Hi, I'm using the serialization library for the first time. Everything works pretty nicely, but I'm a little surprised that I can't use hyphens in the first component of a name-value pair. What's the purpose of this restriction? As long as the hyphen is not the first character of a name, it should correspond to perfectly legal XML. Jonathan

Hmm, I've twiddled with the set of allowable characters from time to time on sort of an ad hoc basis. For some reason it never occured to me to actually try and find the difinitive source for this. So I suppose there are couple of pending fine points here: a) the exact rules for what characters are legal in which part of tag names. This might not be all that obvious given that the html can be coded in wide characters then to utf-8. Also the narrow character version is coded with the current locale so that's another story. b) It has recently brought to my attention that wstrings with a '\0' in then can't be part of a string variable. Actually this raises the issue that the current html coding escapes for html characters like '>' but doesn't do anything special for non-printable characters. This will require another level of escapes - the html # syntax. Again, the resolution probably requires considering the locale - or not. A long time ago, strings (not tags) were coded as arrays of integers. Thus the problem b) above didn't occur. But it seemed inconvenient, inefficient, and incompatible with xml idea that stuff should be sort of readable. Sorry I can't give a better answer - but there it is. Robert Ramey Jonathan Turkanis wrote:

On Mon, Jul 18, 2005 at 08:11:31AM -0700, Robert Ramey wrote:
Assuming you're referring to XML, it's here: http://www.w3.org/TR/REC-xml
A character is a character, how it is encoded is irrelevent. Re-encoding an XML file doesn't change whether it is well-formed or not (assuming you update any encoding specifiers in the document itself.) So if 'a' is allowed in an element name then the representation of 'a' in the document's encoding is allowed in an element name, whatever that encoding is. jon

Jonathan Wakely wrote:
Thanks for the link. That's not obvious to me - especially when one is using a locale specific character set. Maybe XML requires that that all characters be ucs-16 (or 32) or some such thing but as a practical matter lots of people are still using locale-specific types for strings. So its not obvious what the implications are of including a '\0' as part of text string in and xml archive. This is one of those things that seemed simple when I started but ran into a lot of small "gotchas' as time when on.

On Mon, Jul 18, 2005 at 09:48:11AM -0700, Robert Ramey wrote:
I agree that's a harder problem than just "can character X be used in an element name" :-) The '\0' character is not valid anywhere in XML, in any encoding. I don't know the reasoning but it means you have to use some kind of alternative representation for data that could contain NULs. If you're talking text strings with embedded NULs then you might need to define an entity that can stand in for the NUL, so you can expand it back to NUL when you recreate the string from the XML archive, or put all strings that might contain NULs in an element like <hex> and hex-encode the bytes. There might be other solutions too, but I've not used them. jon

Basically we want to map anything that might be contain in a std::string or std::wstring to an XML value string. I little investigation makes me think that the appropriate mechanism is to escape all "non-printable" (uh-oh?) or some subset of "problem characters" using the % escape syntax. It looks to me as non-obvious problem but I have yet to delve into it. Robert Ramey

I think if you used UTF-8 as the output character set you avoid all these problems. The conversion from UCS-2, etc. to UTF-8 is fairly straightforward, and should be at least as quick as using %. Most importantly it is more compact (e.g. for Japanese characters 2-3 bytes instead of 6 bytes for %XX%XX). Darren

I am still not sure that you can not put a representation of the character \0 inside an XML, but if it were possible, the natural representation would be "�" (without the quotes) Why not just use it? \TM "Jonathan Wakely" <cow@compsoc.man.ac.uk> wrote in message news:20050719102904.GC92286@compsoc.man.ac.uk...

Have you tried it? It's not a valid entity, using it means your XML is not well-formed. It doesn't matter whether you say � or � (the decimal and hexadecmial forms are exactly equivalent - but 0 is still not a validnumerical entity.)
As long as you can read the same data back and restore the same sequence of bytes it doesn't really matter. jon

Of course that is the question - what does it matter? Does it matter whether or not the XML in the archive respects some standard? What is the xml_archive output going to be used for besides serialization - if anything. My preference would be that an XML archive be as "conforming" as possible. In this way, I don't have to mess with it when someone comes up with some use for it that I didn't forsee. So its not so much a technical question - any solution can be implemented. Its just that if I have to go through the work to address the problem, I would prefer to do it in such a way that it minimizes the need to re-visit it. That's all Jonathan Wakely wrote:

Jonathan Wakely wrote:
Yes, in XML 1.1, the null character is a special case by itself; ordinary nonprintable characters can be embedded as numerical character references, but the null character cannot (see the "Legal Character" well-formedness constraint for production 66).
As long as you can read the same data back and restore the same sequence of bytes it doesn't really matter.
I strongly agree with Robert that further processing of generated XML archives by external tools is one of the main strengths of XML archives and should be the main concern when evaluating our options when it comes to dealing with this problem. That said, I see the following options: 1. Use � anyway. I've googled around a bit and found that �'s being generated by one tool in a toolchain and rejected by the next is a reasonably common problem, so I don't really like this option. 2. Encode it using some escape sequence: <foo>bar\0bas</foo> This would introduce an extra grammar layer that software used for further processing must parse. 3. Encode it using a dedicated element: <foo>bar<serialization:null/>bas</foo> This seems like a reasonable way to encode null characters, but wouldn't work in attribute values. 4. Encode strings containing null characters using binary encodings such as those defined by XML Schema's data types: http://www.w3.org/TR/xmlschema-2/#base64Binary http://www.w3.org/TR/xmlschema-2/#hexBinary This would require some additional flag that indicates whether a string is encoded textually or binary (unless of course all strings are encoded this way, but then we'd lose the human-readability of strings in XML archives). 5. Disallow serialization of std::(w)strings that contain null characters to XML archives. This is my personal favorite. XML's normal character data is simply inherently textual and not suited to storing binary data containing null characters. We shouldn't try to hack around this. Doing so would only make things complicated in further external processing. If users insist on storing binary fragments in their XML archives they can always resort to vector<char> (by the way, the binary encodings I mentioned above might be very nice for storing things like vector<char> efficiently). Eelis

Eelis van der Weegen wrote:
I like this.
The problem with this is that it's hard to remember the restriction. One of the main advantages of basic_string over C-style strings is that they can store arbitrary sequences, so it's natural for users to take this feature for granted. Errors resulting from accidental embedded nulls can be very hard to track down.
Eelis
Jonathan

This is a great email. It illustrates why I tend to drag my feet on things like this. This is not going to be addressed right away so feel free to investigate and discuss it. FWIW I personally would like options 1 - use � anyway - basically because it would preserve the idea that an xml_archive can do anything any other archive can do and doesn't ripple XML - ness back into the library or user programs. But even this is not so trivial. Its not clear to me whether it should apply to all non-printable character. This then raises the issue of what is non-printable in a UTF context. Then it makes me wonder what the "encoding" attribute in XML is for in a UTF file. This is a perfect example how something that seems simple at first glance turns in to a really time consuming issue. I've never warmed up to XML myself. I learned enough of the details to implement xml_?archive but I still never learned to like it. The only thing I've found it useful for is checking that load/save functions match. The xml_archive classes check that the end tag is found in the right place and in fact matches the start tag so any difference in the save / load functions throws an exception. So if I have an obscure problem I test using xml_archive. Other than the above, the only utility I can see for the xml_?archive is as some sort of bridge to the "outside world". That's why I set aside the original string representation - as a sequence of numbers - in favor of the current one - a text string. The mismatch between what std::string does and xml text data does is the source of the problem. I would hope that some smart person can find the sentence, in the paragraph, on the page, in the chapter of the relevant document which can deal with this is some sort of comforming way. Good Luck Robert Ramey Eelis van der Weegen wrote:

On Wed, Jul 20, 2005 at 12:28:40PM -0700, Robert Ramey wrote:
You're confusing "characters that are allowed in XML" with "encoding used to represent characters". © is a character entity which has NOTHING to do with encoding. It represents the same character whatever encoding your document is stored in. Similarly, � and and other numerical entities are not allowed in an XML document irrespective of the encoding. Character encoding is to do with how an XML file is stored on disk. Whether you can have '\0' in an XML file is to do with the semantic content of the XML document. These issues are unrelated.
So stop using XML. If you're not going to write well-formed XML (which means no � or or etc.) then why bother writing XML? XML is verbose, inefficient and has a number of complicated details. Its main advantage is interoperability and the availablity of compatible tools. If you produce non-well-formed XML then you can't use any existing tools, so you've invented your own markup langguage with most of the drawbacks of XML and none of the advantages! IMHO you should do is produce well-formed XML.
Either: 1) Store all strings in a hexadecimal or base64 representation. This allows any arbitrary sequence of bytes to be mapped to a portable subset of ASCII characters. 2) Store strings normally, unless they contain invalid characters, in which case put the string in a <hex> or <base64> element and use hex/base64 to store the string. The advantage of 1) is consistency. The advantage of 2) is human readibility for most strings - only unrepresentable ones are not human readable. I am completely unfamiliar with the serialization library and its XML format. Do you turn all strings to UTF-8 ? That seems wrong to me, if I give you a std::string with the bytes that map to a ISO-8859-1 string do you re-encode that as UTF-8 using e.g. iconv ? What if I give you a std::string containing bytes that map to a UTF-8 string? Do you re-encode that? I think there is a strong argument for not doing anything encoding-related to strings, just store the bytes exactly as they are, unless that would produce an invalid XML doc, in which case use hex or base64. Otherwise you impose a semantic meaning on the bytes in a std::string that may not be present, namely "this string contains text data that can be stored in an XML text node". C++ allows ANY bytes in a std::string and does not require those bytes to form a valid UTF-8 string, or a valid ASCII string, or any other restriction. jon -- "What I tell you three times is true" - The Hunting of the Snark

Jonathan Wakely wrote:
I agree - that's why I don't use it.
IMHO you should do is produce well-formed XML.
That's what we're trying to do.
That's the way the first version worked - a lot of people were unhappy with it.
A worthy suggestion.
agreed. The fundemental proble is the a std::basic string can hold data that cannot be represented in an XML string.
Do you turn all strings to UTF-8 ?
currently it works like this: a) std::string are written to the xml file using the current stream locale. Actually I use a "null" codecvt facet to work around the fact that the standard facet molests the input/output string. b) std:wstring are converted to UTF-8 using an stream codecvt facet. The library would permit any codecvt facet to be used. (Hmm - this might be the place to permit the user to insert his own decision about how to deal with this problem. The more I think about this - the more I like it)
We're in agreement here as well. I very much want to maintain the independence of the archive from the serlializaiton. This means that the serialization of data is not in any way dependent on the type of archive to be used. Robert Ramey
participants (6)
-
Darren Cook
-
Eelis van der Weegen
-
Jonathan Turkanis
-
Jonathan Wakely
-
Lucas Galfaso
-
Robert Ramey