[serialization] Why can't hyphens appear in NVP's?

Hi, I'm using the serialization library for the first time. Everything works pretty nicely, but I'm a little surprised that I can't use hyphens in the first component of a name-value pair. What's the purpose of this restriction? As long as the hyphen is not the first character of a name, it should correspond to perfectly legal XML. Jonathan

Hmm, I've twiddled with the set of allowable characters from time to time on sort of an ad hoc basis. For some reason it never occured to me to actually try and find the difinitive source for this. So I suppose there are couple of pending fine points here: a) the exact rules for what characters are legal in which part of tag names. This might not be all that obvious given that the html can be coded in wide characters then to utf-8. Also the narrow character version is coded with the current locale so that's another story. b) It has recently brought to my attention that wstrings with a '\0' in then can't be part of a string variable. Actually this raises the issue that the current html coding escapes for html characters like '>' but doesn't do anything special for non-printable characters. This will require another level of escapes - the html # syntax. Again, the resolution probably requires considering the locale - or not. A long time ago, strings (not tags) were coded as arrays of integers. Thus the problem b) above didn't occur. But it seemed inconvenient, inefficient, and incompatible with xml idea that stuff should be sort of readable. Sorry I can't give a better answer - but there it is. Robert Ramey Jonathan Turkanis wrote:
Hi,
I'm using the serialization library for the first time. Everything works pretty nicely, but I'm a little surprised that I can't use hyphens in the first component of a name-value pair. What's the purpose of this restriction? As long as the hyphen is not the first character of a name, it should correspond to perfectly legal XML.
Jonathan
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, Jul 18, 2005 at 08:11:31AM -0700, Robert Ramey wrote:
Hmm, I've twiddled with the set of allowable characters from time to time on sort of an ad hoc basis. For some reason it never occured to me to actually try and find the difinitive source for this. So I suppose there are couple
Assuming you're referring to XML, it's here: http://www.w3.org/TR/REC-xml
of pending fine points here:
a) the exact rules for what characters are legal in which part of tag names. This might not be all that obvious given that the html can be coded in wide characters then to utf-8. Also the narrow character version is coded with the current locale so that's another story.
A character is a character, how it is encoded is irrelevent. Re-encoding an XML file doesn't change whether it is well-formed or not (assuming you update any encoding specifiers in the document itself.) So if 'a' is allowed in an element name then the representation of 'a' in the document's encoding is allowed in an element name, whatever that encoding is. jon

Jonathan Wakely wrote:
On Mon, Jul 18, 2005 at 08:11:31AM -0700, Robert Ramey wrote:
Hmm, I've twiddled with the set of allowable characters from time to time on sort of an ad hoc basis. For some reason it never occured to me to actually try and find the difinitive source for this. So I suppose there are couple
Assuming you're referring to XML, it's here: http://www.w3.org/TR/REC-xml
of pending fine points here:
a) the exact rules for what characters are legal in which part of tag names. This might not be all that obvious given that the html can be coded in wide characters then to utf-8. Also the narrow character version is coded with the current locale so that's another story.
A character is a character, how it is encoded is irrelevent.
Thanks for the link. That's not obvious to me - especially when one is using a locale specific character set. Maybe XML requires that that all characters be ucs-16 (or 32) or some such thing but as a practical matter lots of people are still using locale-specific types for strings. So its not obvious what the implications are of including a '\0' as part of text string in and xml archive. This is one of those things that seemed simple when I started but ran into a lot of small "gotchas' as time when on.
Re-encoding an XML file doesn't change whether it is well-formed or not (assuming you update any encoding specifiers in the document itself.)
So if 'a' is allowed in an element name then the representation of 'a' in the document's encoding is allowed in an element name, whatever that encoding is.
jon
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, Jul 18, 2005 at 09:48:11AM -0700, Robert Ramey wrote:
Jonathan Wakely wrote:
On Mon, Jul 18, 2005 at 08:11:31AM -0700, Robert Ramey wrote:
Hmm, I've twiddled with the set of allowable characters from time to time on sort of an ad hoc basis. For some reason it never occured to me to actually try and find the difinitive source for this. So I suppose there are couple
Assuming you're referring to XML, it's here: http://www.w3.org/TR/REC-xml
of pending fine points here:
a) the exact rules for what characters are legal in which part of tag names. This might not be all that obvious given that the html can be coded in wide characters then to utf-8. Also the narrow character version is coded with the current locale so that's another story.
A character is a character, how it is encoded is irrelevent.
Thanks for the link.
That's not obvious to me - especially when one is using a locale specific character set. Maybe XML requires that that all characters be ucs-16 (or 32) or some such thing but as a practical matter lots of people are still using locale-specific types for strings. So its not obvious what the implications are of including a '\0' as part of text string in and xml archive. This is one of those things that seemed simple when I started but ran into a lot of small "gotchas' as time when on.
I agree that's a harder problem than just "can character X be used in an element name" :-) The '\0' character is not valid anywhere in XML, in any encoding. I don't know the reasoning but it means you have to use some kind of alternative representation for data that could contain NULs. If you're talking text strings with embedded NULs then you might need to define an entity that can stand in for the NUL, so you can expand it back to NUL when you recreate the string from the XML archive, or put all strings that might contain NULs in an element like <hex> and hex-encode the bytes. There might be other solutions too, but I've not used them. jon

The '\0' character is not valid anywhere in XML, in any encoding. I don't know the reasoning but it means you have to use some kind of alternative representation for data that could contain NULs.
If you're talking text strings with embedded NULs then you might need to define an entity that can stand in for the NUL, so you can expand it back to NUL when you recreate the string from the XML archive, or put all strings that might contain NULs in an element like <hex> and hex-encode the bytes. There might be other solutions too, but I've not used them.
jon
Basically we want to map anything that might be contain in a std::string or std::wstring to an XML value string. I little investigation makes me think that the appropriate mechanism is to escape all "non-printable" (uh-oh?) or some subset of "problem characters" using the % escape syntax. It looks to me as non-obvious problem but I have yet to delve into it. Robert Ramey

Basically we want to map anything that might be contain in a std::string or std::wstring to an XML value string. I little investigation makes me think that the appropriate mechanism is to escape all "non-printable" (uh-oh?) or some subset of "problem characters" using the % escape syntax. It looks to me as non-obvious problem but I have yet to delve into it.
I think if you used UTF-8 as the output character set you avoid all these problems. The conversion from UCS-2, etc. to UTF-8 is fairly straightforward, and should be at least as quick as using %. Most importantly it is more compact (e.g. for Japanese characters 2-3 bytes instead of 6 bytes for %XX%XX). Darren

The xml_wiarchve and xml_woarechive do use UTF-8. The specific case reported by the user is a std::wstring with a '\0' in the middle of it. Using UTF-8 doesn't address the issue. Robert Ramey Darren Cook wrote:
Basically we want to map anything that might be contain in a std::string or std::wstring to an XML value string. I little investigation makes me think that the appropriate mechanism is to escape all "non-printable" (uh-oh?) or some subset of "problem characters" using the % escape syntax. It looks to me as non-obvious problem but I have yet to delve into it.
I think if you used UTF-8 as the output character set you avoid all these problems. The conversion from UCS-2, etc. to UTF-8 is fairly straightforward, and should be at least as quick as using %. Most importantly it is more compact (e.g. for Japanese characters 2-3 bytes instead of 6 bytes for %XX%XX).
Darren _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

The xml_wiarchve and xml_woarechive do use UTF-8. The specific case reported by the user is a std::wstring with a '\0' in the middle of it. Using UTF-8 doesn't address the issue.
Sorry, I thought you meant the zero byte was part of a multi-byte character. Converting to UTF-8 solves that (?). But you just mean a standalone \0 character? Doesn't that mean the problem applies to serializing std::string as well? Darren

On Tue, Jul 19, 2005 at 05:12:57PM +0900, Darren Cook wrote:
The xml_wiarchve and xml_woarechive do use UTF-8. The specific case reported by the user is a std::wstring with a '\0' in the middle of it. Using UTF-8 doesn't address the issue.
Sorry, I thought you meant the zero byte was part of a multi-byte character. Converting to UTF-8 solves that (?).
But you just mean a standalone \0 character? Doesn't that mean the problem applies to serializing std::string as well?
Yes, it will. The root of the problem is that std::string and std::wstring can contain any arbitrary sequence of characters, including NULs. jon

I am still not sure that you can not put a representation of the character \0 inside an XML, but if it were possible, the natural representation would be "" (without the quotes) Why not just use it? \TM "Jonathan Wakely" <cow@compsoc.man.ac.uk> wrote in message news:20050719102904.GC92286@compsoc.man.ac.uk...
On Tue, Jul 19, 2005 at 05:12:57PM +0900, Darren Cook wrote:
The xml_wiarchve and xml_woarechive do use UTF-8. The specific case reported by the user is a std::wstring with a '\0' in the middle of it. Using UTF-8 doesn't address the issue.
Sorry, I thought you meant the zero byte was part of a multi-byte character. Converting to UTF-8 solves that (?).
But you just mean a standalone \0 character? Doesn't that mean the problem applies to serializing std::string as well?
Yes, it will. The root of the problem is that std::string and std::wstring can contain any arbitrary sequence of characters, including NULs.
jon
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

That's the solution for '\0'. Its just not obvious to me that that is the whole problem. What about '\01' ? or non-printable characters in general? Does the encoding come into play? How? These are questions I don't currently have an answer to. Robert Ramey Lucas Galfaso wrote:
I am still not sure that you can not put a representation of the character \0 inside an XML, but if it were possible, the natural representation would be "" (without the quotes) Why not just use it?
\TM
"Jonathan Wakely" <cow@compsoc.man.ac.uk> wrote in message news:20050719102904.GC92286@compsoc.man.ac.uk...
On Tue, Jul 19, 2005 at 05:12:57PM +0900, Darren Cook wrote:
The xml_wiarchve and xml_woarechive do use UTF-8. The specific case reported by the user is a std::wstring with a '\0' in the middle of it. Using UTF-8 doesn't address the issue.
Sorry, I thought you meant the zero byte was part of a multi-byte character. Converting to UTF-8 solves that (?).
But you just mean a standalone \0 character? Doesn't that mean the problem applies to serializing std::string as well?
Yes, it will. The root of the problem is that std::string and std::wstring can contain any arbitrary sequence of characters, including NULs.
jon
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey wrote: Lucas Galfaso wrote:
I am still not sure that you can not put a representation of the character \0 inside an XML, but if it were possible, the natural representation would be "" (without the quotes) Why not just use it?
Have you tried it? It's not a valid entity, using it means your XML is not well-formed. It doesn't matter whether you say or (the decimal and hexadecmial forms are exactly equivalent - but 0 is still not a validnumerical entity.)
That's the solution for '\0'. Its just not obvious to me that that is the whole problem. What about '\01' ? or non-printable characters in general? Does the encoding come into play? How? These are questions I don't currently have an answer to.
As long as you can read the same data back and restore the same sequence of bytes it doesn't really matter. jon

Of course that is the question - what does it matter? Does it matter whether or not the XML in the archive respects some standard? What is the xml_archive output going to be used for besides serialization - if anything. My preference would be that an XML archive be as "conforming" as possible. In this way, I don't have to mess with it when someone comes up with some use for it that I didn't forsee. So its not so much a technical question - any solution can be implemented. Its just that if I have to go through the work to address the problem, I would prefer to do it in such a way that it minimizes the need to re-visit it. That's all Jonathan Wakely wrote:
As long as you can read the same data back and restore the same sequence of bytes it doesn't really matter.

Jonathan Wakely wrote:
It's not a valid entity, using it means your XML is not well-formed. It doesn't matter whether you say or (the decimal and hexadecmial forms are exactly equivalent - but 0 is still not a validnumerical entity.)
Yes, in XML 1.1, the null character is a special case by itself; ordinary nonprintable characters can be embedded as numerical character references, but the null character cannot (see the "Legal Character" well-formedness constraint for production 66).
As long as you can read the same data back and restore the same sequence of bytes it doesn't really matter.
I strongly agree with Robert that further processing of generated XML archives by external tools is one of the main strengths of XML archives and should be the main concern when evaluating our options when it comes to dealing with this problem. That said, I see the following options: 1. Use anyway. I've googled around a bit and found that 's being generated by one tool in a toolchain and rejected by the next is a reasonably common problem, so I don't really like this option. 2. Encode it using some escape sequence: <foo>bar\0bas</foo> This would introduce an extra grammar layer that software used for further processing must parse. 3. Encode it using a dedicated element: <foo>bar<serialization:null/>bas</foo> This seems like a reasonable way to encode null characters, but wouldn't work in attribute values. 4. Encode strings containing null characters using binary encodings such as those defined by XML Schema's data types: http://www.w3.org/TR/xmlschema-2/#base64Binary http://www.w3.org/TR/xmlschema-2/#hexBinary This would require some additional flag that indicates whether a string is encoded textually or binary (unless of course all strings are encoded this way, but then we'd lose the human-readability of strings in XML archives). 5. Disallow serialization of std::(w)strings that contain null characters to XML archives. This is my personal favorite. XML's normal character data is simply inherently textual and not suited to storing binary data containing null characters. We shouldn't try to hack around this. Doing so would only make things complicated in further external processing. If users insist on storing binary fragments in their XML archives they can always resort to vector<char> (by the way, the binary encodings I mentioned above might be very nice for storing things like vector<char> efficiently). Eelis

Eelis van der Weegen wrote:
Jonathan Wakely wrote:
It's not a valid entity, using it means your XML is not well-formed. It doesn't matter whether you say or (the decimal and hexadecmial forms are exactly equivalent - but 0 is still not a validnumerical entity.)
Yes, in XML 1.1, the null character is a special case by itself; ordinary nonprintable characters can be embedded as numerical character references, but the null character cannot (see the "Legal Character" well-formedness constraint for production 66).
4. Encode strings containing null characters using binary encodings such as those defined by XML Schema's data types:
http://www.w3.org/TR/xmlschema-2/#base64Binary http://www.w3.org/TR/xmlschema-2/#hexBinary
This would require some additional flag that indicates whether a string is encoded textually or binary (unless of course all strings are encoded this way, but then we'd lose the human-readability of strings in XML archives).
I like this.
5. Disallow serialization of std::(w)strings that contain null characters to XML archives.
This is my personal favorite. XML's normal character data is simply inherently textual and not suited to storing binary data containing null characters. We shouldn't try to hack around this. Doing so would only make things complicated in further external processing. If users insist on storing binary fragments in their XML archives they can always resort to vector<char> (by the way, the binary encodings I mentioned above might be very nice for storing things like vector<char> efficiently).
The problem with this is that it's hard to remember the restriction. One of the main advantages of basic_string over C-style strings is that they can store arbitrary sequences, so it's natural for users to take this feature for granted. Errors resulting from accidental embedded nulls can be very hard to track down.
Eelis
Jonathan

This is a great email. It illustrates why I tend to drag my feet on things like this. This is not going to be addressed right away so feel free to investigate and discuss it. FWIW I personally would like options 1 - use anyway - basically because it would preserve the idea that an xml_archive can do anything any other archive can do and doesn't ripple XML - ness back into the library or user programs. But even this is not so trivial. Its not clear to me whether it should apply to all non-printable character. This then raises the issue of what is non-printable in a UTF context. Then it makes me wonder what the "encoding" attribute in XML is for in a UTF file. This is a perfect example how something that seems simple at first glance turns in to a really time consuming issue. I've never warmed up to XML myself. I learned enough of the details to implement xml_?archive but I still never learned to like it. The only thing I've found it useful for is checking that load/save functions match. The xml_archive classes check that the end tag is found in the right place and in fact matches the start tag so any difference in the save / load functions throws an exception. So if I have an obscure problem I test using xml_archive. Other than the above, the only utility I can see for the xml_?archive is as some sort of bridge to the "outside world". That's why I set aside the original string representation - as a sequence of numbers - in favor of the current one - a text string. The mismatch between what std::string does and xml text data does is the source of the problem. I would hope that some smart person can find the sentence, in the paragraph, on the page, in the chapter of the relevant document which can deal with this is some sort of comforming way. Good Luck Robert Ramey Eelis van der Weegen wrote:
Jonathan Wakely wrote:
It's not a valid entity, using it means your XML is not well-formed. It doesn't matter whether you say or (the decimal and hexadecmial forms are exactly equivalent - but 0 is still not a validnumerical entity.)
Yes, in XML 1.1, the null character is a special case by itself; ordinary nonprintable characters can be embedded as numerical character references, but the null character cannot (see the "Legal Character" well-formedness constraint for production 66).
As long as you can read the same data back and restore the same sequence of bytes it doesn't really matter.
I strongly agree with Robert that further processing of generated XML archives by external tools is one of the main strengths of XML archives and should be the main concern when evaluating our options when it comes to dealing with this problem. That said, I see the following options:
1. Use anyway.
I've googled around a bit and found that 's being generated by one tool in a toolchain and rejected by the next is a reasonably common problem, so I don't really like this option.
2. Encode it using some escape sequence: <foo>bar\0bas</foo>
This would introduce an extra grammar layer that software used for further processing must parse.
3. Encode it using a dedicated element: <foo>bar<serialization:null/>bas</foo>
This seems like a reasonable way to encode null characters, but wouldn't work in attribute values.
4. Encode strings containing null characters using binary encodings such as those defined by XML Schema's data types:
http://www.w3.org/TR/xmlschema-2/#base64Binary http://www.w3.org/TR/xmlschema-2/#hexBinary
This would require some additional flag that indicates whether a string is encoded textually or binary (unless of course all strings are encoded this way, but then we'd lose the human-readability of strings in XML archives).
5. Disallow serialization of std::(w)strings that contain null characters to XML archives.
This is my personal favorite. XML's normal character data is simply inherently textual and not suited to storing binary data containing null characters. We shouldn't try to hack around this. Doing so would only make things complicated in further external processing. If users insist on storing binary fragments in their XML archives they can always resort to vector<char> (by the way, the binary encodings I mentioned above might be very nice for storing things like vector<char> efficiently).
Eelis
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Just a small correction, the representation would be "" (without the quotes but _with_ the semicolon ) \LG "Robert Ramey" <ramey@rrsd.com> wrote in message news:dbm8l9$4ud$1@sea.gmane.org...
This is a great email. It illustrates why I tend to drag my feet on things like this. This is not going to be addressed right away so feel free to investigate and discuss it.
FWIW I personally would like options 1 - use anyway - basically because it would preserve the idea that an xml_archive can do anything any other archive can do and doesn't ripple XML - ness back into the library or user programs. But even this is not so trivial. Its not clear to me whether it should apply to all non-printable character. This then raises the issue of what is non-printable in a UTF context. Then it makes me wonder what the "encoding" attribute in XML is for in a UTF file. This is a perfect example how something that seems simple at first glance turns in to a really time consuming issue.
I've never warmed up to XML myself. I learned enough of the details to implement xml_?archive but I still never learned to like it. The only thing I've found it useful for is checking that load/save functions match. The xml_archive classes check that the end tag is found in the right place and in fact matches the start tag so any difference in the save / load functions throws an exception. So if I have an obscure problem I test using xml_archive.
Other than the above, the only utility I can see for the xml_?archive is as some sort of bridge to the "outside world". That's why I set aside the original string representation - as a sequence of numbers - in favor of the current one - a text string. The mismatch between what std::string does and xml text data does is the source of the problem.
I would hope that some smart person can find the sentence, in the paragraph, on the page, in the chapter of the relevant document which can deal with this is some sort of comforming way.
Good Luck
Robert Ramey
Eelis van der Weegen wrote:
Jonathan Wakely wrote:
It's not a valid entity, using it means your XML is not well-formed. It doesn't matter whether you say or (the decimal and hexadecmial forms are exactly equivalent - but 0 is still not a validnumerical entity.)
Yes, in XML 1.1, the null character is a special case by itself; ordinary nonprintable characters can be embedded as numerical character references, but the null character cannot (see the "Legal Character" well-formedness constraint for production 66).
As long as you can read the same data back and restore the same sequence of bytes it doesn't really matter.
I strongly agree with Robert that further processing of generated XML archives by external tools is one of the main strengths of XML archives and should be the main concern when evaluating our options when it comes to dealing with this problem. That said, I see the following options:
1. Use anyway.
I've googled around a bit and found that 's being generated by one tool in a toolchain and rejected by the next is a reasonably common problem, so I don't really like this option.
2. Encode it using some escape sequence: <foo>bar\0bas</foo>
This would introduce an extra grammar layer that software used for further processing must parse.
3. Encode it using a dedicated element: <foo>bar<serialization:null/>bas</foo>
This seems like a reasonable way to encode null characters, but wouldn't work in attribute values.
4. Encode strings containing null characters using binary encodings such as those defined by XML Schema's data types:
http://www.w3.org/TR/xmlschema-2/#base64Binary http://www.w3.org/TR/xmlschema-2/#hexBinary
This would require some additional flag that indicates whether a string is encoded textually or binary (unless of course all strings are encoded this way, but then we'd lose the human-readability of strings in XML archives).
5. Disallow serialization of std::(w)strings that contain null characters to XML archives.
This is my personal favorite. XML's normal character data is simply inherently textual and not suited to storing binary data containing null characters. We shouldn't try to hack around this. Doing so would only make things complicated in further external processing. If users insist on storing binary fragments in their XML archives they can always resort to vector<char> (by the way, the binary encodings I mentioned above might be very nice for storing things like vector<char> efficiently).
Eelis
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Wed, Jul 20, 2005 at 12:28:40PM -0700, Robert Ramey wrote:
FWIW I personally would like options 1 - use anyway - basically because it would preserve the idea that an xml_archive can do anything any other archive can do and doesn't ripple XML - ness back into the library or user programs. But even this is not so trivial. Its not clear to me whether it should apply to all non-printable character. This then raises the issue of what is non-printable in a UTF context. Then it makes me wonder what the "encoding" attribute in XML is for in a UTF file. This is a perfect example how something that seems simple at first glance turns in to a really time consuming issue.
You're confusing "characters that are allowed in XML" with "encoding used to represent characters". © is a character entity which has NOTHING to do with encoding. It represents the same character whatever encoding your document is stored in. Similarly, and and other numerical entities are not allowed in an XML document irrespective of the encoding. Character encoding is to do with how an XML file is stored on disk. Whether you can have '\0' in an XML file is to do with the semantic content of the XML document. These issues are unrelated.
I've never warmed up to XML myself. I learned enough of the details to implement xml_?archive but I still never learned to like it. The only thing I've found it useful for is checking that load/save functions match. The xml_archive classes check that the end tag is found in the right place and in fact matches the start tag so any difference in the save / load functions throws an exception. So if I have an obscure problem I test using xml_archive.
Other than the above, the only utility I can see for the xml_?archive is as some sort of bridge to the "outside world". That's why I set aside the original string representation - as a sequence of numbers - in favor of the current one - a text string. The mismatch between what std::string does and xml text data does is the source of the problem.
So stop using XML. If you're not going to write well-formed XML (which means no or or etc.) then why bother writing XML? XML is verbose, inefficient and has a number of complicated details. Its main advantage is interoperability and the availablity of compatible tools. If you produce non-well-formed XML then you can't use any existing tools, so you've invented your own markup langguage with most of the drawbacks of XML and none of the advantages! IMHO you should do is produce well-formed XML.
I would hope that some smart person can find the sentence, in the paragraph, on the page, in the chapter of the relevant document which can deal with this is some sort of comforming way.
Either: 1) Store all strings in a hexadecimal or base64 representation. This allows any arbitrary sequence of bytes to be mapped to a portable subset of ASCII characters. 2) Store strings normally, unless they contain invalid characters, in which case put the string in a <hex> or <base64> element and use hex/base64 to store the string. The advantage of 1) is consistency. The advantage of 2) is human readibility for most strings - only unrepresentable ones are not human readable. I am completely unfamiliar with the serialization library and its XML format. Do you turn all strings to UTF-8 ? That seems wrong to me, if I give you a std::string with the bytes that map to a ISO-8859-1 string do you re-encode that as UTF-8 using e.g. iconv ? What if I give you a std::string containing bytes that map to a UTF-8 string? Do you re-encode that? I think there is a strong argument for not doing anything encoding-related to strings, just store the bytes exactly as they are, unless that would produce an invalid XML doc, in which case use hex or base64. Otherwise you impose a semantic meaning on the bytes in a std::string that may not be present, namely "this string contains text data that can be stored in an XML text node". C++ allows ANY bytes in a std::string and does not require those bytes to form a valid UTF-8 string, or a valid ASCII string, or any other restriction. jon -- "What I tell you three times is true" - The Hunting of the Snark

Jonathan Wakely wrote:
So stop using XML. If you're not going to write well-formed XML (which means no or or etc.) then why bother writing XML? XML is verbose, inefficient and has a number of complicated details. Its main advantage is interoperability and the availablity of compatible tools. If you produce non-well-formed XML then you can't use any existing tools, so you've invented your own markup langguage with most of the drawbacks of XML and none of the advantages!
I agree - that's why I don't use it.
IMHO you should do is produce well-formed XML.
That's what we're trying to do.
I would hope that some smart person can find the sentence, in the paragraph, on the page, in the chapter of the relevant document which can deal with this is some sort of comforming way.
Either:
1) Store all strings in a hexadecimal or base64 representation. This allows any arbitrary sequence of bytes to be mapped to a portable subset of ASCII characters.
That's the way the first version worked - a lot of people were unhappy with it.
2) Store strings normally, unless they contain invalid characters, in which case put the string in a <hex> or <base64> element and use hex/base64 to store the string.
A worthy suggestion.
The advantage of 1) is consistency. The advantage of 2) is human readibility for most strings - only unrepresentable ones are not human readable.
agreed. The fundemental proble is the a std::basic string can hold data that cannot be represented in an XML string.
Do you turn all strings to UTF-8 ?
currently it works like this: a) std::string are written to the xml file using the current stream locale. Actually I use a "null" codecvt facet to work around the fact that the standard facet molests the input/output string. b) std:wstring are converted to UTF-8 using an stream codecvt facet. The library would permit any codecvt facet to be used. (Hmm - this might be the place to permit the user to insert his own decision about how to deal with this problem. The more I think about this - the more I like it)
I think there is a strong argument for not doing anything encoding-related to strings, just store the bytes exactly as they are, unless that would produce an invalid XML doc, in which case use hex or base64. Otherwise you impose a semantic meaning on the bytes in a std::string that may not be present, namely "this string contains text data that can be stored in an XML text node". C++ allows ANY bytes in a std::string and does not require those bytes to form a valid UTF-8 string, or a valid ASCII string, or any other restriction.
We're in agreement here as well. I very much want to maintain the independence of the archive from the serlializaiton. This means that the serialization of data is not in any way dependent on the type of archive to be used. Robert Ramey

Sorry, in the previous post, I ment to write "" as the representation of null, and not "" \TM "Jonathan Wakely" <cow@compsoc.man.ac.uk> wrote in message news:20050719102904.GC92286@compsoc.man.ac.uk...
On Tue, Jul 19, 2005 at 05:12:57PM +0900, Darren Cook wrote:
The xml_wiarchve and xml_woarechive do use UTF-8. The specific case reported by the user is a std::wstring with a '\0' in the middle of it. Using UTF-8 doesn't address the issue.
Sorry, I thought you meant the zero byte was part of a multi-byte character. Converting to UTF-8 solves that (?).
But you just mean a standalone \0 character? Doesn't that mean the problem applies to serializing std::string as well?
Yes, it will. The root of the problem is that std::string and std::wstring can contain any arbitrary sequence of characters, including NULs.
jon
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (6)
-
Darren Cook
-
Eelis van der Weegen
-
Jonathan Turkanis
-
Jonathan Wakely
-
Lucas Galfaso
-
Robert Ramey