
Daniel Walker wrote:
On 4/24/06, Marcin Kalicinski <kalita@poczta.onet.pl> wrote:
My knowledge of XML is limited, but I think Dan Nuffer's parser will parse any valid XML. read_xml however discards all that goes beyond nodes, attributes, data and comments.
Isn't the property_tree XML parser originally based on Dan Nuffer's? Couldn't the productions/tokens from the Nuffer parser be added back to read_xml() so that it could at least accept the syntax for all XML files even if it doesn't implement the semantics? I think the runtime overhead of the additional productions in the grammar would be negligible for simple XML files that don't use the features and necessary for XML files that do. It seems to me this could clarify the scope of the parser. The documentation could read something like:
"read_xml() preforms non-validated parsing of the W3C recommendation XML 1.1. In addition, as of version 1.3x, read_xml() parses but ignores the following W3C specifications: XML Names, XInclude, XLink/XPointer, XML Schema, XSLT, ..."
... changing version numbers as appropriate. Also, it may simplify maintenance as far as pulling bug-fixes/enhancements from the Nuffer parser code-base to property_tree.
The property tree's parser is, I believe, either a very slightly modifed Dan Nuffer parser (just semantic actions were added, compared to the file I've seen), or built on the same principle: direct translation of the grammar spec in the XML specification. It is, with the exception of missing entities, a complete non-validating parser of the XML spec, as far as I can see, with the important exception of character set compatibility: the parser parses only files in the character set specified by the current global locale, and will completely ignore the character set specification of the header. Another missing part may be the parsing of the internal DTD subset, which might be (not sure yet) a required thing for non-validating parsers. In addition, it is an XML 1.0 parser. The Namespaces in XML, XInclude, XLink, XPointer, ... specifications are all built on top of XML; they are all well-formed XML. "Parsing but ignoring" them means nothing and can only lead to misunderstandings. Sebastian Redl