
Stefan Seefeld wrote:
I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.
Hi, Haven't looked at the reader component yet, so this will be about the node tree part only. I'm not at all happy with the node tree. It seems to me like it is taking the worst parts of the W3C DOM and leaving out the few advantages it has. The proposed API shares these problems with the DOM: 1) Very verbose. 2) Indirect node construction. I can't create an element by instantiating the class element - it has a protected constructor. Instances are created through some sort of factory, typically by calling methods of document and element that create children and return them. This is not a very natural syntax. The API has some additional disadvantages: 1) No real namespace support. To create an element in a given namespace, I have to register a prefix and then use the prefix in the element name. Worse, to find out the namespace of an element, I have to parse the string for the prefix (it's one find and one substr operator, but still) and then look it up to find the full namespace URI. (Depending on the semantics of element::lookup_namespace, I might have to walk the tree for that.) Given that some documents, especially generated ones, sometimes have multiple binding for the same namespace, this is overly tedious. Namespace URI and local name of an element should be first-class properties. The prefix:local convention is really just a hack - in the Infoset view of the information, it doesn't even exist. (The prefix does, the combined name doesn't. See 2.2 of the XML Infoset spec.) 2) Not an existing standard. Whatever else you can say for the DOM, it is well-known. Whatever your language - Java, C#, PHP, JavaScript, C++ with Xerces - the DOM is, minor variations in capitalization aside, a constant. By providing essentially the same functionality, but through a slightly different interface, you lose the recognition value of the DOM without gaining much. (You're avoiding the full complexity of the DOM, which is a good thing.) 3) Not as extensive. I'm not talking about the annoying multiple redundancy of the DOM here, but of low-level functionality such as preserving entity references in the node tree. For some low-level tasks, this is important stuff. Not that the DOM is really extensive: it provides no way, for example, to modify the document schema. (It allows introspection, at least.) The API has one clear advantage over the DOM: the use of iterators. The DOM also has many shortcomings that the API, due to its restricted scope, doesn't have. All in all, though, I think the chosen balance between closeness to the DOM and doing something different and interesting is not good. There are some thing I simply consider mistakes: 1) cdata should derive from text. It's basically a special case that only differs in its serialization from the general form. 2) You have a class dtd, but to access it you use document::internal_subset. This dtd class doesn't provide access to the internal subset however - only the document type declaration, after which it is named, (Yes, the document type /definition/ has the same abbreviation. Very unfortunate, that.) Some more issues: 1) The whole node/node_ptr mess. From reading your earlier posts, I thought that node and friends where value-like classes, that they directly represent the nodes, whereas node_ptr was a special smart pointer that provided the memory management and the shallow copying semantics. Only, upon reading the code, I find that node_ptr contains an instance of its element type, not a pointer to it, which means that node and derived are the smart pointers with shallow copying and memory management. Except that they don't: the pointers are never freed until the entire owner document is destroyed. Or the node is explicitly removed from its parent. Oh, and document is an exception to this convention, because it actually is a value-style class. That's not to say that this isn't a sensible overall strategy. It just is extremely confusing given your naming conventions. As far as I can see, the only thing node_ptr actually does is make access less convenient by requiring indirection - and thus double indirection for node iterators. (*i)->foo sucks, sorry. Apropos, why isn't node_iterator written using the Boost.Iterator library? 2) write_to_file as a member of document. This is asymmetric to parse_file being a free factory function. It's also unnecessary and, in my opinion, not a good idea for various reasons. One is the aforementioned asymmetry. Another is the public interface size, as mentioned in one of the Effective C++ books: write_to_file doesn't actually need access to document's internals, because it just serializes the node tree, right? (That it needs access to the contained libxml pointer is a detail that shouldn't affect the interface. Make it a friend if you have to, but implementations should have the option of not making it one.) There's more inconsistency here. The Efficient XML Interchange working group is overdue again with their first working draft, but a binary, compact serialization for XML _is_ in the works. Once they publish a recommendation, there will be two official serializations of the XML Infoset. And there are several unofficial ones already. Each one of these needs a pair of parse/serialize functions. (Not necessarily provided by the library, of course.) With write_to_file being a member, it enjoys a unique status that it doesn't really deserve. It also enjoys a very ambiguous name, as does parse_file. Even leaving that aside, there's also the option of multiple parse/serialize pairs just for a single format. They could take alternative input sources: a boost::path instead of a std::string for identification, for example. Or a std::istream as a data source/std::ostream as a data sink. Or a boost::url, when such a class is written, together with a pluggable communication framework for transparently fetching network URLs. Or whatever. Point is, all these are simple extensions to the system, but there is inconsistency if one function is a member when no other can be. Sorry for not being very constructive. I'll take a look at the reader next time I find some time, and make more general comments the time after that. I hope to get there within a week. Sebastian Redl