
Sebastian Redl wrote:
The proposed API shares these problems with the DOM: 1) Very verbose. 2) Indirect node construction. I can't create an element by instantiating the class element - it has a protected constructor. Instances are created through some sort of factory, typically by calling methods of document and element that create children and return them. This is not a very natural syntax.
The reason to delegate to a factory is to let it do a lot of resource management that thus can be hidden from the user. There is a lot to be considered, as each node lives in a particular context given by the document as well as its position in it (think of namespaces, for example). It may of course be possible to hide that by providing stack variables that are merely proxies, so the actual instantiation will be done lazily, once the (proxy) node is inserted into the document. I haven't thought too hard about that, since to me using a factory is a natural means to allow encapsulation.
The API has some additional disadvantages: 1) No real namespace support. To create an element in a given namespace, I have to register a prefix and then use the prefix in the element name. Worse, to find out the namespace of an element, I have to parse the string for the prefix (it's one find and one substr operator, but still) and then look it up to find the full namespace URI. (Depending on the semantics of element::lookup_namespace, I might have to walk the tree for that.) Given that some documents, especially generated ones, sometimes have multiple binding for the same namespace, this is overly tedious. Namespace URI and local name of an element should be first-class properties. The prefix:local convention is really just a hack - in the Infoset view of the information, it doesn't even exist. (The prefix does, the combined name doesn't. See 2.2 of the XML Infoset spec.)
OK, I agree. This can be addressed independently from all the rest, however.
2) Not an existing standard. Whatever else you can say for the DOM, it is well-known. Whatever your language - Java, C#, PHP, JavaScript, C++ with Xerces - the DOM is, minor variations in capitalization aside, a constant. By providing essentially the same functionality, but through a slightly different interface, you lose the recognition value of the DOM without gaining much. (You're avoiding the full complexity of the DOM, which is a good thing.)
Sorry, that argument I don't accept. Yes, I deliberately chose not to use the API as obtained from using the CORBA C++ bindings of the OMG IDL DOM. The hope is to get something better, much more naturally tied to modern C++ idioms. Whether or not I achieve that is to be discussed, and can be criticized, but the lack of conformance to existing DOM APIs in itself is hardly an argument worth debating.
3) Not as extensive. I'm not talking about the annoying multiple redundancy of the DOM here, but of low-level functionality such as preserving entity references in the node tree. For some low-level tasks, this is important stuff. Not that the DOM is really extensive: it provides no way, for example, to modify the document schema. (It allows introspection, at least.)
OK, the API represents the Infoset, and thus has no idea of what an entity is. I'm not sure whether that would be worth adding. And if, it may be some hook into the XML writer (the XML parser already has it). I don't understand what you are aiming at in your comment about the 'document schema'.
The API has one clear advantage over the DOM: the use of iterators. The DOM also has many shortcomings that the API, due to its restricted scope, doesn't have. All in all, though, I think the chosen balance between closeness to the DOM and doing something different and interesting is not good.
There are some thing I simply consider mistakes: 1) cdata should derive from text. It's basically a special case that only differs in its serialization from the general form.
That's an implementation detail (IMO). Semantically, a text node and a cdata node are distinct, and so visitors shouldn't give users access to a cdata node as a text node. (And what else would the ISA relationship be good for ?)
2) You have a class dtd, but to access it you use document::internal_subset. This dtd class doesn't provide access to the internal subset however - only the document type declaration, after which it is named, (Yes, the document type /definition/ has the same abbreviation. Very unfortunate, that.)
I'm sure this can be refined. (In fact, I don't think DTDs will play any significant role in the future, as other document type definitions become more popular, such as relaxng).
Some more issues: 1) The whole node/node_ptr mess. From reading your earlier posts, I thought that node and friends where value-like classes, that they directly represent the nodes, whereas node_ptr was a special smart pointer that provided the memory management and the shallow copying semantics. Only, upon reading the code, I find that node_ptr contains an instance of its element type, not a pointer to it, which means that node and derived are the smart pointers with shallow copying and memory management. Except that they don't: the pointers are never freed until the entire owner document is destroyed. Or the node is explicitly removed from its parent. Oh, and document is an exception to this convention, because it actually is a value-style class. That's not to say that this isn't a sensible overall strategy. It just is extremely confusing given your naming conventions. As far as I can see, the only thing node_ptr actually does is make access less convenient by requiring indirection - and thus double indirection for node iterators. (*i)->foo sucks, sorry. Apropos, why isn't node_iterator written using the Boost.Iterator library?
OK, I understand that I need to rethink how to represent things. To me it is clear, however, what I want: encapsulate nodes and their management such that the user doesn't have to care for allocation / deallocation, but instead accesses (dereferences) nodes via node_ptr proxies.
2) write_to_file as a member of document. This is asymmetric to parse_file being a free factory function. It's also unnecessary and, in my opinion, not a good idea for various reasons. One is the aforementioned asymmetry. Another is the public interface size, as mentioned in one of the Effective C++ books: write_to_file doesn't actually need access to document's internals, because it just serializes the node tree, right? (That it needs access to the contained libxml pointer is a detail that shouldn't affect the interface. Make it a friend if you have to, but implementations should have the option of not making it one.)
That's a good point. I will make write_to_file a free-standing function.
There's more inconsistency here. The Efficient XML Interchange working group is overdue again with their first working draft, but a binary, compact serialization for XML _is_ in the works. Once they publish a recommendation, there will be two official serializations of the XML Infoset. And there are several unofficial ones already. Each one of these needs a pair of parse/serialize functions. (Not necessarily provided by the library, of course.) With write_to_file being a member, it enjoys a unique status that it doesn't really deserve. It also enjoys a very ambiguous name, as does parse_file. Even leaving that aside, there's also the option of multiple parse/serialize pairs just for a single format. They could take alternative input sources: a boost::path instead of a std::string for identification, for example. Or a std::istream as a data source/std::ostream as a data sink. Or a boost::url, when such a class is written, together with a pluggable communication framework for transparently fetching network URLs. Or whatever. Point is, all these are simple extensions to the system, but there is inconsistency if one function is a member when no other can be.
Right.
Sorry for not being very constructive. I'll take a look at the reader next time I find some time, and make more general comments the time after that. I hope to get there within a week.
Thanks for your comments. I will try to address them, if only by working on documentation that give a rationale for the various choices I have taken. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...