Re: [boost] RFC: Boost.XML API prototype in the sandbox

12 Jul 2007

      Stefan Seefeld wrote:
...
I would appreciate if anybody interested into a future boost.xml
submission would have a look, provide feedback, or even get
involved into the (ongoing) development.
Hi,

Haven't looked at the reader component yet, so this will be about the
node tree part only.

I'm not at all happy with the node tree. It seems to me like it is
taking the worst parts of the W3C DOM and leaving out the few advantages
it has.

The proposed API shares these problems with the DOM:
1) Very verbose.
2) Indirect node construction. I can't create an element by
instantiating the class element - it has a protected constructor.
Instances are created through some sort of factory, typically by calling
methods of document and element that create children and return them.
This is not a very natural syntax.

The API has some additional disadvantages:
1) No real namespace support. To create an element in a given namespace,
I have to register a prefix and then use the prefix in the element name.
Worse, to find out the namespace of an element, I have to parse the
string for the prefix (it's one find and one substr operator, but still)
and then look it up to find the full namespace URI. (Depending on the
semantics of element::lookup_namespace, I might have to walk the tree
for that.) Given that some documents, especially generated ones,
sometimes have multiple binding for the same namespace, this is overly
tedious. Namespace URI and local name of an element should be
first-class properties. The prefix:local convention is really just a
hack - in the Infoset view of the information, it doesn't even exist.
(The prefix does, the combined name doesn't. See 2.2 of the XML Infoset
spec.)
2) Not an existing standard. Whatever else you can say for the DOM, it
is well-known. Whatever your language - Java, C#, PHP, JavaScript, C++
with Xerces - the DOM is, minor variations in capitalization aside, a
constant. By providing essentially the same functionality, but through a
slightly different interface, you lose the recognition value of the DOM
without gaining much. (You're avoiding the full complexity of the DOM,
which is a good thing.)
3) Not as extensive. I'm not talking about the annoying multiple
redundancy of the DOM here, but of low-level functionality such as
preserving entity references in the node tree. For some low-level tasks,
this is important stuff. Not that the DOM is really extensive: it
provides no way, for example, to modify the document schema. (It allows
introspection, at least.)

The API has one clear advantage over the DOM: the use of iterators. The
DOM also has many shortcomings that the API, due to its restricted scope,
doesn't have. All in all, though, I think the chosen balance between
closeness to the DOM and doing something different and interesting is
not good.

There are some thing I simply consider mistakes:
1) cdata should derive from text. It's basically a special case that
only differs in its serialization from the general form.
2) You have a class dtd, but to access it you use
document::internal_subset. This dtd class doesn't provide access to the
internal subset however - only the document type declaration, after
which it is named, (Yes, the document type /definition/ has the same
abbreviation. Very unfortunate, that.)

Some more issues:
1) The whole node/node_ptr mess. From reading your earlier posts, I
thought that node and friends where value-like classes, that they
directly represent the nodes, whereas node_ptr was a special smart
pointer that provided the memory management and the shallow copying
semantics. Only, upon reading the code, I find that node_ptr contains an
instance of its element type, not a pointer to it, which means that node
and derived are the smart pointers with shallow copying and memory
management. Except that they don't: the pointers are never freed until
the entire owner document is destroyed. Or the node is explicitly
removed from its parent. Oh, and document is an exception to this
convention, because it actually is a value-style class.
That's not to say that this isn't a sensible overall strategy. It just
is extremely confusing given your naming conventions. As far as I can
see, the only thing node_ptr actually does is make access less
convenient by requiring indirection - and thus double indirection for
node iterators. (*i)->foo sucks, sorry. Apropos, why isn't node_iterator
written using the Boost.Iterator library?
2) write_to_file as a  member of document. This is asymmetric to
parse_file being a free factory function. It's also unnecessary and, in
my opinion, not a good idea for various reasons. One is the
aforementioned asymmetry. Another is the public interface size, as
mentioned in one of the Effective C++ books: write_to_file doesn't
actually need access to document's internals, because it just serializes
the node tree, right? (That it needs access to the contained libxml
pointer is a detail that shouldn't affect the interface. Make it a
friend if you have to, but implementations should have the option of not
making it one.)
There's more inconsistency here. The Efficient XML Interchange working
group is overdue again with their first working draft, but a binary,
compact serialization for XML _is_ in the works. Once they publish a
recommendation, there will be two official serializations of the XML
Infoset. And there are several unofficial ones already. Each one of
these needs a pair of parse/serialize functions. (Not necessarily
provided by the library, of course.) With write_to_file being a member,
it enjoys a unique status that it doesn't really deserve. It also enjoys
a very ambiguous name, as does parse_file.
Even leaving that aside, there's also the option of multiple
parse/serialize pairs just for a single format. They could take
alternative input sources: a boost::path instead of a std::string for
identification, for example. Or a std::istream as a data
source/std::ostream as a data sink. Or a boost::url, when such a class
is written, together with a pluggable communication framework for
transparently fetching network URLs. Or whatever.
Point is, all these are simple extensions to the system, but there is
inconsistency if one function is a member when no other can be.

Sorry for not being very constructive. I'll take a look at the reader
next time I find some time, and make more general comments the time
after that. I hope to get there within a week.

Sebastian Redl