Re: [boost] RFC: Boost.XML API prototype in the sandbox

14 Jul 2007

      Sebastian Redl wrote:
...
The proposed API shares these problems with the DOM:
1) Very verbose.
2) Indirect node construction. I can't create an element by
instantiating the class element - it has a protected constructor.
Instances are created through some sort of factory, typically by calling
methods of document and element that create children and return them.
This is not a very natural syntax.
The reason to delegate to a factory is to let it do a lot of resource
management that thus can be hidden from the user. There is a lot to be
considered, as each node lives in a particular context given by the document
as well as its position in it (think of namespaces, for example).

It may of course be possible to hide that by providing stack variables
that are merely proxies, so the actual instantiation will be done lazily,
once the (proxy) node is inserted into the document. I haven't thought too
hard about that, since to me using a factory is a natural means to allow
encapsulation.
...
The API has some additional disadvantages:
1) No real namespace support. To create an element in a given namespace,
I have to register a prefix and then use the prefix in the element name.
Worse, to find out the namespace of an element, I have to parse the
string for the prefix (it's one find and one substr operator, but still)
and then look it up to find the full namespace URI. (Depending on the
semantics of element::lookup_namespace, I might have to walk the tree
for that.) Given that some documents, especially generated ones,
sometimes have multiple binding for the same namespace, this is overly
tedious. Namespace URI and local name of an element should be
first-class properties. The prefix:local convention is really just a
hack - in the Infoset view of the information, it doesn't even exist.
(The prefix does, the combined name doesn't. See 2.2 of the XML Infoset
spec.)
OK, I agree. This can be addressed independently from all the rest, however.
...
2) Not an existing standard. Whatever else you can say for the DOM, it
is well-known. Whatever your language - Java, C#, PHP, JavaScript, C++
with Xerces - the DOM is, minor variations in capitalization aside, a
constant. By providing essentially the same functionality, but through a
slightly different interface, you lose the recognition value of the DOM
without gaining much. (You're avoiding the full complexity of the DOM,
which is a good thing.)
Sorry, that argument I don't accept. Yes, I deliberately chose not to
use the API as obtained from using the CORBA C++ bindings of the OMG IDL DOM.
The hope is to get something better, much more naturally tied to modern
C++ idioms. Whether or not I achieve that is to be discussed, and can be
criticized, but the lack of conformance to existing DOM APIs in itself
is hardly an argument worth debating.
...
3) Not as extensive. I'm not talking about the annoying multiple
redundancy of the DOM here, but of low-level functionality such as
preserving entity references in the node tree. For some low-level tasks,
this is important stuff. Not that the DOM is really extensive: it
provides no way, for example, to modify the document schema. (It allows
introspection, at least.)
OK, the API represents the Infoset, and thus has no idea of what an entity
is. I'm not sure whether that would be worth adding. And if, it may be
some hook into the XML writer (the XML parser already has it).

I don't understand what you are aiming at in your comment about the
'document schema'.
...
The API has one clear advantage over the DOM: the use of iterators. The
DOM also has many shortcomings that the API, due to its restricted scope,
doesn't have. All in all, though, I think the chosen balance between
closeness to the DOM and doing something different and interesting is
not good.
There are some thing I simply consider mistakes:
1) cdata should derive from text. It's basically a special case that
only differs in its serialization from the general form.
That's an implementation detail (IMO). Semantically, a text node and
a cdata node are distinct, and so visitors shouldn't give users access
to a cdata node as a text node. (And what else would the ISA relationship
be good for ?)
...
2) You have a class dtd, but to access it you use
document::internal_subset. This dtd class doesn't provide access to the
internal subset however - only the document type declaration, after
which it is named, (Yes, the document type /definition/ has the same
abbreviation. Very unfortunate, that.)
I'm sure this can be refined. (In fact, I don't think DTDs will play any
significant role in the future, as other document type definitions become
more popular, such as relaxng).
...
Some more issues:
1) The whole node/node_ptr mess. From reading your earlier posts, I
thought that node and friends where value-like classes, that they
directly represent the nodes, whereas node_ptr was a special smart
pointer that provided the memory management and the shallow copying
semantics. Only, upon reading the code, I find that node_ptr contains an
instance of its element type, not a pointer to it, which means that node
and derived are the smart pointers with shallow copying and memory
management. Except that they don't: the pointers are never freed until
the entire owner document is destroyed. Or the node is explicitly
removed from its parent. Oh, and document is an exception to this
convention, because it actually is a value-style class.
That's not to say that this isn't a sensible overall strategy. It just
is extremely confusing given your naming conventions. As far as I can
see, the only thing node_ptr actually does is make access less
convenient by requiring indirection - and thus double indirection for
node iterators. (*i)->foo sucks, sorry. Apropos, why isn't node_iterator
written using the Boost.Iterator library?
OK, I understand that I need to rethink how to represent things. To me
it is clear, however, what I want: encapsulate nodes and their management
such that the user doesn't have to care for allocation / deallocation,
but instead accesses (dereferences) nodes via node_ptr proxies.
...
2) write_to_file as a  member of document. This is asymmetric to
parse_file being a free factory function. It's also unnecessary and, in
my opinion, not a good idea for various reasons. One is the
aforementioned asymmetry. Another is the public interface size, as
mentioned in one of the Effective C++ books: write_to_file doesn't
actually need access to document's internals, because it just serializes
the node tree, right? (That it needs access to the contained libxml
pointer is a detail that shouldn't affect the interface. Make it a
friend if you have to, but implementations should have the option of not
making it one.)
That's a good point. I will make write_to_file a free-standing function.
...
There's more inconsistency here. The Efficient XML Interchange working
group is overdue again with their first working draft, but a binary,
compact serialization for XML _is_ in the works. Once they publish a
recommendation, there will be two official serializations of the XML
Infoset. And there are several unofficial ones already. Each one of
these needs a pair of parse/serialize functions. (Not necessarily
provided by the library, of course.) With write_to_file being a member,
it enjoys a unique status that it doesn't really deserve. It also enjoys
a very ambiguous name, as does parse_file.
Even leaving that aside, there's also the option of multiple
parse/serialize pairs just for a single format. They could take
alternative input sources: a boost::path instead of a std::string for
identification, for example. Or a std::istream as a data
source/std::ostream as a data sink. Or a boost::url, when such a class
is written, together with a pluggable communication framework for
transparently fetching network URLs. Or whatever.
Point is, all these are simple extensions to the system, but there is
inconsistency if one function is a member when no other can be.
Right.
...
Sorry for not being very constructive. I'll take a look at the reader
next time I find some time, and make more general comments the time
after that. I hope to get there within a week.
Thanks for your comments. I will try to address them, if only by working
on documentation that give a rationale for the various choices I have taken.

Regards,
		Stefan

-- 

      ...ich hab' noch einen Koffer in Berlin...

Re: [boost] RFC: Boost.XML API prototype in the sandbox

Stefan Seefeld