Re: [boost] [xml] XML Reader Interface

29 Oct 2006

      Sebastian Redl wrote:
...
The interface in question is the reader interface, also known as pull
interface. Like SAX, the pull interface is an event-based interface.
This confused me.  I've always heard event-driven or callback-based 
interfaces described as "push", since the user's code gets invoked by an 
external event source.  Do I correctly understand that you're talking 
about a SAX-like interface (in that it processes the document in-order, 
and limits visibility to one node at a time) that's "pull" (i.e. user 
code calls the parser) instead of "push" (i.e. parser calls user 
provided methods)?
...
There are two types of reader interfaces currently in use that I've
found.
You mean two types of "pull" interfaces, right?
...
1) The Monolithic Interface
All methods are
always available on the object; calling one that is not appropriate for
the current event (e.g. getTagName() for a Characters event) returns a
null value or signals an error.
Contra: You cannot store raw events: calling next() overwrites the
current data. The parser contains a lot of state. This interface does
not protect you in any way from calling inappropriate methods.
I don't see any fundamental reason why such an interface can't support 
inheritance.

If I were using an interface that let me iterate over the document, I'd 
at least want to be able to decide whether to get the next sibling node 
or the next child node.  A pull interface, like this, could not only 
support copying the current node, but could even copy entire subtrees 
(e.g. copyUntilNextSibling()) - though this operation would require 
dynamic memory allocation.
...
2) The Inheritance Interface
Contra: Event objects need to be allocated on the heap.
Why does an inheritance-based parser need to store objects on the heap?  
If the memory is owned by the parser, it can pre-allocate a temporary 
object for each type of node.  Based on the type of node, it fills the 
temporary object of the appropriate type, and returns a const ref to 
that object.

If the caller wants a copy, the copy would only have to be 
heap-allocated if copied via some virtual function in the node 
base-class.  For the concrete classes (obtained via dynamic-casting), 
copy constructors and assignment operators would work just fine.  
Another option (as you point out) is returning a shared_ptr, though this 
would slightly complicate the parser's management of its temporary objects.
...
It does not
return a reference to the event object, though, but instead a
boost::variant of all possible events.
That could be big, depending on how much text you buffer.  Not only 
would it waste memory, but memcpy'ing around all of that could waste 
some of the performance savings gained by avoiding heap allocation.  
Maybe RVO eliminates some or all of the performance penalty, but it's 
probably unwise to depend so much on RVO.

Of course, passing in the result might be the solution - at least to the 
performance issues.  A parser that allows the user to pass in the result 
would also facilitate copying subtrees, if your node type has addChild() 
and addSibling() methods.
...
Independently of the type of interface chosen, another issue is
important: the scope of the interface. Should it report all XML events,
including those coming from DTD parsing?
Why re-invent more than necessary?  Use DOM and/or pick some other, 
existing object models (unless you have specific issues which they don't 
address).
...
Should errors be reported as error events, or as
exceptions? Should this, too, be a user choice?
I think the biggest reason to avoid exceptions would be for the 
performance impact.  I don't know whether the difference would be 
significant, in the case of XML parsing.  However, to get the full 
performance benefit, I think you'd need to use empty exception 
specifications - in which case the choice would have to be made at 
compile-time (at the latest).

Perhaps there are some other benefits to using an iostreams style 
error-handling model, where the parser is treated like a stream.
...
How about warnings:
exceptions are inappropriate for them. Should it be possible to disable
them completely?
What's a warning?  A document is either well-formed, or it's not.  The 
only possible distinction that comes to mind is perhaps treating bad 
syntax as errors and validation failures as warnings.  However, you 
could basically get the same effect by providing a switch to disable 
validation.  That way, rather than just ignore warnings, users who don't 
care about validation failures could disable validation and maybe also 
save some runtime overhead.

Matt

Re: [boost] [xml] XML Reader Interface

Matt Gruenke