
Sebastian Redl wrote:
The interface in question is the reader interface, also known as pull interface. Like SAX, the pull interface is an event-based interface.
This confused me. I've always heard event-driven or callback-based interfaces described as "push", since the user's code gets invoked by an external event source. Do I correctly understand that you're talking about a SAX-like interface (in that it processes the document in-order, and limits visibility to one node at a time) that's "pull" (i.e. user code calls the parser) instead of "push" (i.e. parser calls user provided methods)?
There are two types of reader interfaces currently in use that I've found.
You mean two types of "pull" interfaces, right?
1) The Monolithic Interface
All methods are always available on the object; calling one that is not appropriate for the current event (e.g. getTagName() for a Characters event) returns a null value or signals an error.
Contra: You cannot store raw events: calling next() overwrites the current data. The parser contains a lot of state. This interface does not protect you in any way from calling inappropriate methods.
I don't see any fundamental reason why such an interface can't support inheritance. If I were using an interface that let me iterate over the document, I'd at least want to be able to decide whether to get the next sibling node or the next child node. A pull interface, like this, could not only support copying the current node, but could even copy entire subtrees (e.g. copyUntilNextSibling()) - though this operation would require dynamic memory allocation.
2) The Inheritance Interface
Contra: Event objects need to be allocated on the heap.
Why does an inheritance-based parser need to store objects on the heap? If the memory is owned by the parser, it can pre-allocate a temporary object for each type of node. Based on the type of node, it fills the temporary object of the appropriate type, and returns a const ref to that object. If the caller wants a copy, the copy would only have to be heap-allocated if copied via some virtual function in the node base-class. For the concrete classes (obtained via dynamic-casting), copy constructors and assignment operators would work just fine. Another option (as you point out) is returning a shared_ptr, though this would slightly complicate the parser's management of its temporary objects.
It does not return a reference to the event object, though, but instead a boost::variant of all possible events.
That could be big, depending on how much text you buffer. Not only would it waste memory, but memcpy'ing around all of that could waste some of the performance savings gained by avoiding heap allocation. Maybe RVO eliminates some or all of the performance penalty, but it's probably unwise to depend so much on RVO. Of course, passing in the result might be the solution - at least to the performance issues. A parser that allows the user to pass in the result would also facilitate copying subtrees, if your node type has addChild() and addSibling() methods.
Independently of the type of interface chosen, another issue is important: the scope of the interface. Should it report all XML events, including those coming from DTD parsing?
Why re-invent more than necessary? Use DOM and/or pick some other, existing object models (unless you have specific issues which they don't address).
Should errors be reported as error events, or as exceptions? Should this, too, be a user choice?
I think the biggest reason to avoid exceptions would be for the performance impact. I don't know whether the difference would be significant, in the case of XML parsing. However, to get the full performance benefit, I think you'd need to use empty exception specifications - in which case the choice would have to be made at compile-time (at the latest). Perhaps there are some other benefits to using an iostreams style error-handling model, where the parser is treated like a stream.
How about warnings: exceptions are inappropriate for them. Should it be possible to disable them completely?
What's a warning? A document is either well-formed, or it's not. The only possible distinction that comes to mind is perhaps treating bad syntax as errors and validation failures as warnings. However, you could basically get the same effect by providing a switch to disable validation. That way, rather than just ignore warnings, users who don't care about validation failures could disable validation and maybe also save some runtime overhead. Matt