[xml] XML Reader Interface

28 Oct 2006

Matt Gruenke wrote:
> This confused me.  I've always heard event-driven or callback-based 
> interfaces described as "push", since the user's code gets invoked by an 
> external event source.  Do I correctly understand that you're talking 
> about a SAX-like interface (in that it processes the document in-order, 
> and limits visibility to one node at a time) that's "pull" (i.e. user 
> code calls the parser) instead of "push" (i.e. parser calls user 
> provided methods)?
>   
Correct. I said "event-based" because that's what's "pulled" from the
parser each time: an event.
> You mean two types of "pull" interfaces, right?
>   
Yes. "Reader" and "pull" are often used synonymously.
>> 1) The Monolithic Interface
>>     
> I don't see any fundamental reason why such an interface can't support 
> inheritance.
>   
Because the parser object *is* the event object of the later models. If
it were to use inheritance to hide inappropriate methods, it would have
to change type dynamically. That's not possible.
> If I were using an interface that let me iterate over the document, I'd 
> at least want to be able to decide whether to get the next sibling node 
> or the next child node.
You could realize something like that by filtering events. The option to
insert event filters is definitely part of the plan.
> A pull interface, like this, could not only 
> support copying the current node, but could even copy entire subtrees 
> (e.g. copyUntilNextSibling()) - though this operation would require 
> dynamic memory allocation.
>   
Copying in what way? To a writer? To a tree of objects? I don't understand.
>> 2) The Inheritance Interface
>>  
>> Contra: Event objects need to be allocated on the heap.
>>     
> Why does an inheritance-based parser need to store objects on the heap?  
> If the memory is owned by the parser, it can pre-allocate a temporary 
> object for each type of node.  Based on the type of node, it fills the 
> temporary object of the appropriate type, and returns a const ref to 
> that object.
>   
True, that's possible. It voids the advantage of directly storing the
returned objects, though.
> If the caller wants a copy, the copy would only have to be 
> heap-allocated if copied via some virtual function in the node 
> base-class.  For the concrete classes (obtained via dynamic-casting), 
> copy constructors and assignment operators would work just fine.  
>   
Certainly an option.
>> It does not
>> return a reference to the event object, though, but instead a
>> boost::variant of all possible events.
>>     
>
> That could be big, depending on how much text you buffer.  Not only 
> would it waste memory, but memcpy'ing around all of that could waste 
> some of the performance savings gained by avoiding heap allocation.  
> Maybe RVO eliminates some or all of the performance penalty, but it's 
> probably unwise to depend so much on RVO.
>   
It also depends on the type of string used. If it's something like the
proposed const_string, the copy overhead would be small.
> Of course, passing in the result might be the solution - at least to the 
> performance issues.  A parser that allows the user to pass in the result 
> would also facilitate copying subtrees, if your node type has addChild() 
> and addSibling() methods.
>   
I have no node type. I'm not building a DOM tree here - this interface
is more low-level.
>> Independently of the type of interface chosen, another issue is
>> important: the scope of the interface. Should it report all XML events,
>> including those coming from DTD parsing?
>>     
> Why re-invent more than necessary?
I don't understand. A generic XML library needs to cater to all users.
This means that it must be able to report the structure of DTDs to those
clients that want it, like graphical XML construction kits, while still
being easy to use in the case of high-level demands such as an
application reading an XML configuration file.
This means the library either needs one interface that can do both, or
one interface per use case.

And that's the pull parsing alone. The same decision has to be made for
push parsing (although the SAX specification pretty much takes care of
the decisions there) and again for any in-memory object models that the
library wants to support.
> Use DOM and/or pick some other, 
> existing object models (unless you have specific issues which they don't 
> address).
>   
I'm not picking any object models at the moment. I'm working strictly on
the basis of a linear stream of events right now.
>> Should errors be reported as error events, or as
>> exceptions? Should this, too, be a user choice?
>>     
> I think the biggest reason to avoid exceptions would be for the 
> performance impact.
That is one. The other is a sort of consistency: if we're already
handling events, we might as well add errors to those events.
> I don't know whether the difference would be 
> significant, in the case of XML parsing.  However, to get the full 
> performance benefit, I think you'd need to use empty exception 
> specifications - in which case the choice would have to be made at 
> compile-time (at the latest).
>   
Not necessarily. Because of the way the system works, an empty exception
specification does not require that all called methods also have empty
specifications. If only the outermost layer of the parser implementation
uses exceptions (i.e. it translates error events to exceptions if that's
the user's choice) then only that layer cannot have the specification. I
think under these circumstances, the overhead would be indeed completely
negligible.
The other issue here is that no-overhead-without-error exception
handling is possible, so an interface shouldn't really be compromised
because an XML error should still be handled at high performance.
> Perhaps there are some other benefits to using an iostreams style 
> error-handling model, where the parser is treated like a stream.
>   
I don't know what you mean by that.
>> How about warnings:
>> exceptions are inappropriate for them. Should it be possible to disable
>> them completely?
>>     
> What's a warning?
Here are some warnings:
1) A non-validating parser encounters, in the internal document type
subset, entity declarations or attribute-list declarations after a
reference to an external entity. According to XML 1.0, section 5.1, it
MUST NOT process these, but neither does the spec say that is must
signal an error. This is an excellent place to generate a warning.
2) A parser that is not validating or has validation disabled encounters
a reference to an undeclared entity. The requirement that all entities
are declared is a validity constraint, not a well-formedness constraint,
but issuing a warning for this case still makes sense.
3) A validating parser with validation enabled encounters any of the
predefined entities "amp", "lt", "gt", etc. without them being
explicitly declared. Sections 4.1 and 4.6 both mention that valid
documents, for interoperability, SHOULD declare these entities
explicitly, but don't require it. This would be a good place to issue a
warning.

Other warnings that mention bad styles can be come up with. If the
interface doesn't have a way of reporting such warnings, an application
that wants to lint XML files would have to use its own parser interface.
Having the facility included in the interface of Boost.XML increases
flexibility.

There is another thing here: most modern parser libraries operate on the
philosophy of the XML Infoset, not the XML text serialization. This is
very important for a new library, now that a binary serialization of XML
is being formulated. Any parser library thus must provide a way of
plugging a different parser behind the same interface.
And once you have this option, why stop at XML? You could allow any
format that represents the annotated content tree that XML uses (a tree
of nodes, each node either a named element with an unordered set of
key-value pairs and an arbitrary sequence of child nodes, or a text
(leaf) node without name or attributes).
Like HTML. And HTML, with all the quirks that are not valid but still
need to be accepted if there should be any hope of interoperability, is
a great source of warnings of all kinds. For example, mixing alphabetic
and numeric characters in an unquoted attribute value. Not terminating
an entity. That sort of stuff.

Sebastian Redl

Sebastian Redl

Jose

Matt Gruenke

Sebastian Redl

loufoque

loufoque

Stefan Seefeld

Sebastian Redl

Cory Nelson

Sebastian Redl

Oleg Abrosimov

Stefan Seefeld

Sebastian Redl

Stefan Seefeld

loufoque

Stefan Seefeld

Boris Kolpackov

Stefan Seefeld

Sebastian Redl

loufoque

tags

participants (8)