[xml] XML Reader Interface

Hi, Once again I'm turning to the list for discussion about a design issue in the XML library. This time I hope to avoid any discussion about the implementation on the library and focus on interface only. The interface in question is the reader interface, also known as pull interface. Like SAX, the pull interface is an event-based interface. There are a few event types (roughly, StartElement, EndElement, Characters, and a few more for other XML features), all of which provide come with some additional data: the element name, the character data, etc. There are two types of reader interfaces currently in use that I've found. I've come up with a third. I wonder which the people on this list would prefer, where they see their weaknesses and strengths. The names that I've given them are my own creation. 1) The Monolithic Interface Examples: .Net XMLReader, libxml2 XMLReader (modeled after the .Net one), Java Common API for XML Pull Parsing (XmlPull) (don't confuse with JSR 173 "StAX") In the monolithic interface, the XML parser acts as a cursor over the event stream. You call next() and it points to the next event in the stream. From there, you can query its type (usually some integral constants) and call some methods to retrieve the data. All methods are always available on the object; calling one that is not appropriate for the current event (e.g. getTagName() for a Characters event) returns a null value or signals an error. Pro: Event objects do not need to be allocated. The parser itself contains the entire state and can, for example, be passed down a recursive function group. Contra: You cannot store raw events: calling next() overwrites the current data. The parser contains a lot of state. This interface does not protect you in any way from calling inappropriate methods. 2) The Inheritance Interface Examples: JSR 173 "StAX" In the inheritance interface, the event types are modeled as a group of classes that all inherit from an Event base class. The parser acts as an iterator, Java style; calling next() returns a reference/pointer to the event object for this event. You use RTTI or a similar mechanism to find the type of the event, then cast the reference to the appropriate subclass. The subclasses then provide access to the data that is actually available for this event type. Pro: Cannot call methods that are inappropriate for the event. Event objects are independent of the parser and can be stored as they are. This is especially interesting if you have an event-based output system that uses the same event type: in this case, you can store the events, shuffle them, edit them, then pass them on to the writer. A proper analog to the stateful parser is harder to design. The parser contains less state, as it does not need to store the data that is currently queried. Contra: Event objects need to be allocated on the heap. In a non-GC language like C++, this is even more of a problem than in Java, as you have to use either a smart pointer or have the user responsible for deleting the object. The scenario of a group of functions as mentioned above is limited insofar as that if some functions want to process the same event, they need to be passed the current event along with the parser. 3) The Variant Interface Examples: None. I believe I came up with this entirely on my own. The variant interface seeks to combine the strengths of the other two interfaces. It uses a non-monolithic interface, that is, the parser acts like an iterator and the data is not stored within it. It does not return a reference to the event object, though, but instead a boost::variant of all possible events. This way, heap allocation of the event object is avoided, together with all the trouble coming with that. The event type can be determined either by calling variant::which, or with a variant visitor (type-safe!), or with a special get_base() function that works like get() but can retrieve a reference to a common base of all the variant types. (This is possible, although an implementation does not exist in Boost.) Pro: Cannot call methods that are inappropriate. The visitor system allows type-safe usage. (Of course, it also loses you part of the advantage of a pull interface over a push interface.) Does not need heap allocation if the event classes are properly designed (i.e. not trigger the case where the variant allocates heap memory.) Events can be stored, copied (another advantage over the inheritance interface, which would require a clone() method for that), and manipulated at will. They can be pre-allocated, a reference passed to next(), to save even the stack allocation. Contra: The issue about a group of function still applies. Independently of the type of interface chosen, another issue is important: the scope of the interface. Should it report all XML events, including those coming from DTD parsing? Should this be a user choice, or should there perhaps be two interfaces, one "high-level" and one "low-level"? Should errors be reported as error events, or as exceptions? Should this, too, be a user choice? How about warnings: exceptions are inappropriate for them. Should it be possible to disable them completely? All comments are welcome. Sebastian Redl

On 10/28/06, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
Hi,
Once again I'm turning to the list for discussion about a design issue in the XML library. This time I hope to avoid any discussion about the implementation on the library and focus on interface only.
The interface in question is the reader interface, also known as pull interface. Like SAX, the pull interface is an event-based interface. There are a few event types (roughly, StartElement, EndElement, Characters, and a few more for other XML features), all of which provide come with some additional data: the element name, the character data, etc.
The criteria I would apply for choosing the interface are: - performance - custom-parsing simplicity - extensibility Based on this the best interface is 1) . I think the best approach is to learn from the C++ XmlPullParser that you mention in your September 7 thread: http://www.extreme.indiana.edu/xgws/xsoap/xpp/ What is missing is the ability to handle partial files + good UNICODE support + optimizations If you agree that performance is a key requirement, at which XPP already excels, then you might want to focus on the recent ideas for creating an XML tree without memory allocations. http://www.nathanm.com/2006/09/15/techniques-for-parsing-xml-documents.html Once you have the XML stored in a tree it is viable to implement path and query operators which are really necessary. I'm glad you're tackling the xml library as this is an area I find the C++ community is behind the Java .NET camps regards

Sebastian Redl wrote:
The interface in question is the reader interface, also known as pull interface. Like SAX, the pull interface is an event-based interface.
This confused me. I've always heard event-driven or callback-based interfaces described as "push", since the user's code gets invoked by an external event source. Do I correctly understand that you're talking about a SAX-like interface (in that it processes the document in-order, and limits visibility to one node at a time) that's "pull" (i.e. user code calls the parser) instead of "push" (i.e. parser calls user provided methods)?
There are two types of reader interfaces currently in use that I've found.
You mean two types of "pull" interfaces, right?
1) The Monolithic Interface
All methods are always available on the object; calling one that is not appropriate for the current event (e.g. getTagName() for a Characters event) returns a null value or signals an error.
Contra: You cannot store raw events: calling next() overwrites the current data. The parser contains a lot of state. This interface does not protect you in any way from calling inappropriate methods.
I don't see any fundamental reason why such an interface can't support inheritance. If I were using an interface that let me iterate over the document, I'd at least want to be able to decide whether to get the next sibling node or the next child node. A pull interface, like this, could not only support copying the current node, but could even copy entire subtrees (e.g. copyUntilNextSibling()) - though this operation would require dynamic memory allocation.
2) The Inheritance Interface
Contra: Event objects need to be allocated on the heap.
Why does an inheritance-based parser need to store objects on the heap? If the memory is owned by the parser, it can pre-allocate a temporary object for each type of node. Based on the type of node, it fills the temporary object of the appropriate type, and returns a const ref to that object. If the caller wants a copy, the copy would only have to be heap-allocated if copied via some virtual function in the node base-class. For the concrete classes (obtained via dynamic-casting), copy constructors and assignment operators would work just fine. Another option (as you point out) is returning a shared_ptr, though this would slightly complicate the parser's management of its temporary objects.
It does not return a reference to the event object, though, but instead a boost::variant of all possible events.
That could be big, depending on how much text you buffer. Not only would it waste memory, but memcpy'ing around all of that could waste some of the performance savings gained by avoiding heap allocation. Maybe RVO eliminates some or all of the performance penalty, but it's probably unwise to depend so much on RVO. Of course, passing in the result might be the solution - at least to the performance issues. A parser that allows the user to pass in the result would also facilitate copying subtrees, if your node type has addChild() and addSibling() methods.
Independently of the type of interface chosen, another issue is important: the scope of the interface. Should it report all XML events, including those coming from DTD parsing?
Why re-invent more than necessary? Use DOM and/or pick some other, existing object models (unless you have specific issues which they don't address).
Should errors be reported as error events, or as exceptions? Should this, too, be a user choice?
I think the biggest reason to avoid exceptions would be for the performance impact. I don't know whether the difference would be significant, in the case of XML parsing. However, to get the full performance benefit, I think you'd need to use empty exception specifications - in which case the choice would have to be made at compile-time (at the latest). Perhaps there are some other benefits to using an iostreams style error-handling model, where the parser is treated like a stream.
How about warnings: exceptions are inappropriate for them. Should it be possible to disable them completely?
What's a warning? A document is either well-formed, or it's not. The only possible distinction that comes to mind is perhaps treating bad syntax as errors and validation failures as warnings. However, you could basically get the same effect by providing a switch to disable validation. That way, rather than just ignore warnings, users who don't care about validation failures could disable validation and maybe also save some runtime overhead. Matt

Matt Gruenke wrote: > This confused me. I've always heard event-driven or callback-based > interfaces described as "push", since the user's code gets invoked by an > external event source. Do I correctly understand that you're talking > about a SAX-like interface (in that it processes the document in-order, > and limits visibility to one node at a time) that's "pull" (i.e. user > code calls the parser) instead of "push" (i.e. parser calls user > provided methods)? > Correct. I said "event-based" because that's what's "pulled" from the parser each time: an event. > You mean two types of "pull" interfaces, right? > Yes. "Reader" and "pull" are often used synonymously. >> 1) The Monolithic Interface >> > I don't see any fundamental reason why such an interface can't support > inheritance. > Because the parser object *is* the event object of the later models. If it were to use inheritance to hide inappropriate methods, it would have to change type dynamically. That's not possible. > If I were using an interface that let me iterate over the document, I'd > at least want to be able to decide whether to get the next sibling node > or the next child node. You could realize something like that by filtering events. The option to insert event filters is definitely part of the plan. > A pull interface, like this, could not only > support copying the current node, but could even copy entire subtrees > (e.g. copyUntilNextSibling()) - though this operation would require > dynamic memory allocation. > Copying in what way? To a writer? To a tree of objects? I don't understand. >> 2) The Inheritance Interface >> >> Contra: Event objects need to be allocated on the heap. >> > Why does an inheritance-based parser need to store objects on the heap? > If the memory is owned by the parser, it can pre-allocate a temporary > object for each type of node. Based on the type of node, it fills the > temporary object of the appropriate type, and returns a const ref to > that object. > True, that's possible. It voids the advantage of directly storing the returned objects, though. > If the caller wants a copy, the copy would only have to be > heap-allocated if copied via some virtual function in the node > base-class. For the concrete classes (obtained via dynamic-casting), > copy constructors and assignment operators would work just fine. > Certainly an option. >> It does not >> return a reference to the event object, though, but instead a >> boost::variant of all possible events. >> > > That could be big, depending on how much text you buffer. Not only > would it waste memory, but memcpy'ing around all of that could waste > some of the performance savings gained by avoiding heap allocation. > Maybe RVO eliminates some or all of the performance penalty, but it's > probably unwise to depend so much on RVO. > It also depends on the type of string used. If it's something like the proposed const_string, the copy overhead would be small. > Of course, passing in the result might be the solution - at least to the > performance issues. A parser that allows the user to pass in the result > would also facilitate copying subtrees, if your node type has addChild() > and addSibling() methods. > I have no node type. I'm not building a DOM tree here - this interface is more low-level. >> Independently of the type of interface chosen, another issue is >> important: the scope of the interface. Should it report all XML events, >> including those coming from DTD parsing? >> > Why re-invent more than necessary? I don't understand. A generic XML library needs to cater to all users. This means that it must be able to report the structure of DTDs to those clients that want it, like graphical XML construction kits, while still being easy to use in the case of high-level demands such as an application reading an XML configuration file. This means the library either needs one interface that can do both, or one interface per use case. And that's the pull parsing alone. The same decision has to be made for push parsing (although the SAX specification pretty much takes care of the decisions there) and again for any in-memory object models that the library wants to support. > Use DOM and/or pick some other, > existing object models (unless you have specific issues which they don't > address). > I'm not picking any object models at the moment. I'm working strictly on the basis of a linear stream of events right now. >> Should errors be reported as error events, or as >> exceptions? Should this, too, be a user choice? >> > I think the biggest reason to avoid exceptions would be for the > performance impact. That is one. The other is a sort of consistency: if we're already handling events, we might as well add errors to those events. > I don't know whether the difference would be > significant, in the case of XML parsing. However, to get the full > performance benefit, I think you'd need to use empty exception > specifications - in which case the choice would have to be made at > compile-time (at the latest). > Not necessarily. Because of the way the system works, an empty exception specification does not require that all called methods also have empty specifications. If only the outermost layer of the parser implementation uses exceptions (i.e. it translates error events to exceptions if that's the user's choice) then only that layer cannot have the specification. I think under these circumstances, the overhead would be indeed completely negligible. The other issue here is that no-overhead-without-error exception handling is possible, so an interface shouldn't really be compromised because an XML error should still be handled at high performance. > Perhaps there are some other benefits to using an iostreams style > error-handling model, where the parser is treated like a stream. > I don't know what you mean by that. >> How about warnings: >> exceptions are inappropriate for them. Should it be possible to disable >> them completely? >> > What's a warning? Here are some warnings: 1) A non-validating parser encounters, in the internal document type subset, entity declarations or attribute-list declarations after a reference to an external entity. According to XML 1.0, section 5.1, it MUST NOT process these, but neither does the spec say that is must signal an error. This is an excellent place to generate a warning. 2) A parser that is not validating or has validation disabled encounters a reference to an undeclared entity. The requirement that all entities are declared is a validity constraint, not a well-formedness constraint, but issuing a warning for this case still makes sense. 3) A validating parser with validation enabled encounters any of the predefined entities "amp", "lt", "gt", etc. without them being explicitly declared. Sections 4.1 and 4.6 both mention that valid documents, for interoperability, SHOULD declare these entities explicitly, but don't require it. This would be a good place to issue a warning. Other warnings that mention bad styles can be come up with. If the interface doesn't have a way of reporting such warnings, an application that wants to lint XML files would have to use its own parser interface. Having the facility included in the interface of Boost.XML increases flexibility. There is another thing here: most modern parser libraries operate on the philosophy of the XML Infoset, not the XML text serialization. This is very important for a new library, now that a binary serialization of XML is being formulated. Any parser library thus must provide a way of plugging a different parser behind the same interface. And once you have this option, why stop at XML? You could allow any format that represents the annotated content tree that XML uses (a tree of nodes, each node either a named element with an unordered set of key-value pairs and an arbitrary sequence of child nodes, or a text (leaf) node without name or attributes). Like HTML. And HTML, with all the quirks that are not valid but still need to be accepted if there should be any hope of interoperability, is a great source of warnings of all kinds. For example, mixing alphabetic and numeric characters in an unquoted attribute value. Not terminating an entity. That sort of stuff. Sebastian Redl

Matt Gruenke wrote:
Why does an inheritance-based parser need to store objects on the heap? If the memory is owned by the parser, it can pre-allocate a temporary object for each type of node. Based on the type of node, it fills the temporary object of the appropriate type, and returns a const ref to that object.
Or better, a boost.variant of those, and we get the reference with the get_base() function. However in that case the objects are owned by the parser, not the handler.
If the caller wants a copy, the copy would only have to be heap-allocated if copied via some virtual function in the node base-class. For the concrete classes (obtained via dynamic-casting), copy constructors and assignment operators would work just fine. Another option (as you point out) is returning a shared_ptr, though this would slightly complicate the parser's management of its temporary objects.
a clone_ptr (smart-pointer with deep-copying, using some template and virtuality tricks to call the appropriate copy ctor) or poly_obj (the same but with an object interface instead of a pointer one) could also be considered instead of shared_ptr. Those tools were being developed a while ago in boost, I wonder what happened to them. Those are the main two ways to wrap polymorphism (where the value is actually owned): keep value semantics, or make use of the pointers to avoid copies and give entity semantics, which complicates destruction.
That could be big, depending on how much text you buffer. Not only would it waste memory, but memcpy'ing around all of that could waste some of the performance savings gained by avoiding heap allocation. Maybe RVO eliminates some or all of the performance penalty, but it's probably unwise to depend so much on RVO.
Of course, passing in the result might be the solution - at least to the performance issues. A parser that allows the user to pass in the result would also facilitate copying subtrees, if your node type has addChild() and addSibling() methods.
I think you should rely on NRVO existing when returning a local variable by value. Writing void foo(T&); instead of T foo(); is just annoying, and possibly suboptimal. From what I have tested, NRVO in that case is performed by all modern compilers. (MSVC6 only performs RVO) Upcoming move semantics will also allow to prevent copies for sure.
Why re-invent more than necessary? Use DOM and/or pick some other, existing object models (unless you have specific issues which they don't address).
- DOM has a lot of problems, like the ones in relation to namespaces, which were added after some time. It's not perfect. - DOM looks like something made for Java and doesn't make smart use of what C++ can offer - DOM requires a lot of memory usage and preprocessing. This is often not needed, therefore building DOM on top of a lower-level interface is a better choice - There is nothing wrong with trying to invent something new if it gives some interesting benefits over existing solutions. This is what boost can be seen as: a laboratory where new C++ techniques are experimented.
I think the biggest reason to avoid exceptions would be for the performance impact.
Exceptions should only be thrown when the situation is exceptional, not when an error is expected. Therefore they should be thrown very rarely, so the performance impact is not really relevant. (almost -- implementation dependent) In our case, though, since we're building a low-level API and expecting errors, we should not throw exceptions.

Sebastian Redl wrote:
Once again I'm turning to the list for discussion about a design issue in the XML library. This time I hope to avoid any discussion about the implementation on the library and focus on interface only.
Have you thought about asynchronous parsing? How could that be available?
The interface in question is the reader interface, also known as pull interface. Like SAX, the pull interface is an event-based interface. There are a few event types (roughly, StartElement, EndElement, Characters, and a few more for other XML features), all of which provide come with some additional data: the element name, the character data, etc.
There are two types of reader interfaces currently in use that I've found. I've come up with a third. I wonder which the people on this list would prefer, where they see their weaknesses and strengths. The names that I've given them are my own creation.
There are of course variations, like the one Matt Gruenke revealed. You could provide the inheritance interface but with the objects actually owned by the parser (making it kind of like the monolithic interface), and use variant to store those objects on the stack. This idea doesn't look so bad actually, since you have the second solution without its drawbacks and that you only gain the advantages of the first solution (if you provide the appropriate tools to allow copy construction of the referenced objects, that is). I don't understand, though, if you mean that the parser containing its state is a good thing or not. Anyway, whatever is chosen, I think using variant with the ability to get a base will be a good idea somewhere. This provides both type-safe `which' and visitors and RTTI for those we want it. Examples of how some basic operations could be done with those interfaces would come in handy to compare them for the ones, like me, that don't have much experience with parsing XML.
Independently of the type of interface chosen, another issue is important: the scope of the interface. Should it report all XML events, including those coming from DTD parsing?
Validation is quite costly: a way to prevent it would be nice. And it's not just DTD, there are other validation means. However, without validation you don't know what the `id' attribute is, which is quite annoying. It seems that's why they introduced xml:id. Browser engines like Gecko don't validate but they know what the id attributes are for each namespace that they handle. Maybe something similar could be done, be it with static data or user input.
Should this be a user choice,
Don't validate by default, and do it if the user asks for it. It seems like the better choice to me.
Should errors be reported as error events, or as exceptions?
We expect errors to happen, so we shouldn't use exceptions. We could allow them to be toggled on though, for users that don't want to check for such things and are not looking for super efficiency. Maybe they should be using a higher level API then though.
How about warnings: exceptions are inappropriate for them. Should it be possible to disable them completely?
In exception mode, it should be allowed to ignore warnings, and maybe be the default.

loufoque wrote:
Independently of the type of interface chosen, another issue is important: the scope of the interface. Should it report all XML events, including those coming from DTD parsing?
Validation is quite costly: a way to prevent it would be nice. And it's not just DTD, there are other validation means.
I don't think the question is about validation at all. Given that xml documents can contain an 'internal subset' means that the parser needs to understand these, and any xml reader has to handle related events. Whether you use that to validate the document or not is an entirely different question. (And I agree, validation should be handled independently, so users don't have to pay for things they don't use.) Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stefan Seefeld wrote:
(And I agree, validation should be handled independently, so users don't have to pay for things they don't use.)
Little detail: the XML specification actually requires that validation is at user discretion. I plan to implement it as a filter. But even then it must always be present: the spec requires quite a few things from it. Sebastian Redl

On 10/30/06, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
Stefan Seefeld wrote:
(And I agree, validation should be handled independently, so users don't have to pay for things they don't use.)
Little detail: the XML specification actually requires that validation is at user discretion. I plan to implement it as a filter. But even then it must always be present: the spec requires quite a few things from it.
As it is right now, I'd prefer using #1. I am interested in seeing a basic example (in code) of how using your #3 would work. If it is #1, it would be trivial to make a validating_reader that inherits from the basic reader and only wraps next().
Sebastian Redl _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Cory Nelson http://www.int64.org

loufoque wrote:
Sebastian Redl wrote:
Once again I'm turning to the list for discussion about a design issue in the XML library. This time I hope to avoid any discussion about the implementation on the library and focus on interface only.
Have you thought about asynchronous parsing? How could that be available?
I have thought about it. A pull interface is not very suited for asynchronous parsing, but it will provide non-blocking parsing (returning a "would-block" event if not enough input is available). The push interface will provide asynchronous parsing by somehow registering with an ASIO io_service. I'll have to take a closer look at ASIO to find out how exactly to realize this, though.
There are of course variations, like the one Matt Gruenke revealed. You could provide the inheritance interface but with the objects actually owned by the parser (making it kind of like the monolithic interface), and use variant to store those objects on the stack.
This idea doesn't look so bad actually, since you have the second solution without its drawbacks and that you only gain the advantages of the first solution (if you provide the appropriate tools to allow copy construction of the referenced objects, that is).
Yes, that sounds like a good solution indeed.
I don't understand, though, if you mean that the parser containing its state is a good thing or not.
Neither. Both modes have advantages and disadvantages.
Examples of how some basic operations could be done with those interfaces would come in handy to compare them for the ones, like me, that don't have much experience with parsing XML.
Yes, good idea. I'll work something up.
Validation is quite costly: a way to prevent it would be nice. And it's not just DTD, there are other validation means.
Like Relax NG and Schema. I know. But as Stefan Seefeld correctly posted, this is not about validation. This is about what to do with errors that come up during validation and/or well-formedness checking.
However, without validation you don't know what the `id' attribute is, which is quite annoying. It seems that's why they introduced xml:id. Browser engines like Gecko don't validate but they know what the id attributes are for each namespace that they handle. Maybe something similar could be done, be it with static data or user input.
I plan to support the xml:id specification, but not store any knowledge of specific namespaces. Of course, there will be a way to feed the validator programmatically, so this could be implemented easily on top of that.
Should errors be reported as error events, or as exceptions?
We expect errors to happen, so we shouldn't use exceptions.
Do we? A SOAP server typically expects to receive programmatically generated XML, so it ought to be error-free. On the other hand, an XML-aware editor fully expects errors, because they're guaranteed to be there in incomplete documents.
We could allow them to be toggled on though, for users that don't want to check for such things and are not looking for super efficiency. That's what I think, too. Maybe they should be using a higher level API then though.
Perhaps, but some people might have memory as their main constraint, not speed. They would still want to use a low-level interface, yet not expect errors. Sebastian Redl

Hello Sebastian, let me share my ideas inspired by your post. First of all, I'm confused why there is the 'event' term is used if one needs to call 'next' method to get the next one? It looks like this is a good word here though, but it should be used more sequentially. In a current installment you can safely omit this word and think of your XMLReader as a std::copy(std::input_iterator<XMLObject>(std::cin), std::input_iterator<XMLObject>(), MyIteratorAdapterForXMLObject()); where MyIteratorAdapterForXMLObject - is a user defined iterator that do the actual processing. Of course, this view is simplistic, but hope you catch the point: There is no place for the Event concept! But don't get me wrong, please, my point is that Boost.XML should be made around the real Event concept. My background here is ACE Reactor components (events demultiplexing subsystem) and more recent Boost.Channel proposal by Yigong Liu (search boost list for "another snapshot release of Channel framework") In this light, the proper programming model for XML parsing would be general event demultiplexing system, with XMLReader seen as a source of XML events, DTD and Scheme Validators as an event filters (and sources of Validation events too) and user code as event sinks and/or filters (of course in such an environment it would be a good way to implement all other program logic by event sending and responding to them, but it is up to user and not you as an XML parsing library implementor). The whole XML processing system would be a collection of Streams (in ACE terms Stream is a chain of event filters rooted in event sources and ended in event sinks). Of course the whole system can operate asynchronously and optionally in a distributed environment. To conclude, I suggest you to take a look at ACE and try to figure out how it can be useful to accomplish your goal. Thank you for your interest in developing such a library for boost, Oleg Abrosimov.

Oleg Abrosimov wrote:
Hello Sebastian, let me share my ideas inspired by your post.
First of all, I'm confused why there is the 'event' term is used if one needs to call 'next' method to get the next one? It looks like this is a good word here though, but it should be used more sequentially. In a current installment you can safely omit this word and think of your XMLReader as a
std::copy(std::input_iterator<XMLObject>(std::cin), std::input_iterator<XMLObject>(), MyIteratorAdapterForXMLObject());
where MyIteratorAdapterForXMLObject - is a user defined iterator that do the actual processing.
Of course, this view is simplistic, but hope you catch the point: There is no place for the Event concept!
Right.
But don't get me wrong, please, my point is that Boost.XML should be made around the real Event concept.
Let me rephrase that in order to make sure I really understand what you are getting at here. This is about 'push' vs. 'pull', right ? While the 'next()' approach suggests the user is indeed iterating over input 'tokens' that are getting pulled put of some input stream, you suggest a toplevel 'parse_input' resulting in some underlying reactor to push events on the user. Is that correct ? If so, I disagree. I think it is perfectly fine to let the user keep control over the iteration process. The question is how to manage the fact that the tokens are polymorphic. Can a call to 'next()' be combined with some statically typed handlers that neither force the user to do the downcasting himself, nor does force the base class interface be the union of all the wrapped interfaces. (In fact one could wonder whether there needed to be a common base class at all.) Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Hello Sebastian, let me share my ideas inspired by your post.
First of all, I'm confused why there is the 'event' term is used if one needs to call 'next' method to get the next one? Because that's what descriptions of similar interfaces use. I'm
Oleg Abrosimov wrote: personally not very happy about the word. "Token", which Stefan Seefeld just used, seems more appropriate.
It looks like this is a good word here though, but it should be used more sequentially. In a current installment you can safely omit this word and think of your XMLReader as a
std::copy(std::input_iterator<XMLObject>(std::cin), std::input_iterator<XMLObject>(), MyIteratorAdapterForXMLObject());
No, I should not. Please don't get me wrong. My high-level plan for this library is to provide more than one interface. I want a "pull" interface that I'm designing right now, and which this thread is about. I want a "push" interface akin to SAX, which is what you're thinking about. So yes, there will be such an interface, but this thread is not about it. And finally, there will be one or more object model interfaces, like DOM. Thank you for your ideas, though. I will think about them when the time comes to design the push interface. Sebastian Redl

Sebastian Redl wrote:
There are two types of reader interfaces currently in use that I've found. I've come up with a third. I wonder which the people on this list would prefer, where they see their weaknesses and strengths. The names that I've given them are my own creation.
1) The Monolithic Interface Examples: .Net XMLReader, libxml2 XMLReader (modeled after the .Net one), Java Common API for XML Pull Parsing (XmlPull) (don't confuse with JSR 173 "StAX")
In the monolithic interface, the XML parser acts as a cursor over the event stream. You call next() and it points to the next event in the stream. From there, you can query its type (usually some integral constants) and call some methods to retrieve the data. All methods are always available on the object; calling one that is not appropriate for the current event (e.g. getTagName() for a Characters event) returns a null value or signals an error.
I don't like the idea of an all-embracing interface that requires the user to figure out which methods are actually valid for the current type.
2) The Inheritance Interface Examples: JSR 173 "StAX"
In the inheritance interface, the event types are modeled as a group of classes that all inherit from an Event base class. The parser acts as an iterator, Java style; calling next() returns a reference/pointer to the event object for this event. You use RTTI or a similar mechanism to find the type of the event, then cast the reference to the appropriate subclass. The subclasses then provide access to the data that is actually available for this event type.
While this sounds better (the actual interface only provides what the actual type supports), it is still the user's responsibility to figure out the type and do the cast.
3) The Variant Interface Examples: None. I believe I came up with this entirely on my own.
The variant interface seeks to combine the strengths of the other two interfaces. It uses a non-monolithic interface, that is, the parser acts like an iterator and the data is not stored within it. It does not return a reference to the event object, though, but instead a boost::variant of all possible events. This way, heap allocation of the event object is avoided, together with all the trouble coming with that. The event type can be determined either by calling variant::which, or with a variant visitor (type-safe!), or with a special get_base() function that works like get() but can retrieve a reference to a common base of all the variant types. (This is possible, although an implementation does not exist in Boost.)
Same here. You seem to assume that a single accessor is to be used to retrieve the current data, whether it is strongly / statically typed or not. What about an interface similar to SAX, where the user provides a set of handlers, one per type, and then the reader calls the appropriate one ? For example: void handle1(token1 const &); void handle2(token2 const &); ... typedef reader<handle1, handle2, ...> my_reader; my_reader r(filename); while (r.next()) r.process(); Please disregard the syntax; there are certainly multiple ways to declare and bind handlers to the reader, either at compile- or at runtime. My question is merely about whether it would be useful to use typed callbacks like this. What are the pros / cons ? Note that there is room between the two extremes, i.e. a single token type vs. independent token types: All tokens can be derived from a common base that provides access to common data, so an iterator is still possible, for example to 'fast-forward' to a particular position in the stream. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

loufoque wrote:
Stefan Seefeld wrote:
What about an interface similar to SAX, where the user provides a set of handlers, one per type, and then the reader calls the appropriate one ?
This is a discussion about the interface for a Pull parser. You're talking about a Push one.
No I'm not; at least, I don't think I am. :-) The user still has control over how the reader advances from one token to the next, since it is the user who calls 'next()' (or however it will be spelled). A push parser would be one where you call a single 'run' method and then the parser dispatches tokens (embedded into events). I don't think the fact that my design involves a dispatch qualifies it as push. Compare this to a visitor to resolve the token's type. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Hi Stefan, Stefan Seefeld <seefeld@sympatico.ca> writes:
What about an interface similar to SAX, where the user provides a set of handlers, one per type, and then the reader calls the appropriate one ? For example:
void handle1(token1 const &); void handle2(token2 const &); ...
typedef reader<handle1, handle2, ...> my_reader; my_reader r(filename); while (r.next()) r.process();
I think there is not much you can do in that while loop except calling process() which then brings the question of why not use the push model (e.g., SAX) since that is what you are essentially emulating. Also note that you can get this behavior with a normal reader and a visitor but with a modular design as a bonus: typedef visitor<handle1, handle2, ...> v; reader r (filename); while (node* n = r.next()) v.visit (n);
... so an iterator is still possible, for example to 'fast-forward' to a particular position in the stream.
In order to skip to a particular position you will need to examine the data (and thus cast, etc.). There is not much use of skipping 5 nodes from now. One common example would be skipping a sub-tree for which you will need node types and/or names. HTH, -Boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Hi Boris, Boris Kolpackov wrote:
Hi Stefan,
Stefan Seefeld <seefeld@sympatico.ca> writes:
What about an interface similar to SAX, where the user provides a set of handlers, one per type, and then the reader calls the appropriate one ? For example:
void handle1(token1 const &); void handle2(token2 const &); ...
typedef reader<handle1, handle2, ...> my_reader; my_reader r(filename); while (r.next()) r.process();
I think there is not much you can do in that while loop except calling process() which then brings the question of why not use the push model (e.g., SAX) since that is what you are essentially emulating.
Are you arguing against the pull model here or against my use of callbacks ?
Also note that you can get this behavior with a normal reader and a visitor but with a modular design as a bonus:
typedef visitor<handle1, handle2, ...> v; reader r (filename); while (node* n = r.next()) v.visit (n);
That's right, but that is inefficient: The reader already does know the type of the token, but in the name of 'modular design' you throw it away only to recover it later with an extra round-robin dispatch through the visitor. What is the advantage of that ? Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stefan Seefeld wrote:
Hi Boris,
Boris Kolpackov wrote:
Also note that you can get this behavior with a normal reader and a visitor but with a modular design as a bonus:
typedef visitor<handle1, handle2, ...> v; reader r (filename); while (node* n = r.next()) v.visit (n);
That's right, but that is inefficient: The reader already does know the type of the token, but in the name of 'modular design' you throw it away only to recover it later with an extra round-robin dispatch through the visitor. What is the advantage of that ?
The handlers are easily switchable. Imagine XML handling that looks something like this: function basic(parser&) : gets the events at the root. If a <foo> element is encountered, calls foo to handle its contents. If a <bar> element is encountered, calls bar to handle its contents. function foo(parser &): handles <foo> and its contents. function bar(parser &): handles <bar> and its contents. If a token is returned and dispatched through a local visitor, this looks kind of like this (C++-like pseudo-code, note the lambdas to make things easier): void basic(parser &p) { visitor basic_visitor(start_tag = (start_tag_info &t){ if(t.tag_name() == "foo") foo(p); else if(t.tag_name() == "bar") bar(p); }); while(node *n = p.next()) basic_visitor.visit(n); } void foo(parser &p) { visitor foo_visitor( ..., end_tag = (end_tag_info &t){ if(t.tag_name() == "foo") return_parent; }); foo_visitor.visit(p.current()); while(node *n = p.next()) basic_visitor.visit(n); } void bar(parser &p) like foo(). With the visitor stored in the parser, foo would have to store the old visitor, place its new one, then restore the old one when it's finished. Which is less nice. Sebastian Redl
participants (8)
-
Boris Kolpackov
-
Cory Nelson
-
Jose
-
loufoque
-
Matt Gruenke
-
Oleg Abrosimov
-
Sebastian Redl
-
Stefan Seefeld