
7 Sep
2006
7 Sep
'06
12:08 a.m.
Hi, A few months ago I said I'd take on writing an XML library for Boost. Well, I've finally got some time on my hand and started with a little bit of brainstorming about the library. I've written up my thoughts in this hopefully halfway comprehensible document and would like to hear everyone's suggestions, opinions, advice, requirements, etc. Especially real-world requirements, as my own are just that of a single person, and that isn't exactly a good basis for a general-purpose XML library. The current brainstorming is only for the XML reading side of things. The writing side will come afterwards. So here goes. ------------------------------------------------------------------ Purpose of document: Identify important decisions in the design of a C++ XML parser library. 1) API TYPE Pull-API (StAX), Push-API (SAX), Object-Model-API (DOM)? - All of them, of course! The main question is, which one is the base API? - DOM is out of the question (performance/memory overhead). - Implementing a push parser on top of a pull parser is trivial: while(fetchEvent()) pushEvent() - Implementing a pull parser on top of a push parser requires at least generator-style coroutines. This occurs a performance overhead at best, unusability at worst (in limited environments). - It is therefore best to use a pull model at the lowest model, although this makes the parser implementation more complex. 2) Pull Interface There are several models of pull interfaces in use. These are for Java and C#, so a C++ parser does not necessarily have to use any of them. Existing APIs: --- Java - XMLPull - StAX (JSR 173) - Xerces XNI Pull Configuration --- .Net - .Net XmlReader --- Python - Python has very little material on pull parsers. There seem to be some available, but they're not popular. --- Ruby - Ruby has a built-in XML library with pull support. The API is not yet stable but seems to resemble .Net's XmlReader. --- C++ - An early version of XPP has a C++ implementation. The interface is Java-style. API Styles: >From the above APIs, we can gather the following: - Pull parsing always involves calling a method to obtain the next piece of document information (called "event"), then processing that piece. - Two main models seem typical: -- StAX has a nextEvent() method that returns a reference to an object, identified by a base interface XMLEvent. This reference can then be cast to the appropriate sub-interface. This is the polymorphism approach. -- XMLPull and .Net's XmlReader also have a next()/read() method. However, they do not return an event object but instead store the information internally, to be queried by special methods. This model is also used by REXML, Ruby's parser. This is the monolith model. Polymorphism pro: - State is not held in parser object. Calling next() does not necessarily discard the old information. - Once the correct interface is obtained, all methods on it are guaranteed to work. With the monolith model, calling the wrong method may lead to exceptions or error returns. Monolith pro: - Does not need to allocate an object for each parse event. Can in fact hold information in a very compact way internally. - No casts necessary. Options: - What are the options that a C++ API has? - Polymorphism-style API. Return smart pointer? Returning allocated object in raw pointer unacceptable. Returning pointer to static storage possible, but is basically monolith in disguise. Cannot pass an existing object IN to be filled with data. - Monolith-style API. Means, among other things, that passing the current data to a function means passing a reference to the entire parser. Furthermore, it is not possible to pass a small object containing only the data from the actual event to a function, unless that object is written by the API user. Thus, every function would have to either have its own switch on the event type or assert that the passed-in object contains the right data. - Union of events, like a Boost.Variant. This seems a good compromise between polymorphic and monolithic approaches: - State not in parser object, but separate. - No dynamic allocation: Variant is usually stack-based. - Obtain the actual object from it and use that. Check is required only once, other uses are statically checked. - Can pass in the variant as an out parameter, saving even the copy. Of course, the variant has downsides, mainly that you have to either cast or use a static visitor. - Other downsides? - Other approaches? 3) Input/Output System How does the library access underlying storage? - Since it needs to access resources from various sources, typically specified as URLs, it needs a flexible and runtime-switchable input system. - In particular, it should be possible to plug schema resolvers in at runtime, so that program extensions can provide support for, say, the ftp: schema. - Two basic options: - Iterator-based approach. - Stream-based approach. - Other? - Iterators are tricky to switch at runtime, and non-trivial to implement. - Streams are easier to implement, especially in a polymorphic fashion, but they are a poor abstraction of things like memory-mapped files. Does that matter? - Streams, not necessarily being random-access, require caching for backtracking libraries like Spirit to work. Alternative: hand-write the parser. XML is not, in my opinion, particularly suited to being implemented with Spirit anyway. - Is it even possible to have iterators model non-blocking I/O? - Having tried a few experiments, I favour streams. Iterators are somewhat icky to work with in a hand-written parser, especially as they always need to be passed in pairs (or as a range). 4) Integration With Other Boost Libraries What other Boost libraries should Xml work/integrate with? - For example, does it make sense to provide an interface to the parser that can be used for parsing streaming content? Either non-blocking, with the option to parse partial data and hop back on missing content, or a completely asynchronous implementation that dispatches SAX events through e.g. ASIO? 5) Parser Back-End / Library Organization - Should Boost.Xml be a complete XML solution, with a parser, DOM implementation and everything? - Or should it be split into two parts, one being a parser, the other a DOM implementation with various construction modes? - Or should even the core parser be split into the actual text parser and the event/pull/whatever interface, so that an HTML or YAML or PYX parser or even an algorithmic content generator can be placed behind? - What, then, is the interface between that parser and the user interface? 6) Other Issues ???