[boost] [xml] Brainstorming / Request for Comments, Suggestions, Opinions ...

7 Sep 2006

      Hi,

A few months ago I said I'd take on writing an XML library for Boost.
Well, I've finally got some time on my hand and started with a little bit
of brainstorming about the library.

I've written up my thoughts in this hopefully halfway comprehensible
document and would like to hear everyone's suggestions, opinions, advice,
requirements, etc.

Especially real-world requirements, as my own are just that of a single
person, and that isn't exactly a good basis for a general-purpose XML
library.

The current brainstorming is only for the XML reading side of things. The
writing side will come afterwards.

So here goes.

------------------------------------------------------------------
Purpose of document:
Identify important decisions in the design of a C++ XML parser library.

1) API TYPE
Pull-API (StAX), Push-API (SAX), Object-Model-API (DOM)?

- All of them, of course! The main question is, which one is the base API?
- DOM is out of the question (performance/memory overhead).
- Implementing a push parser on top of a pull parser is trivial:
	while(fetchEvent()) pushEvent()
- Implementing a pull parser on top of a push parser requires at least
	generator-style coroutines. This occurs a performance overhead at best,
	unusability at worst (in limited environments).
- It is therefore best to use a pull model at the lowest model, although this
	makes the parser implementation more complex.

2) Pull Interface
There are several models of pull interfaces in use. These are for Java and
C#,
so a C++ parser does not necessarily have to use any of them.

Existing APIs:
--- Java
- XMLPull
- StAX (JSR 173)
- Xerces XNI Pull Configuration
--- .Net
- .Net XmlReader
--- Python
- Python has very little material on pull parsers. There seem to be some
	available, but they're not popular.
--- Ruby
- Ruby has a built-in XML library with pull support. The API is not yet
stable
	but seems to resemble .Net's XmlReader.
--- C++
- An early version of XPP has a C++ implementation. The interface is
Java-style.

API Styles:
>From the above APIs, we can gather the following:
- Pull parsing always involves calling a method to obtain the next piece of
	document information (called "event"), then processing that piece.
- Two main models seem typical:
-- StAX has a nextEvent() method that returns a reference to an object,
	identified by a base interface XMLEvent. This reference can then be cast to
	the appropriate sub-interface. This is the polymorphism approach.
-- XMLPull and .Net's XmlReader also have a next()/read() method. However,
they
	do not return an event object but instead store the information internally,
	to be queried by special methods. This model is also used by REXML, Ruby's
	parser. This is the monolith model.

Polymorphism pro:
- State is not held in parser object. Calling next() does not necessarily
	discard the old information.
- Once the correct interface is obtained, all methods on it are guaranteed to
	work. With the monolith model, calling the wrong method may lead to
	exceptions or error returns.

Monolith pro:
- Does not need to allocate an object for each parse event. Can in fact hold
	information in a very compact way internally.
- No casts necessary.

Options:
- What are the options that a C++ API has?
- Polymorphism-style API. Return smart pointer? Returning allocated object in
	raw pointer unacceptable. Returning pointer to static storage possible, but
	is basically monolith in disguise. Cannot pass an existing object IN to be
	filled with data.
- Monolith-style API. Means, among other things, that passing the current
data
	to a function means passing a reference to the entire parser. Furthermore,
	it is not possible to pass a small object containing only the data from the
	actual event to a function, unless that object is written by the API user.
	Thus, every function would have to either have its own switch on the event
	type or assert that the passed-in object contains the right data.
- Union of events, like a Boost.Variant. This seems a good compromise between
	polymorphic and monolithic approaches:
	- State not in parser object, but separate.
	- No dynamic allocation: Variant is usually stack-based.
	- Obtain the actual object from it and use that. Check is required only
		once, other uses are statically checked.
	- Can pass in the variant as an out parameter, saving even the copy.
	Of course, the variant has downsides, mainly that you have to either cast or
	use a static visitor.
	- Other downsides?
- Other approaches?

3) Input/Output System
How does the library access underlying storage?

- Since it needs to access resources from various sources, typically
specified
	as URLs, it needs a flexible and runtime-switchable input system.
- In particular, it should be possible to plug schema resolvers in at
runtime,
	so that program extensions can provide support for, say, the ftp: schema.
- Two basic options:
	- Iterator-based approach.
	- Stream-based approach.
	- Other?
- Iterators are tricky to switch at runtime, and non-trivial to implement.
- Streams are easier to implement, especially in a polymorphic fashion, but
	they are a poor abstraction of things like memory-mapped files.
	Does that matter?
- Streams, not necessarily being random-access, require caching for
backtracking
	libraries like Spirit to work. Alternative: hand-write the parser. XML is
	not, in my opinion, particularly suited to being implemented with Spirit
	anyway.
- Is it even possible to have iterators model non-blocking I/O?
- Having tried a few experiments, I favour streams. Iterators are somewhat
icky
	to work with in a hand-written parser, especially as they always need to be
	passed in pairs (or as a range).

4) Integration With Other Boost Libraries
What other Boost libraries should Xml work/integrate with?

- For example, does it make sense to provide an interface to the parser
that can
	be used for parsing streaming content? Either non-blocking, with the option
	to parse partial data and hop back on missing content, or a completely
	asynchronous implementation that dispatches SAX events through e.g. ASIO?

5) Parser Back-End / Library Organization

- Should Boost.Xml be a complete XML solution, with a parser, DOM
implementation
	and everything?
- Or should it be split into two parts, one being a parser, the other a DOM
	implementation with various construction modes?
- Or should even the core parser be split into the actual text parser and the
	event/pull/whatever interface, so that an HTML or YAML or PYX parser or even
	an algorithmic content generator can be placed behind?
- What, then, is the interface between that parser and the user interface?

6) Other Issues ???

[boost] [xml] Brainstorming / Request for Comments, Suggestions, Opinions ...

Sebastian Redl