[boost] XML parser

13 Jul 2007

      Hi,

Since there's a lot of discussion about XML parsers going on at the moment, I
thought I'd mention the axemill XML parser that I've been working
on. Development stalled a few years ago, but I just revamped it to work with
Boost 1.34, and added a few more features. It's not ready yet, but I would
like to submit it to boost when it is.

For those interested, it's at http://www.sf.net/projects/axemill

It's a full validating parser, and builds a model of the DTD as it gets
parsed. It then uses that when building the DOM --- elements don't store their
name directly, instead they store a reference to the corresponding entry in
the model, which provides the name and additional information like the
permitted attributes and the content model. This means that multiple elements
with the same name only incur a 4-byte overhead (on my system) rather than
storing the full name.

There's not currently any support for xpath, xslt or schemas, but they should
be able to be added without too much hassle.

It is possible to reuse the same model for multiple documents, and the
structure allows for validating a freshly-built document as it is constructed
--- it won't allow you to create an element not in the model, or add an
attribute to an element that the model doesn't specify. That's all the
on-the-fly validation it does at the moment, but it wouldn't be too hard to
add attribute value checking, and element content checking (though that might
be better suited as a post-construction phase).

Nodes within the tree are referenced by shared_ptr, though you can construct
standalone nodes. Child nodes use a boost::variant to store the possible data
types (comment, text, cdata, pi, element), and the interface allows for
iteration through the child elements. The intention is to allow separate
iteration through all the nodes as well as through the child elements, and
also provide a content() function to retrieve the text content.

Internally everything is processed using 32-bit Unicode, so input and output
has to be converted. I provide convertTo<encoding> and convertFrom<encoding>
functions to do this, as well as overloads of some functions that take
std::string parameters and convert them using the "default" encoding
(currently ascii).

There's scope there for reading data of the web using the URIs from public
identifiers to retrieve the DTD, for example, but currently the only URI
scheme supported is file://.

Like I said, it's not ready yet, but it might provide an interesting
alternative direction.

Anthony
-- 
Anthony Williams
Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL