Re: [boost] New XML library

10 Dec 2008

      On Wed, Dec 10, 2008 at 1:23 PM, Phil Endecott
<spam_from_boost_dev@chezphil.org> wrote:
...
Themis Vassiliadis wrote:
...
I have been working in a C++ library like Apache Digester
(http://commons.apache.org/digester). I'm intending to convert it
following  boost policies described in Requirements and Guidelines.
What are the chances of it become a Boost library ?
Personally I would like to see something like RapidXML in Boost.
It seems that Apache Digester provides an element matching infrastructure.
 This could be useful, as manually iterating through the parse tree that
something like RapidXML generates can be a bit tiresome.  It should probably
be layered on top of a lower-level XML parser.
I have a low level iterator-based parser here:
http://svn.int64.org/viewvc/int64/xml/

The design I've been taking is something like this:

parser.hpp (xml::parser): the lowest level.  Given two UTF-32
compatible forward iterators, it returns one of (ok, done, need_more,
error), a node type (element/xmldecl/etc.), and an iterator range.
This parser performs no allocations, and as such does minimal
structural checking.  It does however have full character validation,
if you so choose (by a template parameter).  Really this does only
slightly more than a lexer, and is available if you want need top
performance and don't need full XML compliance and validation.

reader.hpp (xml::reader): the next level.  A UTF-32 push parser that
is fully XML 1.0 and 1.1 compliant, capable of validating the
document, tracking line/column numbers, entity substitution, and other
normal things you'd expect from a parser.

document.hpp (xml::document): a full in-memory document.  A modifiable
version, and constant version which uses an arena allocator to stay as
compact as possible.

As of now, only xml::parser is usable- everything but DTD parsing is
complete.  I have been really busy these past few months and haven't
got a chance to complete it.  The main goal I had when beginning this
is to have something I/O agnostic, that can drop out when it finds an
incomplete stream and be resumed later.  It was really important that
it work just as fantastically with parsing from memory, blocking I/O,
or async I/O.

It should also be very performant, which it is: the parser being very
lightweight, UTF-8 decoding is actually a huge bottleneck in my tests
which led me to allow the parser (via template parameter) to work
directly with UTF-8 if you don't require full compliance.

-- 
Cory Nelson