Re: [boost] RFC: Boost.XML API prototype in the sandbox

12 Jul 2007

      Mathias Gaunard wrote:
...
Stefan Seefeld wrote:
...
I would appreciate if anybody interested into a future boost.xml
submission would have a look, provide feedback, or even get
involved into the (ongoing) development.
I have to admit I am a bit confused by the interface.
For example, the first thing I see is that all nodes are actually 
pointers.
What's the best solution to this, and the memory management issues in general?

It's difficult, because a node could be a simple text node with just 
one character in it, or it could be the root of a multi-megabyte 
document.  Clearly in the first case it's OK to copy the data, while in 
the second case you would want to avoid deep copying in some way, e.g. 
with smart pointers and/or some sort of copy-on-write system.  
(Thinking aloud) I can imagine a tree where the top levels have 
different types from the bottom ones.

I have just dug out an old email that I wrote to the xmlwrapp list in 
2004.  I had observed that xmlwrapp, which is a C++ wrapper around 
libxml2 with much in common with Stefan's proposal, had an overhead of 
about 150 bytes per node.  I had been getting binary data from a 
database (with negligible space overhead compared to a C++ struct), 
converting it to in-memory XML and then converting that to HTML or SVG 
using XSLT.  The in-memory XML was terribly bloated to the extent that 
it was the bottleneck in the system, for a couple of reasons.  Firstly, 
as noted above, the data structures had a significant overhead (150 
bytes for the XML node compares to a 30 byte per element overhead in a 
std::list<std::string>).  And secondly there's the overhead for 
self-description i.e. all the strings making up the tag and attribute 
names.  This second issue could be helped if a symbol-table approach 
were used (e.g. the Boost.Spirit symbols class; each n-byte string 
would be reduced to hopefully one pointer).

The view I'm coming to is that I'd prefer an XML class with enough 
template parameters to make these sorts of issues resolvable by the 
user, depending on the requirements of their application:

struct node {};

template <typename STR_T = std::string>
struct textnode: public node, public STR_T { .... };

template <typename STR_T = std::string>
struct comment: public node, public STR_T { .... };

template <typename TAG_NAME_T  = std::string,
           typename ATTR_NAME_T = std::string,
           typename ATTR_VAL_T  = std::string,
           typename CHILD_T     = boost::shared_ptr<node> >
struct element: public node {
   TAG_NAME_T name;
   std::map<ATTR_NAME_T,ATTR_VAL_T> attributes;
   std::list<CHILD_T> children;
};

You would then need parser and writer functions that would work with 
any suitable string and child-pointer types.  I don't know whether the 
CHILD_T parameter can be used to support both pointers-to-children and 
inline-children, nor whether a copy_on_write_ptr<T> could somehow fit 
into this.  Any thoughts?

I'm also interested to hear what people think about XPath.  I have not 
used it much outside of XSLT.  My feeling is that it is easy to write 
"//foo" to find all the <foo> elements, but that this will descend all 
the way to the leaves of your tree even if you know that <foo> elements 
can only appear in certain contexts near the root.  To do efficient 
searches you either need a set of index structures (like XLST's keys, 
or Boost.MultiIndex), or a kind of tree-traversal-iterator that lets 
you be a bit more selective about when to descend into a subtree and 
when to skip it.

Regards,

Phil.

Re: [boost] RFC: Boost.XML API prototype in the sandbox

Phil Endecott