Proposal: XML APIs in boost

Some years ago I proposed a XML-related APIs for inclusion into boost (http://lists.boost.org/Archives/boost/2003/06/48955.php). Everybody agreed that such APIs would be very useful as part of boost. Unfortunately, though, after some weeks of discussing a number of details, I got distracted, and so I never managed to submit an enhancement. I'v now started to look into this topic again, and wrote down a start of a DOM-like API as I believe would be suitable for boost. Here are some highlights: * User access dom nodes as <>_ptr objects. All memory management is hidden, and only requires the document itself to be managed. * The API is modular to allow incremental addition of new modules with minimal impact on existing ones. This implies that implementations may only support a subset of the API. (For example, no validation.) * All classes are parametrized around the (unicode) string type, so the code can be bound to arbitrary unicode libraries, or even std::string, if all potential input will be ASCII only. * The Implementation uses existing libraries (libxml2, to be specific), since writing a XML DOM library requires substantial efford. A first sketch at a XML API is submitted to the boost file vault under the 'Programming Interfaces' category. It contains demo code, as well as some auto-generated documentation. I'm aware that this requires some more work before I can attempt a formal submission. This is simply to see whether there is still any interest into such an API, and to get some discussion on the design. Regards, Stefan

At 10:44 PM -0500 10/31/05, Stefan Seefeld wrote:
Some years ago I proposed a XML-related APIs for inclusion into boost (http://lists.boost.org/Archives/boost/2003/06/48955.php). Everybody agreed that such APIs would be very useful as part of boost. Unfortunately, though, after some weeks of discussing a number of details, I got distracted, and so I never managed to submit an enhancement.
[snip]
I'm aware that this requires some more work before I can attempt a formal submission. This is simply to see whether there is still any interest into such an API, and to get some discussion on the design.
Yes, I have interest. That's one. ;-) -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> It is by caffeine alone I set my mind in motion. It is by the beans of Java that thoughts acquire speed, the hands acquire shaking, the shaking becomes a warning. It is by caffeine alone I set my mind in motion.

* Stefan Seefeld <seefeld@sympatico.ca> [2005-10-31 22:47]:
Some years ago I proposed a XML-related APIs for inclusion into boost (http://lists.boost.org/Archives/boost/2003/06/48955.php). Everybody agreed that such APIs would be very useful as part of boost. Unfortunately, though, after some weeks of discussing a number of details, I got distracted, and so I never managed to submit an enhancement.
I'v now started to look into this topic again, and wrote down a start of a DOM-like API as I believe would be suitable for boost.
Here are some highlights:
* User access dom nodes as <>_ptr objects. All memory management is hidden, and only requires the document itself to be managed.
* The API is modular to allow incremental addition of new modules with minimal impact on existing ones. This implies that implementations may only support a subset of the API. (For example, no validation.)
* All classes are parametrized around the (unicode) string type, so the code can be bound to arbitrary unicode libraries, or even std::string, if all potential input will be ASCII only.
* The Implementation uses existing libraries (libxml2, to be specific), since writing a XML DOM library requires substantial efford.
A first sketch at a XML API is submitted to the boost file vault under the 'Programming Interfaces' category. It contains demo code, as well as some auto-generated documentation.
I'm aware that this requires some more work before I can attempt a formal submission. This is simply to see whether there is still any interest into such an API, and to get some discussion on the design.
I'm going to respond off-the-cuff, so excuse me if what I mention is covered in your sketch. Simply, the Java APIs have moved away from W3C DOM. In that langauge, developers have moved to JDOM, DOM4J, or XOM, for node surgery. The W3C DOM predates namespaces, and namespaces feel kludgy. It permits the construction of documents that are unlikely in the wild. Most documents conform to XML Namespaces. Of those alterate object models noted above, only DOM4J separates interface from implementation as rigidly as W3C DOM, using the factory pattern to create all nodes. More recent object models in Java like XMLBeans move away from modeling XML as a tree of nodes connected by links, and instead models XML as a target node, with a set of axis that are traversed by iterators, rather than node references. This model is the most C++ like. There are also document object models coming out of XPath and XSLT that are not as well known but are all axis based. Saxon's NodeInfo model, Jaxen's Navigator model, and Groovy's GPath model. All of these models are immutable. They support transformations and queries. For many applications this is all that is necessary. XQuery, XSLT, and XPath all generate new documents from immutable documents. The need for document surgery for many in memory applications is not as common as one might think. Transformation is often easier to express. I'd suggest, in any language wide implementation of XML, to attempt to separate transformation and query, from update. They are two very different applications. I'd suggest starting with supporting XML documents that conform to the XPath and Query data model, and working backwards as the need arises. It makes for a much more consice library, and removes a lot of methods for rarely needed, often pathalogical, mutations. Implementing an object model would be much easier, if you ipmlement the 95% that is most frequently used. And if you sepearate the compexity of document mutation from the realative simplicity of iteration and transformation. Cheers. -- Alan Gutierrez - alan@engrm.com - http://engrm.com/blogometer/

Alan, thank you for your interesting points. The API I suggest is not modeled after the the W3C DOM IDL, neither its java implementation. Many people have expressed discomfort both with the W3C DOM API as well as the idea of simply transcribing the java API to C++. Therefor, the API I suggest here is (so I hope) as C++-like as it can be, while still giving full flexibility to operate on (i.e. inspect as well as modify) XML documents. From the little I could gather about the alternatives you mention, it sounds like they would make very nice access layers on top of the base API (axis-oriented iterators, say).
I'd suggest, in any language wide implementation of XML, to attempt to separate transformation and query, from update. They are two very different applications.
I'm not sure I understand what you mean by transformation. How is it different from update ? Or is the former simply a (coarse-grained) special case of the latter, using a particular language to express the mapping (such as xslt) ?
I'd suggest starting with supporting XML documents that conform to the XPath and Query data model, and working backwards as the need arises. It makes for a much more consice library, and removes a lot of methods for rarely needed, often pathalogical, mutations.
There are clearly very different use cases to be considered. We should collect them and try to make sure that all of them can be expressed in a concise way. I'm not sure all of them operate on the same API layer. The code I posted supports xpath queries. While the result of an xpath query can have different types, right now only node-sets are supported (May be boost::variant would be good to describe all of the possible types). I'm not quite sure I understand what you mean by 'XPath data model'.
Implementing an object model would be much easier, if you ipmlement the 95% that is most frequently used. And if you sepearate the compexity of document mutation from the realative simplicity of iteration and transformation.
Could you show an example of both, what you consider (overly) complex as well as simple ? While the API in my code is certainly not complete (namespaces are missing, notably), I find it quite simple and intuitive. I don't think it needs to become more much complex to be complete. In particular, I'm hoping that we can make the API modular, so document access and document validation are kept separate (for example). May be that is what you mean, I'm not sure. Regards, Stefan

* Stefan Seefeld <seefeld@sympatico.ca> [2005-11-01 10:18]:
Alan,
thank you for your interesting points. The API I suggest is not modeled after the the W3C DOM IDL, neither its java implementation.
Many people have expressed discomfort both with the W3C DOM API as well as the idea of simply transcribing the java API to C++.
Therefor, the API I suggest here is (so I hope) as C++-like as it can be, while still giving full flexibility to operate on (i.e. inspect as well as modify) XML documents.
From the little I could gather about the alternatives you mention, it sounds like they would make very nice access layers on top of the base API (axis-oriented iterators, say).
I'd suggest, in any language wide implementation of XML, to attempt to separate transformation and query, from update. They are two very different applications.
I'm not sure I understand what you mean by transformation. How is it different from update ? Or is the former simply a (coarse-grained) special case of the latter, using a particular language to express the mapping (such as xslt) ?
Transformation engines are XQuery, XSLT, STX, and Groovy GPath. They do not update the document provided. The produce a new document. That is what I mean by transformation. The input XML document is not changed, it is read, and a new document is emitted. The document object model does not need to be mutable. Thus you can perform all sorts of optimizations for navigation. The ability to add or remove a node makes a document object model far more complex. Many people prefer this mode of operation over adding and removing nodes. Node insert/remove appears to be a common operation, because of web programming, where chaning the dom in the browser changes the display of the page. When you are not programminng for the pretty side-effects, node surgery becomes a real pain. Reading the document in, shuffling nodes, writing it back out is cumbersome. A lot of code is spent on the add and remove that is repetitious. It's much easier to express an XML operation in terms of a query that returns a document, or as a reactor to a set of events.
I'd suggest starting with supporting XML documents that conform to the XPath and Query data model, and working backwards as the need arises. It makes for a much more consice library, and removes a lot of methods for rarely needed, often pathalogical, mutations.
There are clearly very different use cases to be considered. We should collect them and try to make sure that all of them can be expressed in a concise way. I'm not sure all of them operate on the same API layer.
I'm sure they could, but I'm sure it would make a heavyier API than necessary. XSLT, XQuery, and XPath simply do not require "removeChild".
The code I posted supports xpath queries. While the result of an xpath query can have different types, right now only node-sets are supported
Which is cool, since in XPath an atomic value is the same thing as a node set that contains only that atomic value.
(May be boost::variant would be good to describe all of the possible types).
Types are described by a qualified name in XPath. Someone who is implementing a host language for XPath, like XQuery or XSLT, will require a named type.
I'm not quite sure I understand what you mean by 'XPath data model'.
http://www.w3.org/TR/xpath-datamodel/
Implementing an object model would be much easier, if you ipmlement the 95% that is most frequently used. And if you sepearate the compexity of document mutation from the realative simplicity of iteration and transformation.
Could you show an example of both, what you consider (overly) complex as well as simple ? While the API in my code is certainly not complete (namespaces are missing, notably), I find it quite simple and intuitive. I don't think it needs to become more much complex to be complete.
You are right on the money with W3C DOM. That is an overly complex object model. It allows for the creation of documents that do not adhere to XML Namespaces. If it were up to me, I'd create an document object model that was an XML Namespaces document object model, instead of an XML document object model. W3C DOM is designed to accept <a:b:c/> as a valid element name. For a good example of production code, I'd look at Saxon's NodeInfo object. The code is wooly, but describes the subset of data used in XPath, XQuery and XSLT, and the implementation gotchas. It really is an implementation of XPath data model, and probably the best example of how to implement it that is open source.
In particular, I'm hoping that we can make the API modular, so document access and document validation are kept separate (for example). May be that is what you mean, I'm not sure.
Yes. There are different breakdowns. Validation is something that people will want to do without, for the sake of performance. An XML document can be very useful read-only. I find that in my work, I'm don't have call to update nodes, since the XML comes from Atom feeds or SQL databases, replacing nodes makes little sense. http://www.w3.org/TR/xml-infoset/ I'd start by modeling the information, then move on to a separeate interface for mutating it. I'd put axis high on the list, since that is how XML has come to be seen by many, and they are a natural for the C++ STL. That strikes me as the best way to work with XML in C++, using C++ SQL algorithms as a query language, navigating a very efficent XML document object model, emitting a new document. Cheers. -- Alan Gutierrez - alan@engrm.com - http://engrm.com/blogometer/ http://www.w3.org/TR/xml-infoset/

Alan Gutierrez wrote:
I'm not sure I understand what you mean by transformation. How is it different from update ? Or is the former simply a (coarse-grained) special case of the latter, using a particular language to express the mapping (such as xslt) ?
Transformation engines are XQuery, XSLT, STX, and Groovy GPath.
They do not update the document provided. The produce a new document. That is what I mean by transformation. The input XML document is not changed, it is read, and a new document is emitted.
Fine. An what API are you looking for to do such transformations ? What's the granularity that makes the most sense ? document_ptr document = parse_file(input); stylesheet_ptr transformer = load_stylesheet(stylesheet); document_ptr new_document = transformer->transform(document); ought to be enough, no ?
The document object model does not need to be mutable. Thus you can perform all sorts of optimizations for navigation.
Could you provide an example ? I'm not sure I see what you have in mind.
The ability to add or remove a node makes a document object model far more complex.
Really ? Look at my design, there aren't all that many methods to begin with, and only a fraction are about modifying documents / nodes. I believe that non-modifiable documents aren't that frequent that they require an entirely different object model.
Many people prefer this mode of operation over adding and removing nodes.
As I said, the idea of an API to navigate based on axes (for example) sounds very interesting and elegant, but I believe that could easily be layered on top of the API I'm presently proposing.
Node insert/remove appears to be a common operation, because of web programming, where chaning the dom in the browser changes the display of the page.
I don't quite agree that this is the number one use case. I haven't actually looked into boost::serialization, but I'd expect a DOM API to be useful there. I'v written libraries for data management that used some kind of DOM API, too. Or configuration data, or...
When you are not programminng for the pretty side-effects, node surgery becomes a real pain. Reading the document in, shuffling nodes, writing it back out is cumbersome. A lot of code is spent on the add and remove that is repetitious.
Ok, so let's make it concise then. But someone has to create the document that you want to read in and transform. And may be that someone would prefer boost::xml if it existed. :-) Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
Some years ago I proposed a XML-related APIs for inclusion into boost (http://lists.boost.org/Archives/boost/2003/06/48955.php). Everybody agreed that such APIs would be very useful as part of boost. Unfortunately, though, after some weeks of discussing a number of details, I got distracted, and so I never managed to submit an enhancement.
I'v now started to look into this topic again, and wrote down a start of a DOM-like API as I believe would be suitable for boost.
IIUC the major obstacle to XML support is proper unicode support. Don't we need a unicode library first? -- Dave Abrahams Boost Consulting www.boost-consulting.com

On Nov 1, 2005, at 12:33 PM, David Abrahams wrote:
Stefan Seefeld <seefeld@sympatico.ca> writes:
Some years ago I proposed a XML-related APIs for inclusion into boost (http://lists.boost.org/Archives/boost/2003/06/48955.php). Everybody agreed that such APIs would be very useful as part of boost. Unfortunately, though, after some weeks of discussing a number of details, I got distracted, and so I never managed to submit an enhancement.
I'v now started to look into this topic again, and wrote down a start of a DOM-like API as I believe would be suitable for boost.
IIUC the major obstacle to XML support is proper unicode support. Don't we need a unicode library first?
Indeed this is the major stumbling block. Matthias

On Nov 1, 2005, at 6:33 AM, David Abrahams wrote:
Stefan Seefeld <seefeld@sympatico.ca> writes:
Some years ago I proposed a XML-related APIs for inclusion into boost (http://lists.boost.org/Archives/boost/2003/06/48955.php). Everybody agreed that such APIs would be very useful as part of boost. Unfortunately, though, after some weeks of discussing a number of details, I got distracted, and so I never managed to submit an enhancement.
I'v now started to look into this topic again, and wrote down a start of a DOM-like API as I believe would be suitable for boost.
IIUC the major obstacle to XML support is proper unicode support. Don't we need a unicode library first?
I don't agree with this. We need Unicode for handling XML documents, yes, but there we don't need a Unicode library first and an XML library second: we could accept a solid XML library (which is mainly about navigation and manipulation of the XML tree) and drop a Unicode string into it later on. Stefan is even proposing to parameterize over the string type, so it becomes a non-issue. Doug

Doug Gregor <dgregor@cs.indiana.edu> writes:
On Nov 1, 2005, at 6:33 AM, David Abrahams wrote:
IIUC the major obstacle to XML support is proper unicode support. Don't we need a unicode library first?
I don't agree with this. We need Unicode for handling XML documents, yes, but there we don't need a Unicode library first and an XML library second: we could accept a solid XML library (which is mainly about navigation and manipulation of the XML tree) and drop a Unicode string into it later on. Stefan is even proposing to parameterize over the string type, so it becomes a non-issue.
Cool! -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
IIUC the major obstacle to XML support is proper unicode support. Don't we need a unicode library first?
I carefully worked around this issue by making the string type a template parameter. :-) That's not only because boost doesn't have a unicode library yet, but because people might use different libraries for that, or even use std::string if they are careful. Regards, Stefan

"Stefan Seefeld" <seefeld@sympatico.ca> wrote in message news:43678835.2010605@sympatico.ca...
David Abrahams wrote:
IIUC the major obstacle to XML support is proper unicode support. Don't we need a unicode library first?
I carefully worked around this issue by making the string type a template parameter. :-)
That's not only because boost doesn't have a unicode library yet, but because people might use different libraries for that, or even use std::string if they are careful.
IMO, Unicode support is way beyond string template parameter. Unicode means different character sets to support, different encoding format, different encoding schemes sets and different tradeoffs in optimization and all above. Gennadiy.

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
"Stefan Seefeld" <seefeld@sympatico.ca> wrote in message news:43678835.2010605@sympatico.ca...
David Abrahams wrote:
IIUC the major obstacle to XML support is proper unicode support. Don't we need a unicode library first?
I carefully worked around this issue by making the string type a template parameter. :-)
That's not only because boost doesn't have a unicode library yet, but because people might use different libraries for that, or even use std::string if they are careful.
IMO, Unicode support is way beyond string template parameter. Unicode means different character sets to support, different encoding format, different encoding schemes sets and different tradeoffs in optimization and all above.
Sort of. For XML processing, the primary feature of Unicode is the extended character set. For XML 1.0, once an XML processor has decided whether or not a given character is whitespace, one of the special characters (such as <, >, and &), a name start character, a name character or "other", the peculiarities of Unicode are mostly irrelevant. Obviously, there has to be code to handle the detection of the input encoding, and conversion to a stream of Unicode codepoints, in order to facilitate such classification. However, beyond that, the details don't matter. It may be that for schema processing, or XPath processing, then you need more Unicode facilities; I never got that far when writing my XML processor. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

IMO, Unicode support is way beyond string template parameter. Unicode means different character sets to support, different encoding format, different encoding schemes sets and different tradeoffs in optimization and all above.
, and &), a name start character, a name character or "other", the
Sort of. For XML processing, the primary feature of Unicode is the extended character set. For XML 1.0, once an XML processor has decided whether or not a given character is whitespace, one of the special characters (such as <, peculiarities of Unicode are mostly irrelevant. Obviously, there has to be code to handle the detection of the input encoding, and conversion to a stream of Unicode codepoints, in order to facilitate such classification. However, beyond that, the details don't matter.
I think it's more then just that. Scenario 1: I prefer parse documents that use only first plane, use UCS2 as encoding format and UTF8, UTF16 as Encoding scheme. IOW I will always use wchar_t and wstring. Scenario 2: I prefer parse documents that use only ASCII chars, use 8bit as encoding format and 7bit as encoding scheme. IOW prefer to use char as std::string and I do not want to know about any transcoding, wide chars e.t.c. Scenario 3: I prefer parse documents that use whole Unicode set, use UTF16 as encoding format and UTF8, UTF16 as Encoding scheme and I want parser to be lazy, IOW if it is big(huge) XML document that uses UTF8, I do not want parser to convert any CDATA immediately into native encoding form, until requested, but only do some local char by char conversion required for markup detection. (Essentially I want to limit memory usage and unnecessary work) Scenario 4: I prefer parse documents that use whole Unicode set, use UCS4 as encoding format and support a wide variety (10 or more) different encoding schemes. I do not care about performance and memory usage that much - but prefer single parser that does it all. I could list a lot of different usage schemes with different tradeoffs. Eventually it bound to affect XML parser interface in regards to Unicode support (instead of Unicode I would prefer to use term Charsets and Encoding scheme sets - Unicode is just one particular charset/encoding scheme sets combination) Gennadiy

On Nov 1, 2005, at 1:26 PM, Gennadiy Rozental wrote:
Scenario 1: I prefer parse documents Scenario 2: I prefer parse documents Scenario 3: I prefer parse documents Scenario 4: I prefer parse documents
That's a lot of parsing.
I could list a lot of different usage schemes with different tradeoffs. Eventually it bound to affect XML parser interface in regards to Unicode
The parsing interface in an XML library will be a handful of functions, with a single overloaded name, that take in iterators/streams/filenames/URLs/whatever and produce an XML document. In user code, parsing will take about 5 lines of code: try { doc = parse_xml(input); process_xml_doc(doc); } catch (xml_input_error&) { // report failure } A library should focus on what people really spend their time on. For XML, this is navigating, traversing, and manipulating XML documents, not parsing. Get a solid, usable interface for that and we'll have a Boost XML library. Leave XML parsing to the underlying library (Stefan wisely chose libxml2) and revisit it later for those few users that need the full parsing expressivity that you describe. Doug

Gennadiy Rozental wrote:
IMO, Unicode support is way beyond string template parameter. Unicode means different character sets to support, different encoding format, different encoding schemes sets and different tradeoffs in optimization and all above.
I think it is important to distinguish between how expressive the API is concerning its string type, encoding, etc., and how the library (or its backend) actually deals with the content. The implementation I propose deals with unicode perfectly fine (storing utf-8 internally), but as far as the C++ XML API is concerned I delegate unicode access to a template parameter and associated traits that can be tuned for performance (to avoid copies, say), maintenance (to avoid certain runtime dependencies, for example), etc., etc. It's just a reflection on the orthogonality of the two domains we are dealing with. Regards, Stefan

The implementation I propose deals with unicode perfectly fine (storing utf-8 internally),
What if I prefer UCS4 or plain 8 bit?
but as far as the C++ XML API is concerned I delegate unicode access to a template parameter and associated traits that can be
So now it's traits template parameter (which in itself isn't exactly corect - you probably meant policy). Yes - under generic Policy template parameter you could hide almost anything. But policy is much more that just a string type. And then I would be interested in concept this policy represent. Gennadiy

Gennadiy Rozental wrote:
The implementation I propose deals with unicode perfectly fine (storing utf-8 internally),
What if I prefer UCS4 or plain 8 bit?
Why would you care ? The API lets you plug in your own unicode string type, so you can access the data and map it to whatever encoding you want. Of course, the choice of the encoding used by the backend will have an impact on performance and memory usage, and it would be nice if that could be tuned by the user. Yet, that is an implementation detail. If we require backends to *use* different encodings internally I'm afraid that will reduce the choice a lot (I'm actually not aware of any implementation that gives such a choice).
And then I would be interested in concept this policy represent.
Have a look into the code ! I admit right now it's totally trivial, as it only provides functions to convert in and out between the externally visible unicode type and the type libxml2 uses ('xmlChar *'). There isn't even an explicit second template parameter for it, as I deduce it from the first. If that proves to be insufficient, I'll add it as an explicit second parameter to all types. Regards, Stefan

--- Stefan Seefeld wrote:
This is simply to see whether there is still any interest into such an API, and to get some discussion on the design.
Count me in as an interested party. I'm currently studying the Arabica library <http://www.jezuk.co.uk/cgi-bin/view/arabica> for my XML needs. (It even uses Boost.LexicalCast and Boost.Regex, among other things.) In the meatime, anything that avoids a dependency on Glib::ustring would be welcome! Cromwell D. Enage __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Cromwell Enage Sent: Tuesday, November 01, 2005 8:58 AM To: boost@lists.boost.org Subject: Re: [boost] Proposal: XML APIs in boost
--- Stefan Seefeld wrote:
This is simply to see whether there is still any interest into such an API, and to get some discussion on the design.
Count me in as an interested party. I'm currently studying the Arabica library <http://www.jezuk.co.uk/cgi-bin/view/arabica> for my XML needs. (It even uses Boost.LexicalCast and Boost.Regex, among other things.) In the meatime, anything that avoids a dependency on Glib::ustring would be welcome!
Cromwell D. Enage
May be I'm too late in this thread. Could someone explain what the benefits of the proposed Boost XML APIs would be over the existing Xerces (http://xml.apache.org/)? --Suman
__________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

--- Suman Cherukuri wrote:
May be I'm too late in this thread.
Your timing is perfect.
Could someone explain what the benefits of the proposed Boost XML APIs would be over the existing Xerces (http://xml.apache.org/)?
I haven't looked at the design, but it seems that Boost.XML, like Arabica, will provide a C++ interface to make use of existing C parsers, e.g. libxml2 or xerces. Cromwell D. Enage __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Cromwell Enage Sent: Tuesday, November 01, 2005 11:12 AM To: boost@lists.boost.org Subject: Re: [boost] Proposal: XML APIs in boost
--- Suman Cherukuri wrote:
May be I'm too late in this thread.
Your timing is perfect.
Could someone explain what the benefits of the proposed Boost XML APIs would be over the existing Xerces (http://xml.apache.org/)?
I haven't looked at the design, but it seems that Boost.XML, like Arabica, will provide a C++ interface to make use of existing C parsers, e.g. libxml2 or xerces.
Cromwell D. Enage
Xerces has good C++ support. I'm not sure if this discussion is about extending 'C' based XML parsers like libxml2 and Expat to provide C++ interface or to have a whole new XML parser within Boost. If it is later, Why? --Suman
__________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

--- Suman Cherukuri wrote:
Xerces has good C++ support. I'm not sure if this discussion is about extending 'C' based XML parsers like libxml2 and Expat to provide C++ interface or to have a whole new XML parser within Boost.
The discussion is about providing a simple, effective C++ interface for navigating and manipulating XML document trees. Parsing will be left to the backend library of the user's choice. The Apache Software License, under which Xerces is distributed, has more restrictive terms than the Boost Software License. This turns off quite a few people from using it, myself included. Cromwell D. Enage __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com

Cromwell Enage wrote:
--- Stefan Seefeld wrote:
This is simply to see whether there is still any interest into such an API, and to get some discussion on the design.
Count me in as an interested party.
And me :). One thing that I would like to see is the use of [] for performing XPath queries, e.g.: xmldom << L"<xml-data>...</xml-data>"; // set xml std::cout << "found " << xmldom[ L"xml-data/foo/@bar" ]; With current technologies like Lambda and Spirit, we can make a streamlined XML API that fits in with C++ paradigms. - Reece

* Reece Dunn <msclrhd@hotmail.com> [2005-11-01 13:17]:
Cromwell Enage wrote:
--- Stefan Seefeld wrote:
This is simply to see whether there is still any interest into such an API, and to get some discussion on the design.
Count me in as an interested party.
And me :).
One thing that I would like to see is the use of [] for performing XPath queries, e.g.:
xmldom << L"<xml-data>...</xml-data>"; // set xml
std::cout << "found " << xmldom[ L"xml-data/foo/@bar" ];
With current technologies like Lambda and Spirit, we can make a streamlined XML API that fits in with C++ paradigms.
This is nice syntax for both parsing and xquery. I'd suggest syntax that returns an axis as well. axis decendent_or_self = xmldom[ L"/xml/foo" ].decendent_or_self; for (i = decendent_or_self.begin(); i != decendent_or_self.end(); i++) { node n = *i; if (n.namespace_uri() != null) { std::cout << n.namespace_uri() << "\n"; } } Excuse my C++. I'm rusty. -- Alan Gutierrez - alan@engrm.com - http://engrm.com/blogometer/

On 10/31/05 10:44 PM, "Stefan Seefeld" <seefeld@sympatico.ca> wrote: [SNIP]
Here are some highlights:
* User access dom nodes as <>_ptr objects. All memory management is hidden, and only requires the document itself to be managed.
I hope there's support for custom allocators somewhere in there. (Maybe indirectly through the string type's allocator.) [SNIP]
* All classes are parametrized around the (unicode) string type, so the code can be bound to arbitrary unicode libraries, or even std::string, if all potential input will be ASCII only. [TRUNCATE]
-- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote:
On 10/31/05 10:44 PM, "Stefan Seefeld" <seefeld@sympatico.ca> wrote:
[SNIP]
Here are some highlights:
* User access dom nodes as <>_ptr objects. All memory management is hidden, and only requires the document itself to be managed.
I hope there's support for custom allocators somewhere in there. (Maybe indirectly through the string type's allocator.)
Not really, as the backend does the memory allocation. Here again, as with the internal character encoding: would we impose any fine-grained user control on these implementation policies, we would reduce the set of potential backends considerably. (libxml2 lets 'users' define their own memory (de)allocators, but that is a configuration choice, and I'm not sure whether we want to bind so tightly to that.) Regards, Stefan

On 11/2/05 8:08 AM, "Stefan Seefeld" <seefeld@sympatico.ca> wrote:
Daryle Walker wrote:
On 10/31/05 10:44 PM, "Stefan Seefeld" <seefeld@sympatico.ca> wrote:
[SNIP]
Here are some highlights:
* User access dom nodes as <>_ptr objects. All memory management is hidden, and only requires the document itself to be managed.
I hope there's support for custom allocators somewhere in there. (Maybe indirectly through the string type's allocator.)
Not really, as the backend does the memory allocation. Here again, as with the internal character encoding: would we impose any fine-grained user control on these implementation policies, we would reduce the set of potential backends considerably. (libxml2 lets 'users' define their own memory (de)allocators, but that is a configuration choice, and I'm not sure whether we want to bind so tightly to that.)
Why would we use a back end? Why bother making a Boost XML library if we're just going to make a wrapper that punts to a XML-specific open-source library? If we're not going to just mirror the back end, then we going to have to configure translations between our C++ front end and the back end, which could include allocations. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

On 11/8/05, Daryle Walker <darylew@hotmail.com> wrote:
Why would we use a back end? Why bother making a Boost XML library if we're just going to make a wrapper that punts to a XML-specific open-source library? If we're not going to just mirror the back end, then we going to have to configure translations between our C++ front end and the back end, which could include allocations.
I haven't seen an answer to your questions, but I should pose the same ones. I thought a boost wrapper should only be used for system libraries but not for thinks like libxml2 Personally, I strongly dislike the wrapper for thinks like libxml2. I think it devalues what boost is about ! The only argument I've seen for not writing the back end is that is a lot of additional work

Jose wrote:
On 11/8/05, Daryle Walker <darylew@hotmail.com> wrote:
Why would we use a back end? Why bother making a Boost XML library if we're just going to make a wrapper that punts to a XML-specific open-source library? If we're not going to just mirror the back end, then we going to have to configure translations between our C++ front end and the back end, which could include allocations.
I haven't seen an answer to your questions, but I should pose the same ones.
I'm not sure I understand the questions, or the assumptions they are based on. Have you bothered looking at the code ? Have you noted what 'translation' actually means there, i.e. that there is *nothing* but smart pointers being allocated there ? As to the 'why bother', well, the proposal is about an API, i.e. interface, which may or may not be implemented by a separate backend. If you are not happy with a libxml2 backend, well, good luck implementing it yourself ! Regards, Stefan

A first sketch at a XML API is submitted to the boost file vault under the 'Programming Interfaces' category. It contains demo code, as well as some auto-generated documentation.
http://www.boost-consulting.com/vault/ Has anybody been able to download it ? The vault website is incredibly slow !! (at least yesterday and today)

--- Jose wrote:
http://www.boost-consulting.com/vault/
Has anybody been able to download it ? The vault website is incredibly slow !! (at least yesterday and today)
Got my copy the day before yesterday. Cromwell D. Enage __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com

Jose wrote:
On 11/1/05, Stefan Seefeld <seefeld@sympatico.ca> wrote:
* The Implementation uses existing libraries (libxml2, to be specific), since writing a XML DOM library requires substantial efford.
Do you plan to support event-based parsing (SAX-like) ?
That's a good question. My initial proposal a couple of years ago had a SAX API. However, SAX has a number of shortcomings that make it hard or inapropriate to use (e.g., no namespaces !). A better API that still follows the cursor-style approach from SAX, is the XMLReader. It uses a pull model instead of push, i.e. there are no callbacks, but instead the application advances the reader's internal cursor to the next 'token'. See http://xmlsoft.org/xmlreader.html for a comparison to SAX. If there is enough interest I could add a boost::xml::reader API, though dom and reader are completely independent, as far as the API itself is concerned. Regards, Stefan

Stefan Seefeld wrote:
That's a good question. My initial proposal a couple of years ago had a SAX API. However, SAX has a number of shortcomings that make it hard or inapropriate to use (e.g., no namespaces !).
That's not true. SAX was extended to support namespaces above 5 years ago - see http://www.saxproject.org/ for the Java interface, or my Arabica library for a C++ version http://www.jezuk.co.uk/cgi-bin/view/arabica
A better API that still follows the cursor-style approach from SAX, is the XMLReader. It uses a pull model instead of push, i.e. there are no callbacks, but instead the application advances the reader's internal cursor to the next 'token'. See http://xmlsoft.org/xmlreader.html for a comparison to SAX.
For some definition of better. The unpleasantness with pull APIs is the token - you have to interrogate it for its actual type, and then dispatch. In certain circumstances this obviously applies to the DOM too. In other cases, and I accept that I might be unusual in this, SAX is the right thing to use. Jez

Jose wrote:
On 11/4/05, Jez Higgins <jez@jezuk.co.uk> wrote:
In other cases, and I accept that I might be unusual in this, SAX is the right thing to use.
For large XML files, an event or libxml2 reader api is the only option Which other cases do refer to ?
I was referring to an event API over a reader API. Filtering, for example, adding or removing something from a document, is straightforward with SAX. I was being slightly sarcastic. There are a number of people who find SAX, or SAX-like, interfaces difficult to use, and advocate "better" alternatives. I'm not one of them. Similarly there are a number of people who regard the DOM as impossibly complex. There was an earlier poster who more-or-less described the DOM in Java as dead. That's not my experience, and it's not my opinion. It's a long winded way of saying there is no one-true-way to process XML, no one way is universally better than any other. Jez

On Nov 4, 2005, at 6:26 AM, Jez Higgins wrote:
It's a long winded way of saying there is no one-true-way to process XML, no one way is universally better than any other.
... and we shouldn't delay a library because it doesn't support a particular way to process XML. It need only support one way, and support it well, and can later grow to support other methods. Doug

* Jez Higgins <jez@jezuk.co.uk> [2005-11-04 07:31]:
Jose wrote:
On 11/4/05, Jez Higgins <jez@jezuk.co.uk> wrote:
In other cases, and I accept that I might be unusual in this, SAX is the right thing to use.
For large XML files, an event or libxml2 reader api is the only option Which other cases do refer to ?
I was referring to an event API over a reader API. Filtering, for example, adding or removing something from a document, is straightforward with SAX.
I was being slightly sarcastic. There are a number of people who find SAX, or SAX-like, interfaces difficult to use, and advocate "better" alternatives. I'm not one of them.
Pull and push are two different methods of looking at XML as events, and they both have their advantages. Pull parsers (XPP or StAX) are not "better" than SAX.
Similarly there are a number of people who regard the DOM as impossibly complex. There was an earlier poster who more-or-less described the DOM in Java as dead. That's not my experience, and it's not my opinion.
I didn't say it was dead. It is part of the JDK, which means it will zombie on for years to come. I'm only noting that most folks feel that it is a stove-pipe API, too heavy for many applications. If you use the W3C DOM that ships with JDK as is, it implements the W3C DOM Event API, and thus dispatches events with nodes are appended to a document. This will hurt you when you build a large document using appendNode. An example of how W3C DOM does by supporting many different concepts, is a poor choice for an application that needs any one concept. (The browser is the only application I've worked with that uses them all effecitvely.) I offer the mulititude of alternatives, XMLBeans, XOM, JDOM, Dom4J, as evedence that something is amiss. These are people who dislike W3C DOM such that they will offer a replacement. There are no real alteratives to SAX. The Simple API to XML is just that, simple. There is a Xerces parser event API, but otherwise, I'm not aware of any widely depoyed SAX alternatives. -- Alan Gutierrez - alan@engrm.com - http://engrm.com/blogometer/

Jez Higgins wrote:
A better API that still follows the cursor-style approach from SAX, is the XMLReader. It uses a pull model instead of push, i.e. there are no callbacks, but instead the application advances the reader's internal cursor to the next 'token'. See http://xmlsoft.org/xmlreader.html for a comparison to SAX.
For some definition of better. The unpleasantness with pull APIs is the token - you have to interrogate it for its actual type, and then dispatch.
Granted. But the underlaying parser which any SAX implementation would build on would have to do that, too. You can think of the reader as that lower layer, and thus a push API with type-safe dispatching can easily be built on top, if that is what you want. Of course, the other direction is possible, too. However, logistically it is easier to put the push layer over the pull layer, i.e. the SAX implementation on top of the reader: As it happens, the implementation I have in mind uses libxml2, a C library. As such between the application calling 'parse()' and the callbacks are two language boundaries (C++ -> C and C -> C++), so you couldn't even throw exceptions from inside the callbacks and catch them in the main application. If, on the other hand, the callback dispatcher itself was written in C++, no language boundaries would need to be crossed while unwinding the callback stack. Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
Jez Higgins wrote:
A better API that still follows the cursor-style approach from SAX, is the XMLReader. It uses a pull model instead of push, i.e. there are no callbacks, but instead the application advances the reader's internal cursor to the next 'token'. See http://xmlsoft.org/xmlreader.html for a comparison to SAX.
For some definition of better. The unpleasantness with pull APIs is the token - you have to interrogate it for its actual type, and then dispatch.
Granted. But the underlaying parser which any SAX implementation would build on would have to do that, too. You can think of the reader as that lower layer, and thus a push API with type-safe dispatching can easily be built on top, if that is what you want.
Of course, the other direction is possible, too. However, logistically it is easier to put the push layer over the pull layer, i.e. the SAX implementation on top of the reader:
Surely it depends on which parser you use. My XML-parser-in-progress (sourceforge.net/projects/axemill) uses a callback mechanism akin to SAX; at the moment, that's all there is, as I haven't written a DOM yet. It is far easier to write a parser that calls user code (push model) than write a parser that can be continued (pull model), since in the pull model you have to save all the internal state in order to return to the user with each token; you basically have to write a "continuations" mechanism.
As it happens, the implementation I have in mind uses libxml2, a C library. As such between the application calling 'parse()' and the callbacks are two language boundaries (C++ -> C and C -> C++), so you couldn't even throw exceptions from inside the callbacks and catch them in the main application.
That's one of my main criticisms of your suggested API --- it's too tightly bound to libxml, and doesn't really allow for substitution of another parser. My other criticism so far is the node::type() function. I really don't believe in such type tags; we should be using virtual function dispatch instead, using the Visitor pattern. Your traversal example could then ditch the traverse(node_ptr) overload, and instead be called with document->root.visit(traversal)
If, on the other hand, the callback dispatcher itself was written in C++, no language boundaries would need to be crossed while unwinding the callback stack.
Yes. Axemill would allow that, for example. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
It is far easier to write a parser that calls user code (push model) than write a parser that can be continued (pull model), since in the pull model you have to save all the internal state in order to return to the user with each token; you basically have to write a "continuations" mechanism.
Fair enough. But here we are (or should be) focussed on the API, i.e. the user. The question is whether to put the parser in control of the data flow or the application. While the latter is harder to implement it is also far more convenient for users.
As it happens, the implementation I have in mind uses libxml2, a C library. As such between the application calling 'parse()' and the callbacks are two language boundaries (C++ -> C and C -> C++), so you couldn't even throw exceptions from inside the callbacks and catch them in the main application.
That's one of my main criticisms of your suggested API --- it's too tightly bound to libxml, and doesn't really allow for substitution of another parser.
Could you substantiate your claim ?
My other criticism so far is the node::type() function. I really don't believe in such type tags; we should be using virtual function dispatch instead, using the Visitor pattern. Your traversal example could then ditch the traverse(node_ptr) overload, and instead be called with document->root.visit(traversal)
Node types aren't (runtime-) polymorphic right now, but is that really a big deal ? Polymorphism is important for extensibility. However here the set of node types is well known (and rather limited). Making nodes polymorphic would imply that the library allocates nodes on the heap, instead of the stack (as it now does). That could well hurt performance. I'm not sure how much of an issue that is, though. Regards, Stefan

* Stefan Seefeld <seefeld@sympatico.ca> [2005-11-04 11:39]:
Anthony Williams wrote:
It is far easier to write a parser that calls user code (push model) than write a parser that can be continued (pull model), since in the pull model you have to save all the internal state in order to return to the user with each token; you basically have to write a "continuations" mechanism.
Fair enough. But here we are (or should be) focussed on the API, i.e. the user. The question is whether to put the parser in control of the data flow or the application. While the latter is harder to implement it is also far more convenient for users.
Harder to implement could also imply a complexity that effects performance. If the user is consuming a document object model, whether that document is build via a push parser or a pull parser is moot, and the overhead of maintaining pull parser state is nothing but a penalty.
As it happens, the implementation I have in mind uses libxml2, a C library. As such between the application calling 'parse()' and the callbacks are two language boundaries (C++ -> C and C -> C++), so you couldn't even throw exceptions from inside the callbacks and catch them in the main application.
That's one of my main criticisms of your suggested API --- it's too tightly bound to libxml, and doesn't really allow for substitution of another parser.
Could you substantiate your claim ?
Sorting out exception handling, though and event framework like a push parser framework is no small challenge. I've always been critical of the Java SAXException, it is checked, and it cannot wrap a runtime expcetion, two choices that maximize the chanllenges of tunneling exceptions.
My other criticism so far is the node::type() function. I really don't believe in such type tags; we should be using virtual function dispatch instead, using the Visitor pattern. Your traversal example could then ditch the traverse(node_ptr) overload, and instead be called with document->root.visit(traversal)
Node types aren't (runtime-) polymorphic right now, but is that really a big deal ?
Polymorphism is important for extensibility. However here the set of node types is well known (and rather limited).
What about a Post-Schema Valiation Infoset PSVI? With XMLSchema the types of nodes are unlimited. -- Alan Gutierrez - alan@engrm.com - http://engrm.com/blogometer/

Alan Gutierrez wrote:
That's one of my main criticisms of your suggested API --- it's too tightly bound to libxml, and doesn't really allow for substitution of another parser.
Could you substantiate your claim ?
Sorting out exception handling, though and event framework like a push parser framework is no small challenge.
I've always been critical of the Java SAXException, it is checked, and it cannot wrap a runtime expcetion, two choices that maximize the chanllenges of tunneling exceptions.
The probable addition of a cursor / event based API aside, I'd really prefer we focus on the DOM API I proposed. It's totally independent from SAX /XmlReader, and so we may be able to agree on something fast. [...]
Polymorphism is important for extensibility. However here the set of node types is well known (and rather limited).
What about a Post-Schema Valiation Infoset PSVI? With XMLSchema the types of nodes are unlimited.
That's an interesting point, and I have been waiting for this to get raised. :-) I'm not sure, though, that such a node (element ?) type system should live at the same level as the suggested DOM layer, as opposed to on top of it. Regards, Stefan

Stefan Seefeld wrote:
What about a Post-Schema Valiation Infoset PSVI? With XMLSchema the types of nodes are unlimited.
That's an interesting point, and I have been waiting for this to get raised. :-)
I'm not sure, though, that such a node (element ?) type system should live at the same level as the suggested DOM layer, as opposed to on top of it.
To illustrate my point a bit, consider this: Since we are talking about domain-specific extensions here, users of the library would have to provide their own node factories, and the nodes it creates would most likely contain some user-provided state which applications would manipulate later on. This is one example where modularization is important, i.e. instead of designing such a facility right into the parser, and complicating its design even for all those users who have no interest in XML Schema. Instead, I propose the parser to generate a generic dom as suggested, and then provide a schema validator that *translates* the generic dom into a tree of typed nodes, where the actual types are provided by users, for specific application domains. Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
Anthony Williams wrote:
It is far easier to write a parser that calls user code (push model) than write a parser that can be continued (pull model), since in the pull model you have to save all the internal state in order to return to the user with each token; you basically have to write a "continuations" mechanism.
Fair enough. But here we are (or should be) focussed on the API, i.e. the user. The question is whether to put the parser in control of the data flow or the application. While the latter is harder to implement it is also far more convenient for users.
Convenience for users depends on their application. As you say in another mail, this is a moot point for now, since we're discussing the API for the parsed DOM, not the API for parsing.
As it happens, the implementation I have in mind uses libxml2, a C library. As such between the application calling 'parse()' and the callbacks are two language boundaries (C++ -> C and C -> C++), so you couldn't even throw exceptions from inside the callbacks and catch them in the main application.
That's one of my main criticisms of your suggested API --- it's too tightly bound to libxml, and doesn't really allow for substitution of another parser.
Could you substantiate your claim ?
Really, it was just a feeling from looking at the API docs. However: In order to use a particular external type, such as std::string, the user has to supply a specialization of converter<> for their type, which converts to and from the libxml xmlChar type. Also, there are lots of constructors around that take xmlNode pointers, or xmlDoc pointers, or similar. It may be that these are intended as private implementation details, but they show up in the documentation.
My other criticism so far is the node::type() function. I really don't believe in such type tags; we should be using virtual function dispatch instead, using the Visitor pattern. Your traversal example could then ditch the traverse(node_ptr) overload, and instead be called with document->root.visit(traversal)
Node types aren't (runtime-) polymorphic right now, but is that really a big deal ? Polymorphism is important for extensibility. However here the set of node types is well known (and rather limited).
Polymorphism is not just important for extensibility of the polymorphic set, but also for type-safe, convenient handling of objects whose exact type is only known at runtime. Check-type-flag-and-cast is a nasty smell that I would rather avoid where possible.
Making nodes polymorphic would imply that the library allocates nodes on the heap, instead of the stack (as it now does). That could well hurt performance. I'm not sure how much of an issue that is, though.
Since the types of nodes is well-known and limited, you could use boost::variant to allow stack-based storage, whilst maintaining the polymorphic behaviour. One additional comment on re-reading the samples --- having to instantiate every template for the external string type seems rather awkward. One alternative is to accept and return an internal string type, and provide conversion functions to/from the user's external string type. This way, the library is not dependent on the string type, but it does add complexity to the interface. Another alternative is to make the functions that accept or return the user's string type into templates, whilst leaving the enclosing class as a non-template, since there are no data members of the user's string type. Template parameter type deduction can be used to determine the type when it is given as a parameter, and explicit specification can be used when it is needed for a return type. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
That's one of my main criticisms of your suggested API --- it's too tightly bound to libxml, and doesn't really allow for substitution of another parser.
Could you substantiate your claim ?
Really, it was just a feeling from looking at the API docs. However:
In order to use a particular external type, such as std::string, the user has to supply a specialization of converter<> for their type, which converts to and from the libxml xmlChar type.
Correct. That's the price to pay for not forcing any particular unicode library on users who want to use the XML API.
Also, there are lots of constructors around that take xmlNode pointers, or xmlDoc pointers, or similar. It may be that these are intended as private implementation details, but they show up in the documentation.
Ok, let's fix the documentation. :-) (You are correct, these are all private constructors. As I already mentioned, users never instantiate nodes explicitely, but use various factory methods for that purpose.)
My other criticism so far is the node::type() function. I really don't believe in such type tags; we should be using virtual function dispatch instead, using the Visitor pattern. Your traversal example could then ditch the traverse(node_ptr) overload, and instead be called with document->root.visit(traversal)
Node types aren't (runtime-) polymorphic right now, but is that really a big deal ? Polymorphism is important for extensibility. However here the set of node types is well known (and rather limited).
Polymorphism is not just important for extensibility of the polymorphic set, but also for type-safe, convenient handling of objects whose exact type is only known at runtime. Check-type-flag-and-cast is a nasty smell that I would rather avoid where possible.
I understand and agree with you...in principle. There are a number of situations where a method call will return a node_ptr, and the user typically wants to cast to the exact type. Examples include element child iteration, node_sets (resulting from xpath lookup, etc.). However, doing this with a visitor-like visit/accept pair of methods incures two virtual method calls, just to get hold of the real type. That's a lot ! As my node implementations already know their type (in terms of an enum tag), casting is a simple matter of rewrapping the implementation by a new proxy. Using RTTI to represent the node's type is definitely possible. I'm just not convinced of its advantages.
One additional comment on re-reading the samples --- having to instantiate every template for the external string type seems rather awkward.
One alternative is to accept and return an internal string type, and provide conversion functions to/from the user's external string type. This way, the library is not dependent on the string type, but it does add complexity to the interface.
Right, I considered that. One has to be careful with those string conversions, though, to avoid unnecessary copies.
Another alternative is to make the functions that accept or return the user's string type into templates, whilst leaving the enclosing class as a non-template, since there are no data members of the user's string type. Template parameter type deduction can be used to determine the type when it is given as a parameter, and explicit specification can be used when it is needed for a return type.
Actually the use I have in mind is really what I demonstrate in the examples: Developers who decide to use boost::xml::dom fix the unicode binding by specifying the string type, conversion, etc. in a single place (such as my 'string.hpp'), such then all the rest of the code doesn't need to be aware of this mechanism. In your case I couldn't encapsulate the binding in a single place, as you mention yourself. What would be possible, though, is to put all types into a single parametrized struct: template <typename S> struct types { typedef typename document<S> document_type; typedef typename node_ptr<element<S> > element_ptr; ... }; and then let the user simply write: typedef dom::types<my_unicode_string> types; ... std::auto_ptr<types::document_type> doc = dom::parse_file("input.xml"); types::element_ptr root = doc->root(); ... Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
Anthony Williams wrote:
That's one of my main criticisms of your suggested API --- it's too tightly bound to libxml, and doesn't really allow for substitution of another parser.
In order to use a particular external type, such as std::string, the user has to supply a specialization of converter<> for their type, which converts to and from the libxml xmlChar type.
Correct. That's the price to pay for not forcing any particular unicode library on users who want to use the XML API.
Hmm. What is an xmlChar? From your string.hpp, it appears it is the same as a normal char, since you can cast a char* to an xmlchar*, but I don't know libxml2, so I wouldn't like to assume. I would rather that the boost::xml API defined a type (even if it was a typedef for the libxml xmlChar), and the requirements on that type (e.g. ASCII, UTF-32 or UTF-8 encoding). By exposing the underlying character type of the backend like this, you are restricting the backends to those that share the same internal character type and encoding, or imposing an additional conversion layer on backends with a different internal encoding. Just as an example, my axemill parser uses a POD struct containing an unsigned long as the character type, so that each Unicode Codepoint is a single "character", and I don't have to worry about variable-length encodings such as UTF-8 internally. If I wanted use axemill as the backend parser, and handle std::wstring input on a platform where wchar_t was UTF-32, but keep xmlChar in the API, the converter would have to change UTF-32 to UTF-8 (I assume), and then internally this would have to be converted back to UTF-32. I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The user then has to supply a conversion function from their encoding to one of these, and the library converts internally if the one they choose is not the "correct" one. I would imagine that a user that works in UTF-8 will choose to provide a UTF-8 conversion, someone that works with UCS-2 wchar_t characters will provide a UTF-16 conversion, and someone that works with UTF-32 wchar_t characters will provide a UTF-32 conversion. Someone who uses a different encoding, such as EBCDIC, will provide a conversion appropriate to their usage. This should produce the minimum of cross-encoding conversions.
My other criticism so far is the node::type() function. I really don't believe in such type tags; we should be using virtual function dispatch instead, using the Visitor pattern. Your traversal example could then ditch the traverse(node_ptr) overload, and instead be called with document->root.visit(traversal)
Node types aren't (runtime-) polymorphic right now, but is that really a big deal ? Polymorphism is important for extensibility. However here the set of node types is well known (and rather limited).
Polymorphism is not just important for extensibility of the polymorphic set, but also for type-safe, convenient handling of objects whose exact type is only known at runtime. Check-type-flag-and-cast is a nasty smell that I would rather avoid where possible.
I understand and agree with you...in principle. There are a number of situations where a method call will return a node_ptr, and the user typically wants to cast to the exact type. Examples include element child iteration, node_sets (resulting from xpath lookup, etc.). However, doing this with a visitor-like visit/accept pair of methods incures two virtual method calls, just to get hold of the real type. That's a lot !
Two virtual method calls (node.visit/visitor.handle) vs two plain function calls (node.type/handle_xxx), a switch, and a "cast" that constructs a new object. Have you run a profiler on it? Premature optimization is the root of all evil; I would rather have something that helps me write correct code, rather than fast code. I really dislike switch-on-type code, and I'm not convinced that it is necessarily faster in all cases.
As my node implementations already know their type (in terms of an enum tag), casting is a simple matter of rewrapping the implementation by a new proxy.
There's nothing stopping this from continuing to work.
Using RTTI to represent the node's type is definitely possible. I'm just not convinced of its advantages.
I'm not convinced of the advantage of not using it ;-)
One additional comment on re-reading the samples --- having to instantiate every template for the external string type seems rather awkward.
One alternative is to accept and return an internal string type, and provide conversion functions to/from the user's external string type. This way, the library is not dependent on the string type, but it does add complexity to the interface.
Right, I considered that. One has to be careful with those string conversions, though, to avoid unnecessary copies.
Yes. However, this problem exists with your proposed API anyway --- by converting on every call to the API, you are forcing possibly-unnecessary conversions on your users. For example, they may want to add the same attribute to 50 nodes; your proposed API requires that the attribute name is converted 50 times. Accepting an internal string type, and making the user do the conversion allows the user to do the conversion once, and then pass the converted string 50 times.
Another alternative is to make the functions that accept or return the user's string type into templates, whilst leaving the enclosing class as a non-template, since there are no data members of the user's string type. Template parameter type deduction can be used to determine the type when it is given as a parameter, and explicit specification can be used when it is needed for a return type.
Actually the use I have in mind is really what I demonstrate in the examples: Developers who decide to use boost::xml::dom fix the unicode binding by specifying the string type, conversion, etc. in a single place (such as my 'string.hpp'), such then all the rest of the code doesn't need to be aware of this mechanism.
In your case I couldn't encapsulate the binding in a single place, as you mention yourself.
Agreed, but you wouldn't have to. It's also more flexible --- it would allow input to come in with one encoding/string type, and output to be generated with a different encoding/string type, but the same boost::xml::dom objects could be used.
What would be possible, though, is to put all types into a single parametrized struct:
template <typename S> struct types { typedef typename document<S> document_type; typedef typename node_ptr<element<S> > element_ptr; ... };
This is preferable to the current proposed API, but I still prefer that the conversion happens at the boundary as per my suggestions, rather than the entire classes being parameterized. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
In order to use a particular external type, such as std::string, the user has to supply a specialization of converter<> for their type, which converts to and from the libxml xmlChar type.
Correct. That's the price to pay for not forcing any particular unicode library on users who want to use the XML API.
Hmm. What is an xmlChar? From your string.hpp, it appears it is the same as a normal char, since you can cast a char* to an xmlchar*, but I don't know libxml2, so I wouldn't like to assume.
You must not care ! :-) Seriously, though, xmlChar is indeed an alias for 'char', meaning the strings in libxml2 can be passed around as usual (in particular, are null-terminated). The format, though, is UTF-8, so a simple cast to 'char' only makes sense if the document contains ASCII only.
I would rather that the boost::xml API defined a type (even if it was a typedef for the libxml xmlChar), and the requirements on that type (e.g. ASCII, UTF-32 or UTF-8 encoding).
By exposing the underlying character type of the backend like this, you are restricting the backends to those that share the same internal character type and encoding, or imposing an additional conversion layer on backends with a different internal encoding.
Why ? I propose a mechanism involving at most a single conversion. Why the additional layer ?
Just as an example, my axemill parser uses a POD struct containing an unsigned long as the character type, so that each Unicode Codepoint is a single "character", and I don't have to worry about variable-length encodings such as UTF-8 internally.
(that may consume considerably more memory for big documents, and a lot of waste if the content is ASCII)
If I wanted use axemill as the backend parser, and handle std::wstring input on a platform where wchar_t was UTF-32, but keep xmlChar in
(some nit-picking: wchar_t and UTF-32 are unrelated concepts. The former provides a storage type of some (unfortunately platform-dependent) size, while the latter defines an encoding. See the various unicode-related threads in this ML.)
the API, the converter would have to change UTF-32 to UTF-8 (I assume), and then internally this would have to be converted back to UTF-32.
Well, we definitely need some 'xml char trait' for the backend to fill in that provides sufficient information for users to write their own converter. Again, the hope is to do that such that any redundant conversion / copying can be avoided.
I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The user then has to supply a conversion function from their encoding to one of these, and the library converts internally if the one they choose is not the "correct" one.
It already does. libxml2 provides conversion functions. I need to hook them up into such an 'xml char trait'.
I would imagine that a user that works in UTF-8 will choose to provide a UTF-8 conversion, someone that works with UCS-2 wchar_t characters will provide a UTF-16 conversion, and someone that works with UTF-32 wchar_t characters will provide a UTF-32 conversion. Someone who uses a different encoding, such as EBCDIC, will provide a conversion appropriate to their usage. This should produce the minimum of cross-encoding conversions.
Yup. [...]
Two virtual method calls (node.visit/visitor.handle) vs two plain function calls (node.type/handle_xxx), a switch, and a "cast" that constructs a new object.
Have you run a profiler on it?
Premature optimization is the root of all evil; I would rather have something that helps me write correct code, rather than fast code. I really dislike switch-on-type code, and I'm not convinced that it is necessarily faster in all cases.
Ok, fair enough. That can easily be tested, and the change is (almost) straight forward.
As my node implementations already know their type (in terms of an enum tag), casting is a simple matter of rewrapping the implementation by a new proxy.
There's nothing stopping this from continuing to work.
Right, though it becomes a bit more involved. node_ptr doesn't hold a 'node *' despite its name, but rather a 'node' (which itself, being a proxy, points to a xmlNode). thus, casting node_ptr to element_ptr (or the other way around) will actually construct a new element (wrapper). The only way to make this work with polymorphic nodes is to heap-allocate nodes (inside node_ptr), which requires extra calls to new / delete. We may override these operators for node types, though, but it's not *that* trivial to optimize.
Using RTTI to represent the node's type is definitely possible. I'm just not convinced of its advantages.
I'm not convinced of the advantage of not using it ;-)
One additional comment on re-reading the samples --- having to instantiate every template for the external string type seems rather awkward.
One alternative is to accept and return an internal string type, and provide conversion functions to/from the user's external string type. This way, the library is not dependent on the string type, but it does add complexity to the interface.
Right, I considered that. One has to be careful with those string conversions, though, to avoid unnecessary copies.
Yes. However, this problem exists with your proposed API anyway --- by converting on every call to the API, you are forcing possibly-unnecessary conversions on your users. For example, they may want to add the same attribute to 50 nodes; your proposed API requires that the attribute name is converted 50 times. Accepting an internal string type, and making the user do the conversion allows the user to do the conversion once, and then pass the converted string 50 times.
Good point ! I have to think about how to 'internalize' and reuse a string. [...]
In your case I couldn't encapsulate the binding in a single place, as you mention yourself.
Agreed, but you wouldn't have to. It's also more flexible --- it would allow input to come in with one encoding/string type, and output to be generated with a different encoding/string type, but the same boost::xml::dom objects could be used.
How big of an issue is that, really ? How many users use different unicode libraries in the same application ? If they really want different encodings, they may as well do the conversion within their unicode library.
What would be possible, though, is to put all types into a single parametrized struct:
template <typename S> struct types { typedef typename document<S> document_type; typedef typename node_ptr<element<S> > element_ptr; ... };
This is preferable to the current proposed API, but I still prefer that the conversion happens at the boundary as per my suggestions, rather than the entire classes being parameterized.
I'm not sure I understand your requirement ? Do you really want to plug in multiple unicode libraries / string types ? Or do you want to use multiple encodings ? Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
Anthony Williams wrote:
In order to use a particular external type, such as std::string, the user has to supply a specialization of converter<> for their type, which converts to and from the libxml xmlChar type.
Correct. That's the price to pay for not forcing any particular unicode library on users who want to use the XML API.
Hmm. What is an xmlChar? From your string.hpp, it appears it is the same as a normal char, since you can cast a char* to an xmlchar*, but I don't know libxml2, so I wouldn't like to assume.
You must not care ! :-) Seriously, though, xmlChar is indeed an alias for 'char', meaning the strings in libxml2 can be passed around as usual (in particular, are null-terminated). The format, though, is UTF-8, so a simple cast to 'char' only makes sense if the document contains ASCII only.
I would rather that the boost::xml API defined a type (even if it was a typedef for the libxml xmlChar), and the requirements on that type (e.g. ASCII, UTF-32 or UTF-8 encoding).
By exposing the underlying character type of the backend like this, you are restricting the backends to those that share the same internal character type and encoding, or imposing an additional conversion layer on backends with a different internal encoding.
Why ? I propose a mechanism involving at most a single conversion. Why the additional layer ?
Assume I know the encoding and character type I wish to use as input. In order to specialize converter<> for my string type, I need to know what encoding and character type the library is using. If the encoding and character type are not specified in the API, but are instead open to the whims of the backend, I cannot write my conversion code. For example, your string.hpp converter only works if the native encoding for char on the current platform is UTF-8, or the application only uses a shared subset (e.g. ASCII). For platforms where EBCDIC is the default, this won't work.
Just as an example, my axemill parser uses a POD struct containing an unsigned long as the character type, so that each Unicode Codepoint is a single "character", and I don't have to worry about variable-length encodings such as UTF-8 internally.
(that may consume considerably more memory for big documents, and a lot of waste if the content is ASCII)
Indeed it might, but that's a decision I'm happy with for now --- it doesn't currently store entire documents in memory, as it's a SAX-style push parser (though the API is completely non-SAX). I'm just using it here as an example of a backend that doesn't use UTF-8.
If I wanted use axemill as the backend parser, and handle std::wstring input on a platform where wchar_t was UTF-32, but keep xmlChar in
(some nit-picking: wchar_t and UTF-32 are unrelated concepts. The former provides a storage type of some (unfortunately platform-dependent) size, while the latter defines an encoding. See the various unicode-related threads in this ML.)
I know about the issues surrounding wchar_t and encodings. They are not entirely unrelated. The platform has to pick a default encoding for wchar_t, so we know whether or not 0x61==L'a', for example. On platforms where wchar_t is 32 bit, this can be (and often is) UTF-32.
the API, the converter would have to change UTF-32 to UTF-8 (I assume), and then internally this would have to be converted back to UTF-32.
Well, we definitely need some 'xml char trait' for the backend to fill in that provides sufficient information for users to write their own converter. Again, the hope is to do that such that any redundant conversion / copying can be avoided.
Good.
I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The user then has to supply a conversion function from their encoding to one of these, and the library converts internally if the one they choose is not the "correct" one.
It already does. libxml2 provides conversion functions. I need to hook them up into such an 'xml char trait'.
I don't understand how your response ties in with my comment, so I'll try again. I was suggesting that we have overloads like: node::append_element(utf8_string_type); node::append_element(utf16_string_type); node::append_element(utf32_string_type); With two of them (but unspecified which two) converting to the correct internal encoding. If the user has EBCDIC or shift-JIS data, then they need to convert to one of these three standard types. If their data is already correctly encoded, but not using the same string type, then they just need to recast to the appropriate string type.
As my node implementations already know their type (in terms of an enum tag), casting is a simple matter of rewrapping the implementation by a new proxy.
There's nothing stopping this from continuing to work.
Right, though it becomes a bit more involved. node_ptr doesn't hold a 'node *' despite its name, but rather a 'node' (which itself, being a proxy, points to a xmlNode). thus, casting node_ptr to element_ptr (or the other way around) will actually construct a new element (wrapper). The only way to make this work with polymorphic nodes is to heap-allocate nodes (inside node_ptr), which requires extra calls to new / delete. We may override these operators for node types, though, but it's not *that* trivial to optimize.
Agreed, it will require careful thought.
In your case I couldn't encapsulate the binding in a single place, as you mention yourself.
Agreed, but you wouldn't have to. It's also more flexible --- it would allow input to come in with one encoding/string type, and output to be generated with a different encoding/string type, but the same boost::xml::dom objects could be used.
How big of an issue is that, really ? How many users use different unicode libraries in the same application ? If they really want different encodings, they may as well do the conversion within their unicode library.
Imagine, for example a web browser or XML editor. The XML comes in as a byte stream with an encoding tag such as a Charset-encoding field (if you're lucky). You then have to read this and convert it from whatever encoding is specified to the DOM library's internal encoding, do some processing and then output to the screen in the user's chosen encoding. If I specify the conversions to use directly on the input and output, then I can cleanly separate my application into three layers --- process input, and build DOM in internal encoding; process DOM as necessary; display result to user. If the string type and encoding is inherently part of the DOM types, this is not so simple.
What would be possible, though, is to put all types into a single parametrized struct:
template <typename S> struct types { typedef typename document<S> document_type; typedef typename node_ptr<element<S> > element_ptr; ... };
This is preferable to the current proposed API, but I still prefer that the conversion happens at the boundary as per my suggestions, rather than the entire classes being parameterized.
I'm not sure I understand your requirement ? Do you really want to plug in multiple unicode libraries / string types ? Or do you want to use multiple encodings ?
Multiple encodings, generally. However, your converter<> template doesn't allow for that --- it only allows one encoding per string type. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
Assume I know the encoding and character type I wish to use as input. In order to specialize converter<> for my string type, I need to know what encoding and character type the library is using. If the encoding and character type are not specified in the API, but are instead open to the whims of the backend, I cannot write my conversion code.
Ah, I think I understand what you mean by 'character type'. Yes, you are right. The code as I posted it to the vault is missing these bits. that enable users to write converters without knowing backend-specific details. However, some 'dom::char_trait' should be enough, right ?
I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The user then has to supply a conversion function from their encoding to one of these, and the library converts internally if the one they choose is not the "correct" one.
It already does. libxml2 provides conversion functions. I need to hook them up into such an 'xml char trait'.
I don't understand how your response ties in with my comment, so I'll try again.
I was suggesting that we have overloads like:
node::append_element(utf8_string_type); node::append_element(utf16_string_type); node::append_element(utf32_string_type);
With two of them (but unspecified which two) converting to the correct internal encoding.
Oh, but that multiplies quite a chunk of the API by four ! Typically, a unicode library provides converter functions, so what advantage would such a rich interface have instead of asking the user to do the conversion before calling into the xml library ? If the internal storage encoding is a compile-time constant that can be queried from the proposed dom::char_trait, it should be simple for users to decide how to write the converter, and in particular, how to pass strings in the most efficient way. [...]
Imagine, for example a web browser or XML editor. The XML comes in as a byte stream with an encoding tag such as a Charset-encoding field (if you're lucky). You then have to read this and convert it from whatever encoding is specified to the DOM library's internal encoding, do some processing and then output to the screen in the user's chosen encoding.
Right.
If I specify the conversions to use directly on the input and output, then I can cleanly separate my application into three layers --- process input, and build DOM in internal encoding; process DOM as necessary; display result to user.
If the string type and encoding is inherently part of the DOM types, this is not so simple.
I still don't understand what you have in mind: Are you thinking of using two separate unicode libraries / string types for input and output ? Again unicode libraries should provide encoding conversion, if all you want is to use distinct encodings. I may not understand the details well enough, but asking for the API to integrate the string conversions as you seem to be doing sounds exactly like what you accused me of doing: premature optimization. ;-)
I'm not sure I understand your requirement ? Do you really want to plug in multiple unicode libraries / string types ? Or do you want to use multiple encodings ?
Multiple encodings, generally. However, your converter<> template doesn't allow for that --- it only allows one encoding per string type.
Ah, well, the converter is not even half-finished, as in its current form it is tied to the string type. It sure requires some substantial design to be of any practical use. Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
Anthony Williams wrote:
Assume I know the encoding and character type I wish to use as input. In order to specialize converter<> for my string type, I need to know what encoding and character type the library is using. If the encoding and character type are not specified in the API, but are instead open to the whims of the backend, I cannot write my conversion code.
Ah, I think I understand what you mean by 'character type'. Yes, you are right. The code as I posted it to the vault is missing these bits. that enable users to write converters without knowing backend-specific details. However, some 'dom::char_trait' should be enough, right ?
Yes and no. Suppose my incoming data is a stream of 8-bit "characters", using Shift-JIS encoding. I need to write a converter to convert this to whatever encoding is accepted by the XML API. I need to know which encoding to use when I write my converter --- if the API is expecting UTF-16 stored in a string of unsigned shorts, my converter is going to be quite different to if the API is expecting UTF-8 stored in a string of unsigned chars. I also need to know how to construct the final string --- whether I need to provide a boost::xml::char_type*, or whether I need to construct a boost::xml::string_type from a pair of iterators, or something else.
I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The user then has to supply a conversion function from their encoding to one of these, and the library converts internally if the one they choose is not the "correct" one.
It already does. libxml2 provides conversion functions. I need to hook them up into such an 'xml char trait'.
I don't understand how your response ties in with my comment, so I'll try again.
I was suggesting that we have overloads like:
node::append_element(utf8_string_type); node::append_element(utf16_string_type); node::append_element(utf32_string_type);
With two of them (but unspecified which two) converting to the correct internal encoding.
Oh, but that multiplies quite a chunk of the API by four !
What's the fourth option? Yes, I agree it multiplies the API, but for the convenience of users.
Typically, a unicode library provides converter functions, so what advantage would such a rich interface have instead of asking the user to do the conversion before calling into the xml library ?
It avoids the user doing any conversion in many cases.
If the internal storage encoding is a compile-time constant that can be queried from the proposed dom::char_trait, it should be simple for users to decide how to write the converter, and in particular, how to pass strings in the most efficient way.
If the encoding is only available as a compile-time constant, that won't help me write a converter. I need it available as a software-writing-time constant for that (i.e. specified in the documentation). If you don't want to fix the encoding in the docs, maybe we should require that the user supply conversions to each of UTF-8, UTF-16 and UTF-32, and the library will use whichever is most convenient.
If I specify the conversions to use directly on the input and output, then I can cleanly separate my application into three layers --- process input, and build DOM in internal encoding; process DOM as necessary; display result to user.
If the string type and encoding is inherently part of the DOM types, this is not so simple.
I still don't understand what you have in mind: Are you thinking of using two separate unicode libraries / string types for input and output ? Again unicode libraries should provide encoding conversion, if all you want is to use distinct encodings.
I may not understand the details well enough, but asking for the API to integrate the string conversions as you seem to be doing sounds exactly like what you accused me of doing: premature optimization. ;-)
It seems I am failing to communicate my thoughts correctly, since optimization is certainly far from my thoughts. It is separation of concerns that I am currently thinking about. In the input layer of an application, you need to deal with all the variety of encodings that the user might supply. I'm quite happy to use a single Unicode library to deal with the conversions, but I can imagine having to deal with numerous external encodings. I would like the rest of the application to have no need to know about the complications of the input handling, and the variety of encodings used --- provided I get a set of DOM objects from somewhere, the rest of the application shouldn't care. Once the input has been handled, and the DOM built, there might be additional input in terms of XPath expressions, or element names, which might be in another encoding still. Again, the choice of input encoding here should have no impact on the rest of the application. With the current design, the whole API is tied to a single external string type, with a single converter function for converting to the internal string type. This implies that if you wish to use different encodings, you need a different external string type, and therefore you end up with different template instantiations for different encodings, and my nice separate application parts suddenly need to know what encodings are used for input and output.
I'm not sure I understand your requirement ? Do you really want to plug in multiple unicode libraries / string types ? Or do you want to use multiple encodings ?
Multiple encodings, generally. However, your converter<> template doesn't allow for that --- it only allows one encoding per string type.
Ah, well, the converter is not even half-finished, as in its current form it is tied to the string type. It sure requires some substantial design to be of any practical use.
Ok. I'm trying to raise issues which will affect the design. For axemill, I decided to provide a set of conversion templates for converting between encodings. Firstly, there is the Decode template that takes a pair of input iterators, and returns the first UTF-32 character in the sequence, advancing the start iterator in the process. Secondly, there is the Encode template that takes a single UTF-32 character, and an output iterator, and writes the character to the output iterator in the appropriate encoding. These templates are then specialized for each encoding, by using types as tags (so there are types axemill::encoding::ASCII, axemill::encoding::UTF8, axemill::encoding::UTF32_LE, axemill::encoding::ISO_8859_1, etc.) Then I provide template functions convertFrom<someEncoding>(start,end) and convertFrom<someEncoding>(std::string) which convert to the internal string type, and convertFrom<someEncoding>(start,end,out), which converts an input sequence, and appends it to the specified output sequence (which must be a sequence of the internal UTF-32 characters). Complementing that are the convertTo overloads, which convert an internal string to a std::string in some encoding, and convert an input range of internal UTF-32 characters, writing to an output iterator. Finally, there is a recode<inputEncoding,outputEncoding>(start,end,out) template function that takes an input range in some encoding, and writes it out as a different encoding, going through the internal UTF-32 character type in the middle. This allows for the full complement of input and output encodings to be used (provided appropriate specializations of Encode<> and Decode<> are provided), but the main part of the library is oblivious to this, and just uses the internal UTF-32 string type. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
Ah, I think I understand what you mean by 'character type'. Yes, you are right. The code as I posted it to the vault is missing these bits. that enable users to write converters without knowing backend-specific details. However, some 'dom::char_trait' should be enough, right ?
Yes and no. Suppose my incoming data is a stream of 8-bit "characters", using Shift-JIS encoding. I need to write a converter to convert this to whatever encoding is accepted by the XML API. I need to know which encoding to use when I write my converter --- if the API is expecting UTF-16 stored in a string of unsigned shorts, my converter is going to be quite different to if the API is expecting UTF-8 stored in a string of unsigned chars. I also need to know how to construct the final string --- whether I need to provide a boost::xml::char_type*, or whether I need to construct a boost::xml::string_type from a pair of iterators, or something else.
[...]
I was suggesting that we have overloads like:
node::append_element(utf8_string_type); node::append_element(utf16_string_type); node::append_element(utf32_string_type);
With two of them (but unspecified which two) converting to the correct internal encoding.
Oh, but that multiplies quite a chunk of the API by four !
What's the fourth option? Yes, I agree it multiplies the API, but for the convenience of users.
Sorry, that's because I can't count.
Typically, a unicode library provides converter functions, so what advantage would such a rich interface have instead of asking the user to do the conversion before calling into the xml library ?
It avoids the user doing any conversion in many cases.
Well, I would phrase it differetly: instead of encapsulating unicode-related functionality you are suggesting to spread it across various APIs. Really, we are now exclusively arguing about unicode-related issues, which I deliberately desiged out of my API. Let me rephrase the relevant part of my suggestion: The XML API provides a means to write a converter to internalize (and exteralize) unicode strings, but otherwise remains agnostic to unicode issues. This allows the library to collaborate with any exteral unicode library without duplicating its functionality.
If the encoding is only available as a compile-time constant, that won't help me write a converter. I need it available as a software-writing-time constant for that (i.e. specified in the documentation).
If you don't want to fix the encoding in the docs, maybe we should require that the user supply conversions to each of UTF-8, UTF-16 and UTF-32, and the library will use whichever is most convenient.
Exactly. Unicode libraries provide these conversion functions, and the user should be able to implemet the boost::xml::dom::converter with these.
In the input layer of an application, you need to deal with all the variety of encodings that the user might supply. I'm quite happy to use a single Unicode library to deal with the conversions, but I can imagine having to deal with numerous external encodings. I would like the rest of the application to have no need to know about the complications of the input handling, and the variety of encodings used --- provided I get a set of DOM objects from somewhere, the rest of the application shouldn't care.
Access to the dom content is routed through the unicode library, by means of the converter. Thus, whatever requirement you have for dealing with encodings etc. should all be taken care of by that.
Once the input has been handled, and the DOM built, there might be additional input in terms of XPath expressions, or element names, which might be in another encoding still. Again, the choice of input encoding here should have no impact on the rest of the application.
Correct. Same reasoning as above.
With the current design, the whole API is tied to a single external string type, with a single converter function for converting to the internal string type. This implies that if you wish to use different encodings, you need a different external string type, and therefore you end up with different template instantiations for different encodings, and my nice separate application parts suddenly need to know what encodings are used for input and output.
Oh, now I see your point ! You argue that multiple encodings will be tied to multiple C++ types, even if they are part of the same unicode library. I'm not quite sure what to say. I suspect there are ways around this issue with a clever choice of the string type template argument for the library. But if not, let's fix that once it becomes a problem. I'd rather start simple and let the system evolve once we see users plug real unicode libraries into it.
For axemill, I decided to provide a set of conversion templates for converting between encodings.
what unicode libraries are you working with ? As I said above, I'd suspect these to provide all coversions, no matter whether that would generate a new C++ type or not. Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
If the encoding is only available as a compile-time constant, that won't help me write a converter. I need it available as a software-writing-time constant for that (i.e. specified in the documentation).
If you don't want to fix the encoding in the docs, maybe we should require that the user supply conversions to each of UTF-8, UTF-16 and UTF-32, and the library will use whichever is most convenient.
Exactly. Unicode libraries provide these conversion functions, and the user should be able to implemet the boost::xml::dom::converter with these.
Ok. So boost::xml::dom::converter will have convert_to_utf8, convert_to_utf16 and convert_to_utf32 functions (and their inverses) which the user will have to implement. I'm happy with that.
With the current design, the whole API is tied to a single external string type, with a single converter function for converting to the internal string type. This implies that if you wish to use different encodings, you need a different external string type, and therefore you end up with different template instantiations for different encodings, and my nice separate application parts suddenly need to know what encodings are used for input and output.
Oh, now I see your point ! You argue that multiple encodings will be tied to multiple C++ types, even if they are part of the same unicode library. I'm not quite sure what to say. I suspect there are ways around this issue with a clever choice of the string type template argument for the library. But if not, let's fix that once it becomes a problem. I'd rather start simple and let the system evolve once we see users plug real unicode libraries into it.
I guess we disagree over what's simple: I see simple => API not tied to external string type; it's up to the user to do the conversions to/from the internal string type as and when they see fit. As I understand what you're saying, you see simple => user supplies data in their own external string type, and library calls back to the user-supplied converter to convert to/from the internal string type when needed. If we start off with the internal string type being UTF-8 encoded strings, and have the API accept/return internal strings, then we can discard the converter stuff for now, get rid of all the template parameters specifying the external string type, and focus on the API details. If we're going to allow choice of backends (and I'm not happy being tied to libxml2), it would be nice to allow for this internal string type to have a different encoding and character type (in order to avoid unnecessary conversions), but we could leave that for now.
For axemill, I decided to provide a set of conversion templates for converting between encodings.
what unicode libraries are you working with ? As I said above, I'd suspect these to provide all coversions, no matter whether that would generate a new C++ type or not.
Personally I don't use a separate Unicode library; I write the functions I need as and when I need them. With axemill, I again make no assumptions about the Unicode library, but expect the user to provide appropriate specializations of the Encode and Decode templates, using their Unicode library of choice. The axemill API itself expects everything to be in the internal string type; the Encode and Decode templates, and the convert to/from functions are provided as a convenience to the user, to assist them in working with a Unicode library of their choice. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony, thinking more about the involved issues, and especially taking into account one comment regarding the need to pass 'internalized' strings around to avoid duplicate conversions, I'm tempted by a radically different option: Let the XML API provide a 'string_type' (or may be 'cdata_type'), and exclusively operate on that. This type would be self-descriptive such that it could vary with the backend, but always provide the same (compile-time) interface to users (with tags for encoding, etc.). Users then explicitely convert their strings back and forth on each call: document_ptr doc(new document()); element_ptr root = doc->set_root(cdata_type("book")); my_unicode_type name = ...; root->apped_element(internalize(name)); Here, 'internalize' is a user-provided converter that generates dom::cdata_type from my_unicode_type. Thus you can write as many converters as you have unicode types, and have explicit control over when and where to convert. (Well, and if you prefer implicit conversion, say, because you want to operate on a lot of ASCII data, just add implicit cast and non-explicit constructors to your unicode type.) What about that ? Regards, Stefan

Stefan Seefeld <seefeld@sympatico.ca> writes:
thinking more about the involved issues, and especially taking into account one comment regarding the need to pass 'internalized' strings around to avoid duplicate conversions, I'm tempted by a radically different option:
Let the XML API provide a 'string_type' (or may be 'cdata_type'), and exclusively operate on that. This type would be self-descriptive such that it could vary with the backend, but always provide the same (compile-time) interface to users (with tags for encoding, etc.).
Users then explicitely convert their strings back and forth on each call:
document_ptr doc(new document()); element_ptr root = doc->set_root(cdata_type("book"));
my_unicode_type name = ...;
root->apped_element(internalize(name));
Here, 'internalize' is a user-provided converter that generates dom::cdata_type from my_unicode_type. Thus you can write as many converters as you have unicode types, and have explicit control over when and where to convert. (Well, and if you prefer implicit conversion, say, because you want to operate on a lot of ASCII data, just add implicit cast and non-explicit constructors to your unicode type.)
What about that ?
Looks good. You get a clean separation between the internal API string type, and the external string type. I like it. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

On 11/7/05, Steinar Bang <sb@dod.no> wrote:
Stefan Seefeld <seefeld@sympatico.ca>:
Good point ! I have to think about how to 'internalize' and reuse a string.
A string internalizer is something that might have general interest outside of the XML API.
Indeed. There is a class called "name" in the Adobe Open Source libraries that is an interned string and I've recently had call to use it in a project. The implementation is simple (it doesn't bother with reference counts, and thus never garbage collects) but it works nicely. Thanks Sean! -- Caleb Epstein caleb dot epstein at gmail dot com

On 11/4/05, Anthony Williams <anthony_w.geo@yahoo.com> wrote:
Surely it depends on which parser you use. My XML-parser-in-progress (sourceforge.net/projects/axemill<http://sourceforge.net/projects/axemill>) uses a callback mechanism akin to SAX; at the moment, that's all there is, as I haven't written a DOM yet.
Any chance to check your library ? It looks like what I am looking for. And if there is no dependency on libxml2 then it's even better !!

Jose <jmalv04@gmail.com> writes:
On 11/4/05, Anthony Williams <anthony_w.geo@yahoo.com> wrote:
Surely it depends on which parser you use. My XML-parser-in-progress (http://sourceforge.net/projects/axemill) uses a callback mechanism akin to SAX; at the moment, that's all there is, as I haven't written a DOM yet.
Any chance to check your library ? It looks like what I am looking for. And
By all means. The CVS instructions are at the sourceforge site given above; I haven't made a release yet, as it's still quite unfinished. Sadly, it'll need a bit of work to make it use the latest version of boost.
if there is no dependency on libxml2 then it's even better !!
It does not depend on any other XML library. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Jose wrote:
On 11/3/05, Stefan Seefeld <seefeld@sympatico.ca> wrote:
If there is enough interest I could add a boost::xml::reader API, though dom and reader are completely independent, as far as the API itself is concerned.
I am interested in the reader API !!
Fine. I eventually will add that API, though I prefer to restrict the current discussion (and submission, if we ever get this far) to be about boost::xml::dom, to avoid unrelated issues getting into the way. Both APIs are orthogonal, so there is no reason to bundle them together. Now I hope that someone will actually start to send comments about the actual API / code I propose. ;-) Regards, Stefan

Stefan Seefeld wrote:
Now I hope that someone will actually start to send comments about the actual API / code I propose. ;-)
I not sure why you do not use of overloaded append and insert members. And, I'd like to change begin_children and end_children to std::pair<children_iterator, children_iterator> children (the same with attributes). -- Regards, Janusz

Janusz Piwowarski wrote:
Stefan Seefeld wrote:
Now I hope that someone will actually start to send comments about the actual API / code I propose. ;-)
I not sure why you do not use of overloaded append and insert members. And,
Assuming you are talking about the element class and its methods to add various child node types, I don't quite see how I could overload the function name 'append' and 'insert', as most versions have the same signature. E.g. 'insert_element("foo")' will add (and return) the element <foo/> while 'insert_comment("foo")' will add (and return) the comment <!--foo--> etc.
I'd like to change begin_children and end_children to std::pair<children_iterator, children_iterator> children (the same with attributes).
...or add a container proxy so one would write element->children().begin() instead of element->begin_children(). I'm not sure there is any gain. It seems to be a purely esthetical issue. Regards, Stefan

Stefan Seefeld
Janusz Piwowarski wrote:
Stefan Seefeld wrote:
Now I hope that someone will actually start to send comments about the actual API / code I propose. ;-)
I not sure why you do not use of overloaded append and insert members. And,
Assuming you are talking about the element class and its methods to add various child node types, I don't quite see how I could overload the function name 'append' and 'insert', as most versions have the same signature. E.g. 'insert_element("foo")' will add (and return) the element
<foo/>
node.insert( element( "foo" ));
while 'insert_comment("foo")' will add (and return) the comment
<!--foo-->
node.insert( comment( "foo" ));
I'd like to change begin_children and end_children to std::pair<children_iterator, children_iterator> children (the same with attributes).
...or add a container proxy so one would write element->children().begin() instead of element->begin_children(). I'm not sure there is any gain. It seems to be a purely esthetical issue.
I prefer the container proxy as it is more familiar to STL users - ( children.first, children.second ) is harder to understand what is meant by first and second. I would like to see STL techniques (like the children and attributes types being Container-like types), that would allow the leverage of existing STL and boost code. Since an XML document is a tree structure, it might be useful to have a generic tree library first (using STL principles) so you could do something like: typedef nary_tree< xml::node > document_fragment; or even leverage the Boost.Graph library. - Reece

Reece Dunn wrote:
Assuming you are talking about the element class and its methods to add various child node types, I don't quite see how I could overload the function name 'append' and 'insert', as most versions have the same signature. E.g. 'insert_element("foo")' will add (and return) the element
<foo/>
node.insert( element( "foo" ));
while 'insert_comment("foo")' will add (and return) the comment
<!--foo-->
node.insert( comment( "foo" ));
That assumes you can create a freestanding 'element' and 'comment' object, which I deliberately try to avoid, as it would necessitate the API to become more complex. In the simple model I have in mind users would *never* have to care for node ownership, as nodes always live in the context of a document. You can still move or copy nodes from one document to another, but being able to create nodes that aren't attached to a document complicate this simple model quite a bit. [...]
Since an XML document is a tree structure, it might be useful to have a generic tree library first (using STL principles) so you could do something like:
typedef nary_tree< xml::node > document_fragment;
or even leverage the Boost.Graph library.
Yes, you certainly could implement an XML API on top of the boost graph library, but as I said in my original mail, an XML library is *much* more than a means to manipulate trees. Not only would it be a lot of work to (re)implement all the XML specs, but also the domain-specific knowledge that went into the specific tree structure(s) used in libxml2 make it particularly performant. So, while the boost::xml::dom interface must not be targetted at any particular backend, I don't think that thinking of it as 'yet another tree library' is particularly helpful. Regards, Stefan

Friday, November 4, 2005, 4:19:45 PM, Stefan Seefeld wrote:
Assuming you are talking about the element class and its methods to add various child node types, I don't quite see how I could overload the function name 'append' and 'insert', as most versions have the same signature. E.g. 'insert_element("foo")' will add (and return) the element <foo/>
node.insert( element( "foo" ));
while 'insert_comment("foo")' will add (and return) the comment <!--foo-->
node.insert( comment( "foo" ));
That assumes you can create a freestanding 'element' and 'comment' object, which I deliberately try to avoid, as it would necessitate the API to become more complex.
I disagree - you have classes for different node types, but you don't use it for creating their instances. -- Regards, Janusz

Janusz Piwowarski wrote:
while 'insert_comment("foo")' will add (and return) the comment <!--foo-->
node.insert( comment( "foo" ));
That assumes you can create a freestanding 'element' and 'comment' object, which I deliberately try to avoid, as it would necessitate the API to become more complex.
I disagree - you have classes for different node types, but you don't use it for creating their instances.
Huh ? What is 'comment("foo")' then supposed to be, if not a constructor call ? Regards, Stefan

Janusz Piwowarski wrote:
Saturday, November 5, 2005, 9:49:17 PM, Stefan Seefeld wrote:
Huh ? What is 'comment("foo")' then supposed to be, if not a constructor call ?
Ok, but you don't want to use it. I disagree with placing constructing code in append_commend function.
So we agree then, fine. I suggest 'append_comment("foo")' while someone else proposed to overload the 'append' name, using something like 'append(comment("foo"))' instead, which would imply explicit node creation, as opposed to nodes being created by factories (such as document and element). Regards, Stefan

Stefan Seefeld wrote:
So we agree then, fine. I suggest 'append_comment("foo")' while someone else proposed to overload the 'append' name, using something like 'append(comment("foo"))' instead, which would imply explicit node creation, as opposed to nodes being created by factories (such as document and element).
It is possible to support append( comment( "foo" )) *and* factories: template< typename NodeType > struct node_constructor { typedef NodeType type; std::string value; node_constructor( const std::string & str ): value( str ) { } }; node_constructor< comment_factory > comment( const std::string & str ) { return node_constructor< comment_factory >( str ); } node_constructor< cdata_factory > cdata( const std::string & str ) { return node_constructor< cdata_factory >( str ); } struct element { template< typename Factory > void append( const node_constructor< Factory > & cons ) { // call the factory here... Factory::make_node( ..., cons.value ); } }; This would make it easier to support push_back( cdata( "foo<bar>" )), etc. without having _comment, _cdata, etc. variants for each insertion function. - Reece

Reece Dunn wrote:
This would make it easier to support push_back( cdata( "foo<bar>" )), etc. without having _comment, _cdata, etc. variants for each insertion function.
Are you seriously suggesting all this overhead just for having a single name being overloaded ? It isn't even less typing for users, so what gain does this provide ? Regards, Stefan

Stefan Seefeld wrote:
Reece Dunn wrote:
This would make it easier to support push_back( cdata( "foo<bar>" )), etc. without having _comment, _cdata, etc. variants for each insertion function.
Are you seriously suggesting all this overhead just for having a single name being overloaded ? It isn't even less typing for users, so what gain does this provide ?
I was just suggesting a way it could be done. I agree that it has additional overhead, so would not really be practical, but it would make the handing of CDATA, comment, text, etc. nodes separate from the document or element type. - Reece

On 11/4/05, Janusz Piwowarski <jpiw@go2.pl> wrote:
And, I'd like to change begin_children and end_children to std::pair<children_iterator, children_iterator> children (the same with attributes).
Or perhaps the even more Boost-y iterator_range<children_iterator>? -- Caleb Epstein caleb dot epstein at gmail dot com

On Fri, Nov 04, 2005 at 08:47:53AM -0500, Stefan Seefeld wrote:
Jose wrote:
On 11/3/05, Stefan Seefeld <seefeld@sympatico.ca> wrote:
If there is enough interest I could add a boost::xml::reader API, though dom and reader are completely independent, as far as the API itself is concerned.
I am interested in the reader API !!
Fine. I eventually will add that API, though I prefer to restrict the current discussion (and submission, if we ever get this far) to be about boost::xml::dom, to avoid unrelated issues getting into the way. Both APIs are orthogonal, so there is no reason to bundle them together.
Now I hope that someone will actually start to send comments about the actual API / code I propose. ;-)
Hi, Here are my thoughts on the subject, somewhat disorganised: I would also be very interested in seeing a boost XML reader API. Having worked with such APIs I think a 'standard' pull parser model for C++ would be really beneficial - you only have to look at .NET to see how much having a streaming interface to XML has influenced the design of code that builds on top of it, usually in a good way. I think a C++ reader interface should draw a lot from the basic iterator ideas employed in the STL in its interface. IMO a streaming interface is much more important than DOM as a starting point - one can easily and efficiently build a DOM from a stream, but starting with an in-memory representation of a document usually precludes streaming. There are a number of XML applications where it is not desirable or possible to hold the entire document in memory at once. A reader interface has advantages over SAX in that it is much easier to program with. It's very easy to do things like implement decorators around readers, and to write generic code that just understands how to use a reader and doesn't care how the XML is actually stored. That's not to say I don't think a Boost DOM implementation is a good idea. One thing I would like to see from such an implementation is for it to be policy based, since there are many different use cases for a DOM library. For example some scenarios might only need a read-only tree, which means optimisations can be made in how the nodes are stored. Others might call for efficient access to child elements of a node (e.g. by index) for query, such as when XPath is used. If this kind of thing could be extracted into policies I think it would differentiate such a library from the others that exist already. An XPath implementation should be completely separated from the XML representation, since it's effectively just an algorithm that can be applied to anything that has the correct data model and iterator interface. Thanks, Graham -- Graham Bennett

On Nov 8, 2005, at 7:44 PM, Graham Bennett wrote:
IMO a streaming interface is much more important than DOM as a starting point - one can easily and efficiently build a DOM from a stream, but starting with an in-memory representation of a document usually precludes streaming. There are a number of XML applications where it is not desirable or possible to hold the entire document in memory at once. A reader interface has advantages over SAX in that it is much easier to program with. It's very easy to do things like implement decorators around readers, and to write generic code that just understands how to use a reader and doesn't care how the XML is actually stored.
Readers are important for some things, DOM is important for other things, but there's no reason to tie the two together in one library or predicate one on the other. We can have a XML DOM library that allows reading, traversing, modifying, and writing XML documents, then later turn the reading part into a full-fledged streaming interface for those applications.
That's not to say I don't think a Boost DOM implementation is a good idea. One thing I would like to see from such an implementation is for it to be policy based, since there are many different use cases for a DOM library. For example some scenarios might only need a read-only tree, which means optimisations can be made in how the nodes are stored. Others might call for efficient access to child elements of a node (e.g. by index) for query, such as when XPath is used. If this kind of thing could be extracted into policies I think it would differentiate such a library from the others that exist already.
[Standard anti-policy rant] Policies should be used very, very carefully. They introduce a huge amount of mental overhead, are very hard to combine sensibly, and create very fragile implementations.
An XPath implementation should be completely separated from the XML representation, since it's effectively just an algorithm that can be applied to anything that has the correct data model and iterator interface.
This is probably the case. However, one can think of places where a tighter integration might give a more natural interface, e.g., xml::element_ptr books = ...; xml::node_set cheap_books = books[attr("price") < 30]; But, like the reader interface, a library that supports something DOM- like can be augmented with XPath support. Doug

[Standard anti-policy rant]
Policies should be used very, very carefully. They introduce a huge amount of mental overhead, are very hard to combine sensibly, and create very fragile implementations.
I don't really agree with any of this points. Would you care to elaborate? Gennadiy

On Nov 8, 2005, at 9:30 PM, Gennadiy Rozental wrote:
[Standard anti-policy rant]
Policies should be used very, very carefully. They introduce a huge amount of mental overhead, are very hard to combine sensibly, and create very fragile implementations.
I don't really agree with any of this points. Would you care to elaborate?
Actually, no. Let's just leave it at this: I've seen the eye-glazing properties of various policy-based approaches firsthand, and had more than my fair share of debugging and maintaining policy-based implementations. Yes, policies are interesting; yes, they are useful in some designs. And yes; you should be very careful when choosing to use them. Doug

"Douglas Gregor" <doug.gregor@gmail.com> wrote in message news:48AD9A3F-398E-43A4-BD36-60D2C9556574@cs.indiana.edu...
On Nov 8, 2005, at 9:30 PM, Gennadiy Rozental wrote:
[Standard anti-policy rant]
Policies should be used very, very carefully. They introduce a huge amount of mental overhead, are very hard to combine sensibly, and create very fragile implementations.
I don't really agree with any of this points. Would you care to elaborate?
Actually, no.
Let's just leave it at this: I've seen the eye-glazing properties of various policy-based approaches firsthand, and had more than my fair share of debugging and maintaining policy-based implementations. Yes, policies are interesting; yes, they are useful in some designs. And yes; you should be very careful when choosing to use them.
Yes, one should be careful. But that's about the only statement I am agree in original post. Gennadiy

Douglas Gregor wrote:
An XPath implementation should be completely separated from the XML representation, since it's effectively just an algorithm that can be applied to anything that has the correct data model and iterator interface.
This is probably the case. However, one can think of places where a tighter integration might give a more natural interface, e.g.,
xml::element_ptr books = ...; xml::node_set cheap_books = books[attr("price") < 30];
But, like the reader interface, a library that supports something DOM- like can be augmented with XPath support.
I'm not sure I agree. While the syntax you suggest above is quite cute, and probably follows the 'xpath object model' as implied by the XPath spec, the latter also specifies syntax, which your suggestion doesn't adher to. So, while what you suggest may be useful in itself, it isn't really compatible with xpath. (Think of applications such as xslt processors, where xpath expressions are strings embedded into attributes.) Note that my API already supports XPath. It may be a nice add-on to overload various operators for the xpath type to allow your suggested syntax, but I don't see this becoming really useful. It's just cute. Regards, Stefan

On Nov 8, 2005, at 9:49 PM, Stefan Seefeld wrote:
Douglas Gregor wrote:
This is probably the case. However, one can think of places where a tighter integration might give a more natural interface, e.g.,
xml::element_ptr books = ...; xml::node_set cheap_books = books[attr("price") < 30];
But, like the reader interface, a library that supports something DOM- like can be augmented with XPath support.
I'm not sure I agree. While the syntax you suggest above is quite cute, and probably follows the 'xpath object model' as implied by the XPath spec, the latter also specifies syntax, which your suggestion doesn't adher to.
We can't match the syntax precisely, of course, but we can apply a "similar" syntax.
So, while what you suggest may be useful in itself, it isn't really compatible with xpath. (Think of applications such as xslt processors, where xpath expressions are strings embedded into attributes.)
I wasn't actually excluding the use of strings as XPath expressions.
Note that my API already supports XPath. It may be a nice add-on to overload various operators for the xpath type to allow your suggested syntax, but I don't see this becoming really useful. It's just cute.
Have you perchance looked at Xpressive? The static constructs in it allow static checking of regexes (because they are built with C++ operators), whereas the dynamic (from-string) constructs allow one more flexibility. The same can occur with XPath: using some C++ operator overloading, we can have statically-checked XPath expressions that also provide more accurate type information (e.g., an element-set instead of a node-set). Of course, we'll always need to be able to fall back to strings as XPath expressions, which can only be checked at run-time. Doug

Douglas Gregor wrote:
Have you perchance looked at Xpressive? The static constructs in it allow static checking of regexes (because they are built with C++ operators), whereas the dynamic (from-string) constructs allow one more flexibility. The same can occur with XPath: using some C++ operator overloading, we can have statically-checked XPath expressions that also provide more accurate type information (e.g., an element-set instead of a node-set). Of course, we'll always need to be able to fall back to strings as XPath expressions, which can only be checked at run-time.
I totally agree. And yes, it would be nice if the xpath sub-type could be used to narrow down the result type, i.e. attribute_set instead of node_set, etc. Regards, Stefan

Hi Doug, On Tue, Nov 08, 2005 at 09:28:09PM -0500, Douglas Gregor wrote:
On Nov 8, 2005, at 7:44 PM, Graham Bennett wrote:
IMO a streaming interface is much more important than DOM as a starting point - one can easily and efficiently build a DOM from a stream, but starting with an in-memory representation of a document usually precludes streaming. There are a number of XML applications where it is not desirable or possible to hold the entire document in memory at once. A reader interface has advantages over SAX in that it is much easier to program with. It's very easy to do things like implement decorators around readers, and to write generic code that just understands how to use a reader and doesn't care how the XML is actually stored.
Readers are important for some things, DOM is important for other things, but there's no reason to tie the two together in one library or predicate one on the other.
Well, there is at least one reason - if the DOM is built on top of a reader interface then the DOM library doesn't have to know how to parse XML, and is not tied to any particular parser. Even if you don't agree with using a reader interface for this separation layer, I'd hope you would agree that some separation is at least necessary.
We can have a XML DOM library that allows reading, traversing, modifying, and writing XML documents, then later turn the reading part into a full-fledged streaming interface for those applications.
Can you elaborate on how you would enable a DOM structure to present a streaming interface? Are you talking about lazy tree building or something else? In any case, I would think it's inherantly difficult to retrofit a streaming interface. Much better to build the streaming interface from the start, and build the DOM on top of it. This can only be good for both sides - the reader gets to just be a reader, and the DOM gets to just be a DOM.
That's not to say I don't think a Boost DOM implementation is a good idea. One thing I would like to see from such an implementation is for it to be policy based, since there are many different use cases for a DOM library. For example some scenarios might only need a read-only tree, which means optimisations can be made in how the nodes are stored. Others might call for efficient access to child elements of a node (e.g. by index) for query, such as when XPath is used. If this kind of thing could be extracted into policies I think it would differentiate such a library from the others that exist already.
[Standard anti-policy rant]
Ah, so you have an anti-policy policy :o)
Policies should be used very, very carefully. They introduce a huge amount of mental overhead, are very hard to combine sensibly, and create very fragile implementations.
I don't think the things you list here are properties of all policy-based implementations, but agreed they are potential pitfalls to be avoided. The reason I was suggesting using policies for how a DOM is created is simply from experience of working on C++ DOM libraries myself. I don't think it's possible to make a one size fits all library and, unless one wants to create multiple different libraries for different use cases, policies seem like the way to go. But agreed much care would have to be taken.
An XPath implementation should be completely separated from the XML representation, since it's effectively just an algorithm that can be applied to anything that has the correct data model and iterator interface.
This is probably the case. However, one can think of places where a tighter integration might give a more natural interface, e.g.,
xml::element_ptr books = ...; xml::node_set cheap_books = books[attr("price") < 30];
But, like the reader interface, a library that supports something DOM- like can be augmented with XPath support.
I'm not convinced something separate wouldn't be better and more widely useful. Thanks, Graham -- Graham Bennett

Graham Bennett wrote:
Hi Doug,
On Tue, Nov 08, 2005 at 09:28:09PM -0500, Douglas Gregor wrote:
Readers are important for some things, DOM is important for other things, but there's no reason to tie the two together in one library or predicate one on the other.
Well, there is at least one reason - if the DOM is built on top of a reader interface then the DOM library doesn't have to know how to parse XML, and is not tied to any particular parser. Even if you don't agree with using a reader interface for this separation layer, I'd hope you would agree that some separation is at least necessary.
I wish people would stop being so parser-focussed. I reiterate: the API I suggest is about manipulating a DOM tree. The fact that you *might* want to construct it from an XML file by means of a parser is almost coincidental. Yes indeed, an implementation of such an XML parser will most likely use either a SAX or an XmlReader layer beneath, and in fact, libxml2 does exactly that and it would be quite natural to expose those APIs to C++ in a similar way I propose the DOM wrapper.
We can have a XML DOM library that allows reading, traversing, modifying, and writing XML documents, then later turn the reading part into a full-fledged streaming interface for those applications.
Can you elaborate on how you would enable a DOM structure to present a streaming interface?
Not the DOM structure, but the parser ! It's exactly what you are saying above: Each sensible XML parser will use an API underneath that can be used to build a public SAX or XmlReader (or both) on top of. But instead of requiring the parser to be built on such a C++ API I use a C implenentation that already contains multiple APIs, and I wrap them *separately* into C++ APIs. For a user of the C++ DOM API it is totally irrelevant whether the implementation is based on the C++ SAX API or an internal C SAX API, as long as it adhers to the specification.
Are you talking about lazy tree building or something else? In any case, I would think it's inherantly difficult to retrofit a streaming interface. Much better to build the streaming interface from the start, and build the DOM on top of it. This can only be good for both sides - the reader gets to just be a reader, and the DOM gets to just be a DOM.
You haven't talked about the DOM yet, only about a parser. You still need to provide all the other missing bits, such as an XPath lookup mechanism, XInclude processing, http support for URI lookup, etc., etc. I can't stress it enough: the parser is really just a tiny bit of it all. Regards, Stefan

On Wed, Nov 09, 2005 at 09:21:45PM -0500, Stefan Seefeld wrote:
Graham Bennett wrote:
Hi Doug,
On Tue, Nov 08, 2005 at 09:28:09PM -0500, Douglas Gregor wrote:
Readers are important for some things, DOM is important for other things, but there's no reason to tie the two together in one library or predicate one on the other.
Well, there is at least one reason - if the DOM is built on top of a reader interface then the DOM library doesn't have to know how to parse XML, and is not tied to any particular parser. Even if you don't agree with using a reader interface for this separation layer, I'd hope you would agree that some separation is at least necessary.
I wish people would stop being so parser-focussed. I reiterate: the API I suggest is about manipulating a DOM tree. The fact that you *might* want to construct it from an XML file by means of a parser is almost coincidental.
I agree that the way the DOM is created doesn't really have anything to do with a parser or anything else. It's perfectly possible to put the DOM together any way you want. I think people have expressed concern that the intention might be to ship the library with a libxml2 (or any specific parser) implementation for building the DOM from text, which I don't think would be a good idea. I was suggesting that having a way to build the DOM from a standardised interface, like a reader, would be a way to separate these concerns.
Yes indeed, an implementation of such an XML parser will most likely use either a SAX or an XmlReader layer beneath, and in fact, libxml2 does exactly that and it would be quite natural to expose those APIs
to C++ in a similar way I propose the DOM wrapper.
Ok, I agree.
We can have a XML DOM library that allows reading, traversing, modifying, and writing XML documents, then later turn the reading part into a full-fledged streaming interface for those applications.
Can you elaborate on how you would enable a DOM structure to present a streaming interface?
Not the DOM structure, but the parser ! It's exactly what you are saying above: Each sensible XML parser will use an API underneath that can be used to build a public SAX or XmlReader (or both) on top of.
But instead of requiring the parser to be built on such a C++ API I use a C implenentation that already contains multiple APIs, and I wrap them *separately* into C++ APIs. For a user of the C++ DOM API it is totally irrelevant whether the implementation is based on the C++ SAX API or an internal C SAX API, as long as it adhers to the specification.
Are you talking about lazy tree building or something else? In any case, I would think it's inherantly difficult to retrofit a streaming interface. Much better to build the streaming interface from the start, and build the DOM on top of it. This can only be good for both sides - the reader gets to just be a reader, and the DOM gets to just be a DOM.
You haven't talked about the DOM yet, only about a parser.
I think I wasn't clear in my previous mail. I'm not at all concerned with parsers, there are plenty of them and they do a good job. I'm not suggesting a parser should be implemented. The only thing I am concerned about is that Boost define a standard streaming XML reader API. That is where I think there is a distinct need in C++ at the moment.
You still need to provide all the other missing bits, such as an XPath lookup mechanism, XInclude processing, http support for URI lookup, etc., etc. I can't stress it enough: the parser is really just a tiny bit of it all.
Agreed that the parser is a small part, but so is the DOM. All of the things you mention above can and should be implemented independently of a DOM model, IMO. Please don't think that I'm against a Boost DOM implementation, I think it's a worthy effort and what you have submitted is a good start. I just think that a standardised reader interface is a much more important integration point than DOM, and I'm suggesting that it would be worthwhile putting effort into that area sooner rather than later. cheers, Graham -- Graham Bennett

Graham Bennett wrote:
I wish people would stop being so parser-focussed. I reiterate: the API I suggest is about manipulating a DOM tree. The fact that you *might* want to construct it from an XML file by means of a parser is almost coincidental.
I agree that the way the DOM is created doesn't really have anything to do with a parser or anything else. It's perfectly possible to put the DOM together any way you want. I think people have expressed concern that the intention might be to ship the library with a libxml2 (or any specific parser) implementation for building the DOM from text, which I don't think would be a good idea. I was suggesting that having a way to build the DOM from a standardised interface, like a reader, would be a way to separate these concerns.
Well, I seem to express myself poorly: The creation of the DOM tree itself is only half of the story. You also need to be able to traverse and process it, in a very domain-specific way. That processing involves a lot of knowledge that would have to be coded (XPath, XInclude, XPointer, ...). So, when talking about 'backend' I'm not talking about a DOM factory, but a DOM implementation. I'm not sure how many people really are concerned about the libxml2 backend. One or two have voiced their opposition. It was not because my specific choice of backend, but because the fact that it wouldn't be an integral part of boost. That's the old 'Not Invented Here' syndrom, and there isn't much to do about it. People will always try to reinvent wheels. That's ok, as long as they don't hinder others to use existing wheels. And finally, I'm not too concerned about backend dependencies, as the real point of the proposal is the API, not the implementation, i.e. it must be possible to replace one backend by another. And if people are dissatisfied with having an external library dependency, they can 'easily' write a new implementation and throw that in instead; the user mustn't be aware, or else the API design was a failure.
You still need to provide all the other missing bits, such as an XPath lookup mechanism, XInclude processing, http support for URI lookup, etc., etc. I can't stress it enough: the parser is really just a tiny bit of it all.
Agreed that the parser is a small part, but so is the DOM. All of the things you mention above can and should be implemented independently of a DOM model, IMO.
Why should they ? 'DOM' is exactly that: a specific 'model'. It's there for a purpose, so other services can be hooked into it.
Please don't think that I'm against a Boost DOM implementation, I think it's a worthy effort and what you have submitted is a good start. I just think that a standardised reader interface is a much more important integration point than DOM, and I'm suggesting that it would be worthwhile putting effort into that area sooner rather than later.
Good. Interdependency between DOM and SAX / Reader aside, I understand your point. I'm tempted to add an XmlReader to the library, as it would be relatively simple to implement on top of the libxml2 API. But for the sake of keeping the discussion focussed on the DOM API I won't. Let's start a different thread for other APIs. I agree that they would be useful. Regards, Stefan

By the way, do you think there is any merit to the idea of introducing policies for the way the DOM is created/stored? Graham. On Wed, Nov 09, 2005 at 09:21:45PM -0500, Stefan Seefeld wrote:
Graham Bennett wrote:
Hi Doug,
On Tue, Nov 08, 2005 at 09:28:09PM -0500, Douglas Gregor wrote:
Readers are important for some things, DOM is important for other things, but there's no reason to tie the two together in one library or predicate one on the other.
Well, there is at least one reason - if the DOM is built on top of a reader interface then the DOM library doesn't have to know how to parse XML, and is not tied to any particular parser. Even if you don't agree with using a reader interface for this separation layer, I'd hope you would agree that some separation is at least necessary.
I wish people would stop being so parser-focussed. I reiterate: the API I suggest is about manipulating a DOM tree. The fact that you *might* want to construct it from an XML file by means of a parser is almost coincidental.
Yes indeed, an implementation of such an XML parser will most likely use either a SAX or an XmlReader layer beneath, and in fact, libxml2 does exactly that and it would be quite natural to expose those APIs to C++ in a similar way I propose the DOM wrapper.
We can have a XML DOM library that allows reading, traversing, modifying, and writing XML documents, then later turn the reading part into a full-fledged streaming interface for those applications.
Can you elaborate on how you would enable a DOM structure to present a streaming interface?
Not the DOM structure, but the parser ! It's exactly what you are saying above: Each sensible XML parser will use an API underneath that can be used to build a public SAX or XmlReader (or both) on top of.
But instead of requiring the parser to be built on such a C++ API I use a C implenentation that already contains multiple APIs, and I wrap them *separately* into C++ APIs. For a user of the C++ DOM API it is totally irrelevant whether the implementation is based on the C++ SAX API or an internal C SAX API, as long as it adhers to the specification.
Are you talking about lazy tree building or something else? In any case, I would think it's inherantly difficult to retrofit a streaming interface. Much better to build the streaming interface from the start, and build the DOM on top of it. This can only be good for both sides - the reader gets to just be a reader, and the DOM gets to just be a DOM.
You haven't talked about the DOM yet, only about a parser. You still need to provide all the other missing bits, such as an XPath lookup mechanism, XInclude processing, http support for URI lookup, etc., etc. I can't stress it enough: the parser is really just a tiny bit of it all.
Regards, Stefan _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Graham -- Graham Bennett

Graham Bennett wrote:
By the way, do you think there is any merit to the idea of introducing policies for the way the DOM is created/stored?
See my other post. A central point of my design is to allow 'backends' to be used to implement the DOM and associated services. Having too fine grained a control over internals would restrict the possibility to decouple interface and implementation. Also, I don't see any need to do that. What parameters do you have in mind that you would like to represent as policies ? Regards, Stefan

On 11/9/05 9:21 PM, "Stefan Seefeld" <seefeld@sympatico.ca> wrote: [SNIP]
I wish people would stop being so parser-focussed. I reiterate: the API I suggest is about manipulating a DOM tree. The fact that you *might* want to construct it from an XML file by means of a parser is almost coincidental. [TRUNCATE]
So, if we had a Tree container class (template), you could build the DOM class out of that? Maybe a tree type is something else we should be looking for Boost. (If you don't want to wait, you could build your DOM type, internal tree and all, and later consider promoting that tree type as a fully acknowledged Boost type.) -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote:
On 11/9/05 9:21 PM, "Stefan Seefeld" <seefeld@sympatico.ca> wrote:
[SNIP]
I wish people would stop being so parser-focussed. I reiterate: the API I suggest is about manipulating a DOM tree. The fact that you *might* want to construct it from an XML file by means of a parser is almost coincidental.
[TRUNCATE]
So, if we had a Tree container class (template), you could build the DOM class out of that? Maybe a tree type is something else we should be looking for Boost. (If you don't want to wait, you could build your DOM type, internal tree and all, and later consider promoting that tree type as a fully acknowledged Boost type.)
The question about whether the boost graph library and spirit could be used came up before, which I already commented on in a previous posting in this thread. The argument is essetially the same here. Regards, Stefan

On 11/9/05, Graham Bennett <graham-boost@simulcra.org> wrote:
IMO a streaming interface is much more important than DOM as a starting point - one can easily and efficiently build a DOM from a stream, but starting with an in-memory representation of a document usually precludes streaming. There are a number of XML applications where it is not desirable or possible to hold the entire document in memory at once. A reader interface has advantages over SAX in that it is much easier to program with. It's very easy to do things like implement decorators around readers, and to write generic code that just understands how to use a reader and doesn't care how the XML is actually stored.
I completely agree. I think the basic interface should be a reader interface. Also as per my previous post, this basic interface would not have any dependencies on external libraries like libxml2, which I think is very bad for a boost library, not to mention the licensing issues ! I think a very simple way to look at this is based on the XML file size reader API --> any file SAX API --> any file DOM API --> small files in memory So we should start with the general solution and build on top of that. Also, has anybody check the Spirit XML examples. wouldn't a reader API + spirit give an entry point to a big percentage of the XML problem domain ?

On Wed, Nov 09, 2005 at 11:45:29AM +0100, Jose wrote:
On 11/9/05, Graham Bennett <graham-boost@simulcra.org> wrote:
IMO a streaming interface is much more important than DOM as a starting point - one can easily and efficiently build a DOM from a stream, but starting with an in-memory representation of a document usually precludes streaming. There are a number of XML applications where it is not desirable or possible to hold the entire document in memory at once. A reader interface has advantages over SAX in that it is much easier to program with. It's very easy to do things like implement decorators around readers, and to write generic code that just understands how to use a reader and doesn't care how the XML is actually stored.
I completely agree. I think the basic interface should be a reader interface. Also as per my previous post, this basic interface would not have any dependencies on external libraries like libxml2, which I think is very bad for a boost library, not to mention the licensing issues !
Yes, some degree of separation is a must. Libxml2 is a very good and portable library, but explicit dependencies are a bad thing, especially if they can be avoided easily.
I think a very simple way to look at this is based on the XML file size
reader API --> any file SAX API --> any file DOM API --> small files in memory
So we should start with the general solution and build on top of that. Also, has anybody check the Spirit XML examples.
wouldn't a reader API + spirit give an entry point to a big percentage of the XML problem domain ?
The reader api would give you the entry point, a spirit-based parser implementation (I'm no spirit expert) would seem to make sense as default implementation to provide with Boost, so as not to add any more dependencies. Thanks, Graham -- Graham Bennett

Graham Bennett wrote:
Yes, some degree of separation is a must. Libxml2 is a very good and portable library, but explicit dependencies are a bad thing, especially if they can be avoided easily.
Easily ?!? You can't be serious. 'No dependencies' would imply to reimplement the functionality provided by libxml2 entirely inside boost. If that is easy I wonder why it hasn't happened so far. Regards, Stefan

On Wed, Nov 09, 2005 at 09:24:22PM -0500, Stefan Seefeld wrote:
Graham Bennett wrote:
Yes, some degree of separation is a must. Libxml2 is a very good and portable library, but explicit dependencies are a bad thing, especially if they can be avoided easily.
Easily ?!? You can't be serious. 'No dependencies' would imply to reimplement the functionality provided by libxml2 entirely inside boost. If that is easy I wonder why it hasn't happened so far.
No, no. I'm not suggesting that at all. I'm just saying that Boost should provide the standard interface and the implementation of parsing is a separate issue which shouldn't concern Boost at all. I think that's the same thing you are saying. Boost doesn't need to ship with a way of parsing XML at all. Graham -- Graham Bennett
participants (19)
-
Alan Gutierrez
-
Anthony Williams
-
Caleb Epstein
-
Cromwell Enage
-
Daryle Walker
-
David Abrahams
-
Doug Gregor
-
Douglas Gregor
-
Gennadiy Rozental
-
Graham Bennett
-
Janusz Piwowarski
-
Jez Higgins
-
Jose
-
Marshall Clow
-
Matthias Troyer
-
Reece Dunn
-
Stefan Seefeld
-
Steinar Bang
-
Suman Cherukuri