[xml] Brainstorming / Request for Comments, Suggestions, Opinions ...

Hi, A few months ago I said I'd take on writing an XML library for Boost. Well, I've finally got some time on my hand and started with a little bit of brainstorming about the library. I've written up my thoughts in this hopefully halfway comprehensible document and would like to hear everyone's suggestions, opinions, advice, requirements, etc. Especially real-world requirements, as my own are just that of a single person, and that isn't exactly a good basis for a general-purpose XML library. The current brainstorming is only for the XML reading side of things. The writing side will come afterwards. So here goes. ------------------------------------------------------------------ Purpose of document: Identify important decisions in the design of a C++ XML parser library. 1) API TYPE Pull-API (StAX), Push-API (SAX), Object-Model-API (DOM)? - All of them, of course! The main question is, which one is the base API? - DOM is out of the question (performance/memory overhead). - Implementing a push parser on top of a pull parser is trivial: while(fetchEvent()) pushEvent() - Implementing a pull parser on top of a push parser requires at least generator-style coroutines. This occurs a performance overhead at best, unusability at worst (in limited environments). - It is therefore best to use a pull model at the lowest model, although this makes the parser implementation more complex. 2) Pull Interface There are several models of pull interfaces in use. These are for Java and C#, so a C++ parser does not necessarily have to use any of them. Existing APIs: --- Java - XMLPull - StAX (JSR 173) - Xerces XNI Pull Configuration --- .Net - .Net XmlReader --- Python - Python has very little material on pull parsers. There seem to be some available, but they're not popular. --- Ruby - Ruby has a built-in XML library with pull support. The API is not yet stable but seems to resemble .Net's XmlReader. --- C++ - An early version of XPP has a C++ implementation. The interface is Java-style. API Styles: >From the above APIs, we can gather the following: - Pull parsing always involves calling a method to obtain the next piece of document information (called "event"), then processing that piece. - Two main models seem typical: -- StAX has a nextEvent() method that returns a reference to an object, identified by a base interface XMLEvent. This reference can then be cast to the appropriate sub-interface. This is the polymorphism approach. -- XMLPull and .Net's XmlReader also have a next()/read() method. However, they do not return an event object but instead store the information internally, to be queried by special methods. This model is also used by REXML, Ruby's parser. This is the monolith model. Polymorphism pro: - State is not held in parser object. Calling next() does not necessarily discard the old information. - Once the correct interface is obtained, all methods on it are guaranteed to work. With the monolith model, calling the wrong method may lead to exceptions or error returns. Monolith pro: - Does not need to allocate an object for each parse event. Can in fact hold information in a very compact way internally. - No casts necessary. Options: - What are the options that a C++ API has? - Polymorphism-style API. Return smart pointer? Returning allocated object in raw pointer unacceptable. Returning pointer to static storage possible, but is basically monolith in disguise. Cannot pass an existing object IN to be filled with data. - Monolith-style API. Means, among other things, that passing the current data to a function means passing a reference to the entire parser. Furthermore, it is not possible to pass a small object containing only the data from the actual event to a function, unless that object is written by the API user. Thus, every function would have to either have its own switch on the event type or assert that the passed-in object contains the right data. - Union of events, like a Boost.Variant. This seems a good compromise between polymorphic and monolithic approaches: - State not in parser object, but separate. - No dynamic allocation: Variant is usually stack-based. - Obtain the actual object from it and use that. Check is required only once, other uses are statically checked. - Can pass in the variant as an out parameter, saving even the copy. Of course, the variant has downsides, mainly that you have to either cast or use a static visitor. - Other downsides? - Other approaches? 3) Input/Output System How does the library access underlying storage? - Since it needs to access resources from various sources, typically specified as URLs, it needs a flexible and runtime-switchable input system. - In particular, it should be possible to plug schema resolvers in at runtime, so that program extensions can provide support for, say, the ftp: schema. - Two basic options: - Iterator-based approach. - Stream-based approach. - Other? - Iterators are tricky to switch at runtime, and non-trivial to implement. - Streams are easier to implement, especially in a polymorphic fashion, but they are a poor abstraction of things like memory-mapped files. Does that matter? - Streams, not necessarily being random-access, require caching for backtracking libraries like Spirit to work. Alternative: hand-write the parser. XML is not, in my opinion, particularly suited to being implemented with Spirit anyway. - Is it even possible to have iterators model non-blocking I/O? - Having tried a few experiments, I favour streams. Iterators are somewhat icky to work with in a hand-written parser, especially as they always need to be passed in pairs (or as a range). 4) Integration With Other Boost Libraries What other Boost libraries should Xml work/integrate with? - For example, does it make sense to provide an interface to the parser that can be used for parsing streaming content? Either non-blocking, with the option to parse partial data and hop back on missing content, or a completely asynchronous implementation that dispatches SAX events through e.g. ASIO? 5) Parser Back-End / Library Organization - Should Boost.Xml be a complete XML solution, with a parser, DOM implementation and everything? - Or should it be split into two parts, one being a parser, the other a DOM implementation with various construction modes? - Or should even the core parser be split into the actual text parser and the event/pull/whatever interface, so that an HTML or YAML or PYX parser or even an algorithmic content generator can be placed behind? - What, then, is the interface between that parser and the user interface? 6) Other Issues ???

Sebastian Redl wrote:
Hi,
A few months ago I said I'd take on writing an XML library for Boost. Well, I've finally got some time on my hand and started with a little bit of brainstorming about the library.
Have you followed the discussions around my proposal for an XML API in boost that I implemented on top of libxml2 (http://xmlsoft.org/) ? I think that starting a new implementation from scratch is the wrong way to approach this (rather big) topic. This in particular since an 'XML library' shouldn't just provide ways to de- and encode XML documents into generic tree structures, but instead needs to provide quite a substantional amount of functionality in order to be considered complete (even if you approach this in a modular way). As an example, imagine querying your DOM-like structure with an XPath expression. Think about all this does involve, from regular expression handling, over XPath pattern matching, http lookup, entity handling, unicode, etc., etc. This is why I don't think that you should think about such a project one step at a time (e.g. the 'XML reading side of things'). Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

On Wed, September 6, 2006 7:30 pm, Stefan Seefeld wrote:
Have you followed the discussions around my proposal for an XML API in boost that I implemented on top of libxml2 (http://xmlsoft.org/) ?
I wasn't around for the early discussions, but have caught up on them now. It's an interesting discussion, though I'm not sure how relevant it is. I'll elaborate later.
I think that starting a new implementation from scratch is the wrong way to approach this (rather big) topic.
This in particular since an 'XML library' shouldn't just provide ways to de- and encode XML documents into generic tree structures, but instead needs to provide quite a substantional amount of functionality in order to be considered complete (even if you approach this in a modular way). As an example, imagine querying your DOM-like structure with an XPath expression. Think about all this does involve, from regular expression handling, over XPath pattern matching, http lookup, entity handling, unicode, etc., etc.
This is why I don't think that you should think about such a project one step at a time (e.g. the 'XML reading side of things').
It seems to me that your earlier proposal was mainly about a few API specifications, that were then supposed to be implemented somehow - preferably on top of an existing XML library, in order to avoid reinventing the wheel. This idea certainly has a lot of merit, but it also has some distinct disadvantages. First, an API specification is nice for standardization, but not very usable within the context of Boost. In order to be useful, there must be at least one implementation of the API. Otherwise, the specification is worth nothing to the end user. This implementation must exist within Boost, i.e. it must be completely contained within Boost. Libraries like Regex and Iostreams offer enhanced functionality if certain external libraries are available, but they will work without them, too. Obviously, the Xml library could not work without the external XML implementation if it is just a wrapper around it. This means that, if the library is a wrapper around an external one, the external library (let's for argument's sake assume libxml2, which seems to bring less licensing trouble compared to Xerces, the only other sufficiently complete XML library I can think of) must be distributed with Boost. What does this entail? The library must build as part of Boost. I haven't checked, but I assume libxml2's build system right now is based on automake. That would have to be translated to Boost.Build. As part of this process, configuration macros might need to be translated. This could easily lead to a real fork of the code base. Unless Boost wants to rely on the regression testing done by the authors of libxml2, regression tests, portability tests and everything else must be written and maintained. And last but certainly not least, there's the licensing issue. Boost is working hard to get all code under the Boost license. Would we want an external library under any other license, no matter how permissive, in that code base? Or would the authors of libxml2 permit relicensing of the source? (As a programmer, I'd rather reimplement a library than pursuing such goals. ;) ) Second, the recommendation focused on a DOM-style API. As at least two people [1][2] pointed out, DOM-style APIs are not as universally useful as other APIs. That said, I do intend to provide a DOM-syle API, but only after having completed the event-based API and thought long and hard about what a DOM-style API means in C++. Still, this is one of the main reasons why I asked for real-world use cases. My own uses of XML have usually been satisfied by SAX, although I would have preferred a pull-style API. I'd love to hear how other people use XML. I know that two Boost-internal uses could work with a pull API very well: Property Tree's XML reader and the Serialization XML archive. To sum up, I do believe we should reinvent the wheel here. But we should create an improved wheel, and I think the Boost community is uniquely suited to create a wheel that works particularly well with C++. To maintain thread integrity, I'll reply to each post individually. [1] http://lists.boost.org/Archives/boost/2005/11/96131.php [2] http://lists.boost.org/Archives/boost/2005/11/96521.php

Sebastian Redl wrote:
It seems to me that your earlier proposal was mainly about a few API specifications, that were then supposed to be implemented somehow - preferably on top of an existing XML library, in order to avoid reinventing the wheel.
This idea certainly has a lot of merit, but it also has some distinct disadvantages.
First, an API specification is nice for standardization, but not very usable within the context of Boost. In order to be useful, there must be at least one implementation of the API. Otherwise, the specification is worth nothing to the end user.
Indeed.
This implementation must exist within Boost, i.e. it must be completely contained within Boost.
Huh ? While I can see advantages in code being contained within the boost source / binary packages, I think that is by no means a requirement. (FWIW, we had a discussion about that specific point, and agreed that there was no such requirement.)
Libraries like Regex and Iostreams offer enhanced functionality if certain external libraries are available, but they will work without them, too. Obviously, the Xml library could not work without the external XML implementation if it is just a wrapper around it.
Quite correct, and in order to make sure the API can be implemented by other means (i.e. without libxml2) we have to make sure it is neutral, i.e. no implementation-specific aspects percolate through to the API. But it doesn't mean the reference implementation has to be free of third-party code. Otherwise, would you prefer boost.python to ship its own 'Bython' implementation ? ;-)
This means that, if the library is a wrapper around an external one, the external library (let's for argument's sake assume libxml2, which seems to bring less licensing trouble compared to Xerces, the only other sufficiently complete XML library I can think of) must be distributed with Boost. What does this entail? The library must build as part of Boost. I haven't checked, but I assume libxml2's build system right now is based on automake. That would have to be translated to Boost.Build. As part of this process, configuration macros might need to be translated. This could easily lead to a real fork of the code base. Unless Boost wants to rely on the regression testing done by the authors of libxml2, regression tests, portability tests and everything else must be written and maintained. And last but certainly not least, there's the licensing issue. Boost is working hard to get all code under the Boost license. Would we want an external library under any other license, no matter how permissive, in that code base? Or would the authors of libxml2 permit relicensing of the source? (As a programmer, I'd rather reimplement a library than pursuing such goals. ;) )
Second, the recommendation focused on a DOM-style API. As at least two people [1][2] pointed out, DOM-style APIs are not as universally useful as other APIs.
Indeed, I was focussing on the DOM-style API, but that's only because I happened to use it at that time, and so had an immediate need. I indicated that a SAX-like (better: XMLReader-like) API would follow, if there was enough interest. (libxml2 naturally builds its own tree API on top of such an xml_reader.)
To sum up, I do believe we should reinvent the wheel here. But we should create an improved wheel, and I think the Boost community is uniquely suited to create a wheel that works particularly well with C++.
What improvements would your implementation offer ? (Note that I'm specifically asking for the implementation, as the whole argument seems to be much less concerned about the API. Or, would there be API improvements that couldn't be implemented using third-part libraries such as libxml2 ?) Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Sebastian Redl wrote:
I know that two Boost-internal uses could work with a pull API very well: Property Tree's XML reader and the Serialization XML archive.
To sum up, I do believe we should reinvent the wheel here. But we should create an improved wheel, and I think the Boost community is uniquely suited to create a wheel that works particularly well with C++.
If you've looked at serialization of xml you'll see that it uses spirit and an xml grammer derived from a complete one included in the spirit library. If you really don't want to re-invent the wheel, they why not just use the same approach? The spirit library contains a very complete XML grammer already. I don't recall this even having been considered - much less rejected. Creating XML is easy. Handling issues of code conversion (e.g. from locale specific code to UTF-8 or UTF-16) is easily handled with i/o stream facets - some of which are also already available. It is also already "Boost Friendly" and works with all boost platforms. It also does most of the heavy lifting at compile time - very much in line with other boost tools. I can't understand why any other approach would be attractive for users of other boost libraries. To me the whole idea is "re-inventing the wheel" Robert Ramey

On 9/8/06, Robert Ramey <ramey@rrsd.com> wrote:
If you've looked at serialization of xml you'll see that it uses spirit and an xml grammer derived from a complete one included in the spirit library. If you really don't want to re-invent the wheel, they why not just use the same approach? The spirit library contains a very complete XML grammer already. I don't recall this even having been considered - much less rejected. Creating XML is easy. Handling issues of code conversion (e.g. from locale specific code to UTF-8 or UTF-16) is easily handled with i/o stream facets - some of which are also already available. It is also already "Boost Friendly" and works with all boost platforms. It also does most of the heavy lifting at compile time - very much in line with other boost tools. I can't understand why any other approach would be attractive for users of other boost libraries. To me the whole idea is "re-inventing the wheel"
The only argument I'd have against it is that libraries like expat and libxml2 are hugely optimized, and I have found in the past that XML parsing can account for a significant percentage of runtime in some applications. Has anyone done any performance benchmarks on the spirit xml parser? In the past I've found that, past a certain level of complexity, spirit parsers are substantially slower than handcrafted alternatives, but that could simply be because I'm not an expert in writing "good" spirit parsers.

"Sebastian Redl" <sebastian.redl@getdesigned.at> writes:
This implementation must exist within Boost, i.e. it must be completely contained within Boost.
Not if you put any stock in past Boost discussions and the intention behind Boost. We've discussed this many times and we've always said that a library whose implementation just happens to be as a C++ wrapper over code available elsewhere could be valuable and appropriate for Boost. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Sebastian Redl" <sebastian.redl@getdesigned.at> writes:
This implementation must exist within Boost, i.e. it must be completely contained within Boost.
Not if you put any stock in past Boost discussions and the intention behind Boost. We've discussed this many times and we've always said that a library whose implementation just happens to be as a C++ wrapper over code available elsewhere could be valuable and appropriate for Boost.
And just to reinforce the point, the MPI library under review can't operate without another package. Nor can the Python lib. iostreams and other libraries use external code from some features. So there is certainly precedent for the use of external libraries. Jeff

Sebastian Redl wrote :
1) API TYPE Pull-API (StAX), Push-API (SAX), Object-Model-API (DOM)?
- All of them, of course! The main question is, which one is the base API? - DOM is out of the question (performance/memory overhead). - Implementing a push parser on top of a pull parser is trivial: while(fetchEvent()) pushEvent() - Implementing a pull parser on top of a push parser requires at least generator-style coroutines. This occurs a performance overhead at best, unusability at worst (in limited environments). - It is therefore best to use a pull model at the lowest model, although this makes the parser implementation more complex.
It could also be possible to make the push and pull parsers more or less independant, so that each one can be as efficient as it can be.
3) Input/Output System How does the library access underlying storage?
- Since it needs to access resources from various sources, typically specified as URLs, it needs a flexible and runtime-switchable input system. - In particular, it should be possible to plug schema resolvers in at runtime, so that program extensions can provide support for, say, the ftp: schema.
That would be the work of another library, that would provide a way to read any kind of resource from an URL, a bit like what PHP has. That kind of library would be very useful too outside of the XML library.
- Two basic options: - Iterator-based approach. - Stream-based approach. - Other?
Maybe a more low-level approach like what boost asio provides could be interesting, especially since this models also provides asynchronous I/O.
4) Integration With Other Boost Libraries What other Boost libraries should Xml work/integrate with?
Since XML needs good Unicode support and the like, maybe there is work to be done in that area first in boost.
- For example, does it make sense to provide an interface to the parser that can be used for parsing streaming content? Either non-blocking, with the option to parse partial data and hop back on missing content, or a completely asynchronous implementation that dispatches SAX events through e.g. ASIO?
The ability to parse partial content would be a great plus.
5) Parser Back-End / Library Organization
- Should Boost.Xml be a complete XML solution, with a parser, DOM implementation and everything?
Writing a complete XML solution is a lot of work, especially if you want to support all XML technologies (XMLSchema, RelaxNG, XPath, XLink, XInclude, XPointer...) Maybe it could be interesting to reuse libxml2, which is under the MIT license, to build something on top of it. Of course first we need to weight the gains behind a new C++ implementation.
- Or should it be split into two parts, one being a parser, the other a DOM implementation with various construction modes? - Or should even the core parser be split into the actual text parser and the event/pull/whatever interface, so that an HTML or YAML or PYX parser or even an algorithmic content generator can be placed behind? - What, then, is the interface between that parser and the user interface?
6) Other Issues ???
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

loufoque wrote:
Writing a complete XML solution is a lot of work, especially if you want to support all XML technologies (XMLSchema, RelaxNG, XPath, XLink, XInclude, XPointer...) Maybe it could be interesting to reuse libxml2, which is under the MIT license, to build something on top of it. Of course first we need to weight the gains behind a new C++ implementation.
FWIW, we had discussions to that effect in the past. See http://aspn.activestate.com/ASPN/Mail/Message/boost/1653426 http://aspn.activestate.com/ASPN/Mail/Message/boost/1684091 I think a frequent mistake people make when thinking about XML is that they assume it's all about parsing. As you point out, there are quite a lot of more or less interdependent aspects of XML. While it would be good to keep a potential C++ API as modular as possible, there are certain limits. In particular, I think that, while the API may be kept modular, an implementation may actually want to share code to make certain operations faster. libxml2 for example has been heavily tuned for performance, and I believe it would be foolish to even think about starting anew, as opposed to leveraging this knowledge. FWIW, the DOM API / implementation I proposed (see above) was 'only' lacking in its parametrization for unicode types (specifically, the importing / exporting to and from user unicode types to the library internal types. I hoped to find the time to finish that work, but so far didn't manage to. I'd appreciate any help ! Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

On Thu, September 7, 2006 11:04 am, loufoque wrote:
It could also be possible to make the push and pull parsers more or less independant, so that each one can be as efficient as it can be.
True. Basically, by making a direct push parser implementation, you can avoid the overhead of state saving that a pull parser requires. However, that effectively means duplicating the work, so the library should be written in such a way that the client can easily substitute the push parser that is implemented on top of a pull parser with a direct push parser.
- Since it needs to access resources from various sources, typically specified as URLs, it needs a flexible and runtime-switchable input system. - In particular, it should be possible to plug schema resolvers in at runtime, so that program extensions can provide support for, say, the ftp: schema.
That would be the work of another library, that would provide a way to read any kind of resource from an URL, a bit like what PHP has. That kind of library would be very useful too outside of the XML library.
True, and I have no intention of providing such a library within the Xml library. But it is something that needs to be kept in mind when thinking about I/O.
Maybe a more low-level approach like what boost asio provides could be interesting, especially since this models also provides asynchronous I/O.
I'm not sure how useful async I/O is for a parser. Parsing of incomplete data seems more important. If you have that and you want asynchronous parsing events, you can start the async I/O and have a handler that parses the newly received data, posting events to the async queue. The Xml library could provide such a handler, but that would be a very independent feature. The problem with the really low-level approach is the one you mention next.
Since XML needs good Unicode support and the like, maybe there is work to be done in that area first in boost.
Oh, yes, the Unicode problem. It would take an examination of systems in use, but my impression is that most programs use either UTF-16 or UTF-8 as their internal coding. With that in mind, I think it might be best to have the XML library internally support exactly these two encodings (perhaps as two template specializations) and interact with the user only in these two encodings. The transcoding of whatever external character set/encoding is used would then be an issue for the I/O interface. However, such transcoding requires the I/O interface to be sufficiently abstractable to provide it transparently - which is an obstacle for the low-level approach you suggest above.
The ability to parse partial content would be a great plus.
Yes, that seems to be important. However, it should be at the discretion of the user to switch it off, enabling the parser to work with a single lookahead character. (To support partial content, either the parser needs to support extremely complex state saving, or cache content until a complete event has been generated.)
Writing a complete XML solution is a lot of work, especially if you want to support all XML technologies (XMLSchema, RelaxNG, XPath, XLink, XInclude, XPointer...)
It is. However, it is something that, I think, can be done very well in steps, i.e. first release supports only pull parsing, second release adds push parsing, third adds a DOM, fourth another technology, etc. As long as you consider all possible technologies when implementing the basic ones, this ought to be feasible. And yes, it's a lot of work. I'm willing to put a lot of work into it.
Maybe it could be interesting to reuse libxml2, which is under the MIT license, to build something on top of it. Of course first we need to weight the gains behind a new C++ implementation.
See my reply to Stefan Seefeld. I think that within Boost, depending on an external library, no matter what license, is a very bad idea. I also think that an implementation intended from the ground up to work with C++ is a better choice. Thank you for your comments. They've given me some ideas.

Hi , Perhaps you can get some additional ideas from my xml lib, it's not finished yet, but it already works. You can download it here: http://download.nicai-systems.com/xmlpp/xmlpp.tar.gz I am using my library to transmit data over a tcp stream (wrapped in an std::istream and an std::ostream), load and store measurement data and for handling configuration data in xml-files. I wrote a class generator which generates derived xml element nodes. The generator itself uses the library, and generates it's own source code ;-) I started developing an xml-library for c++ because the existing libraries did not fit my taste. I think they are written for other languages than c++: DOM and SAX are good and nice for java and .net applications. Libxml is to big, and TinyXml supports only the DOM interface... When I use xml, I can ensure that everything is coded in UTF-8, so it would be possible to work with char based std::streams and std::strings, and if the file would be coded in "real Unicode", there would be the possibility to use the wchar based classes. My design of the library has 3 different layers: The requirements for the fist (tag based) layer were (demo1): - Compatibility with std::istream and std::ostream - Extremely lightweight, not a huge xml framework... The requirements for the second layer were (demo2): - Access to an element tree, which may contain a whole xml document or a part of it. - The parser can load the whole document, or a single element with or without it's contents. The requirement for the third layer were (demo3 and xclassgen): - Access to the elements of the tree by derived classes which can be automatically generated by an xml:schema definition file. The current limitations of the library are: - no encoding support, but UTF-8 can be handled with std::string - no namespaces, but it should be possible to change this... - the automatic generation of the derived classes needs at the moment a special xml definition file, it does not work with the xml:schema file Loading an xml document is always done by a pull parser based on an std::istream, pushing will be handled by using a thread wrapped around the pull parser (so there is no need to use an self made stack....) std::ifstream in("test.xml"); xml::ixmlstream xin(in); while(true) { xml::node n = xin.getNextNode(); if (n==0) break; handle(n); delete n; } Regards, Nils

On 9/7/06, Nils Springob <nils.springob@nicai-systems.de> wrote:
the link is broken !!

A few thoughts on adding XML to the boost libraries: I'll start with what my issues are with existing alternatives, since I think that indicates why there's a rationale for a boost XML library, and then move onto a few other points specific to boost. 1) My problems with existing C++ DOM/XML libraries: a) Most are inadequate/incomplete or don't really support DOM (thin DOM-like wrappers around event-based APIs don't really cut it). b) Those that are complete have very annoying and/or intractable dependencies. Examples: Libxml++ is simply a beast to build (lots of GNOME/Glib dependencies, and it produces compiler errors on MSVC that are left to the implementor to fix); Microsoft's XML implementation is pretty nice but it is tied to the whole .NET/Windows environment. 2) Using another library for the under-the-hood technology seems reasonable, considering the amount of effort involved in getting all of the features of XML working. We currently use Expat in our products, simply because it's easy to incorporate, but of course that wouldn't be adequate for many of the recent features of XML. Libxml strikes me as perhaps the lesser of evils, since it has the needed support, isn't hard to build (at least compared to libxml++) and the binary libraries are easy to obtain. 3) I think DOM support is critical. There's ample C++ libraries and wrappers for doing non-DOM XML manipulations, but there doesn't seem to be adequate options for good DOM libraries. It seems to me that if the library implementor doesn't tackle the challenge of making a nice C++ DOM class interface, then I'm left with the impression that boost wouldn't really be adding anything new to the world.

Jon Radoff wrote:
2) Using another library for the under-the-hood technology seems reasonable, considering the amount of effort involved in getting all of the features of XML working. We currently use Expat in our products, simply because it's easy to incorporate, but of course that wouldn't be adequate for many of the recent features of XML. Libxml strikes me as perhaps the lesser of evils, since it has the needed support, isn't hard to build (at least compared to libxml++) and the binary libraries are easy to obtain.
3) I think DOM support is critical. There's ample C++ libraries and wrappers for doing non-DOM XML manipulations, but there doesn't seem to be adequate options for good DOM libraries. It seems to me that if the library implementor doesn't tackle the challenge of making a nice C++ DOM class interface, then I'm left with the impression that boost wouldn't really be adding anything new to the world.
Could you detail a bit what you mean by 'DOM' and 'DOM-like' here ? We all are probably thinking of some tree structure that can be navigated, queried, etc.. However, some have already argued that the DOM API as it exists for Java is inappropriate for C++, or even that the DOM API is already conceptually broken. Thus, I think it might help if we could detail a bit what it should and should not be, and what use cases it should support. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Could you detail a bit what you mean by 'DOM' and 'DOM-like' here ? We all are probably thinking of some tree structure that can be navigated, queried, etc.. However, some have already argued that the DOM API as it exists for Java is inappropriate for C++, or even that the DOM API is already conceptually broken. Thus, I think it might help if we could detail a bit what it should and should not be, and what use cases it should support. I don't know why the Java "Document" interface wouldn't be an appropriate model to work with. It simply defines XML documents as a hierarchy of nodes, individually representing such objects as "elements," content (character) areas, comments, and so forth. The Element objects contain useful methods for inspecting attributes and so forth. The two most common-use cases for a program interacting with XML are: (a) a need to easily extract data from the document, (b) modify and save an existing XML document. I guess I'm not acquainted with the arguments that the DOM model is broken. It's a good model for saving data up to fairly large sizes. Random-access or streamed reading (essentially Expat-type methods) are better if all you want to do is extract data. Are critics of DOM looking for some record-based approach that allows them to lock, modify and save parts of an XML document without the need to load an entire document into memory? If so, I suppose I could see some advantages of that, but it also seems like that sort of functionality could be an extension of the DOM concept rather than an entirely new approach. A good C++ implementation, in my mind, would attempt to utilize a Document class hierarchy, perhaps based directly upon the object hierarchy presented in the Java Document interface. In C++ we can enhance it by using familiar STL containers for a lot of things. For example, Java provides a getElementsByTagName API (gets me a list of all the elements given a particular name). In C++ it would be nice if the equivalent function gave me a hash_multimap<string,Element> so I could work whatever iteration magic I felt like. If the DOM class inherited from some intermediary random-access class, the latter might provide an interface for those unwilling to load the whole document into memory, yet also save people the trouble of dealing with writing the event handling & function-callbacks that's required in a C-style implementation based on Expat.

Stefan Seefeld:
Could you detail a bit what you mean by 'DOM' and 'DOM-like' here ? We all are probably thinking of some tree structure that can be navigated, queried, etc..
Jon Radoff wrote:
However, some have already argued that the DOM API as it exists for Java is inappropriate for C++, or even that the DOM API is already conceptually broken.
The XML DOM has a standardized API: http://www.w3.org/DOM/DOMTR It's designed to be implemented across different languages and mentions C++ in some places. But I don't think this is what you're after.
A good C++ implementation, in my mind, would attempt to utilize a Document class hierarchy, perhaps based directly upon the object hierarchy presented in the Java Document interface. In C++ we can enhance it by using familiar STL containers for a lot of things. For example, Java provides a getElementsByTagName API (gets me a list of all the elements given a particular name). In C++ it would be nice if the equivalent function gave me a hash_multimap<string,Element> so I could work whatever iteration magic I felt like.
A hash_multimap (or unordered_multimap) would be inappropriate. The DOM standard requires that getElementsByTagName returns a NodeList, which is live - ie. it updates dynamically when the XML document is updated. http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929/level-one-core.html#ID-536... If you're implementing a different interface, then be careful with unordered_multimap, it isn't guaranteed to maintain the order of the elements and in XML documents the order is important. If the implementation you're using does maintain the order, you might find that your code isn't portable.

Jon Radoff wrote :
b) Those that are complete have very annoying and/or intractable dependencies. Examples: Libxml++ is simply a beast to build (lots of GNOME/Glib dependencies, and it produces compiler errors on MSVC that are left to the implementor to fix);
That's because libxml++ wanted to use a special string type for unicode that was available in glibmm (glib C++ wrapper). You can disable this though and handle the utf-8 through std::string, which removes the dependencies, and probably the building problems with older MSVC versions. (current versions of gtkmm/glibmm are known to support MSVC8 only) I think using such a string type is a good idea though, but something available as a self-supporting library would be better.
3) I think DOM support is critical. There's ample C++ libraries and wrappers for doing non-DOM XML manipulations, but there doesn't seem to be adequate options for good DOM libraries. It seems to me that if the library implementor doesn't tackle the challenge of making a nice C++ DOM class interface, then I'm left with the impression that boost wouldn't really be adding anything new to the world.
While DOM is memory consuming, it's still the simplest way I can think of to edit an existing XML document. Moreoever it is the standard API. So of course supporting it is very important.

loufoque wrote:
Jon Radoff wrote :
b) Those that are complete have very annoying and/or intractable dependencies. Examples: Libxml++ is simply a beast to build (lots of GNOME/Glib dependencies, and it produces compiler errors on MSVC that are left to the implementor to fix);
That's because libxml++ wanted to use a special string type for unicode that was available in glibmm (glib C++ wrapper). You can disable this though and handle the utf-8 through std::string, which removes the dependencies, and probably the building problems with older MSVC versions. (current versions of gtkmm/glibmm are known to support MSVC8 only)
I think using such a string type is a good idea though, but something available as a self-supporting library would be better.
As I suggested earlier, I think the best approach is to parametrize the XML library for the (Unicode) string type, as dealing with the content and dealing with the structure are mostly orthogonal aspects of the overall functionality. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

We use libxml2 with our own C++ wrappers (for SAX and XSLT mainly). The wrappers are incomplete (they only cover what we use) and they are not documented very well, so I don't think it would help much to publish them here. Instead I'll try to summarize what we would like, that libxml2 does not provide. We have some code that parses using the push parser within async callbacks (ISAPI). The only problem with this is that we can't perform async actions easily in the libxml2 SAX callbacks. Making an async call from a SAX callback is not possible without adding a queue, this is not very nice and would not work well with SAX functions that return objects (I think libxml2 has some of these and though I have not used them they look like just the kinds of things you would want to use asio for). For that reason I think the ideal low level interface would have the option of async callbacks. I think data should be pushed to it with an async write to give the caller control over the buffer size (and avoiding unnecessary copying). I don't think it is a good idea to focus on input alone. I think a good interface could be used for both input and output of XML. For instance, we have implemented a SAX handler that writes to a std::ostream and we use that as the lowest level of our xml output system. This is useful as it means we can easily use SAX for in-process communication and an XML stream for inter process (avoiding writing and parsing XML when possible). I think one good test case for a boost.xml would be a boost.asio application that reads XML from a socket and writes it to an XML parser. This would cause SAX like async events, and these events would write the XML back to the socket also using asio. A kind of XML echo. I hope this helps, Hamish

"Sebastian Redl" <sebastian.redl@getdesigned.at> writes:
A few months ago I said I'd take on writing an XML library for Boost. Well, I've finally got some time on my hand and started with a little bit of brainstorming about the library.
I have a half-finished validating push parser on sourceforge, at http://sourceforge.net/projects/axemill It needs a bit of work, because it was built using an older version of boost, and the API has changed in the meantime, but it was working --- it could parse the XHTML DTD, and validate XHTML docs. I'll try and find the time to update it so it works again. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
participants (13)
-
Anthony Williams
-
Damien Fisher
-
Daniel James
-
David Abrahams
-
Hamish Mackenzie
-
Jeff Garland
-
Jon Radoff
-
Jose
-
loufoque
-
Nils Springob
-
Robert Ramey
-
Sebastian Redl
-
Stefan Seefeld