XML APIs: use cases, and wishes

Hi there, Stefan Seefeld recently presented his set of Dom classes. Personally I do not have any need for a DOM API. To focus the discussion on other APIs I started this new thread (and also because I could not find a good entry point in the previous disucussions). Last time I had to process XML I wrote my own wrapper on top of SAX, which forwarded higher level events to xml-unaware structures. The model served well to directly map a fixed XML-format to C++ structures (also XML writing was supported). The wrapper that forwarded all SAX events to the datastructures was generated from a single C++ expression, or several if recursion was required. I adopted a lot of spirits techinques to make it look nice. There were no intermediate data structures. In retrospective I also dislike my attempt, although it was better than working on xml document classes. It only worked with full xml documents, no partial parsing was supported. Furthermore it required a nearly direct mapping of xml elements to datastructures, different use cases which do not include a mapping of xml<->c++ werent considered. I currently do not need to fiddle with XML, but I would like to see at least these different APIs: * on-demand parsing, a parser drived by a cursor, that allows to navigate through a document, without loading it completly (I dont see a need for prior validation, here) * xml path / query like api to fastly grep for certain subtrees or elements of the xml * direct mapper of c++ structures to a certain format, so a kind of xml serialization, As you can see my view on XML is very limited, it boils down to a flexilbe form of reading XML. Regards, Andreas Pokorny

I've only looked at some of the posts on this thread so I haven't followed everythihg there. I'm curious as to why spirit - which includes an XML parser - has not been mentioned. I used it to great effect with the serialization library. Yes it is effort to learn to use spirit - and I only learned the bare minimum to get it to work for the serialization library. But it was a lot better then spending (a lot more) time writing xml parsing from scratch. Even more importantly, its much, much easier to maintain as it's basically data driven code. It seems that lots of the functionality being discussed is available "right out of the box" with spirit and the include xml examples. Want a DOM style parser - I believe that something similar is available "out of the box" with the included parse tree generator. Want a SAX style parser, Specify your own action routines as tokens are recogized.
but I would like to see at least these different APIs: * on-demand parsing, a parser drived by a cursor, that allows to navigate through a document, without loading it completly (I dont see a need for prior validation, here)
The serialization library does exactly this with the spirit xml parser. ...
* direct mapper of c++ structures to a certain format, so a kind of xml serialization,
I'm not sure how this differs from the xml serialization already in the serialization library. I'm exactly sure what you have in mind, but I can see the need for a program which reads and xml schema and generates C++ data structures which can be navigated with previously compiled code modules. I think this would be fairly easily achieved using spirit xml parsing. As you can see, I think spirit is underrated. It IS hard to learn - and I'm in no way an expert but I have managed to find it very useful. Robert Ramey

On Mon, Nov 07, 2005 at 09:46:24AM -0800, Robert Ramey <ramey@rrsd.com> wrote:
but I would like to see at least these different APIs: * on-demand parsing, a parser drived by a cursor, that allows to navigate through a document, without loading it completly (I dont see a need for prior validation, here)
The serialization library does exactly this with the spirit xml parser. ...
I always assumed that spirit has full control over the parsing process, so the parse() function itself is the driving force to walk towards the end of input. The above really was about jumping through the file (provided that the file is well formed xml) and only examining the chunks around the interesting data fields. Maybe I am a bit to optimistic about parsing xml files :).
* direct mapper of c++ structures to a certain format, so a kind of xml serialization,
I'm not sure how this differs from the xml serialization already in the serialization library.
The boost::serialization archives tries to encode C++ objects in data streams, that might be xml documents as well. So the archive defines the format, and adds suficient meta information to be able to recreate the objects with the aid of the meta data found in the serialize-functions. I was talking about a use case in which a certain known XML format has to be mapped onto C++ structures. Thus the C++ structures already got designed to represent the model which is discribed by that certain XML format. Still the binding between the format ought to be separate. The purpose of that archive is to provide persistence for the objects, while the purpose of the binding library is to provide a high level document parsing system. I really considered boost::serialization for that binding library, but I found the NVP-system not flexible enough to represent all xml possibilities. As far as I understood, every complex type gets converted into an xml node, while every primitive leaf type get converted into an attribute. That restriction is too hard to represetnt the full xml format space. E.g the model struct some_node { std::vector<int> data; }; could be encoded like this: <some_node > <data value="1"/> <data value="41"/> <data value="2"/> <data value="9"/> </some_node> or: <some_node > <data>1</data> <data>41</data> <data>2</data> <data>9</data> </some_node> Both variants are representable in the binding library I talked about. To extend the binding idea one might add the possiblity to map only a certain view on the XML format onto these C++ structures.
I'm exactly sure what you have in mind, but I can see the need for a program which reads and xml schema and generates C++ data structures which can be navigated with previously compiled code modules. I think this would be fairly easily achieved using spirit xml parsing.
As you can see, I think spirit is underrated. It IS hard to learn - and I'm in no way an expert but I have managed to find it very useful.
Aggreed. My binding library used libxml2 and later expat for parsing, but I initially planed to use spirit as backend. Regards Andreas Pokorny

I really considered boost::serialization for that binding library, but I found the NVP-system not flexible enough to represent all xml possibilities. As far as I understood, every complex type gets converted into an xml node, while every primitive leaf type get converted into an attribute. That restriction is too hard to represetnt the full xml format space. E.g the model struct some_node { std::vector<int> data; }; could be encoded like this: <some_node > <data value="1"/> <data value="41"/> <data value="2"/> <data value="9"/> </some_node> or: <some_node > <data>1</data> <data>41</data> <data>2</data> <data>9</data> </some_node>
That's not quite correct, in the current xml_archives, all primitives are tagged as well. The only things that are attributes are the "extra" things that that the serialization library needs to reconstruct the C++ structures. (class_id and object_id index, etc). In the course of implementing this, it became apparent to me that that there were lots of ways it could be done and still be valid XML. I just decided more or less arbitrarily what should go as attributes and what should go as tagged data. Someone else might want it done differently, in which case he would have to make his own implementation of xml_archive.
To extend the binding idea one might add the possiblity to map only a certain view on the XML format onto these C++ structures.
I'm NOT exactly sure what you have in mind, but I can see the need for a program which reads and xml schema and generates C++ data structures which can be navigated with previously compiled code modules. I think this would be fairly easily achieved using spirit xml parsing.
Aggreed. My binding library used libxml2 and later expat for parsing, but I initially planed to use spirit as backend.
Robert Ramey

On Tue, Nov 08, 2005 at 12:09:03AM +0100, Andreas Pokorny wrote:
On Mon, Nov 07, 2005 at 09:46:24AM -0800, Robert Ramey <ramey@rrsd.com> wrote:
but I would like to see at least these different APIs: * on-demand parsing, a parser drived by a cursor, that allows to navigate through a document, without loading it completly (I dont see a need for prior validation, here)
The serialization library does exactly this with the spirit xml parser. ...
I always assumed that spirit has full control over the parsing process, so the parse() function itself is the driving force to walk towards the end of input. The above really was about jumping through the file (provided that the file is well formed xml) and only examining the chunks around the interesting data fields. Maybe I am a bit to optimistic about parsing xml files :).
* direct mapper of c++ structures to a certain format, so a kind of xml serialization,
I'm not sure how this differs from the xml serialization already in the serialization library.
The boost::serialization archives tries to encode C++ objects in data streams, that might be xml documents as well. So the archive defines the format, and adds suficient meta information to be able to recreate the objects with the aid of the meta data found in the serialize-functions.
I was talking about a use case in which a certain known XML format has to be mapped onto C++ structures. Thus the C++ structures already got designed to represent the model which is discribed by that certain XML format. Still the binding between the format ought to be separate.
Another case is where you have an existing XML vocabulary and you want to create a C++ representation of it for data manipulation. There are a lot of tools that allow you to generate class hierarchies from XML schema or similar, for example xsd (http://codesynthesis.com/products/xsd/). I don't really think Boost library should stray into this area, but it should probably be considered as a use case for an XML API. cheers, Graham -- Graham Bennett

Robert Ramey wrote:
I've only looked at some of the posts on this thread so I haven't followed everythihg there.
I'm curious as to why spirit - which includes an XML parser - has not been mentioned.
probably because the focus is more on a DOM API, i.e. an in-memory representation of an XML infoset, and ways to manipulate that. How to actually construct the document is secondary, and only relevant because typically parsers are an integral part of XML (DOM) libraries. It was suggested a number of times to use bgl to represent the DOM, and build it with spirit. I think whatever API we settle on such a configuration should definitely work, though I doubt it will be worth the efford, given how much domain-specific additions are required, for example when implementing xpath support, or the required http support to query dtds, xincluded documents, etc., etc. Regards, Stefan

Andreas, Andreas Pokorny <andreas.pokorny@gmx.de> writes:
Last time I had to process XML I wrote my own wrapper on top of SAX, which forwarded higher level events to xml-unaware structures. The model served well to directly map a fixed XML-format to C++ structures (also XML writing was supported). The wrapper that forwarded all SAX events to the datastructures was generated from a single C++ expression, or several if recursion was required. I adopted a lot of spirits techinques to make it look nice. There were no intermediate data structures.
In retrospective I also dislike my attempt, although it was better than working on xml document classes. It only worked with full xml documents, no partial parsing was supported. Furthermore it required a nearly direct mapping of xml elements to datastructures, different use cases which do not include a mapping of xml<->c++ werent considered.
We tried to solve this exact problem with the C++/Parser mapping for XML Schema. The basic idea boils down to generating parser templates for data types defined in XML Schema. Using these parser templates you can build your own in-memory representations or perform immediate processing of XML instance documents. The following document has a quick introduction to the mapping: http://codesynthesis.com/projects/xsd/documentation/cxx/parser/quick-guide/ hth, -boris
participants (6)
-
Andreas Pokorny
-
Boris Kolpackov
-
Graham Bennett
-
Robert Ramey
-
Stefan Seefeld
-
Steinar Bang