[ANN] libstudxml - modern XML API for C++
Hi, libstudxml is an XML library for modern, standard C++. It has an API that I believe should have already been in Boost or even in the C++ standard library. The API was first presented at the C++Now 2014 conference. Based on the positive feedback and encouragement I received during the talk, I've decided to make the implementation generally available. As an example, we can parse this XML: <person id="123"> <name>John Doe</name> <age>23</age> <gender>male</gender> </person> With the following C++ code, which performs all the validation necessary for this XML vocabulary: enum class gender {...}; ifstream ifs (argv[1]); parser p (ifs, argv[1]); p.next_expect (parser::start_element, "person", content::complex); long id = p.attribute<long> ("id"); string n = p.element ("name"); short a = p.element<short> ("age"); gender g = p.element<gender> ("gender"); p.next_expect (parser::end_element); // person The API has the following interesting features: * Streaming pull parser and streaming serializer * Two-level API: minimum overhead low-level & more convenient high-level * Content model-aware (empty, simple, complex, mixed) * Whitespace processing based on content model * Validation based on content model * Validation of missing/extra attributes * Validation of unexpected events (elements, etc) * Data extraction to value types * Attribute map with extended lifetime (high-level API) libstudxml is compact, external dependency-free, and reasonably efficient. The XML parser is a conforming, non-validating XML 1.0 implementation that is based on tested and proven code. The library is released under the MIT license. More information, documentation, and source code are available from: http://www.codesynthesis.com/projects/libstudxml/ Or, you can jump directly to the API description with examples: http://www.codesynthesis.com/projects/libstudxml/doc/intro.xhtml#2 Enjoy, Boris
On 2014-05-21 10:53, Boris Kolpackov wrote:
Hi,
libstudxml is an XML library for modern, standard C++. It has an API that I believe should have already been in Boost or even in the C++ standard library.
Hi Boris, very interesting ! There is an XML library in the boost sandbox (now https://github.com/stefanseefeld/boost.xml), which has been around for many years.
The API was first presented at the C++Now 2014 conference. Based on the positive feedback and encouragement I received during the talk, I've decided to make the implementation generally available.
As an example, we can parse this XML:
<person id="123"> <name>John Doe</name> <age>23</age> <gender>male</gender> </person>
With the following C++ code, which performs all the validation necessary for this XML vocabulary:
enum class gender {...};
ifstream ifs (argv[1]); parser p (ifs, argv[1]);
p.next_expect (parser::start_element, "person", content::complex);
long id = p.attribute<long> ("id");
string n = p.element ("name"); short a = p.element<short> ("age"); gender g = p.element<gender> ("gender");
p.next_expect (parser::end_element); // person
The API has the following interesting features:
* Streaming pull parser and streaming serializer * Two-level API: minimum overhead low-level & more convenient high-level * Content model-aware (empty, simple, complex, mixed) * Whitespace processing based on content model * Validation based on content model * Validation of missing/extra attributes * Validation of unexpected events (elements, etc) * Data extraction to value types * Attribute map with extended lifetime (high-level API)
libstudxml is compact, external dependency-free, and reasonably efficient. The XML parser is a conforming, non-validating XML 1.0 implementation that is based on tested and proven code. The library is released under the MIT license.
Does it support a DOM-like API, i.e. an in-memory representation of the document ? I have always strongly argued against the idea that an "XML API" was only about parsing XML data, as there are many useful features that involve manipulation of XML data (including transformations between documents, xpath-based search, etc.). I have also argued that re-implementing all that functionality from scratch is foolish with so many existing implementations, so any boost.xml project should focus on wrapping such implementations, rather than reinventing them. In fact, I believe such an API should be robust enough to be able to wrap different backends, rather than depending on a particular implementation choice. As it happens, we have a project working on those features as part of this year's Google Summer of Code program. It would be great if we could collaborate towards a shared goal to define a feature-rich and robust Boost.XML library.
More information, documentation, and source code are available from:
http://www.codesynthesis.com/projects/libstudxml/
Or, you can jump directly to the API description with examples:
http://www.codesynthesis.com/projects/libstudxml/doc/intro.xhtml#2
Thanks for sharing ! Stefan -- ...ich hab' noch einen Koffer in Berlin...
Hi,
On Wed, May 21, 2014 at 11:15 AM, Stefan Seefeld
I have always strongly argued against the idea that an "XML API" was only about parsing XML data, as there are many useful features that involve manipulation of XML data (including transformations between documents, xpath-based search, etc.). I have also argued that re-implementing all that functionality from scratch is foolish with so many existing implementations, so any boost.xml project should focus on wrapping such implementations, rather than reinventing them. In fact, I believe such an API should be robust enough to be able to wrap different backends, rather than depending on a particular implementation choice.
I agree with that sentiment completely. Different implementations have different strengths and weaknesses. I generally choose pugixml, but I understand Xerces is also very popular. Boost.Multiprecision is a good example of this kind of library, but, like Boost.Multiprecision, I'd say there should be a "default" implementation included with the Boost.XML library in the case where the user doesn't want to add another dependency besides Boost. I guess there are also wrapper type libraries like Boost.MPI, which don't really lend themselves to a default implementation, but I'd argue that XML is standardized well enough that a default implementation is appropriate. Just thought I'd throw in my 0.02. -g
On 05/21/2014 08:15 AM, Stefan Seefeld wrote:
I have also argued that re-implementing all that functionality from scratch is foolish with so many existing implementations, so any boost.xml project should focus on wrapping such implementations, rather than reinventing them. In fact, I believe such an API should be robust enough to be able to wrap different backends, rather than depending on a particular implementation choice.
I personally like the idea of having a small, fast, lightweight XML parser with no external dependencies under the Boost Software License. Eric
Hi Eric,
Eric Niebler
I personally like the idea of having a small, fast, lightweight XML parser with no external dependencies under the Boost Software License.
Just to clarify, libstudxml consists of three parts: the interface implementation that I wrote (MIT) and two other components that are not mine: the Expat XML parser (MIT) and the Genx XML serializer (MIT). Expat and Genx source code is included into libstudxml as implementation details (see Implementation Notes[1] for more information). So the whole library is under the MIT license. Should libstudxml end up in Boost, I have no problem changing the license for the bits I wrote to the Boost License. However, we won't be able to change the license for Expat of Genx. MIT is probably even more permissive that the Boost License but whether this is a potential problem I don't know. [1] http://codesynthesis.com/projects/libstudxml/doc/intro.xhtml#6 Boris
----- Original Message -----
From: "Eric Niebler"
I personally like the idea of having a small, fast, lightweight XML parser with no external dependencies under the Boost Software License.
I would also--for both XML 1.0 and the less often used XML 1.1. On top of that, I would also like to see modern C++ implementations of *current* versions of XInclude (1.0 Second Edition + the XPath-based xpointer() scheme), XPath (3.0, 3.1 forthcoming), XML Schema (1.1), XSLT (schema-aware 2.0, streaming 3.0 forthcoming), XQuery (3.0, 3.1 forthcoming), XProc (1.0 + a few of the obvious extensions), Relax NG, ISO Schematron, XML Catalogs, etc.. There are actually *very few* implementations of these things at current versions much less the 3.0 versions of XPath, XQuery, and (forthcoming) XSLT. Regards, Paul Mensonides
Hi Stefan, In gmane.comp.lib.boost.devel you write:
Does it support a DOM-like API, i.e. an in-memory representation of the document ?
No, it does not. I spent quite a bit of time on the in-memory vs streaming debate in my talk. How I wish the video was already available... Until then, to summarize the key points: * Most people think they need DOM. I believe it is not because in-memory is conceptually better but because of the really awful and inconvenient streaming APIs (like SAX). So I tried to convince the audience that a well designed streaming pull API is actually sufficient for the majority of cases. I didn't hear many objections. Take a look at the API Introduction[1], it shows how to handle everything from converters/filters that don't care about the data, to applications that process the data without creating any kind of in-memory object model, to C++ classes that know how to persist themselves in XML. * On that last point (C++ class persistence) a lot of applications extract XML data into some kind of object model (C++ classes that correspond to the XML vocabulary). Creating an intermediate representation of XML (DOM) just to throw it way moments later seems kind of pointless. * Of course there will always be applications that need to revisit the bulk of raw XML data and for them in-memory would probably always be a better choice. * Which brings us to this point: it is easy to go from streaming to in-memory but not the other way around. * In fact, an even better approach would be to support hybrid, partially streaming/partially in-memory parsing and serialization (also discussed in the talk). Then, the fully in-memory would simply be a special case. * libstudxml has the ‘hybrid’ example which shows how to implement this hybrid approach. You would be shocked how short and simple the code is (I know I was once I wrote it ;-)). [1] http://www.codesynthesis.com/projects/libstudxml/doc/intro.xhtml#2
I have always strongly argued against the idea that an "XML API" was only about parsing XML data, as there are many useful features that involve manipulation of XML data (including transformations between documents, xpath-based search, etc.).
You need to start somewhere. And support for (relatively) low-level XML parsing and serialization seems like a good place.
In fact, I believe such an API should be robust enough to be able to wrap different backends, rather than depending on a particular implementation choice.
I don't think it will be robust. I think it will be awful and inconvenient. Try to adapt straight SAX API to anything other than callback-based with inversion of control (i.e., SAX again). Boris
On 2014-05-21 13:57, Boris Kolpackov wrote:
Hi Stefan,
In gmane.comp.lib.boost.devel you write:
Does it support a DOM-like API, i.e. an in-memory representation of the document ? No, it does not. I spent quite a bit of time on the in-memory vs streaming debate in my talk. How I wish the video was already available...
Let me know when it is, I'm looking forward to hear your arguments. :-)
Until then, to summarize the key points:
* Most people think they need DOM. I believe it is not because in-memory is conceptually better but because of the really awful and inconvenient streaming APIs (like SAX). So I tried to convince the audience that a well designed streaming pull API is actually sufficient for the majority of cases. I didn't hear many objections.
Take a look at the API Introduction[1], it shows how to handle everything from converters/filters that don't care about the data, to applications that process the data without creating any kind of in-memory object model, to C++ classes that know how to persist themselves in XML.
* On that last point (C++ class persistence) a lot of applications extract XML data into some kind of object model (C++ classes that correspond to the XML vocabulary). Creating an intermediate representation of XML (DOM) just to throw it way moments later seems kind of pointless.
* Of course there will always be applications that need to revisit the bulk of raw XML data and for them in-memory would probably always be a better choice.
Right. I can agree with you that a good API over SAX (or reader) could be better than DOM in certain cases, but not all. Just think of someone wanting to write an XML editor (e.g., to edit XHTML or DocBook documents), with support for standard XML features such as xinclude, xpath-based search, perhaps even xslt-based transformations. Again, I'm definitely not suggesting everyone needs those features, but there has to be a place where these can be added in boost.xml.
* Which brings us to this point: it is easy to go from streaming to in-memory but not the other way around.
Yes, of course, a DOM API can be implemented on top of a streaming API. But you are pushing down the road of yet another implementation of XML, which I strongly object to. I'm not against anyone re-implementing an XML library. But as I said, I don't think Boost.XML should mandate a new implementation with so many existing choices. There just is no point in such an exercise, other than self-education.
* In fact, an even better approach would be to support hybrid, partially streaming/partially in-memory parsing and serialization (also discussed in the talk). Then, the fully in-memory would simply be a special case.
* libstudxml has the ‘hybrid’ example which shows how to implement this hybrid approach. You would be shocked how short and simple the code is (I know I was once I wrote it ;-)).
[1] http://www.codesynthesis.com/projects/libstudxml/doc/intro.xhtml#2
Again, I'm resisting to get dragged into a discussion about implementation A vs. implementation B. I don't want to argue about that. I'm arguing for a Boost.XML API that supports multiple choices of backends. This is mostly a maintainability question. XML is a complex standard, with occasional updates and new feature additions. Just adding a few new wrappers around existing implementations is far easier than having to re-implement things just because of a bad design decision when Boost.XML first came into being...
I have always strongly argued against the idea that an "XML API" was only about parsing XML data, as there are many useful features that involve manipulation of XML data (including transformations between documents, xpath-based search, etc.). You need to start somewhere. And support for (relatively) low-level XML parsing and serialization seems like a good place.
In fact, I believe such an API should be robust enough to be able to wrap different backends, rather than depending on a particular implementation choice. I don't think it will be robust. I think it will be awful and inconvenient. Try to adapt straight SAX API to anything other than callback-based with inversion of control (i.e., SAX again).
Have you looked at existing XML libraries before you started libstudxml ? Did you know about Boost.XML, or Arabica ? Anyhow, I'm not trying to convince you that you should change anything. I'm trying to show you how a thin wrapper can look like. Stefan -- ...ich hab' noch einen Koffer in Berlin...
Hi Stefan,
Stefan Seefeld
Just think of someone wanting to write an XML editor...
Quoting my presentation verbatim: "If you are writing an XML editor, your needs are not common".
But you are pushing down the road of yet another implementation of XML, which I strongly object to.
Actually, libstudxml is based on the existing XML parser (Expat) and even XML serializer (SAX). See Implementation Notes[1] for details.
Have you looked at existing XML libraries before you started libstudxml?
Oh, yes, I've been looking at existing C++ XML libraries for the past 10 years. Let's be honest, they are all pretty bad.
Did you know about Boost.XML
Strictly speaking there is no Boost.XML. There are periodic attempts to have one but they seldom go past the SAX vs DOM debate.
or Arabica ?
Same old SAX and DOM. I think this debate can go on forever so let's just agree that some people prefer DOM, some are pretty happy with SAX, and some would prefer something more convenient and light-weight like a streaming pull parser. I am with the last group. [1] http://codesynthesis.com/projects/libstudxml/doc/intro.xhtml#6 Boris
On 2014-05-21 16:06, Boris Kolpackov wrote:
I think this debate can go on forever so let's just agree that some people prefer DOM, some are pretty happy with SAX, and some would prefer something more convenient and light-weight like a streaming pull parser. I am with the last group.
Yes, that's all fine. What I have been trying to say is that a Boost.XML API will have to be flexible enough to support all of the above. (Note that my sandbox Boost.XML project (which you rightly pointed out as not being a Boost library !) supports a "reader" API, which I think is very much in line with your streaming pull parser. So it should be possible to use libstudxml as a backend for it.) Stefan -- ...ich hab' noch einen Koffer in Berlin...
On Wed, May 21, 2014 at 7:57 PM, Boris Kolpackov
In fact, I believe such an API should be robust enough to be able to wrap different backends, rather than depending on a particular implementation choice.
I don't think it will be robust. I think it will be awful and inconvenient. Try to adapt straight SAX API to anything other than callback-based with inversion of control (i.e., SAX again).
SAX is not that bad, once you have a layer on top to push/pop handlers for various XML elements. That's the technique the Java Ant build tool used (on top of the Java SAX APIs), and I've adapted the same technique on top of Qt's SAX API, before Qt's pull-parser came along. Basically each function of a recursive descent parser is replaced by a handler instance, and the C/C++ function stack is replaced by an explicit stack. But that's beside the point, I also agree with you that a pull-parser is much nicer to program against, and the DOM-like APIs can easily be layered on top of those. But it's actually harder that it looks to properly implement a standard compliant XML parser dealing correctly with DTDs, character and system entities, encodings, namespaces, space normalizations, default attributes from inline or out-of-line DTDs, etc, etc... That you base your library on the long established Expat parser, from James Clark, one of the world's XML expert, is probably a good thing, although the fact it hasn't seen any release since 2007 is a bit worrying (and the license might indeed be an issue). Many people don't care about these XML "details", but any library worthy of boost that wants to be a foundational building block (in Niall's term) at the bottom of a Boost/C++ XML ecosystem should strive for full conformance IMHO, or at least provide all the low-level tools to allow another library on top to be conformant. Some apps want very-low level knowledge of the structure of an XML document, including all low level irrelevant whitespace, processing instructions, character entities, etc... (something even the XML standards don't necessarily allow), while others don't care and want the XML *InfoSet* as specified in the XPath/XSL standards. Then there's also schema-aware processing which associates XSD types to elements, validating parsers (DTDs or XSDs), etc... Sounds like your library targets the lower-level parsing part, but even that is non-trivial and rarely truly conformant in the many XML libraries out there, so hopefully you're aware of all this, and will explicitly document your conformance level, or lack thereof. My $0.02. --DD
Hi Dominique,
Just a few comments/corrections without getting into the whole SAX vs
DOM vs Pull debate:
Dominique Devienne
But it's actually harder that it looks to properly implement a standard compliant XML parser dealing correctly with DTDs, character and system entities, encodings, namespaces, space normalizations, default attributes from inline or out-of-line DTDs, etc, etc...
Completely agree. Implementing a conforming, non-validating XML 1.0 parser and making sure that it actually does conform (that is, it is well tested) is a big job.
That you base your library on the long established Expat parser, from James Clark, one of the world's XML expert, is probably a good thing,
Expat is probably the world's most widely used XML parser if you consider how many scripting languages and mobile/embedded platforms use it under the hood for their XML APIs.
although the fact it hasn't seen any release since 2007 is a bit worrying...
There was actually 2.1.0 release in 2012 that somehow didn't make it to the website's news section: http://sourceforge.net/projects/expat/files/expat/ But, generally, Expat is very mature and the bulk of the changes these days are bug fixes.
Many people don't care about these XML "details", but any library worthy of boost that wants to be a foundational building block (in Niall's term) at the bottom of a Boost/C++ XML ecosystem should strive for full conformance IMHO, or at least provide all the low-level tools to allow another library on top to be conformant.
100% agree.
Sounds like your library targets the lower-level parsing part, but even that is non-trivial and rarely truly conformant in the many XML libraries out there, so hopefully you're aware of all this, and will explicitly document your conformance level, or lack thereof.
Already do. See Implementation Notes: http://www.codesynthesis.com/projects/libstudxml/doc/intro.xhtml#6 Boris
participants (6)
-
Boris Kolpackov
-
Dominique Devienne
-
Eric Niebler
-
Greg Rubino
-
pmenso57@comcast.net
-
Stefan Seefeld