RFC: Boost.XML API prototype in the sandbox

Stefan Seefeld

6 Jul 2007 6 Jul '07

9:06 p.m.

Hello, over the last couple of years we have discussed possible XML APIs for inclusion into boost. As I already had an early prototype for such an API, I kept evolving it, based on feedback from those discussions. A couple of weeks ago I actually checked it into the sandbox (http://svn.boost.org/trac/boost/browser/sandbox/xml). Today, I adjusted the source layout to conform to the sandbox layout we agreed on, including a boost.build - based build-system. I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development. Best regards, Stefan PS: The current scope of the project is described in http://svn.boost.org/trac/boost/browser/sandbox/xml/README -- ...ich hab' noch einen Koffer in Berlin...

Show replies by date

Mathias Gaunard

9 Jul 9 Jul

5:14 p.m.

Stefan Seefeld wrote:

...

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

Sorry, I don't have much time to look at the sources at the moment. It would be nice, however, if you wrote an actual documentation, with cross-linked references, design rationales and examples.

Jake Voytko

6:11 p.m.

Stefan, I'm working on the Boost.Plot GSoC project, and the first image format that it will support is SVG, an XML based format. My project would benefit greatly from this effort, so I'll be checking on it periodically (most of my current logic is in the area of plotting, and very little in XML generation, so changes can still easily be made). Looking through dom.cpp, I like how your headers work. It's a similar approach to how I'm manipulating a SVG document, but much better thought out (mine acts more as a kludge than anything). I especially like the iterator approach to enumerating children. I'm looking through the example code and I have a few questions. - Is it possible to have parameters in the tags? For example: element_ptr g_elem = root->insert_element("g"); g_elem->add_parameter("stroke_color", "red"); produces: <g stroke_color = "red"> .. </g> Is this what append_element() does, or is there another function to do this? - The operator->() syntax.. It would appear that you did this because all of the elements act as pointers. However, I feel that some of the operations are unclear as to why they should be implemented using the -> syntax my_elem -> write_to_file("my_file.xml"), for example. - Also, how far does the pointer analogy carry? If I call ++my_elem(); does this go to the next child of my_elem->parent()? I second the call for documentation from Mathias. This nonwithstanding, good work so far! Jake On 7/6/07, Stefan Seefeld <seefeld@sympatico.ca> wrote:

...

Hello,

over the last couple of years we have discussed possible XML APIs for inclusion into boost. As I already had an early prototype for such an API, I kept evolving it, based on feedback from those discussions. A couple of weeks ago I actually checked it into the sandbox (http://svn.boost.org/trac/boost/browser/sandbox/xml). Today, I adjusted the source layout to conform to the sandbox layout we agreed on, including a boost.build - based build-system.

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

Best regards, Stefan

PS: The current scope of the project is described in http://svn.boost.org/trac/boost/browser/sandbox/xml/README

--

...ich hab' noch einen Koffer in Berlin... _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Stefan Seefeld

10 Jul 10 Jul

1:11 a.m.

Jake Voytko wrote:

...

I'm looking through the example code and I have a few questions.

- Is it possible to have parameters in the tags? For example:

element_ptr g_elem = root->insert_element("g"); g_elem->add_parameter("stroke_color", "red");

produces:

<g stroke_color = "red"> .. </g>

Is this what append_element() does, or is there another function to do this?

What you call a 'parameter' is actually called an 'attribute', and it gets set like g_elem->set_parameter("stroke_color", "red"); (it can't be "added" since it may only exist once. Thus, you can set and unset it.)

...

- The operator->() syntax.. It would appear that you did this because all of the elements act as pointers. However, I feel that some of the operations

precisely.

...

are unclear as to why they should be implemented using the -> syntax

my_elem -> write_to_file("my_file.xml"), for example.

'write_to_file' is a method of 'document', not 'element'. The reason why I expect applications to hold document pointers is that they often get instantiated by a parser (acting as a factory).

...

- Also, how far does the pointer analogy carry? If I call

++my_elem();

does this go to the next child of my_elem->parent()?

No. an element doesn't have iterator semantics.

...

I second the call for documentation from Mathias. This nonwithstanding, good work so far!

Thanks ! And yes, I will do my best to write some documentation (beside some API reference that can be readily extracted via synopsis or doxygen). Thanks for the feedback, Stefan -- ...ich hab' noch einen Koffer in Berlin...

David Abrahams

11 Jul 11 Jul

2 a.m.

on Mon Jul 09 2007, Stefan Seefeld <seefeld-AT-sympatico.ca> wrote:

...

Jake Voytko wrote:

...
I'm looking through the example code and I have a few questions.

- Is it possible to have parameters in the tags? For example:

element_ptr g_elem = root->insert_element("g"); g_elem->add_parameter("stroke_color", "red");

produces:

<g stroke_color = "red"> .. </g>

Is this what append_element() does, or is there another function to do this?

What you call a 'parameter' is actually called an 'attribute', and it gets set like

g_elem->set_parameter("stroke_color", "red");

Surely g_elem->set_attribute(...) ? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Stefan Seefeld

2:08 a.m.

David Abrahams wrote:

...

on Mon Jul 09 2007, Stefan Seefeld <seefeld-AT-sympatico.ca> wrote:

...

...
What you call a 'parameter' is actually called an 'attribute', and it gets set like

g_elem->set_parameter("stroke_color", "red");

Surely

g_elem->set_attribute(...)

?

Indeed. Sorry for the confusion. Stefan -- ...ich hab' noch einen Koffer in Berlin...

Phil Endecott

9 Jul 9 Jul

10:22 p.m.

Stefan Seefeld wrote:

...

over the last couple of years we have discussed possible XML APIs for inclusion into boost. As I already had an early prototype for such an API, I kept evolving it, based on feedback from those discussions. A couple of weeks ago I actually checked it into the sandbox (http://svn.boost.org/trac/boost/browser/sandbox/xml).

...

PS: The current scope of the project is described in http://svn.boost.org/trac/boost/browser/sandbox/xml/README

Hi Stefan, My comments follow; these are based on maybe half an hour looking at your code, but it's quite possible that I have missed something. As others have pointed out, it would be easier to evaluate with some more docs... I certainly agree that C++ would benefit from an XML API and Boost is a good place to develop it. As far as I can see, what you have is a wrapper around the GNOME libxml2 (which has an MIT-license and is cross-platform) that implements something that you call dom, but is not the standardised "DOM" API for XML (http://www.w3.org/DOM/). I think that two C++ APIs for XML document manipulation could be justified: (a) DOM. This has the benefit of being standardised, so you can transfer at least your experience and to some extent actual code from one language to another (e.g. C++ to/from Javascript in my case). On the other hand it is a rather verbose and unenjoyable API that isn't a great match to 'modern' C++. (b) A standard-library-like API (e.g. attributes are a map, child nodes are a sequence). This would have the benefit of familiarity to users of the C++ standard library, and I think it would be a more concise and usable API. As far as I can see, what you have created is something that isn't (a) or (b) but falls somewhere between. For example, you provide iterators rather than the nextSibling-style functions of DOM, but you provide custom functions like append_element and set_attribute rather than standard-library-like append() and operator[] implementations. For example, compare: - DOM: e.setAttribute("color","red"); e.appendChild(doc.createElement("P")); - Yours: e.set_attribute("color","red"); e.append_element("P"); - STL-like: e.attributes["color"]="red"; e.children.push_back(new Element("P")); In the past I have used a library called xmlwrapp. You should take a look at it if you have not done so already. It has a very liberal license (boost-like). It is also a C++ libxml2 wrapper and as I recall its style is similar to yours. It seemed to do nearly everything that I wanted. I remember being confused about the ownership semantics of pointed-to objects sometimes; what is your policy? (e.g. if I copy a subtree to another place in the document, is it a deep copy or a pointer copy? Copy-on-write? When is it freed? Reference counted?) I was also surprised once with the memory inefficiency: you might like to consider how many MB of RAM are needed to store in-memory a document that is X MB on disk, for examples with many small nodes or fewer larger nodes. In my case, it would have helped to use some sort of dictionary for element and attribute names. One thing that xmlwrapp did not offer was a way to access the underlying libxml2 C 'object'. While this is normally an implementation detail that you would like to hide, note that there are other C libraries that you might want to use; I think the one that I was looking at was the SVG renderer librsvg [attn Jake!]. I wanted to build an in-memory XML/SVG document in my C++ code and then convert it to a bitmap, but because xmlwrapp wouldn't let me get at the raw libxml2 stuff, I couldn't, and had to go via a temporary file. (Or maybe I hacked it, can't remember.) Doing XSLT transformations would be another example where this would be necessary. I hope these comments are useful; what do others think? Regards, Phil.

Stefan Seefeld

10 Jul 10 Jul

3:12 a.m.

Phil Endecott wrote:

...

As far as I can see, what you have is a wrapper around the GNOME libxml2 (which has an MIT-license and is cross-platform) that implements something that you call dom, but is not the standardised "DOM" API for XML (http://www.w3.org/DOM/).

Right. The "DOM" API isn't actually standardized for anything but Java. So in my attempt to provide a "DOM-like" API I have been trying to capture the semantics, not the syntax, and map that to "modern C++ style" idioms. To me, DOM stands for an API that allows in-memory manipulation of, well, structured documents. That includes full (generic) tree traversal, as well as XML-specific operations, such as XPath-based node lookup, validation, as well as other domain-specific operations. Finally, it is important to keep a balance between encapsulation and transparency. While it is true that, semantically at least, one could map some of the helpers by STL (or boost, for that matter) types, I deliberately refrained from such an approach, to allow existing implementations to be bound to the API. In particular, I'm using here libxml2 as backend, which has been fine-tuned for best performance over a couple of years. I'm not courageous / naive enough to attempt to match such an implementation with a new one. Instead, I want a seamless mapping that doesn't impose any unnecessary copying or other indirection to mitigate the 'impedance mismatch'. I will try to write my thoughts and design rationales up in a document, so things become a little clearer. Thanks for your comments ! Stefan -- ...ich hab' noch einen Koffer in Berlin...

Michael Caisse

6:15 a.m.

Stefan Seefeld wrote:

...

Right. The "DOM" API isn't actually standardized for anything but Java. So in my attempt to provide a "DOM-like" API I have been trying to capture the semantics, not the syntax, and map that to "modern C++ style" idioms.

<-------- snip ----------> Stefan - I am interested in reviewing your work. I have written wrappers of the Xerces DOM and SAX to make it more C++ friendly to my needs. I think a Boost reviewed library will be beneficial to the community. I hope this doesn't sound pedantic but the DOM is a well defined API using OMG-IDL. The interface is standardized. The W3C has only described language bindings for Java and Javascript; however, the DOM API has well defined levels of conformance that are language independent. Perhaps a different *name* than DOM would be useful. I had similar issues with my API because users had an expectation from the term DOM that included a defined API. Maybe something like XMLDoc or XOM (sorry, Mr. Namespace is not available right now). I personally think the DOM is annoying to use and I look forward to studying your library. Best regards - Michael -- ---------------------------------- Michael Caisse Object Modeling Designs www.objectmodelingdesigns.com

Doug Gregor

12:32 p.m.

On Jul 9, 2007, at 11:12 PM, Stefan Seefeld wrote:

...

Finally, it is important to keep a balance between encapsulation and transparency. While it is true that, semantically at least, one could map some of the helpers by STL (or boost, for that matter) types, I deliberately refrained from such an approach, to allow existing implementations to be bound to the API. In particular, I'm using here libxml2 as backend, which has been fine-tuned for best performance over a couple of years. I'm not courageous / naive enough to attempt to match such an implementation with a new one. Instead, I want a seamless mapping that doesn't impose any unnecessary copying or other indirection to mitigate the 'impedance mismatch'.

This is a very, very, very, very, very good approach. - Doug

Phil Endecott

4:31 p.m.

Doug Gregor wrote:

...

On Jul 9, 2007, at 11:12 PM, Stefan Seefeld wrote:

...
I'm using here libxml2 as backend

...

...
I want a seamless mapping that doesn't impose any unnecessary copying or other indirection to mitigate the 'impedance mismatch'.

...

This is a very, very, very, very, very good approach.

[I have quoted Stefan's words in the hope that I've captured the core of what Doug is agreeing so emphatically with - I hope I've got it right.] Aiming for the minimum overhead in your libxml2 wrapper is a valid objective. But perhaps in that case you should be selling this as a "C++ wrapper for libxml2", not as a "Boost XML library"? I would have thought that a largely backend-independent (or self-contained) library with STL-like interface would be more "Boost-compatible". Does anyone have any experience of how little overhead could be involved in going from e.attributes["foo"]="blah"; to e.set_attribute("foo","blah"); ? Sketch of implementation: class AttributeProxy { Element& e; string name; public: AttributeProxy(Element& e_, string name_): e(e_), name(name_) {} operator=(string value) { e.set_attribute(name,value); } }; class Attributes { Element& e; public: Attributes(Element& e_): e(e_) {} AttributeProxy operator[](string name) { return AttributeProxy(e,name); } // hmm, returns temporary! }; class Element { public: Attributes attributes; Element(): attributes(*this) {} private: set_attribute(string name, string value) { .... } } What do current compilers do with that? What can we expect in the future? Phil.

Stefan Seefeld

4:37 p.m.

Phil Endecott wrote:

...

Doug Gregor wrote:

...
On Jul 9, 2007, at 11:12 PM, Stefan Seefeld wrote:

...
I'm using here libxml2 as backend

...
...
I want a seamless mapping that doesn't impose any unnecessary copying or other indirection to mitigate the 'impedance mismatch'.

...
This is a very, very, very, very, very good approach.

[I have quoted Stefan's words in the hope that I've captured the core of what Doug is agreeing so emphatically with - I hope I've got it right.]

Aiming for the minimum overhead in your libxml2 wrapper is a valid objective. But perhaps in that case you should be selling this as a "C++ wrapper for libxml2", not as a "Boost XML library"? I would have thought that a largely backend-independent (or self-contained) library with STL-like interface would be more "Boost-compatible".

What about my proposed interface is libxml2-specific, prompting you to call it a 'libxml2 wrapper' ? Making the wrapper as thin as possible, yet making the API itself backend-agnostic (and thus allow it to be reimplemented with other backends) is part of the balance I was talking about, too.

...

Does anyone have any experience of how little overhead could be involved in going from

e.attributes["foo"]="blah"; to e.set_attribute("foo","blah"); ?

That is pure syntactic sugar. Yes, this can be done using some proxy classes. Why should we worry about such details at this point ? Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Michael Marcin

5:19 p.m.

Stefan Seefeld wrote:

...

Phil Endecott wrote:

...
Aiming for the minimum overhead in your libxml2 wrapper is a valid objective. But perhaps in that case you should be selling this as a "C++ wrapper for libxml2", not as a "Boost XML library"? I would have thought that a largely backend-independent (or self-contained) library with STL-like interface would be more "Boost-compatible".

What about my proposed interface is libxml2-specific, prompting you to call it a 'libxml2 wrapper' ?

If you are designing the interface to minimize overhead using a libxml2 backend then it will likely incur undue overhead using some other backend and thus make the library essentially viable only with a libxml2 backend.

...

Making the wrapper as thin as possible, yet making the API itself backend-agnostic (and thus allow it to be reimplemented with other backends) is part of the balance I was talking about, too.

In my experience until I actually have 2 or more dissimilar backends implemented the interface is not implementation agnostic. If that is truly a goal of this library (and I'm not convinced it needs to be) then it should actually be exercised. - Michael Marcin

Stefan Seefeld

5:43 p.m.

Michael Marcin wrote:

...

Stefan Seefeld wrote:

...
Phil Endecott wrote:

...
Aiming for the minimum overhead in your libxml2 wrapper is a valid objective. But perhaps in that case you should be selling this as a "C++ wrapper for libxml2", not as a "Boost XML library"? I would have thought that a largely backend-independent (or self-contained) library with STL-like interface would be more "Boost-compatible". What about my proposed interface is libxml2-specific, prompting you to call it a 'libxml2 wrapper' ?

If you are designing the interface to minimize overhead using a libxml2 backend then it will likely incur undue overhead using some other backend and thus make the library essentially viable only with a libxml2 backend.

I don't follow that argument. Minimizing overhead doesn't mean I try to keep as close as possible to the libxml2 API, but instead, I allow for enough latitude in the spec to adjust to backend-specific handling. That appears to be a common theme in standardization: Be specific enough to be actually useful for end-users, and flexible enough to make implementers happy.

...

...
Making the wrapper as thin as possible, yet making the API itself backend-agnostic (and thus allow it to be reimplemented with other backends) is part of the balance I was talking about, too.

In my experience until I actually have 2 or more dissimilar backends implemented the interface is not implementation agnostic.

That's a fair point. I'm not saying the API actually is implementation agnostic. But I try to. I would certainly appreciate if others tried to provide alternative bindings, so we can compare. May be it's a little early to do that, though. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Martin Wille

3:20 p.m.

Stefan Seefeld wrote:

...

Right. The "DOM" API isn't actually standardized for anything but Java.

<nitpick> and ECMAScript and OMG IDL (http://www.w3.org/TR/DOM-Level-2-Core/) </nitpick>

...

So in my attempt to provide a "DOM-like" API I have been trying to capture the semantics, not the syntax, and map that to "modern C++ style" idioms.

That's certainly better than to apply IDL-C++ mappings ;) Regards, m

Stuart Dootson

9:32 a.m.

On 06/07/07, Stefan Seefeld <seefeld@sympatico.ca> wrote:

...

Hello,

over the last couple of years we have discussed possible XML APIs for inclusion into boost. As I already had an early prototype for such an API, I kept evolving it, based on feedback from those discussions. A couple of weeks ago I actually checked it into the sandbox (http://svn.boost.org/trac/boost/browser/sandbox/xml). Today, I adjusted the source layout to conform to the sandbox layout we agreed on, including a boost.build - based build-system.

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

Best regards, Stefan

PS: The current scope of the project is described in http://svn.boost.org/trac/boost/browser/sandbox/xml/README

Stefan - I'll certainly have a look - I've got an XML processing project that I need to translate from C# to C++ (for targets without .NET framework installed - it's a locked down corporate environment). I only used C# 'cause it had an easy to use interface for XML processing - it'd be nice to have one for C++! Stuart

David Abrahams

11 Jul 11 Jul

2:20 a.m.

on Fri Jul 06 2007, Stefan Seefeld <seefeld-AT-sympatico.ca> wrote:

...

over the last couple of years we have discussed possible XML APIs for inclusion into boost. As I already had an early prototype for such an API, I kept evolving it, based on feedback from those discussions. A couple of weeks ago I actually checked it into the sandbox (http://svn.boost.org/trac/boost/browser/sandbox/xml). Today, I adjusted the source layout to conform to the sandbox layout we agreed on, including a boost.build - based build-system.

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

Okay, this is going to sound very opinionated: I find XML horrible to read, however I find most of the procedural code I've seen for manipulating it even more horrible. I would like to see a more declarative syntax for much of this stuff. element_ptr info = root->insert_element(root->begin_children(), "articleinfo"); if (title) { info->insert(info->begin_children(), title); info->insert_comment(info->begin_children(), "This title was moved"); } element_ptr author = info->append_element("author"); element_ptr firstname = author->append_element("firstname"); firstname->set_content("Joe"); element_ptr surname = author->append_element("surname"); surname->set_content("Random"); could be something like: root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] ) You could use Boost.Parameter to do attributes, or use runtime attributes like tag("div", attr("class") = "someclass")[ ... ] You might be familiar with Nevow's STAN, which I had a hand in. This suggestion is reminiscent of that.

...

PS: The current scope of the project is described in http://svn.boost.org/trac/boost/browser/sandbox/xml/README

Another suggestion: use the .rst extension for ReST documents -- Trac will preview them formatted via ReST [Oh, and I suggest you get the tab characters out of your code) Cheers, -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Stefan Seefeld

1:14 p.m.

David Abrahams wrote:

...

Okay, this is going to sound very opinionated:

I find XML horrible to read, however I find most of the procedural code I've seen for manipulating it even more horrible.

I would like to see a more declarative syntax for much of this stuff.

element_ptr info = root->insert_element(root->begin_children(), "articleinfo"); if (title) { info->insert(info->begin_children(), title); info->insert_comment(info->begin_children(), "This title was moved"); } element_ptr author = info->append_element("author"); element_ptr firstname = author->append_element("firstname"); firstname->set_content("Joe"); element_ptr surname = author->append_element("surname"); surname->set_content("Random");

could be something like:

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

You could use Boost.Parameter to do attributes, or use runtime attributes like

tag("div", attr("class") = "someclass")[ ... ]

You might be familiar with Nevow's STAN, which I had a hand in. This suggestion is reminiscent of that.

Do you think the above syntax would replace the procedural API, or merely complement it ? While I can see the appeal of such a declarative approach, I'm not sure how well that fits into a broader picture where users want to use the same API not only to build a document, but traverse it, remove and replace elements, etc. To me, right now, what you propose looks mostly like syntactic sugar, which can be worked on as a refinement once the basic (and common) API is established.

...

...
PS: The current scope of the project is described in http://svn.boost.org/trac/boost/browser/sandbox/xml/README

Another suggestion: use the .rst extension for ReST documents -- Trac will preview them formatted via ReST

Sure, will do. I hadn't even thought of the README as a ReST document. :-) I'll migrate more things to boost conventions as I have time to work on it.

...

[Oh, and I suggest you get the tab characters out of your code)

That, too. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Felipe Magno de Almeida

4:37 p.m.

On 7/11/07, Stefan Seefeld <seefeld@sympatico.ca> wrote:

...

David Abrahams wrote:

[snipped]

...

Do you think the above syntax would replace the procedural API, or merely complement it ? While I can see the appeal of such a declarative approach, I'm not sure how well that fits into a broader picture where users want to use the same API not only to build a document, but traverse it, remove and replace elements, etc.

FWIW, if a declarative syntax were available, I wouldn't find much need for a procedural API, except where it weren't possible at all. Pursuing a very easy-to-use syntax should be the goal, IMHO, in a XML boost library.

...

To me, right now, what you propose looks mostly like syntactic sugar, which can be worked on as a refinement once the basic (and common) API is established.

It is mostly syntactic sugar that you're defining. Anyone can use libxml2 directly if syntactic sugar isn't needed or desirable. [snip]

...

Stefan

-- Felipe Magno de Almeida

Mathias Gaunard

6:19 p.m.

David Abrahams wrote:

...

I find XML horrible to read, however I find most of the procedural code I've seen for manipulating it even more horrible.

I would like to see a more declarative syntax for much of this stuff.

element_ptr info = root->insert_element(root->begin_children(), "articleinfo"); if (title) { info->insert(info->begin_children(), title); info->insert_comment(info->begin_children(), "This title was moved"); } element_ptr author = info->append_element("author"); element_ptr firstname = author->append_element("firstname"); firstname->set_content("Joe"); element_ptr surname = author->append_element("surname"); surname->set_content("Random");

could be something like:

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

Interesting. It doesn't look a lot like raw XML but it certainly integrates better into C++.

David Abrahams

12 Jul 12 Jul

3:12 a.m.

on Wed Jul 11 2007, Mathias Gaunard <mathias.gaunard-AT-etu.u-bordeaux1.fr> wrote:

...

David Abrahams wrote:

...
I find XML horrible to read, however I find most of the ^^^^^^^^^^^^^^^^^^^^^^^^^^^ procedural code I've seen for manipulating it even more horrible.

I would like to see a more declarative syntax for much of this stuff.

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

Interesting. It doesn't look a lot like raw XML

Maybe that's the point ;-)

...

but it certainly integrates better into C++.

That too. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Peter Dimov

8:15 a.m.

David Abrahams wrote:

...

on Wed Jul 11 2007, Mathias Gaunard <mathias.gaunard-AT-etu.u-bordeaux1.fr> wrote:

...
David Abrahams wrote:

...
I find XML horrible to read, however I find most of the ^^^^^^^^^^^^^^^^^^^^^^^^^^^ procedural code I've seen for manipulating it even more horrible.

I would like to see a more declarative syntax for much of this stuff.

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

Interesting. It doesn't look a lot like raw XML

Maybe that's the point ;-)

...
but it certainly integrates better into C++.

That too.

You're optimizing the wrong case, a toy example. It obviously depends on the actual XML schema, but I typically average one Node::add_element call per function when the format is sensible and can be mapped to C++. void add_element( node * p, char const * n, ArticleInfo const & ai ) { node * p2 = p->add_element( n ); add_element( p2, "author", ai.author() ); } void add_element( node * p, char const * n, Author const & a ) { node * p2 = p->add_element( n ); add_element( p2, "firstname", a.firstname() ); add_element( p2, "surname", a.surname() ); } void add_element( node * p, char const * n, std::string const & s ) { node * p2 = p->add_element( n ); p2->add_text( s ); } int main() { ArticleInfo ai( ... ); XmlDocument xd( "test.xml" ); add_element( xd, "articleinfo", ai ); } (this is C++ pseudocode and out of order but you get the point.) This is how we can move to a list of authors: void add_element( node * p, char const * n, ArticleInfo const & ai ) { node * p2 = p->add_element( n ); add_element( p2, "authors", ai.authors() ); } void add_element( node * p, char const * n, vector<Author> const & v ) { node * p2 = p->add_element( n ); for( auto i = v.begin(); i != v.end(); ++i ) { add_element( p2, "author", *i ); } }

David Abrahams

6:02 p.m.

on Thu Jul 12 2007, "Peter Dimov" <pdimov-AT-pdimov.com> wrote:

...

David Abrahams wrote:

...
on Wed Jul 11 2007, Mathias Gaunard <mathias.gaunard-AT-etu.u-bordeaux1.fr> wrote:

...
David Abrahams wrote:

...
I find XML horrible to read, however I find most of the ^^^^^^^^^^^^^^^^^^^^^^^^^^^ procedural code I've seen for manipulating it even more horrible.

I would like to see a more declarative syntax for much of this stuff.

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

Interesting. It doesn't look a lot like raw XML

Maybe that's the point ;-)

...
but it certainly integrates better into C++.

That too.

You're optimizing the wrong case

I don't know about that. The kind of interface used above has become very popular in the Python world... because it's useful. It may not cover the entire domain of XML manipulations well, but it does cover an important corner. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Peter Dimov

6:26 p.m.

David Abrahams wrote:

...

on Thu Jul 12 2007, "Peter Dimov" <pdimov-AT-pdimov.com> wrote:

...

...
...
...
...
I find XML horrible to read, however I find most of the procedural code I've seen for manipulating it even more horrible.

...

...
You're optimizing the wrong case

I don't know about that. The kind of interface used above has become very popular in the Python world... because it's useful. It may not cover the entire domain of XML manipulations well, but it does cover an important corner.

Maybe it is, maybe it does. But you are still optimizing the wrong case; horrible procedural code for creating a toy XML. Now, if you find the code I posted horrible, that'd be another matter. :-)

David Abrahams

10:12 p.m.

on Thu Jul 12 2007, "Peter Dimov" <pdimov-AT-pdimov.com> wrote:

...

...
...
You're optimizing the wrong case

I don't know about that. The kind of interface used above has become very popular in the Python world... because it's useful. It may not cover the entire domain of XML manipulations well, but it does cover an important corner.

Maybe it is, maybe it does. But you are still optimizing the wrong case; horrible procedural code for creating a toy XML.

I don't understand what you're objecting to. The XML schema being generated is not realistic?

...

Now, if you find the code I posted horrible, that'd be another matter. :-)

Well, I didn't understand the point you were trying to make, whereas (of course) I find my own code immediately understandable. Not being able to understand the code is one form of horror ;-) -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Peter Dimov

10:39 p.m.

David Abrahams wrote:

...

on Thu Jul 12 2007, "Peter Dimov" <pdimov-AT-pdimov.com> wrote:

...
...
...
You're optimizing the wrong case

I don't know about that. The kind of interface used above has become very popular in the Python world... because it's useful. It may not cover the entire domain of XML manipulations well, but it does cover an important corner.

Maybe it is, maybe it does. But you are still optimizing the wrong case; horrible procedural code for creating a toy XML.

I don't understand what you're objecting to. The XML schema being generated is not realistic?

...

Well, I didn't understand the point you were trying to make, whereas (of course) I find my own code immediately understandable.

The point I was trying to make is that when you write the XML generation code in a modular way, there's much less incentive to DSEL things up. The advantages of this style are not apparent in simple examples where there's no reuse within the generated XML, either in the same document, or across documents. In other words, if the <author> tag occurs several times in the schema(s), in different contexts, I'm able to just reuse my add_element overload that takes a C++ author class, without repeating the same tags over and over in a declarative way. Similarly, if several elements in the schema have a lot of common, I can use C++ inheritance to factor the common core into a base class and have it have its own add_element function, which would save me the trouble of repeating the common parts in each subclass element.

David Abrahams

13 Jul 13 Jul

8:54 p.m.

on Thu Jul 12 2007, "Peter Dimov" <pdimov-AT-pdimov.com> wrote:

...

David Abrahams wrote:

...
on Thu Jul 12 2007, "Peter Dimov" <pdimov-AT-pdimov.com> wrote:

...
...
...
You're optimizing the wrong case

I don't know about that. The kind of interface used above has become very popular in the Python world... because it's useful. It may not cover the entire domain of XML manipulations well, but it does cover an important corner.

Maybe it is, maybe it does. But you are still optimizing the wrong case; horrible procedural code for creating a toy XML.

I don't understand what you're objecting to. The XML schema being generated is not realistic?

...

...
Well, I didn't understand the point you were trying to make, whereas (of course) I find my own code immediately understandable.

The point I was trying to make is that when you write the XML generation code in a modular way, there's much less incentive to DSEL things up.

I'm afraid I still don't see. Could you add some comments to your code? What about that code makes it "modular"?

...

The advantages of this style are not apparent in simple examples where there's no reuse within the generated XML, either in the same document, or across documents.

In other words, if the <author> tag occurs several times in the schema(s), in different contexts, I'm able to just reuse my add_element overload that takes a C++ author class, without repeating the same tags over and over in a declarative way.

Aren't you just repeating your procedural calls to add_element over and over?

...

Similarly, if several elements in the schema have a lot of common, I can use C++ inheritance to factor the common core into a base class and have it have its own add_element function, which would save me the trouble of repeating the common parts in each subclass element.

I'm sorry, I know you must be making a good point, but I'm still not seeing it. Help... please... :) -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Mathias Gaunard

6:11 p.m.

Peter Dimov a écrit :

...

David Abrahams wrote:

...

...
...
...
root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

Interesting. It doesn't look a lot like raw XML Maybe that's the point ;-)

...
but it certainly integrates better into C++. That too.

You're optimizing the wrong case, a toy example.

Its best usage is obviously for an XML template, where variables directly maps to the contents of an XML structure. To do that, people usually invent pseudo-languages that integrate into XML, but then they end up duplicating full programming languages, with an even uglier syntax. Being able to write an XML document and integrate it with C++ variables and logic is what that kind of stuff is for.

David Abrahams

8:55 p.m.

on Fri Jul 13 2007, Mathias Gaunard <mathias.gaunard-AT-etu.u-bordeaux1.fr> wrote:

...

Peter Dimov a écrit :

...
David Abrahams wrote:

...
...
...
...
root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

Interesting. It doesn't look a lot like raw XML Maybe that's the point ;-)

...
but it certainly integrates better into C++. That too.

You're optimizing the wrong case, a toy example.

Its best usage is obviously for an XML template, where variables directly maps to the contents of an XML structure.

Yes, that's where I've used it most.

...

To do that, people usually invent pseudo-languages that integrate into XML, but then they end up duplicating full programming languages, with an even uglier syntax.

Yep. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Phil Endecott

11 Jul 11 Jul

6:23 p.m.

David Abrahams wrote:

...

on Fri Jul 06 2007, Stefan Seefeld <seefeld-AT-sympatico.ca> wrote:

...
over the last couple of years we have discussed possible XML APIs for inclusion into boost. As I already had an early prototype for such an API, I kept evolving it, based on feedback from those discussions. A couple of weeks ago I actually checked it into the sandbox (http://svn.boost.org/trac/boost/browser/sandbox/xml). Today, I adjusted the source layout to conform to the sandbox layout we agreed on, including a boost.build - based build-system.

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

Okay, this is going to sound very opinionated:

I find XML horrible to read

Agreed; I think it's just sufficiently good that a better alternative would fail.

...

however I find most of the procedural code I've seen for manipulating it even more horrible.

You've seen my code, haven't you :-)

...

I would like to see a more declarative syntax for much of this stuff.

I prefer to read any constant fragments from files or from literal strings, which is about as declarative as you can get.

...

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

How close can we get to that using Boost.Assign, if the element class is a model of a suitable container? But I think that using subclasses is the key to making this sort of code easier on the eye: struct firstname: public element { firstname(n): element("firstname",n) {}; }; etc. articleinfo ai; if (title) ai.push_back(comment("This title was moved")); author a; a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); root.push_back(ai); Just adding some non-standard indentation to that makes it almost as clear as I need. You don't need the {}, but it keeps your editor's auto-indent happy: articleinfo ai; { if (title) ai.push_back(comment("This title was moved")); author a; { a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); } root.push_back(ai); } You can also reduce the declare-populate-add to delcare-populate if you pass the parent to the sub-elements' constructors: articleinfo ai(root); { if (title) comment(ai,"This title was moved"); author a(ai); { firstname(a,"Joe"); surname(a,"Random"); } } Of course there is a question of the lifespan of these apparently-temporary objects to resolve. Anyway, I think that nice-looking XML-generating code can be possible without "resorting to" operator overloading. Regards, Phil.

Felipe Magno de Almeida

6:51 p.m.

On 7/11/07, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...

[snip]

...

...
I would like to see a more declarative syntax for much of this stuff.

I prefer to read any constant fragments from files or from literal strings, which is about as declarative as you can get.

But you lose locality and sometimes performance. What if you have lots of little xml snippets that must be added in parts of a xml document. Does saving them in files is really a good option? Wouldn't be better to just have four or five lines more of code inside the function? There you can see both the code logic *and* what it is adding. I wouldn't want to have some kind of j2ee directory structure with god-knows how many xmls files all over the place. [snip]

...

But I think that using subclasses is the key to making this sort of code easier on the eye:

struct firstname: public element { firstname(n): element("firstname",n) {}; };

etc.

articleinfo ai; if (title) ai.push_back(comment("This title was moved")); author a; a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); root.push_back(ai);

Wow! I can't understand how this can be any closer to as readable as the interface david proposed. There are a lot of names which just mean nothing to anybody reading this code. author a; ? then a.push_back() a.push_back() ai.push_back() It doesnt strike to you that this code is very error-prone?

...

Just adding some non-standard indentation to that makes it almost as clear as I need. You don't need the {}, but it keeps your editor's auto-indent happy:

articleinfo ai; { if (title) ai.push_back(comment("This title was moved")); author a; { a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); } root.push_back(ai); }

Instantiating every node you have to add is just too much unnecessary burden. I can't see how the user gains with this, instead of the declarative approach. [snip]

...

Of course there is a question of the lifespan of these apparently-temporary objects to resolve. Anyway, I think that nice-looking XML-generating code can be possible without "resorting to" operator overloading.

What is the advantage of not resorting to operator overloading? What exactly anybody wins not using it?

...

Regards,

Phil.

Best regards, -- Felipe Magno de Almeida

Phil Endecott

10:36 p.m.

Felipe Magno de Almeida wrote:

...

On 7/11/07, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...
I prefer to read any constant fragments from files or from literal strings, which is about as declarative as you can get.

But you lose locality

Not with literal strings, e.g. element("<author>" "<firstname>Joe</firstname>" "<surname>Random</surname>" "</author>");

...

and sometimes performance.

Ideally these would be static, i.e. parsed once per program invokation. But yes, this is not ideal for performance.

...

...
But I think that using subclasses is the key to making this sort of code easier on the eye:

struct firstname: public element { firstname(n): element("firstname",n) {}; };

etc.

articleinfo ai; if (title) ai.push_back(comment("This title was moved")); author a; a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); root.push_back(ai);

Wow! I can't understand how this can be any closer to as readable as the interface david proposed. There are a lot of names which just mean nothing to anybody reading this code.

author a; ?

Well if you're generating an XML fragment with an <author> tag, it should mean something to you. If you prefer, try author_element or element("author").

...

then

a.push_back() a.push_back() ai.push_back()

I hope that push_back() is familiar to all C++ users.

...

It doesnt strike to you that this code is very error-prone?

The main types of error are creating an element and then forgetting to add it to its parent, and stuff like that. Yes, this happens. But it's fairly easy to debug.

...

...
Just adding some non-standard indentation to that makes it almost as clear as I need. You don't need the {}, but it keeps your editor's auto-indent happy:

articleinfo ai; { if (title) ai.push_back(comment("This title was moved")); author a; { a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); } root.push_back(ai); }

Instantiating every node you have to add is just too much unnecessary burden. I can't see how the user gains with this, instead of the declarative approach.

[snip]

...
Of course there is a question of the lifespan of these apparently-temporary objects to resolve. Anyway, I think that nice-looking XML-generating code can be possible without "resorting to" operator overloading.

What is the advantage of not resorting to operator overloading? What exactly anybody wins not using it?

The advantage of not using operator overloading is that it is easier to see what is going on; the code is less magical. We all know what "OBJECT.METHOD(PARAMETER)" does, but what does X [ Y , Z ] do? It looks like 2D array indexing to me. The overloading of operator[] and operator, that David presented would be justified if it made a substantial saving in LOC and increase in clarity compared to what you could achieve using conventional syntax. But my feeling (and I'm happy to see counter-examples) is that the syntax that I presented gets close to the LOC and clarity (see below), and that in some cases you can use inline string literals for maximum clarity. root.push_front( articleinfo ai(root); { tag("articleinfo")[ author a(ai); { tag("author")[ firstname(a,"Joe"); tag("firstname")["Joe"], surname(a,"Random"); tag("surname")["Random"] } ] } ] ) (The syntax on the right is pretty much what I use to build XML in Javascript, and I'm fairly happy with it.) (Oh no, I confessed to writing Javascript on the Boost list, now my secret is out...) Phil.

Felipe Magno de Almeida

12 Jul 12 Jul

midnight

On 7/11/07, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...

Felipe Magno de Almeida wrote:

[snipped]

...

...
But you lose locality

Not with literal strings, e.g.

element("<author>" "<firstname>Joe</firstname>" "<surname>Random</surname>" "</author>");

But no syntax checking whatsoever.

...

...
and sometimes performance.

[snip]

...

Well if you're generating an XML fragment with an <author> tag, it should mean something to you. If you prefer, try author_element or element("author").

...
then

a.push_back() a.push_back() ai.push_back()

I hope that push_back() is familiar to all C++ users.

Have you noted that the first two are a, and the firth is ai ? It is from your own code.

...

...
It doesnt strike to you that this code is very error-prone?

The main types of error are creating an element and then forgetting to add it to its parent, and stuff like that. Yes, this happens. But it's fairly easy to debug.

I believe that it is not only very recurring, but there are a lot of other errors that compiles and fails. Debugging xml generation is a PITA when the xml logic is quite complex. Some may never be tested. [snip]

...

...
What is the advantage of not resorting to operator overloading? What exactly anybody wins not using it?

The advantage of not using operator overloading is that it is easier to see what is going on; the code is less magical. We all know what "OBJECT.METHOD(PARAMETER)" does, but what does X [ Y , Z ] do? It looks like 2D array indexing to me. The overloading of operator[] and operator, that David presented would be justified if it made a substantial saving in LOC and increase in clarity compared to what you could achieve using conventional syntax. But my feeling (and I'm happy to see counter-examples) is that the syntax that I presented gets close to the LOC and clarity (see below), and that in some cases you can use inline string literals for maximum clarity.

With fear of sounding too hasty, IMO, boost is way beyond this abstraction prejudice. [snip]

...

Phil.

-- Felipe Magno de Almeida

Oliver.Kowalke＠qimonda.com

4:54 a.m.

...

...
Just adding some non-standard indentation to that makes it almost as clear as I need. You don't need the {}, but it keeps your editor's auto-indent happy:

articleinfo ai; { if (title) ai.push_back(comment("This title was moved")); author a; { a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); } root.push_back(ai); }

Instantiating every node you have to add is just too much unnecessary burden. I can't see how the user gains with this, instead of the declarative approach.

What about typesafety? Isn't it better to insert firstname objects into articleinfo objects than inserting tag objects into tag object? Oliver

Felipe Magno de Almeida

4:58 a.m.

On 7/12/07, Oliver.Kowalke@qimonda.com <Oliver.Kowalke@qimonda.com> wrote:

...

[snip]

...

...
Instantiating every node you have to add is just too much unnecessary burden. I can't see how the user gains with this, instead of the declarative approach.

What about typesafety? Isn't it better to insert firstname objects into articleinfo objects than inserting tag objects into tag object?

How are you going to enforce only firstnames to articleinfo? Or did I miss something?

...

Oliver

Best regards, -- Felipe Magno de Almeida

Oliver.Kowalke＠qimonda.com

7:21 a.m.

...

...
...
Instantiating every node you have to add is just too much unnecessary burden. I can't see how the user gains with this, instead of the declarative approach.

What about typesafety? Isn't it better to insert firstname objects into articleinfo objects than inserting tag objects into tag object?

How are you going to enforce only firstnames to articleinfo? Or did I miss something?

You could do it over templates like typedef tag< mpl::vector< title, author > // tags mpl::vector< isbn > // atributes

...

article_info;

Kind regards, Oliver

David Abrahams

3:13 a.m.

on Wed Jul 11 2007, "Phil Endecott" <spam_from_boost_dev-AT-chezphil.org> wrote:

...

I think that using subclasses is the key to making this sort of code easier on the eye:

struct firstname: public element { firstname(n): element("firstname",n) {}; };

etc.

You're already headed off in a harder-to-read direction from my point of view. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Jake Voytko

3:58 a.m.

On 7/11/07, David Abrahams <dave@boost-consulting.com> wrote:

...

on Wed Jul 11 2007, "Phil Endecott" <spam_from_boost_dev-AT-chezphil.org> wrote:

...
I think that using subclasses is the key to making this sort of code easier on the eye:

struct firstname: public element { firstname(n): element("firstname",n) {}; };

etc.

You're already headed off in a harder-to-read direction from my point of view.

As well as a harder-to-implement direction. Having to make a struct for every tag you implement seems a little user-unfriendly. For a constrained format (like SVG) I could agree, but I think for Boost.XML library, there should be easy naming. Splitting the difference (or so I hope) between easy trees/expressiveness and standard syntax, how about something like this? root.push_front ( tag("Article Info", (title ? (comment("This title was moved"), title) : NULL) (tag("author", (tag("firstname", "Joe")) (tag("surname", "Random")) )) ) );

Oliver.Kowalke＠qimonda.com

4:52 a.m.

New subject: RFC: Boost.XML API prototype in the sandbox -> generation form xsd

...

As well as a harder-to-implement direction. Having to make a struct for every tag you implement seems a little user-unfriendly. For a constrained format (like SVG) I could agree, but I think for Boost.XML library, there should be easy naming.

Splitting the difference (or so I hope) between easy trees/expressiveness and standard syntax, how about something like this?

root.push_front ( tag("Article Info", (title ? (comment("This title was moved"), title) : NULL) (tag("author", (tag("firstname", "Joe")) (tag("surname", "Random")) )) ) );

I find it relay hard to read and I'm missing the typesafety. Nothing prevents you to put tag("firstname", "Joe") in place of article info tag! You are forced to validate it at runtime. Phils suggestion is more typesafe: articleinfo ai; { if (title) ai.push_back(comment("This title was moved")); author a; { a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); } root.push_back(ai); } I would like to see translating xsd (xml schemas) to c++ classes (but this could be a further improvment). A good idea would be providing an iterator which can be used traverse the xml-tree. Oliver

John Moeller

7:38 a.m.

New subject: RFC: Boost.XML API prototype in the sandbox -> generation form xsd

Oliver.Kowalke@qimonda.com wrote:

...

...
root.push_front ( tag("Article Info", (title ? (comment("This title was moved"), title) : NULL) (tag("author", (tag("firstname", "Joe")) (tag("surname", "Random")) )) ) );

I find it relay hard to read and I'm missing the typesafety. Nothing prevents you to put tag("firstname", "Joe") in place of article info tag!

I agree that it's harder to read; I like Dave's syntax better. Somehow, the square brackets do it for me.

...

You are forced to validate it at runtime.

Nothing's really new here. You can't validate XML at compile time. Unless you mean something different.

...

Phils suggestion is more typesafe:

No, it's just more verbose, and there's no way to eliminate all the intermediate variables. Type is enforced only by the structure of the data and the way that it's interpreted by the parser/writer. You may as well pick an interface that just lets you read/write hierarchical data easily, or support an extension that already deals with interpretation, like schemas.

...

articleinfo ai; { if (title) ai.push_back(comment("This title was moved")); author a; { a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); } root.push_back(ai); }

I don't really see how it's more type-safe. The structures "articleinfo", "firstname" and "surname" would all inherit from "element", which is presumably a type accepted by "push_back". So nothing prevents you from doing: root.push_back(firstname("Joe")); which is exactly what you had hoped to avoid. If you really want type-safety at this level, you'd have to define a push_back for all the sub-element types accepted by an element (and push_front, insert, etc.). No thanks.

...

I would like to see translating xsd (xml schemas) to c++ classes (but this could be a further improvment).

I'd like to see schema support as well. One other thing that I'd like to see is an XPath library that lets me traverse *any* hierarchical structure, not just XML trees. That would have been remarkably handy in a project I worked on a couple years ago. -- John Moeller fishcorn@gmail.com

Oliver.Kowalke＠qimonda.com

8:06 a.m.

New subject: RFC: Boost.XML API prototype in the sandbox -> generation form xsd

...

...
You are forced to validate it at runtime.

Nothing's really new here. You can't validate XML at compile time. Unless you mean something different.

please see below

...

...
Phils suggestion is more typesafe:

No, it's just more verbose, and there's no way to eliminate all the intermediate variables.

Type is enforced only by the structure of the data and the way that it's interpreted by the parser/writer. You may as well pick an interface that just lets you read/write hierarchical data easily, or support an extension that already deals with interpretation, like schemas.

...
articleinfo ai; { if (title) ai.push_back(comment("This title was moved")); author a; { a.push_back(firstname("Joe")); a.push_back(surname("Random")); ai.push_back(a); } root.push_back(ai); }

I don't really see how it's more type-safe. The structures "articleinfo", "firstname" and "surname" would all inherit from "element", which is presumably a type accepted by "push_back". So nothing prevents you from doing:

root.push_back(firstname("Joe"));

which is exactly what you had hoped to avoid.

If you really want type-safety at this level, you'd have to define a push_back for all the sub-element types accepted by an element (and push_front, insert, etc.). No thanks.

I would do it over templates - than it's type safe and a part of the xml validation can be done at compile time. typedef tag< mpl::vector< title, author > // tags mpl::vector< isbn > // atributes

...

article_info;

etc. Regards, Oliver

Oliver.Kowalke＠qimonda.com

8:17 a.m.

New subject: RFC: Boost.XML API - typesafety

Hello, the suggestions about the design of boost.XML API adon't utilize the strong typesafety of C++. Why not create a class tree representing the xml document instead of generating it out of generic element classes or strings? Benefit: a part of the xml validation can be done at compile time - the xml should be more well-defined 8no missed tags etc.) and as an further improvement the class tree could be generated out of an xsd (wor instance with boost.spirit). regards, Oliver

Jake Voytko

3:22 p.m.

New subject: RFC: Boost.XML API prototype in the sandbox -> generation form xsd

On 7/12/07, Oliver.Kowalke@qimonda.com <Oliver.Kowalke@qimonda.com> wrote:

...

I would do it over templates - than it's type safe and a part of the xml validation can be done at compile time.

typedef tag< mpl::vector< title, author > // tags mpl::vector< isbn > // atributes

...
article_info;

I hadn't considered compile-time type safety when I wrote my above example. However, I feel an example like this might also preclude XML features like comments. If we were to parse an existing article: <article_info isbn="foo">  <title>bar</title> <author>Testy McTest</author> </article_info> your definition doesn't immediately make it clear that the comment, when parsed, ends up in the document tree. I may, however, just be missing something. Complicating matters, according to the standard, they can also appear within the document type declaration, so comments significantly complicate the matter: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd And the standard also lists retrieval of comment text as optional: http://www.w3.org/TR/REC-xml/#sec-comments But I think that a Boost.XML library should support the retrieval of the text of comments. Jake

Oliver.Kowalke＠qimonda.com

13 Jul 13 Jul

4:47 a.m.

New subject: RFC: Boost.XML API prototype in the sandbox ->generation form xsd

...

...
I would do it over templates - than it's type safe and a part of the xml validation can be done at compile time.

typedef tag< mpl::vector< title, author > // tags mpl::vector< isbn > // atributes

...
article_info;

I hadn't considered compile-time type safety when I wrote my above example. However, I feel an example like this might also preclude XML features like comments. If we were to parse an existing article:

<article_info isbn="foo">  <title>bar</title> <author>Testy McTest</author> </article_info>

your definition doesn't immediately make it clear that the comment, when parsed, ends up in the document tree. I may, however, just be missing something. Complicating matters, according to the standard, they can also appear within the document type declaration, so comments significantly complicate the matter: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd

And the standard also lists retrieval of comment text as optional: http://www.w3.org/TR/REC-xml/#sec-comments

But I think that a Boost.XML library should support the retrieval of the text of comments.

I believe that comments could be supported with a typesafe implementation. I don't know where you see problems with comments - article_info could contain optional instances of comment-class. Oliver

Jake Voytko

1:31 p.m.

New subject: RFC: Boost.XML API prototype in the sandbox ->generation form xsd

My concern, as I indicated in the other thread, is that 1) <?xml version="1.0"?> <article_info isbn="foo"> <title>bar</title> <author>Testy McTest</author> </article_info> 2) <?xml version="1.0"?> <article_info isbn="foo">  <title>bar</title> <author>Testy McTest</author> </article_info> and 3) <?xml  version="1.0"?> <article_info isbn="foo">  <author>Testy McTest</author>  <title>bar</title>  </article_info> are all valid XML documents, and represent the same data. If this library were solely to be used for writing, I would say that this is not an issue. However, what is written must be read. The thing I don't understand about the type-safety proposals is how they intend to take into account data that is read from a file. Suppose we write a program to neatly display XML data from any generic XML file: boost::xml my_doc_reader(filename); would certainly be a step, and I would expect it to contain a full XML tree after this step is said and done with no extra intervention. The XML library certainly can't instantiate the proper structs in memory if we don't know what they are. Do they exist solely for writing the document? To me, type safety is a more specific problem than what the XML standard deals with. Yet, it is a common enough problem to warrant a library solution. Perhaps: public typesafe_xml: public xml { }; that requires you to define the document structure in class format, and can throw runtime exceptions when a document is read in with the wrong format? Jake On 7/13/07, Oliver.Kowalke@qimonda.com <Oliver.Kowalke@qimonda.com> wrote:

...

...
...
I would do it over templates - than it's type safe and a part of the xml validation can be done at compile time.

typedef tag< mpl::vector< title, author > // tags mpl::vector< isbn > // atributes

...
article_info;

I hadn't considered compile-time type safety when I wrote my above example. However, I feel an example like this might also preclude XML features like comments. If we were to parse an existing article:

<article_info isbn="foo">  <title>bar</title> <author>Testy McTest</author> </article_info>

your definition doesn't immediately make it clear that the comment, when parsed, ends up in the document tree. I may, however, just be missing something. Complicating matters, according to the standard, they can also appear within the document type declaration, so comments significantly complicate the matter: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd

And the standard also lists retrieval of comment text as optional: http://www.w3.org/TR/REC-xml/#sec-comments

But I think that a Boost.XML library should support the retrieval of the text of comments.

I believe that comments could be supported with a typesafe implementation. I don't know where you see problems with comments - article_info could contain optional instances of comment-class.

Oliver _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sebastian Redl

1:38 p.m.

New subject: RFC: Boost.XML API prototype in the sandbox ->generation form xsd

...

My concern, as I indicated in the other thread, is that 1) <?xml version="1.0"?> <article_info isbn="foo"> <title>bar</title> <author>Testy McTest</author> </article_info>

2) <?xml version="1.0"?> <article_info isbn="foo">  <title>bar</title> <author>Testy McTest</author> </article_info>

and 3) <?xml  version="1.0"?> <article_info isbn="foo">  <author>Testy McTest</author>  <title>bar</title>  </article_info>

are all valid XML documents, and represent the same data. The third isn't valid - the xml declaration cannot contain comments - and whether they represent the same data is a matter of interpretation. If your document type says that article_info can only contain elements,

Jake Voytko wrote: then yes. But you have no document type declaration, so the processor doesn't actually know it, and must preserve the character data. The character data here is just whitespace, but it differs between the three documents. Sebastian Redl

Jake Voytko

3:04 p.m.

New subject: RFC: Boost.XML API prototype in the sandbox ->generation form xsd

...

The third isn't valid - the xml declaration cannot contain comments - and whether they represent the same data is a matter of interpretation. If your document type says that article_info can only contain elements, then yes. But you have no document type declaration, so the processor doesn't actually know it, and must preserve the character data. The character data here is just whitespace, but it differs between the three documents.

Sebastian Redl _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

http://www.w3.org/TR/REC-xml/#sec-prolog-dtd Am I misinterpreting this section? It looks like it allows comments in the header

Sebastian Redl

3:09 p.m.

New subject: RFC: Boost.XML API prototype in the sandbox ->generation form xsd

Jake Voytko wrote:

...

http://www.w3.org/TR/REC-xml/#sec-prolog-dtd Am I misinterpreting this section? It looks like it allows comments in the header

You must be misinterpreting it. There is no such thing as a header. The prolog consists of the xml declaration and the document type declaration, both optional. <?xml ... ?> <!DOCTYPE ...> After the xml decl and after the dtdecl, the grammar allows misc, which can include comments. But it doesn't allow comments inside the xml decl, nor in the dtdecl, except in the internal subset. Sebastian Redl

Stefan Seefeld

3:15 p.m.

New subject: RFC: Boost.XML API prototype in the sandbox ->generation form xsd

Jake Voytko wrote:

...

My concern, as I indicated in the other thread, is that

[...]

...

are all valid XML documents, and represent the same data. If this library were solely to be used for writing, I would say that this is not an issue. However, what is written must be read.

I'm not sure what you are driving at. The parsing of XML files is well specified, and that's what libxml2 implements.

...

The thing I don't understand about the type-safety proposals is how they intend to take into account data that is read from a file. Suppose we write a program to neatly display XML data from any generic XML file:

boost::xml my_doc_reader(filename);

would certainly be a step, and I would expect it to contain a full XML tree after this step is said and done with no extra intervention. The XML library certainly can't instantiate the proper structs in memory if we don't know what they are. Do they exist solely for writing the document?

To me, type safety is a more specific problem than what the XML standard deals with.

Indeed, and so, for the purpose of keeping this XML proposal on track, I'd suggest to disregard any type safety issues that don't directly derive from XML well-formedness. XML validation (no matter what schema is used to define the document type, dtd, relaxng, xsd, etc.) is an add-on, and a runtime mechanism. If people want to map some specific (X)Schema to the C++ type system, let them do that, on top of boost.xml. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Oliver.Kowalke＠qimonda.com

16 Jul 16 Jul

6:03 a.m.

New subject: RFC: Boost.XML API prototype in the sandbox ->generation form xsd

...

...
To me, type safety is a more specific problem than what the XML standard deals with.

Indeed, and so, for the purpose of keeping this XML proposal on track, I'd suggest to disregard any type safety issues that don't directly derive from XML well-formedness.

I don't understand why one of C++ main feature - type safety - should not be an option for boost.xml?

...

XML validation (no matter what schema is used to define the document type, dtd, relaxng, xsd, etc.) is an add-on, and a runtime mechanism.

Why shouldn't parts of validation be done by the compile process? Doing correctness checks during compile time is what you can read in books from Sutter, Meyers, Alexandrescu etc. Oliver

Oliver.Kowalke＠qimonda.com

6:05 a.m.

New subject: RFC: Boost.XML API prototype in the sandbox->generation form xsd

Optional class 'comment' wich can occure in arbitrary position- my thoughts. Oliver

...

My concern, as I indicated in the other thread, is that 1) <?xml version="1.0"?> <article_info isbn="foo"> <title>bar</title> <author>Testy McTest</author> </article_info>

2) <?xml version="1.0"?> <article_info isbn="foo">  <title>bar</title> <author>Testy McTest</author> </article_info>

and 3) <?xml  version="1.0"?> <article_info isbn="foo">  <author>Testy McTest</author>  <title>bar</title>  </article_info>

are all valid XML documents, and represent the same data. If this library were solely to be used for writing, I would say that this is not an issue. However, what is written must be read.

The thing I don't understand about the type-safety proposals is how they intend to take into account data that is read from a file. Suppose we write a program to neatly display XML data from any generic XML file:

boost::xml my_doc_reader(filename);

would certainly be a step, and I would expect it to contain a full XML tree after this step is said and done with no extra intervention. The XML library certainly can't instantiate the proper structs in memory if we don't know what they are. Do they exist solely for writing the document?

To me, type safety is a more specific problem than what the XML standard deals with. Yet, it is a common enough problem to warrant a library solution.

Perhaps: public typesafe_xml: public xml {

};

that requires you to define the document structure in class format, and can throw runtime exceptions when a document is read in with the wrong format?

Jake

On 7/13/07, Oliver.Kowalke@qimonda.com <Oliver.Kowalke@qimonda.com> wrote:

...
...
...
I would do it over templates - than it's type safe and a part of the xml validation can be done at compile time.

typedef tag< mpl::vector< title, author > // tags mpl::vector< isbn > // atributes

...
article_info;

I hadn't considered compile-time type safety when I wrote

...
...
example. However, I feel an example like this might also preclude XML features like comments. If we were to parse an existing article:

<article_info isbn="foo">  <title>bar</title> <author>Testy McTest</author> </article_info>

your definition doesn't immediately make it clear that

...
...
when parsed, ends up in the document tree. I may, however, just be missing something. Complicating matters, according to the standard, they can also appear within the document type declaration, so comments significantly complicate the matter: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd

And the standard also lists retrieval of comment text as optional: http://www.w3.org/TR/REC-xml/#sec-comments

But I think that a Boost.XML library should support the retrieval of the text of comments.

I believe that comments could be supported with a typesafe implementation. I don't know where you see problems with comments -

my above the comment, article_info could

...
contain optional instances of comment-class.

Oliver _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

David Abrahams

12 Jul 12 Jul

6:04 p.m.

on Wed Jul 11 2007, "Jake Voytko" <jakevoytko-AT-gmail.com> wrote:

...

Splitting the difference (or so I hope) between easy trees/expressiveness and standard syntax, how about something like this?

root.push_front ( tag("Article Info", (title ? (comment("This title was moved"), title) : NULL) (tag("author", (tag("firstname", "Joe")) (tag("surname", "Random")) )) ) );

How does that improve things? What do you mean by "standard syntax?" -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Jake Voytko

7:05 p.m.

Perhaps "Standard" was a bad term, but rather, "Boost accepted". While perusing the Boost.Bimap docs, I found that the following syntax is allowed for defining a Bimap: typedef bimap<int,std::string> bm; bm b = list_of< bm::relation > (1,"one") (2,"two") (3,"three"); http://cablemodem.fibertel.com.ar/mcape/boost/libs/bimap/boost_bimap/bimap_a... And I found that this felt like a natural way to define pairs. Looking over my syntax again, it doesn't appear to have any specific advantages over your proposal (except alternative syntax, which I won't list as an advantage). It does strike at my major concern, which is the ability to generically parse XML documents. I feel the proposals that set up compile-time invariants for building XML documents would require a lot of thought in this area.. how is the document tree best represented during a read? Does the compile-time check exist solely to build the document, and have a different internal structure? Or can it be used to validate an article that is read (to use the above example)? Consider my example an expression of support for either a declarative syntax, or a C++ map/push_back() syntax as Phil Endecott suggested. My opinion is that if the writer has specific needs as far as document structure is concerned, they can make a class that uses its own compile time invariants for building a document (using Boost.XML to literally write the document), and can use runtime checks when parsing a document. To me, requiring that "authors" exist under "articleinfo"s, for instance, is a member of a problem domain more restrictive than XML worries about, so Boost.XML should not worry about it. Jake On 7/12/07, David Abrahams <dave@boost-consulting.com> wrote:

...

on Wed Jul 11 2007, "Jake Voytko" <jakevoytko-AT-gmail.com> wrote:

...
Splitting the difference (or so I hope) between easy trees/expressiveness and standard syntax, how about something like this?

root.push_front ( tag("Article Info", (title ? (comment("This title was moved"), title) : NULL) (tag("author", (tag("firstname", "Joe")) (tag("surname", "Random")) )) ) );

How does that improve things? What do you mean by "standard syntax?"

-- Dave Abrahams Boost Consulting http://www.boost-consulting.com

The Astoria Seminar ==> http://www.astoriaseminar.com

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Mathias Gaunard

13 Jul 13 Jul

6:16 p.m.

Jake Voytko wrote:

...

Perhaps "Standard" was a bad term, but rather, "Boost accepted". While perusing the Boost.Bimap docs, I found that the following syntax is allowed for defining a Bimap:

typedef bimap<int,std::string> bm; bm b = list_of< bm::relation > (1,"one") (2,"two") (3,"three");

http://cablemodem.fibertel.com.ar/mcape/boost/libs/bimap/boost_bimap/bimap_a...

And I found that this felt like a natural way to define pairs.

It's the natural lisp way.

David Abrahams

8:59 p.m.

on Fri Jul 13 2007, Mathias Gaunard <mathias.gaunard-AT-etu.u-bordeaux1.fr> wrote:

...

...
And I found that this felt like a natural way to define pairs.

It's the natural lisp way.

Well, one of Lisp's limitations is its syntactic uniformity. You get no visual clues from looking the symbols that make up a program about what that program means. Curiously, Lispers often tout this uniformity as an expressivity *advantage*, one that I just don't see. For maximum expressivity, I want lots of syntactic flexibility. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

John Moeller

12 Jul 12 Jul

7:38 a.m.

David Abrahams wrote:

...

Okay, this is going to sound very opinionated:

I find XML horrible to read, however I find most of the procedural code I've seen for manipulating it even more horrible.

Agreed. Much of the time, I find myself defining wrapper RAII classes to handle the tags.

...

[snip]

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

I like this syntax; it was a little awkward for me at first, but it's grown on me. Would this allow you to also save a "template" for later modification? Something like: tag articleinfo = tag("articleinfo") [ tag("author") [ tag("firstname") [""], tag("surname") [""] ] ]; ... articleinfo["author"]["firstname"] = "Joe"; articleinfo["author"]["surname"] = "Random"; I would find this use case to be handy. -- John Moeller fishcorn@gmail.com

gchen

9:40 a.m.

David Abrahams wrote:

...

I find XML horrible to read,

Yes. And that is the good reason we need a library to read it.

...

however I find most of the procedural code I've seen for manipulating it even more horrible.

Agree.

...

I would like to see a more declarative syntax for much of this stuff.

[...]

...

could be something like:

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

This looks very interesting. I am just wondering if this declarative syntax could also apply to xml reader. From my experience, creating xml does't seem a big problem, but writing xml-reader (created by other softwares, like rss feed, xhtml, log in xml format) is a time-consuming task.

Peder Holt

25 Jul 25 Jul

8:03 p.m.

<snip>

...

I would like to see a more declarative syntax for much of this stuff.

element_ptr info = root->insert_element(root->begin_children(), "articleinfo"); if (title) { info->insert(info->begin_children(), title); info->insert_comment(info->begin_children(), "This title was moved"); } element_ptr author = info->append_element("author"); element_ptr firstname = author->append_element("firstname"); firstname->set_content("Joe"); element_ptr surname = author->append_element("surname"); surname->set_content("Random");

could be something like:

root.push_front( tag("articleinfo")[ title ? (comment("This title was moved"), title) : NULL , tag("author")[ tag("firstname")["Joe"], tag("surname")["Random"] ] ] )

I have implemented something similar for my company. We already have an xml parser that we use, but until now we have had no formalized schema (except documents describing them.) In stead of creating a tool that generates c++ code based on an xml schema, we write the xml-schema directly in c++. The import/export code is independent of this schema. The xml-schema is implemented using expression templates. Here is how it looks: //Class definition class xml_position : public attributes { public: xml_position(); position import() const; bool export(const position& pos); private: double m_x,m_y,m_z; }; //Implementation xml_position::xml_position() { required(attribute<double>("x",&xml_position::m_x); required(attribute<double>("y",&xml_position::m_y); required(attribute<double>("z",&xml_position::m_z); } //import/export code is trivial. Now, another class, indexedPoints, has the following member typedef std::map<int,position> position_map; position_map m_positions; This can be exposed as follows: required( sequence( "positions", element<xml_position>("position",&position_map::second) .required(attribute<int>("index",&position_map::first)), &indexedPoints::m_positions ) ); The nice thing about this, is that we get scema validation for free, and we are able to auto generate a valid w3c xml schema. Regards, Peder

...

Cheers,

-- Dave Abrahams Boost Consulting http://www.boost-consulting.com

The Astoria Seminar ==> http://www.astoriaseminar.com

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Mathias Gaunard

11 Jul 11 Jul

6:14 p.m.

Stefan Seefeld wrote:

...

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

I have to admit I am a bit confused by the interface. For example, the first thing I see is that all nodes are actually pointers. Of course, they don't really appear as pointers, but they are. If I copy a node object, both copies still reference the same node. While it is natural for the node_ptr type, it is not really for the node one. (to be honest, I don't really understand their relationship after a few minutes looking at the code) Plus, we're stuck with a pseudo-upcasting thingy (which cannot return NULL/valid pointer and necessarily throws) instead of a nicer visitation mechanism. There doesn't seem to be an iteration/cursor mechanism that can iterate through the whole tree either, like boost.tree has.

Stefan Seefeld

7:02 p.m.

Mathias Gaunard wrote:

...

Stefan Seefeld wrote:

...
I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

I have to admit I am a bit confused by the interface.

For example, the first thing I see is that all nodes are actually pointers. Of course, they don't really appear as pointers, but they are.

A word on terminology: A node is a node, not a node pointer. However, the thing that a user accesses is a node pointer. The reason for this is simple: resource management is handled by the library, not the user.

...

If I copy a node object, both copies still reference the same node.

What do you mean by 'copy' ? Are you sure you copy the node, not the node pointer ? (If the reference the same node, it's the pointer you copy, so everything is consistent, once you adjust your terminology :-) )

...

While it is natural for the node_ptr type, it is not really for the node one. (to be honest, I don't really understand their relationship after a few minutes looking at the code)

Plus, we're stuck with a pseudo-upcasting thingy (which cannot return NULL/valid pointer and necessarily throws) instead of a nicer visitation mechanism.

There doesn't seem to be an iteration/cursor mechanism that can iterate through the whole tree either, like boost.tree has.

The example code contais a traversal class to do that. Something like that should eventually make it into the API proper. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Phil Endecott

12 Jul 12 Jul

10:58 a.m.

Mathias Gaunard wrote:

...

Stefan Seefeld wrote:

...
I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

I have to admit I am a bit confused by the interface.

For example, the first thing I see is that all nodes are actually pointers.

What's the best solution to this, and the memory management issues in general? It's difficult, because a node could be a simple text node with just one character in it, or it could be the root of a multi-megabyte document. Clearly in the first case it's OK to copy the data, while in the second case you would want to avoid deep copying in some way, e.g. with smart pointers and/or some sort of copy-on-write system. (Thinking aloud) I can imagine a tree where the top levels have different types from the bottom ones. I have just dug out an old email that I wrote to the xmlwrapp list in 2004. I had observed that xmlwrapp, which is a C++ wrapper around libxml2 with much in common with Stefan's proposal, had an overhead of about 150 bytes per node. I had been getting binary data from a database (with negligible space overhead compared to a C++ struct), converting it to in-memory XML and then converting that to HTML or SVG using XSLT. The in-memory XML was terribly bloated to the extent that it was the bottleneck in the system, for a couple of reasons. Firstly, as noted above, the data structures had a significant overhead (150 bytes for the XML node compares to a 30 byte per element overhead in a std::list<std::string>). And secondly there's the overhead for self-description i.e. all the strings making up the tag and attribute names. This second issue could be helped if a symbol-table approach were used (e.g. the Boost.Spirit symbols class; each n-byte string would be reduced to hopefully one pointer). The view I'm coming to is that I'd prefer an XML class with enough template parameters to make these sorts of issues resolvable by the user, depending on the requirements of their application: struct node {}; template <typename STR_T = std::string> struct textnode: public node, public STR_T { .... }; template <typename STR_T = std::string> struct comment: public node, public STR_T { .... }; template <typename TAG_NAME_T = std::string, typename ATTR_NAME_T = std::string, typename ATTR_VAL_T = std::string, typename CHILD_T = boost::shared_ptr<node> > struct element: public node { TAG_NAME_T name; std::map<ATTR_NAME_T,ATTR_VAL_T> attributes; std::list<CHILD_T> children; }; You would then need parser and writer functions that would work with any suitable string and child-pointer types. I don't know whether the CHILD_T parameter can be used to support both pointers-to-children and inline-children, nor whether a copy_on_write_ptr<T> could somehow fit into this. Any thoughts? I'm also interested to hear what people think about XPath. I have not used it much outside of XSLT. My feeling is that it is easy to write "//foo" to find all the <foo> elements, but that this will descend all the way to the leaves of your tree even if you know that <foo> elements can only appear in certain contexts near the root. To do efficient searches you either need a set of index structures (like XLST's keys, or Boost.MultiIndex), or a kind of tree-traversal-iterator that lets you be a bit more selective about when to descend into a subtree and when to skip it. Regards, Phil.

Peter Dimov

11:28 a.m.

Phil Endecott wrote:

...

I have just dug out an old email that I wrote to the xmlwrapp list in 2004. I had observed that xmlwrapp, which is a C++ wrapper around libxml2 with much in common with Stefan's proposal, had an overhead of about 150 bytes per node. I had been getting binary data from a database (with negligible space overhead compared to a C++ struct), converting it to in-memory XML and then converting that to HTML or SVG using XSLT. The in-memory XML was terribly bloated to the extent that it was the bottleneck in the system, for a couple of reasons.

...

The view I'm coming to is that I'd prefer an XML class with enough template parameters to make these sorts of issues resolvable by the user, depending on the requirements of their application:

... After such an experience my thoughts would be less oriented towards changing the XML in-memory class and more towards refactoring the application to not build the entire XML document in memory. This obviously wouldn't be possible if your next processing phase demands an in-memory XML as input; but if it does, it's possible that you wouldn't be allowed to change it anyway; you'll have to live with whatever template parameters have been picked by its author. If, on the other hand, the HTML/SVG backend takes an element_iterator range, you'll be able to feed it on demand without ever constructing the whole tree. Of course this would likely mean no XSLT. Doing an XSL transform on a "virtual" document would require an abstract node interface that you implement on top of your existing data to provide an XML view for it.

Phil Endecott

3:15 p.m.

Petr Dimov wrote:

...

Phil Endecott wrote:

...
overhead of about 150 bytes per node.

...

After such an experience my thoughts would be less oriented towards changing the XML in-memory class and more towards refactoring the application to not build the entire XML document in memory.

Yes - but...

...

this would likely mean no XSLT. ..which was exactly what I needed to do.

That brings up another question; which of these approaches do people prefer: - Use a C++ XML library that runs on a libxml2 backend, so that it can also use libxslt to do XSLT transformations. - Use a standalone C++ XML library that is incompatible with libxslt, and instead do XSLT-like transformations in C++. This question brings us back a bit closer to the features of Stefan's propsal (which I think could be extended to do XSLT using libxslt). I think that if I were starting another project of the sort that I described before I would probably avoid XSLT - with hindsight I overstretched it. But what would ideal C++ XML transforming (or just XML reading) code look like? As gchen writes: "creating xml does't seem a big problem, but writing xml-reader [..] is a time-consuming task". Boost.Spirit can match things; can we use something vaguely like Spirit syntax to match XML fragments, and define actions to apply to them? rule input = * person; rule person = element("person")(name, birth, father, mother) // meaning all needed but in any order [ // this is a Spirit-style "action" []. return h::html[ h::body [ // these are declarative-XML []. // h:: is an html element namespace h::h1("Timeline for "+_1), // _1 refers to 'name' above; did Spirit 2 add this? _2, _3, _4 ] ] ]; rule name = element("name")(firstname, surname) [ return x::textnode(firstname+" "+surname); ]; etc. etc. Maybe that could be made to work, but writing out the example above has made me a bit less optimistic about it. How would it compare with XSLT in terms of capabilities, performance, syntax, and so on? Here's another approach. Say I have <library> <document> <author>..intersting stuff..</author> ...lots and lots of uninteresting stuff... </document> ...more documents... </library> and I just want to extract the authors' names. So, I start off by parsing it into a tree of generic xml elements, and I then (somehow) convert those element objects into element-name-specific subclasses. Or maybe I parse directly into the subclasses, it doesn't matter. These subclasses implement an extract_authors virtual method; for the library and document classes, they recurse into their children, for author it returns the content, and for all other subclasses it returns without doing anything. So I can just call root.extract_authors(). Peter Dimov also wrote:

...

Doing an XSL transform on a "virtual" document would require an abstract node interface that you implement on top of your existing data to provide an XML view for it

I wonder if any serialisation or introspection experts have any suggestions? I think someone else has also mentioned using XPath-like expressions for exploring non-XML tree structures. Regards, Phil.

Peter Dimov

5:24 p.m.

Phil Endecott wrote: ...

...

As gchen writes: "creating xml does't seem a big problem, but writing xml-reader [..] is a time-consuming task". Boost.Spirit can match things; can we use something vaguely like Spirit syntax to match XML fragments, and define actions to apply to them?

... You can do that in principle. In practice I've been using a mirror approach of the write example in my other mail, with 'add_element' replaced with 'get_element', and a bit of error checking thrown in. It's a bit like writing recursive descent parsers by hand, only much, much easier. I can post an example if you like.

...

Peter Dimov also wrote:

...
Doing an XSL transform on a "virtual" document would require an abstract node interface that you implement on top of your existing data to provide an XML view for it

I wonder if any serialisation or introspection experts have any suggestions?

Neither serialisation nor introspection is what you need here. The idea is that you have a foreign data structure and you need to present it as a virtual XML document with a predefined schema. In general, if you reflect or serialise the structure, you'll get XML that doesn't match the schema you need. Let's say that you have map< string, pair<string, string> > m; and you want to present it as <data> <item id="x1"> <foo>...</foo> <bar>...</bar> </item> <item id="x2"> ... </item> ... </data> To achieve the mapping, you'll define data_element, item_element, foo_element and bar_element, each of which points to its corresponding C++ equivalent. data_element will store a pointer to the map, item_element will store map::iterator, foo_element and bar_element will store a pointer to std::string. For maximum benefit the XML node abstraction needs to be written carefully, though; it must not make implicit assumptions that the nodes always exist in memory. (Even if it does you'll likely need to only keep the node path in memory and not the whole tree, though.)

Stefan Seefeld

11:47 p.m.

Phil Endecott wrote:

...

That brings up another question; which of these approaches do people prefer:

- Use a C++ XML library that runs on a libxml2 backend, so that it can also use libxslt to do XSLT transformations.

- Use a standalone C++ XML library that is incompatible with libxslt, and instead do XSLT-like transformations in C++.

That sounds like a false dichotomy. If we have a compliant XML API XSLT can be implemented on top of it. If, additionally, the XML API sits on top of a XML backend library that also provides xslt functionality, it is straight forward to provide bindings for those, too.

...

This question brings us back a bit closer to the features of Stefan's propsal (which I think could be extended to do XSLT using libxslt). I think that if I were starting another project of the sort that I described before I would probably avoid XSLT - with hindsight I overstretched it. But what would ideal C++ XML transforming (or just XML reading) code look like? As gchen writes: "creating xml does't seem a big problem, but writing xml-reader [..] is a time-consuming task". Boost.Spirit can match things; can we use something vaguely like Spirit syntax to match XML fragments, and define actions to apply to them?

That's all interesting to consider, but it is different from my proposal, so I'd like to get us back on focus. (And incidentally, David didn't answer my question yet whether his tree-constructing syntax was something on top of a procedural API, or whether it would replace it.) Yes, there are many things that can be done to "write XML" in C++ that particularly appeal to those among us with strong opinions on syntax. However, as I tried to point out numerous times, I do believe it is important to be able to bind highly efficient XML library backends, and not reinvent everything from scratch. Having some nice-looking syntax bind to the API I propose should be straight forward, but may incure performance overhead. (Some of that may be mitigated by clever proxying tricks, but that likely involves other penalties, such as code complexity, etc.) To put it a little more bluntly: I do believe that XML is one of those topics where almost everybody has a strong opinion, in one direction or another. The syntax does look simple, and so everybody 'knows' how to do it best. That's a typical bike-shed question (http://www.bikeshed.com/). Hoping that this discussion will lead somewhere, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Phil Endecott

13 Jul 13 Jul

12:15 p.m.

Stefan Seefeld wrote: [snip stuff about spirit-like matching]

...

That's all interesting to consider, but it is different from my proposal, so I'd like to get us back on focus.

Yes this thread is diverging in many directions, but I think it's important to consider how people would want to use an XML API. As you say, once you have the right basic API you can implement other stuff on top of it, but if you have the wrong API you can't. For example, I might need to efficiently find all the <foo> nodes in a large tree. How would I do that with your library? I'd like to write something like this: class xml_doc_with_index: public xml_doc { ... extends the library's xml_doc to add an index of elements by their name, maintained automatically as elements are added and removed ... }; void f() { xml_doc_with_index d; d << cin; // parses XML document from input into d; index is built for (node_iterator i = d.elements_by_name.lower_bound("foo"); i != d.elements_by_name.upper_bound("foo"); ++i) { ... } } I think that there are a lot of useful techniques that can't be done using a libxml2 backend, including the symbol-table idea that I have mentioned before and the whole business of pointer semantics i.e. deep vs. shallow copy, copy-on-write and so on.

...

However, as I tried to point out numerous times, I do believe it is important to be able to bind highly efficient XML library backends, and not reinvent everything from scratch.

There is certainly an opportunity for a C++ XML library that "binds highly efficiently to an XML library backend", but I feel that that opportunity is already filled by xmlwrapp (http://sourceforge.net/projects/xmlwrapp/) and libxml++ (http://libxmlplusplus.sourceforge.net/) both of which have quite liberal licenses. What does yours offer that they don't?

...

To put it a little more bluntly: I do believe that XML is one of those topics where almost everybody has a strong opinion, in one direction or another. The syntax does look simple, and so everybody 'knows' how to do it best. That's a typical bike-shed question (http://www.bikeshed.com/).

We already have two quite satisfactory bike sheds (see above). There is not much point (IMHO) in building another to essentially the same design. If we're going to build another bike shed it should be a "modern bike shed for the next millenium". :-) Regards, Phil.

Stefan Seefeld

12:39 p.m.

Phil Endecott wrote:

...

Stefan Seefeld wrote:

[snip stuff about spirit-like matching]

...
That's all interesting to consider, but it is different from my proposal, so I'd like to get us back on focus.

Yes this thread is diverging in many directions, but I think it's important to consider how people would want to use an XML API. As you say, once you have the right basic API you can implement other stuff on top of it, but if you have the wrong API you can't. For example, I might need to efficiently find all the <foo> nodes in a large tree. How would I do that with your library? I'd like to write something

You'd do an XPath query. (See the XPath docs for exact semantics of what can and can not be queried, and returned.)

...

like this:

class xml_doc_with_index: public xml_doc { ... extends the library's xml_doc to add an index of elements by their name, maintained automatically as elements are added and removed ... };

void f() { xml_doc_with_index d; d << cin; // parses XML document from input into d; index is built for (node_iterator i = d.elements_by_name.lower_bound("foo"); i != d.elements_by_name.upper_bound("foo"); ++i) { ... } }

As often, inheritance is the wrong tool to extend. This example is no exception.

...

I think that there are a lot of useful techniques that can't be done using a libxml2 backend, including the symbol-table idea that I have mentioned before and the whole business of pointer semantics i.e. deep vs. shallow copy, copy-on-write and so on.

Right. But I don't see that as a limitation of my approach. Rather, I'd use a different way to achieve that anyway.

...

...
However, as I tried to point out numerous times, I do believe it is important to be able to bind highly efficient XML library backends, and not reinvent everything from scratch.

There is certainly an opportunity for a C++ XML library that "binds highly efficiently to an XML library backend", but I feel that that opportunity is already filled by xmlwrapp (http://sourceforge.net/projects/xmlwrapp/) and libxml++ (http://libxmlplusplus.sourceforge.net/) both of which have quite liberal licenses. What does yours offer that they don't?

I did some work on libxml++ some years ago, but then parted when it became apparent that libxml++ was to be glued to tightly into GNOME (the choice of unicode string is fixed, i.e. not parametrized as in my proposal), and other unfortunate decisions. Incidentally, the approach I had taken in the way I bind to libxml2 there was part of an earlier proposal I submitted to boost (many years ago). Then, people suggested alternatives that eventually let me to the current approach. May I suggest that you search the boost ML archives to find more details. (Before starting to contribute to libxml++ I also looked into xmlwrapp, but decided against it for a number of reasons. It has been too long for me to remember exact details, but performance was a problem (or rather, the unnecessary copying of data, that let me to believe that performance would turn out to be a bottleneck, eventually). Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Sebastian Redl

12 Jul 12 Jul

7:45 p.m.

Stefan Seefeld wrote:

...

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

Hi, Haven't looked at the reader component yet, so this will be about the node tree part only. I'm not at all happy with the node tree. It seems to me like it is taking the worst parts of the W3C DOM and leaving out the few advantages it has. The proposed API shares these problems with the DOM: 1) Very verbose. 2) Indirect node construction. I can't create an element by instantiating the class element - it has a protected constructor. Instances are created through some sort of factory, typically by calling methods of document and element that create children and return them. This is not a very natural syntax. The API has some additional disadvantages: 1) No real namespace support. To create an element in a given namespace, I have to register a prefix and then use the prefix in the element name. Worse, to find out the namespace of an element, I have to parse the string for the prefix (it's one find and one substr operator, but still) and then look it up to find the full namespace URI. (Depending on the semantics of element::lookup_namespace, I might have to walk the tree for that.) Given that some documents, especially generated ones, sometimes have multiple binding for the same namespace, this is overly tedious. Namespace URI and local name of an element should be first-class properties. The prefix:local convention is really just a hack - in the Infoset view of the information, it doesn't even exist. (The prefix does, the combined name doesn't. See 2.2 of the XML Infoset spec.) 2) Not an existing standard. Whatever else you can say for the DOM, it is well-known. Whatever your language - Java, C#, PHP, JavaScript, C++ with Xerces - the DOM is, minor variations in capitalization aside, a constant. By providing essentially the same functionality, but through a slightly different interface, you lose the recognition value of the DOM without gaining much. (You're avoiding the full complexity of the DOM, which is a good thing.) 3) Not as extensive. I'm not talking about the annoying multiple redundancy of the DOM here, but of low-level functionality such as preserving entity references in the node tree. For some low-level tasks, this is important stuff. Not that the DOM is really extensive: it provides no way, for example, to modify the document schema. (It allows introspection, at least.) The API has one clear advantage over the DOM: the use of iterators. The DOM also has many shortcomings that the API, due to its restricted scope, doesn't have. All in all, though, I think the chosen balance between closeness to the DOM and doing something different and interesting is not good. There are some thing I simply consider mistakes: 1) cdata should derive from text. It's basically a special case that only differs in its serialization from the general form. 2) You have a class dtd, but to access it you use document::internal_subset. This dtd class doesn't provide access to the internal subset however - only the document type declaration, after which it is named, (Yes, the document type /definition/ has the same abbreviation. Very unfortunate, that.) Some more issues: 1) The whole node/node_ptr mess. From reading your earlier posts, I thought that node and friends where value-like classes, that they directly represent the nodes, whereas node_ptr was a special smart pointer that provided the memory management and the shallow copying semantics. Only, upon reading the code, I find that node_ptr contains an instance of its element type, not a pointer to it, which means that node and derived are the smart pointers with shallow copying and memory management. Except that they don't: the pointers are never freed until the entire owner document is destroyed. Or the node is explicitly removed from its parent. Oh, and document is an exception to this convention, because it actually is a value-style class. That's not to say that this isn't a sensible overall strategy. It just is extremely confusing given your naming conventions. As far as I can see, the only thing node_ptr actually does is make access less convenient by requiring indirection - and thus double indirection for node iterators. (*i)->foo sucks, sorry. Apropos, why isn't node_iterator written using the Boost.Iterator library? 2) write_to_file as a member of document. This is asymmetric to parse_file being a free factory function. It's also unnecessary and, in my opinion, not a good idea for various reasons. One is the aforementioned asymmetry. Another is the public interface size, as mentioned in one of the Effective C++ books: write_to_file doesn't actually need access to document's internals, because it just serializes the node tree, right? (That it needs access to the contained libxml pointer is a detail that shouldn't affect the interface. Make it a friend if you have to, but implementations should have the option of not making it one.) There's more inconsistency here. The Efficient XML Interchange working group is overdue again with their first working draft, but a binary, compact serialization for XML _is_ in the works. Once they publish a recommendation, there will be two official serializations of the XML Infoset. And there are several unofficial ones already. Each one of these needs a pair of parse/serialize functions. (Not necessarily provided by the library, of course.) With write_to_file being a member, it enjoys a unique status that it doesn't really deserve. It also enjoys a very ambiguous name, as does parse_file. Even leaving that aside, there's also the option of multiple parse/serialize pairs just for a single format. They could take alternative input sources: a boost::path instead of a std::string for identification, for example. Or a std::istream as a data source/std::ostream as a data sink. Or a boost::url, when such a class is written, together with a pluggable communication framework for transparently fetching network URLs. Or whatever. Point is, all these are simple extensions to the system, but there is inconsistency if one function is a member when no other can be. Sorry for not being very constructive. I'll take a look at the reader next time I find some time, and make more general comments the time after that. I hope to get there within a week. Sebastian Redl

Stefan Seefeld

14 Jul 14 Jul

1:01 a.m.

Sebastian Redl wrote:

...

The proposed API shares these problems with the DOM: 1) Very verbose. 2) Indirect node construction. I can't create an element by instantiating the class element - it has a protected constructor. Instances are created through some sort of factory, typically by calling methods of document and element that create children and return them. This is not a very natural syntax.

The reason to delegate to a factory is to let it do a lot of resource management that thus can be hidden from the user. There is a lot to be considered, as each node lives in a particular context given by the document as well as its position in it (think of namespaces, for example). It may of course be possible to hide that by providing stack variables that are merely proxies, so the actual instantiation will be done lazily, once the (proxy) node is inserted into the document. I haven't thought too hard about that, since to me using a factory is a natural means to allow encapsulation.

...

The API has some additional disadvantages: 1) No real namespace support. To create an element in a given namespace, I have to register a prefix and then use the prefix in the element name. Worse, to find out the namespace of an element, I have to parse the string for the prefix (it's one find and one substr operator, but still) and then look it up to find the full namespace URI. (Depending on the semantics of element::lookup_namespace, I might have to walk the tree for that.) Given that some documents, especially generated ones, sometimes have multiple binding for the same namespace, this is overly tedious. Namespace URI and local name of an element should be first-class properties. The prefix:local convention is really just a hack - in the Infoset view of the information, it doesn't even exist. (The prefix does, the combined name doesn't. See 2.2 of the XML Infoset spec.)

OK, I agree. This can be addressed independently from all the rest, however.

...

2) Not an existing standard. Whatever else you can say for the DOM, it is well-known. Whatever your language - Java, C#, PHP, JavaScript, C++ with Xerces - the DOM is, minor variations in capitalization aside, a constant. By providing essentially the same functionality, but through a slightly different interface, you lose the recognition value of the DOM without gaining much. (You're avoiding the full complexity of the DOM, which is a good thing.)

Sorry, that argument I don't accept. Yes, I deliberately chose not to use the API as obtained from using the CORBA C++ bindings of the OMG IDL DOM. The hope is to get something better, much more naturally tied to modern C++ idioms. Whether or not I achieve that is to be discussed, and can be criticized, but the lack of conformance to existing DOM APIs in itself is hardly an argument worth debating.

...

3) Not as extensive. I'm not talking about the annoying multiple redundancy of the DOM here, but of low-level functionality such as preserving entity references in the node tree. For some low-level tasks, this is important stuff. Not that the DOM is really extensive: it provides no way, for example, to modify the document schema. (It allows introspection, at least.)

OK, the API represents the Infoset, and thus has no idea of what an entity is. I'm not sure whether that would be worth adding. And if, it may be some hook into the XML writer (the XML parser already has it). I don't understand what you are aiming at in your comment about the 'document schema'.

...

The API has one clear advantage over the DOM: the use of iterators. The DOM also has many shortcomings that the API, due to its restricted scope, doesn't have. All in all, though, I think the chosen balance between closeness to the DOM and doing something different and interesting is not good.

There are some thing I simply consider mistakes: 1) cdata should derive from text. It's basically a special case that only differs in its serialization from the general form.

That's an implementation detail (IMO). Semantically, a text node and a cdata node are distinct, and so visitors shouldn't give users access to a cdata node as a text node. (And what else would the ISA relationship be good for ?)

...

2) You have a class dtd, but to access it you use document::internal_subset. This dtd class doesn't provide access to the internal subset however - only the document type declaration, after which it is named, (Yes, the document type /definition/ has the same abbreviation. Very unfortunate, that.)

I'm sure this can be refined. (In fact, I don't think DTDs will play any significant role in the future, as other document type definitions become more popular, such as relaxng).

...

Some more issues: 1) The whole node/node_ptr mess. From reading your earlier posts, I thought that node and friends where value-like classes, that they directly represent the nodes, whereas node_ptr was a special smart pointer that provided the memory management and the shallow copying semantics. Only, upon reading the code, I find that node_ptr contains an instance of its element type, not a pointer to it, which means that node and derived are the smart pointers with shallow copying and memory management. Except that they don't: the pointers are never freed until the entire owner document is destroyed. Or the node is explicitly removed from its parent. Oh, and document is an exception to this convention, because it actually is a value-style class. That's not to say that this isn't a sensible overall strategy. It just is extremely confusing given your naming conventions. As far as I can see, the only thing node_ptr actually does is make access less convenient by requiring indirection - and thus double indirection for node iterators. (*i)->foo sucks, sorry. Apropos, why isn't node_iterator written using the Boost.Iterator library?

OK, I understand that I need to rethink how to represent things. To me it is clear, however, what I want: encapsulate nodes and their management such that the user doesn't have to care for allocation / deallocation, but instead accesses (dereferences) nodes via node_ptr proxies.

...

2) write_to_file as a member of document. This is asymmetric to parse_file being a free factory function. It's also unnecessary and, in my opinion, not a good idea for various reasons. One is the aforementioned asymmetry. Another is the public interface size, as mentioned in one of the Effective C++ books: write_to_file doesn't actually need access to document's internals, because it just serializes the node tree, right? (That it needs access to the contained libxml pointer is a detail that shouldn't affect the interface. Make it a friend if you have to, but implementations should have the option of not making it one.)

That's a good point. I will make write_to_file a free-standing function.

...

There's more inconsistency here. The Efficient XML Interchange working group is overdue again with their first working draft, but a binary, compact serialization for XML _is_ in the works. Once they publish a recommendation, there will be two official serializations of the XML Infoset. And there are several unofficial ones already. Each one of these needs a pair of parse/serialize functions. (Not necessarily provided by the library, of course.) With write_to_file being a member, it enjoys a unique status that it doesn't really deserve. It also enjoys a very ambiguous name, as does parse_file. Even leaving that aside, there's also the option of multiple parse/serialize pairs just for a single format. They could take alternative input sources: a boost::path instead of a std::string for identification, for example. Or a std::istream as a data source/std::ostream as a data sink. Or a boost::url, when such a class is written, together with a pluggable communication framework for transparently fetching network URLs. Or whatever. Point is, all these are simple extensions to the system, but there is inconsistency if one function is a member when no other can be.

Right.

...

Sorry for not being very constructive. I'll take a look at the reader next time I find some time, and make more general comments the time after that. I hope to get there within a week.

Thanks for your comments. I will try to address them, if only by working on documentation that give a rationale for the various choices I have taken. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Sebastian Redl

12:12 p.m.

Stefan Seefeld wrote:

...

The reason to delegate to a factory is to let it do a lot of resource management that thus can be hidden from the user. There is a lot to be considered, as each node lives in a particular context given by the document as well as its position in it (think of namespaces, for example).

OK, makes sense.

...

Sorry, that argument I don't accept. Yes, I deliberately chose not to use the API as obtained from using the CORBA C++ bindings of the OMG IDL DOM. The hope is to get something better, much more naturally tied to modern C++ idioms. Whether or not I achieve that is to be discussed, and can be criticized, but the lack of conformance to existing DOM APIs in itself is hardly an argument worth debating.

But I don't mean the argument to stand by itself. As I said, you diverged slightly from the DOM API. The problem that I see is that it neither matches existing knowledge, nor brings sufficient advantage to justify this loss. In other words, if you leave the beaten track, you shouldn't walk right alongside it, among stones but without any advantage - you should at least take a shortcut through the woods.

...

OK, the API represents the Infoset, and thus has no idea of what an entity is. I'm not sure whether that would be worth adding. And if, it may be some hook into the XML writer (the XML parser already has it).

Possibly, yes. Such low-level functionality might indeed better be limited to the low-level stream representation. On the other hand, rare as their use is, there are also unparsed entities. The very much exist at the high-level representation. And not all parsed entities can be expanded, especially by non-validating parsers.

...

I don't understand what you are aiming at in your comment about the 'document schema'.

I mean that there is no way in the current API to obtain information about the document type, beyond its name and location.

...

That's an implementation detail (IMO). No, it's not. It's a matter of public derivation of classes and thus very much an interface issue. Semantically, a text node and a cdata node are distinct, Also not true, at least as far as Infoset is concerned. See also Appendix D of the Infoset spec, item 19. and so visitors shouldn't give users access to a cdata node as a text node. (And what else would the ISA relationship be good for ?)

But that's exactly what they should do (if the user wants to ignore the difference). CDATA, as I said before, is a serialization issue, and completely irrelevant to a user who just wants to know what text the document contains.

...

I'm sure this can be refined. (In fact, I don't think DTDs will play any significant role in the future, as other document type definitions become more popular, such as relaxng).

True. Perhaps an API for generalized schema access can be devised. Sebastian Redl

Stefan Seefeld

3:34 p.m.

Sebastian Redl wrote:

...

Stefan Seefeld wrote:

...

...
I don't understand what you are aiming at in your comment about the 'document schema'.

I mean that there is no way in the current API to obtain information about the document type, beyond its name and location.

But the DTD may not be available itself. And in fact, for XSchema, RelaxNG, etc., there isn't even a reference to those in the document, so not even the name is available. (But this discussion suggests that in fact I may remove the DTD-related accessors from the document interface and make it freestanding. It may then evolve independently.)

...

...
That's an implementation detail (IMO). No, it's not. It's a matter of public derivation of classes and thus very much an interface issue. Semantically, a text node and a cdata node are distinct, Also not true, at least as far as Infoset is concerned. See also Appendix D of the Infoset spec, item 19.

OK, will read that.

...

...
and so visitors shouldn't give users access to a cdata node as a text node. (And what else would the ISA relationship be good for ?)

But that's exactly what they should do (if the user wants to ignore the difference). CDATA, as I said before, is a serialization issue, and completely irrelevant to a user who just wants to know what text the document contains.

OK.

...

...
I'm sure this can be refined. (In fact, I don't think DTDs will play any significant role in the future, as other document type definitions become more popular, such as relaxng).

True. Perhaps an API for generalized schema access can be devised.

Yes, but that would be a separate API, independent from my current proposal. (Which is a good thing, as it favors modularity. Users who don't need validation don't have to pay for it.) Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Sebastian Redl

5:34 p.m.

Stefan Seefeld wrote:

...

But the DTD may not be available itself. In which case, no information is available. And in fact, for XSchema, RelaxNG, etc., there isn't even a reference to those in the document, so not even the name is available.

Actually, XSchema references the schemas through the schemaLocation and noNamespaceSchemaLocation attributes (in the schema instance namespace) of the root element. I don't know anything about RelaxNG.

...

(But this discussion suggests that in fact I may remove the DTD-related accessors from the document interface and make it freestanding. It may then evolve independently.)

Yes, that sounds good.

...

...
True. Perhaps an API for generalized schema access can be devised.

Yes, but that would be a separate API, independent from my current proposal. (Which is a good thing, as it favors modularity. Users who don't need validation don't have to pay for it.)

Of course not. Furthermore, documents which are not valid should still be parseable, as long as they're well-formed. Sebastian Redl

Stuart Dootson

13 Jul 13 Jul

7:18 p.m.

On 06/07/07, Stefan Seefeld <seefeld@sympatico.ca> wrote:

...

Hello,

over the last couple of years we have discussed possible XML APIs for inclusion into boost. As I already had an early prototype for such an API, I kept evolving it, based on feedback from those discussions. A couple of weeks ago I actually checked it into the sandbox (http://svn.boost.org/trac/boost/browser/sandbox/xml). Today, I adjusted the source layout to conform to the sandbox layout we agreed on, including a boost.build - based build-system.

I would appreciate if anybody interested into a future boost.xml submission would have a look, provide feedback, or even get involved into the (ongoing) development.

Best regards, Stefan

PS: The current scope of the project is described in http://svn.boost.org/trac/boost/browser/sandbox/xml/README

Stefan - I've spent a few hours today porting a small (500 line) C# application to C++ using your libxml2 wrapper code. Overall, for what I've done, I liked it - there did seem to be some bugs (but that could be due to deficiencies in VC8), and I could see some opportunities for some conveniences. However, it enabled me to translate my 500 lines of C# to 500 lines of C++, produce the same results and gain a 6x performance increase in the process (woot!). What was I doing - well, mainly iterating through nodes and evaluating attribute values. For these functions, it would be nice to have some form of filtering/transforming iterator layered on top of an elements children_iterator that allows you to specify some subset of the children that you want to see. I wrote a modified for_each that performed an equivalent function (yes, I should have used Boosts iterator adaptors). Also, it would be nice to have a typed attribute value retrieval function. I knocked something up using lexical_cast that did that - it'd be nice if it was in the library. I also performed some XML tree construction - it was a pretty flat structure, and I found the provided functions adequate for that. As for the 'bugs' - well, I had to modify your code in places to get my code to build (the code is at work, so it'll be Monday before I can get at it again and post diffs if you're interested). I also had issues compiling the dom.cpp file with VC7.1 and VC8 - to get VC8 to build it, I had to move the factory and ptr_factory functions out of the details namespace - VC8 just wouldn't admit to seeing those functions otherwise. Stuart Dootson

Stefan Seefeld

7:45 p.m.

Stuart Dootson wrote:

...

Stefan - I've spent a few hours today porting a small (500 line) C# application to C++ using your libxml2 wrapper code. Overall, for what I've done, I liked it - there did seem to be some bugs (but that could be due to deficiencies in VC8), and I could see some opportunities for some conveniences. However, it enabled me to translate my 500 lines of C# to 500 lines of C++, produce the same results and gain a 6x performance increase in the process (woot!).

That's excellent news (and a welcome change in this thread ;-) ) !

...

What was I doing - well, mainly iterating through nodes and evaluating attribute values. For these functions, it would be nice to have some form of filtering/transforming iterator layered on top of an elements children_iterator that allows you to specify some subset of the children that you want to see. I wrote a modified for_each that performed an equivalent function (yes, I should have used Boosts iterator adaptors). Also, it would be nice to have a typed attribute value retrieval function. I knocked something up using lexical_cast that did that - it'd be nice if it was in the library.

Have you had a look at the 'traversal.cpp' example ? That may do what you want. (Some generalization of that may become part of the public API.)

...

I also performed some XML tree construction - it was a pretty flat structure, and I found the provided functions adequate for that.

As for the 'bugs' - well, I had to modify your code in places to get my code to build (the code is at work, so it'll be Monday before I can get at it again and post diffs if you're interested). I also had issues compiling the dom.cpp file with VC7.1 and VC8 - to get VC8 to build it, I had to move the factory and ptr_factory functions out of the details namespace - VC8 just wouldn't admit to seeing those functions otherwise.

Yes, I'd appreciate any patches, especially if they explain the proposed changes. I'm not sure what the best way to do that is, though. Can trac be configured to allow sandbox-project-specific issues ? Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stuart Dootson

16 Jul 16 Jul

3:31 p.m.

On 13/07/07, Stefan Seefeld <seefeld@sympatico.ca> wrote:

...

Stuart Dootson wrote:

...
What was I doing - well, mainly iterating through nodes and evaluating attribute values. For these functions, it would be nice to have some form of filtering/transforming iterator layered on top of an elements children_iterator that allows you to specify some subset of the children that you want to see. I wrote a modified for_each that performed an equivalent function (yes, I should have used Boosts iterator adaptors). Also, it would be nice to have a typed attribute value retrieval function. I knocked something up using lexical_cast that did that - it'd be nice if it was in the library.

Have you had a look at the 'traversal.cpp' example ? That may do what you want. (Some generalization of that may become part of the public API.)

Yes, I've had a look. I'd still rather have STL style iterators :-) The other technique I've used, when iterating through elements where I may encounter one of several different elements, was to use a std::map, with element name as key and a boost::function as the value, i.e. lookup the appropriate function and call the related function. I initialise the map using Boost.Assign. It works quite nicely.

...

Yes, I'd appreciate any patches, especially if they explain the proposed changes. I'm not sure what the best way to do that is, though. Can trac be configured to allow sandbox-project-specific issues ?

Thanks, Stefan

I've attached patch (generated by Tortoise SVN) and 'patch explanation' files. The diffs are a combination of a) what I needed to do to get the dom.cpp file to compile under MSVC8, and b) some changes I made to make things I wanted to do build. Hope they're of use. Stuart Dootson

Aaron W. LaFramboise

13 Jul 13 Jul

7:34 p.m.

What about SAX API? In the past, most of the time when I wanted to parse XML, I've actually wanted something closer to SAX, not DOM. (However, I don't think SAX is really appropriate for C++, because of the way it eats control flow. You can't, for example, pause a SAX parse mid-parse, exit the SAX context, and come back to it later. SAX, in essence, either dominates your application's control flow, unacceptable in GUI, or requires multithreading.) A stream-based XML reader would be even more welcome to me than the DOM-based API. One could also argue that stream-based processing is more in the tradition of C++. Full validation would also be a major plus, as most stream-based XML readers don't completely validate.

Stefan Seefeld

7:46 p.m.

Aaron W. LaFramboise wrote:

...

What about SAX API? In the past, most of the time when I wanted to parse XML, I've actually wanted something closer to SAX, not DOM.

(However, I don't think SAX is really appropriate for C++, because of the way it eats control flow. You can't, for example, pause a SAX parse mid-parse, exit the SAX context, and come back to it later. SAX, in essence, either dominates your application's control flow, unacceptable in GUI, or requires multithreading.)

A stream-based XML reader would be even more welcome to me than the DOM-based API. One could also argue that stream-based processing is more in the tradition of C++. Full validation would also be a major plus, as most stream-based XML readers don't completely validate.

I agree, and the sandbox xml project does in fact contain some (rudimentary) XMLReader API. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

6577

Age (days ago)

6596

Last active (days ago)

List overview

Download

76 comments

18 participants

participants (18)

Aaron W. LaFramboise
David Abrahams
Doug Gregor
Felipe Magno de Almeida
gchen
Jake Voytko
John Moeller
Martin Wille
Mathias Gaunard
Michael Caisse
Michael Marcin
Oliver.Kowalke＠qimonda.com
Peder Holt
Peter Dimov
Phil Endecott
Sebastian Redl
Stefan Seefeld
Stuart Dootson