[GSoC] Boost.XML

newer
New Boost.XInt Library, request...

older
[bgl] Manual include necessary for...

Ilie Halip

20 Mar 2010 20 Mar '10

10:54 a.m.

Hi! My name is Ilie Halip, and I'm a student at the Faculty of Computer Science in Iasi, Romania. I'm really interested in this year's Summer of Code, and stumbled upon your list of ideas yesterday. I have a few questions about the Boost.XML project. First, what actually needs to be done? The project proposal isn't clear about that. If it's about parsing using DOM/SAX, using XPath queries, then the project linked from the page already does that, right? So the project would involve working with the existing sources and adding validation, tests etc (as mentioned in the README)? How much working knowledge should a student have before trying to work on such a project? I must admit I haven't used Boost, nor libxml2 in the past, so I guess I should start reading documentation about both. I am good in C++, and I've been working with XML APIs for years, so that's not a problem. But the other two (Boost+libxml2) are? Is there a coding style I should follow if I were to work for this project to conform to Boost standards? Where can I find such info? Do you have any requirements from students working with GSoC? Something like: weekly reports, blogging about their experience, maintaining a wiki page about the status of the project? And last but not least... I'm actually employed right now, but my superiors are willing to give me more than 3 months of time off to be able to work for GSoC. Is that ok? Should I ask for more? I was looking over the timeline, and thought about getting a vacation between May 15th and August 30th. Would that be enough? Is it also alright if I stop working on the project for about a week in early June? Because I have a few exams then. Regards, Ilie. P.S. You can usually find me on IRC too (nickname: ihalip).

Show replies by date

Andrew Sutton

20 Mar 20 Mar

1:46 p.m.

Hi Ilie, My name is Ilie Halip, and I'm a student at the Faculty of Computer Science

...

in Iasi, Romania. I'm really interested in this year's Summer of Code, and stumbled upon your list of ideas yesterday.

I have a few questions about the Boost.XML project.

Hopefully Stefan Seefeld will be able to provide more information about a Boost.XML project. He did some work on this several years ago, but it was never finished. This is a feature that a number of people have requested over the last couple of years.

...

First, what actually needs to be done? The project proposal isn't clear about that. If it's about parsing using DOM/SAX, using XPath queries, then the project linked from the page already does that, right? So the project would involve working with the existing sources and adding validation, tests etc (as mentioned in the README)?

I suspect that much of the initial work will be involved in actually parsing the raw XML format in order to support DOM/SAX API's. Actually, after looking at the previous work [1], I would say that it's all about parsing :) I think writing a Boost SAX library would be a good start for a SoC project.

...

How much working knowledge should a student have before trying to work on such a project? I must admit I haven't used Boost, nor libxml2 in the past

I'm not sure that using libxml2 is a viable option for this project. Boost has very few external dependencies. Besides, if you use libxml2, then your work is mostly done for you. Is there a coding style I should follow if I were to work for this project

...

to conform to Boost standards? Where can I find such info?

Yes. See [2]. You could also start looking at other Boost library code.

...

Do you have any requirements from students working with GSoC? Something like: weekly reports, blogging about their experience, maintaining a wiki page about the status of the project?

We haven't defined specific student requirements yet, but you will minimally be required to provide weekly status updates to your mentor. We may add more requirements (keep a blog or wiki page), provide actual releases of your code. And last but not least... I'm actually employed right now, but my superiors

...

are willing to give me more than 3 months of time off to be able to work for GSoC. Is that ok? Should I ask for more? I was looking over the timeline, and thought about getting a vacation between May 15th and August 30th. Would that be enough? Is it also alright if I stop working on the project for about a week in early June? Because I have a few exams then.

I don't think we've ever had a student request time off from a job before. GSoC runs Mar 20 thru Aug 20, so that's about 3 months. Hope that helps, Andrew Sutton andrew.n.sutton@gmail.com [1] https://svn.boost.org/svn/boost/sandbox/xml/ [2] http://www.boost.org/development/requirements.html

Rene Rivera

2:03 p.m.

Andrew Sutton wrote:

...

My name is Ilie Halip.. And last but not least... I'm actually employed right now, but my superiors

...
are willing to give me more than 3 months of time off to be able to work for GSoC. Is that ok? Should I ask for more? I was looking over the timeline, and thought about getting a vacation between May 15th and August 30th. Would that be enough? Is it also alright if I stop working on the project for about a week in early June? Because I have a few exams then.

I don't think we've ever had a student request time off from a job before. GSoC runs Mar 20 thru Aug 20, so that's about 3 months.

We have had such circumstances come up before. The usual arrangement is that students need to start early to make up for the lost time. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org (msn) - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim,yahoo,skype,efnet,gmail

Ilie Halip

2:20 p.m.

On Sat, Mar 20, 2010 at 4:03 PM, Rene Rivera <grafikrobot@gmail.com> wrote:

...

Andrew Sutton wrote:

...
My name is Ilie Halip..

And last but not least... I'm actually employed right now, but my superiors

...
are willing to give me more than 3 months of time off to be able to work for GSoC. Is that ok? Should I ask for more? I was looking over the timeline, and thought about getting a vacation between May 15th and August 30th. Would that be enough? Is it also alright if I stop working on the project for about a week in early June? Because I have a few exams then.

I don't think we've ever had a student request time off from a job before. GSoC runs Mar 20 thru Aug 20, so that's about 3 months.

We have had such circumstances come up before. The usual arrangement is that students need to start early to make up for the lost time.

Well, in the timeline, it sais that the coding should begin on May 24th. Until then, I am of course able to read documentation, check examples, use the APIs myself, and get acquainted to everything I need to work on this project. I already did a svn checkout of the boost repository and Stefan's sanbox project, and I looked around the source code. @Andrew: I only mentioned libxml2 because Stefan's boost::xml is using it too, and I thought that this project involves developing it further. But I think we need his opinion on this issue. :) BTW, will he mentor the project, or somebody else? Regards, Ilie.

Jens Weller

4:27 p.m.

Hi, just my thoughts on that. Currently I'm working on a Project where xerces is used, but I'm also familiar with other parsers.

...

How much working knowledge should a student have before trying to work on such a project? I must admit I haven't used Boost, nor libxml2 in the past, so I guess I should start reading documentation about both. I am good in C++, and I've been working with XML APIs for years, so that's not a problem. But the other two (Boost+libxml2) are?

That would be a nice attempt, but I would favor an attempt like arabica does, beeing able to switch the Parser underneath the Framework is always good. Also note, that boost already includes with propertyTree another XML parser, rapid XML (not sure its exposed). Also rapid XML isn't offering SAX. Also the old parsers (Xerces, libxml2, expat and others) are usually build for singlethreaded parsing, maybe with spirit it would be possible to write a Parser that could parse a document in multiple threads. regards, Jens Weller -- Sicherer, schneller und einfacher. Die aktuellen Internet-Browser - jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser

Phil Endecott

11 p.m.

Ilie Halip wrote:

...

I have a few questions about the Boost.XML project.

First, what actually needs to be done?

Shall we have another thread about what a good C++ XML library would look like? It's been a while since the last one... I have done a couple of projects using rapidxml, and until recently my feeling was that it was close to the best design. If you're not familiar with it, it holds the XML in memory (e.g. as a memory-mapped file) and does a single-pass parse that builds up a tree that points into the original XML for the strings. This is fast and reasonably memory-efficient. However, recently I needed something that used less memory. I wanted to process a very large file without ever having all of it in memory (imagine e.g. loading a database). So I wrote something where the element and attribute iterators (etc.) are just pointers into the (memory-mapped) XML source. When an iterator is incremented it steps through the source looking (textually) for the start of the next element or attribute (etc.). The result is something that uses almost no memory and is fast for the sorts of access pattern that I needed. An interesting observation is that both a rapidxml-like method and my new method could have very similar interfaces, albeit with different complexity (c.f. std::vector vs. std::list). So it is interesting to consider whether something like an XPath engine could be designed in terms of an interface to multiple back-end "XML containers", if they shared the same interface. In fact, something "XPath-like" but also more "C++-like" would be the next step to improve the "user" code in my application. Currently I have too much verbose iteration looking for the elements that I want. It would be great to have a XPath-like DSL for finding these elements. (An application for Proto?) Regards, Phil.

Mathias Gaunard

21 Mar 21 Mar

1:35 a.m.

Phil Endecott wrote:

...

However, recently I needed something that used less memory. I wanted to process a very large file without ever having all of it in memory (imagine e.g. loading a database). So I wrote something where the element and attribute iterators (etc.) are just pointers into the (memory-mapped) XML source. When an iterator is incremented it steps through the source looking (textually) for the start of the next element or attribute (etc.). The result is something that uses almost no memory and is fast for the sorts of access pattern that I needed.

This is exactly the kind of approach I find interesting. Could you share this code?

Phil Endecott

1:25 p.m.

Mathias Gaunard wrote:

...

Phil Endecott wrote:

...
However, recently I needed something that used less memory. I wanted to process a very large file without ever having all of it in memory (imagine e.g. loading a database). So I wrote something where the element and attribute iterators (etc.) are just pointers into the (memory-mapped) XML source. When an iterator is incremented it steps through the source looking (textually) for the start of the next element or attribute (etc.). The result is something that uses almost no memory and is fast for the sorts of access pattern that I needed.

This is exactly the kind of approach I find interesting. Could you share this code?

Oh dear, I feared someone would ask that. OK, as long as you promise to remember that I wrote this in a hurry, you can find it here: http://svn.chezphil.org/libpbe/trunk/include/rxml.hh http://svn.chezphil.org/libpbe/trunk/src/rxml.cc That currently says "GPL" but I could change that to Boost if people want. Regards, Phil.

Jose

1:40 a.m.

On Sun, Mar 21, 2010 at 12:00 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...

In fact, something "XPath-like" but also more "C++-like" would be the next step to improve the "user" code in my application. Currently I have too much verbose iteration looking for the elements that I want. It would be great to have a XPath-like DSL for finding these elements. (An application for Proto?)

Have you tried pugixml? http://code.google.com/p/pugixml/

Ilie Halip

7:44 a.m.

...

So I wrote something where the element and attribute iterators (etc.) are just pointers into the (memory-mapped) XML source. When an iterator is incremented it steps through the source looking (textually) for the start of the next element or attribute (etc.). The result is something that uses almost no memory and is fast for the sorts of access pattern that I needed.

That's a great idea! If possible, I'd also like to have a look at the code. Also note, that boost already includes with propertyTree another XML parser,

...

rapid XML (not sure its exposed). Also rapid XML isn't offering SAX.

There's been a talk on IRC about property_tree. The main point was that property_tree reads an xml, but Boost.XML is all about XML specs and related APIs.

...

maybe with spirit it would be possible to write a Parser that could parse a document in multiple threads.

What would be the approach here? For a DOM parser, I can imagine each child of the current element being parsed in separate thread (with a max. of 10 threads, for example). But a SAX parser - how would it work multithreaded? I'm stumped with all your great ideas, and right now I don't know if all of them can be done by only one person in 3 months. In this case, I think it's best we ask the mentor for this project, which I believe is Stefan. I'll ask him to join the discussion if I see him on IRC. 8 days from now, the student application period begins, and until then I want to be sure what's expected of me, so I can write a good proposal. And by the way, is it ok if I ask you for suggestions writing it? Regards, Ilie.

Andrew Sutton

3:18 p.m.

...

I'm stumped with all your great ideas, and right now I don't know if all of them can be done by only one person in 3 months. In this case, I think it's best we ask the mentor for this project, which I believe is Stefan. I'll ask him to join the discussion if I see him on IRC. 8 days from now, the student application period begins, and until then I want to be sure what's expected of me, so I can write a good proposal. And by the way, is it ok if I ask you for suggestions writing it?

I would strongly encourage asking for suggestions and asking for reviews of draft proposals. Hopefully, this will give you a better idea of what you should be doing and give the potential mentors an idea of what you want to be doing. I would start by writing down a list of ideas that have been discussed so far (multiple parsing backends, interesting parsing techniqes, supported APIs, etc.). This will help you get a feel for the breadth of possible projects. That should help you (and us!) figure out the best approach for a 3 month project. Andrew Sutton andrew.n.sutton@gmail.com

Phil Endecott

1:46 p.m.

Jose wrote:

...

On Sun, Mar 21, 2010 at 12:00 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...
In fact, something "XPath-like" but also more "C++-like" would be the next step to improve the "user" code in my application. ?Currently I have too much verbose iteration looking for the elements that I want. ?It would be great to have a XPath-like DSL for finding these elements. ?(An application for Proto?)

Have you tried pugixml?

http://code.google.com/p/pugixml/

Thanks; I have seen it, but I hadn't noticed that it has an XPath implementation. (In other respects, it is similar in concept to RapidXML i.e. it needs O(N) memory.) From its examples: // You can use a sometimes convenient path function cout << doc.first_element_by_path("bookstore/book/price").child_value() << endl; // And you can use powerful XPath expressions cout << doc.select_single_node("/bookstore/book[@title = 'ShaderX']/price").node().child_value() << endl; // Compile query that prints total price of all Gems book in store xpath_query query("sum(/bookstore/book[contains(@title, 'Gems')]/price)"); cout << query.evaluate_number(doc) << endl; I'm not convinced that a full XPath implementation is really useful (e.g. that final example using the XPath sum() function and the contains() test; I think it would be better to do those in C++). But some subset, implemented as a DSL, would be good: element e = doc.first_element_matching("bookstore" / "book" / "price"); That could expand to something that assumed doc modelled an 'XML container' concept, as in: element_iterator i = find(doc.child_elements().begin(),doc.child_elements().end(),"bookstore"); if (i==doc.end()) throw NotFound(); element_iterator j = find(i->child_elements().begin(),i->child_elements().end(),"book"); if (j==i->end()) throw NotFound(); element_iterator k = find(j->child_elements().begin(),j->echild_elements().nd(),"price"); if (k==j->end()) throw NotFound(); element e = *k; I guess that if there's something I want to propose, it's that we consider what this XML container concept should look like. In particular, I think that DOM is not what we want because it's too dissimilar to other C++ concepts (i.e. iterators). (An adaptor for DOM would be possible, however.) Regards, Phil.

Stefan Seefeld

22 Mar 22 Mar

7:02 p.m.

On 03/20/2010 07:00 PM, Phil Endecott wrote:

...

Ilie Halip wrote:

...
I have a few questions about the Boost.XML project.

First, what actually needs to be done?

Shall we have another thread about what a good C++ XML library would look like? It's been a while since the last one...

I have done a couple of projects using rapidxml, and until recently my feeling was that it was close to the best design. If you're not familiar with it, it holds the XML in memory (e.g. as a memory-mapped file) and does a single-pass parse that builds up a tree that points into the original XML for the strings. This is fast and reasonably memory-efficient.

How does it deal with input needing "preprocessing", such as entity substitution, or (X)inclusion ? Also, this clearly only works with immutable input.

...

However, recently I needed something that used less memory. I wanted to process a very large file without ever having all of it in memory (imagine e.g. loading a database). So I wrote something where the element and attribute iterators (etc.) are just pointers into the (memory-mapped) XML source. When an iterator is incremented it steps through the source looking (textually) for the start of the next element or attribute (etc.). The result is something that uses almost no memory and is fast for the sorts of access pattern that I needed.

But to what degree is that really XML ? In addition to the above concerns, there are other aspects that may result in the generated infoset to differ from the XML storage. For example, attributes with default values (as per DTD), should arguably still be "seen" by an attribute iterator, while a naive iteration over the explicitly spelled-out attributes won't. Etc. In short, I can certainly see and appreciate cases where your approach has advantages. The same is true for lots of other approaches. However, none of these should really claim to be an XML API, if it doesn't allow to support the full spec.

...

An interesting observation is that both a rapidxml-like method and my new method could have very similar interfaces, albeit with different complexity (c.f. std::vector vs. std::list). So it is interesting to consider whether something like an XPath engine could be designed in terms of an interface to multiple back-end "XML containers", if they shared the same interface.

In fact, something "XPath-like" but also more "C++-like" would be the next step to improve the "user" code in my application. Currently I have too much verbose iteration looking for the elements that I want. It would be great to have a XPath-like DSL for finding these elements. (An application for Proto?)

The same applies here. XPath is a well defined specification. While I can definitely see not everyone needing all its features, I think it's a very bad idea to even consider going down that route where you get tons of "XPath-like" APIs, all mutually incompatible in their features and approaches. Stefan -- ...ich hab' noch einen Koffer in Berlin...

Phil Endecott

8:35 p.m.

Hi Stefan, First let me say that I fully understand that there are many different applications of XML. I get the feeling that you and I have probably encountered different subsets of them. My belief is that there are different legitimate types of XML library to support the different kinds of application. An open question is whether a common API, or at least a common API-subset or collection of concepts, could support those different libraries. My recollection from before was that I felt a libxml2 wrapper could not be usefully-compatible with the approaches that I preferred, but you disagreed. I don't think it would be useful to re-visit that "bikeshed discussion" now, not least because I have forgotten most of the details... Stefan Seefeld wrote:

...

On 03/20/2010 07:00 PM, Phil Endecott wrote:

...
I have done a couple of projects using rapidxml, and until recently my feeling was that it was close to the best design. If you're not familiar with it, it holds the XML in memory (e.g. as a memory-mapped file) and does a single-pass parse that builds up a tree that points into the original XML for the strings. This is fast and reasonably memory-efficient.

How does it deal with input needing "preprocessing", such as entity substitution, or (X)inclusion ?

"it" here meaning rapidxml. In one mode of operation, it replaces entities (i.e. <) during parsing; this obviously breaks the idea of not using much RAM since the mmaped file will copy-on-write pages as this happens. In another mode it doesn't do this and leaves it as a job for the user. In my library I have an iterator that processes a text node decoding entities as they are encountered. This currently only recognises the "default" entities i.e. lt, gt, amp, quot, apos and numerics. It would be possible to extend this to decode entities declared in the document, if that were necessary, but it's not something I've ever needed to do. I believe that a lot of XML features like entity declarations and namespaces declared not in the root element are painful precisely because they are tedious to implement, detrimental to performance, and never used in real-world XML documents. My guess is that you would disagree with that. Neither rapidxml nor my library supports xinclude. In my case, I can imagine adding it by modifying the element iterator such that dereferencing an xi:include element would open the referenced document and return its root element.

...

Also, this clearly only works with immutable input.

I think rapidxml lets you modify a document; it must allocate storage for the new strings somewhere and update its tree to point to them. My library does not allow this. I don't think I've ever needed to modify an XML document: I have only either read in or written out a file.

...

...
However, recently I needed something that used less memory. I wanted to process a very large file without ever having all of it in memory (imagine e.g. loading a database). So I wrote something where the element and attribute iterators (etc.) are just pointers into the (memory-mapped) XML source. When an iterator is incremented it steps through the source looking (textually) for the start of the next element or attribute (etc.). The result is something that uses almost no memory and is fast for the sorts of access pattern that I needed.

But to what degree is that really XML ? In addition to the above concerns, there are other aspects that may result in the generated infoset to differ from the XML storage. For example, attributes with default values (as per DTD), should arguably still be "seen" by an attribute iterator, while a naive iteration over the explicitly spelled-out attributes won't. Etc.

Default attribute values defined in a DTD are an excellent example of an XML misfeature not used in any XML application that I care about that simply result in XML processors being more complex and slower than they would otherwise need to be. (Please feel free to list any XML applications that make use of them.) However, I wouldn't say that these features are fundamentally incompatible with my approach in this library. It's only necessary that when you look up an attribute, the returned range somehow includes pseudo-elements corresponding to the default attributes.

...

In short, I can certainly see and appreciate cases where your approach has advantages. The same is true for lots of other approaches. However, none of these should really claim to be an XML API, if it doesn't allow to support the full spec.

OK, I won't call it an XML API if that will make you happy :-) For the record, rapidxml doesn't even support namespaces. Nor does pugixml. I do support namespaces but it involves work on the user-side if you want to recognise namespace declarations below the root element. Pugixml, like my library, will not even successfully skip over the DOCTYPE in some cases due to its complex syntax.

...

...
An interesting observation is that both a rapidxml-like method and my new method could have very similar interfaces, albeit with different complexity (c.f. std::vector vs. std::list). So it is interesting to consider whether something like an XPath engine could be designed in terms of an interface to multiple back-end "XML containers", if they shared the same interface.

In fact, something "XPath-like" but also more "C++-like" would be the next step to improve the "user" code in my application. Currently I have too much verbose iteration looking for the elements that I want. It would be great to have a XPath-like DSL for finding these elements. (An application for Proto?)

The same applies here. XPath is a well defined specification. While I can definitely see not everyone needing all its features, I think it's a very bad idea to even consider going down that route where you get tons of "XPath-like" APIs, all mutually incompatible in their features and approaches.

OK, I won't call it an "XPath-like" API. I'll just call it a convenient syntax for extracting the interesting elements from a XMLwithoutthemisfeatures document. Regards, Phil.

Stefan Seefeld

9:08 p.m.

On 03/22/2010 04:35 PM, Phil Endecott wrote:

...

Hi Stefan,

First let me say that I fully understand that there are many different applications of XML. I get the feeling that you and I have probably encountered different subsets of them. My belief is that there are different legitimate types of XML library to support the different kinds of application.

While I agree with that, that wasn't quite my point. Rather, I tried to point out that you couldn't only support a subset of XML, and still claim to provide an XML library.

...

...
How does it deal with input needing "preprocessing", such as entity substitution, or (X)inclusion ?

"it" here meaning rapidxml. In one mode of operation, it replaces entities (i.e. <) during parsing; this obviously breaks the idea of not using much RAM since the mmaped file will copy-on-write pages as this happens. In another mode it doesn't do this and leaves it as a job for the user.

If the user is exposed to it, I would argue this is not a sufficient API to call itself "XML bindings". The spec has some rather specific discussion on what ought to be done at parsing, and what the result would be (e.g. http://www.w3.org/TR/xml-infoset/). I strongly object to an "XML library" that offers something else. (To be clear: I certainly don't object to such libraries in themselves, but please don't confuse "XML" with "XML-like".

...

In my library I have an iterator that processes a text node decoding entities as they are encountered. This currently only recognises the "default" entities i.e. lt, gt, amp, quot, apos and numerics. It would be possible to extend this to decode entities declared in the document, if that were necessary, but it's not something I've ever needed to do.

Fine. Again, the XML spec clearly defines when and how entities ought to be handled (http://www.w3.org/TR/REC-xml/#entproc). And to the degree that this processing is specified, an XML library ought to honor it.

...

I believe that a lot of XML features like entity declarations and namespaces declared not in the root element are painful precisely because they are tedious to implement, detrimental to performance, and never used in real-world XML documents. My guess is that you would disagree with that.

I don't disagree, but I think that the world doesn't need yet another library that supports some Not-Quite-XML.

...

Neither rapidxml nor my library supports xinclude. In my case, I can imagine adding it by modifying the element iterator such that dereferencing an xi:include element would open the referenced document and return its root element.

That, too, is not confirming to the XML spec (http://www.w3.org/TR/xinclude/#processing)

...

...
Also, this clearly only works with immutable input.

I think rapidxml lets you modify a document; it must allocate storage for the new strings somewhere and update its tree to point to them. My library does not allow this. I don't think I've ever needed to modify an XML document: I have only either read in or written out a file.

Again: that's fine, and I agree it would be great for a boost.xml library to optimize for that code. However, I don't think it should optimize for it by disallowing the infoset to be modified.

...

Default attribute values defined in a DTD are an excellent example of an XML misfeature not used in any XML application that I care about that simply result in XML processors being more complex and slower than they would otherwise need to be. (Please feel free to list any XML applications that make use of them.)

Same argument. You may not care, but others do.

...

However, I wouldn't say that these features are fundamentally incompatible with my approach in this library. It's only necessary that when you look up an attribute, the returned range somehow includes pseudo-elements corresponding to the default attributes.

I certainly expect an attribute iterator to make no distinction between explicitly specified attributes and default attributes. The XML spec has a clear definition of an InfoSet, and what of an XML file actually is semantically relevant and what is not. I want boost.xml to honor those semantics. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stefan Seefeld

5:43 p.m.

Ilie, After having written that Boost.XML GSoC idea, I have started looking more into the existing (sandbox) project. Doing this I realized that the project is in better shape than I feared. Thus, I think most of the work required there is about documenting the existing design and API, and then following through and submitting it for review as a new boost component. I think that is not something that may be done as part of a GSoC... Of course, bringing up the topic of XML support in C++ (boost, specifically) will typically draw in lots of attention from all corners, and result in a big bikeshed discussion. I'd rather try to avoid getting into those arguments again, as I feel like I have wasted enough time on that in the past. The approach I have taken is to provide a (thin) wrapper around an existing (well tested and supported, and very fast) library (libxml2), so I don't have to reinvent those wheels again. Please note that none of this should leak through the API, i.e. the API can be re-implemented differently without users having to notice. If you would like to get involved, I would be happy. But I shall retract it from the list of GSoC ideas, for the aforementioned reasons. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Andrew Sutton

6:55 p.m.

...

The approach I have taken is to provide a (thin) wrapper around an existing (well tested and supported, and very fast) library (libxml2), so I don't have to reinvent those wheels again. Please note that none of this should leak through the API, i.e. the API can be re-implemented differently without users having to notice.

This is similar to the direction of the BigInt proposals, which also suffers bikeshed discussions :) I think think that this is the right direction. Abstracting the interface leaves room for lots other (more user-specific) parsers or frameworks. If you would like to get involved, I would be happy. But I shall retract it

...

from the list of GSoC ideas, for the aforementioned reasons.

I hope you'll reconsider. I think having a student build another back end or two and work on polishing the interface would make a pretty good summer project. Andrew Sutton andrew.n.sutton@gmail.com

Stefan Seefeld

7:11 p.m.

On 03/22/2010 02:55 PM, Andrew Sutton wrote:

...

If you would like to get involved, I would be happy. But I shall retract it

...
from the list of GSoC ideas, for the aforementioned reasons.

I hope you'll reconsider. I think having a student build another back end or two and work on polishing the interface would make a pretty good summer project.

Hmm, that does indeed sound like a good idea. How can we phrase that to be productive, though ? What I would certainly support (and mentor) is a project where another XML "backend" is provided next to libxml2, and by doing this helping us make sure the (prospective) boost.xml API is indeed robust enough for such a switch (similar to boost.mpi, perhaps). How does this sound ? Is there any interest in this ? Such an endeavor may also help rebut some of the arguments against the current approach, and thus help move us further towards an acceptable and official solution. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Andrew Sutton

9:42 p.m.

...

...
...
I hope you'll reconsider. I think having a student build another back end or two and work on polishing the interface would make a pretty good summer project.

Hmm, that does indeed sound like a good idea. How can we phrase that to be productive, though ?

What I would certainly support (and mentor) is a project where another XML "backend" is provided next to libxml2, and by doing this helping us make sure the (prospective) boost.xml API is indeed robust enough for such a switch (similar to boost.mpi, perhaps).

I think that Apache's xerces would make a good second such target. MSXML would probably be a better option. Expat... Maybe. I think that the project depends on the interface that you want to expose. I'd like to see a uniform method for opening XML files (preparing them to be parsed?), opening buffers for parsing, saving files?, etc. I'm guessing that a SAX-based interface wouldn't be too hard. I can't imagine that SAX providers don't expose wildly different APIs. How does this sound ? Is there any interest in this ? Such an endeavor may

...

also help rebut some of the arguments against the current approach, and thus help move us further towards an acceptable and official solution.

Well, I'm interested :) I think designing by the committee is the wrong way to go these libraries (XML, BigInt, etc.). I say wrap up the stuff that's out there and figure out where the problems come as we go. Nothing's ever perfect the first time you do it, and this isn't one of those times where it needs to be. Andrew Sutton andrew.n.sutton@gmail.com

Stefan Seefeld

9:53 p.m.

On 03/22/2010 05:42 PM, Andrew Sutton wrote:

...

I think that Apache's xerces would make a good second such target. MSXML would probably be a better option. Expat... Maybe.

OK. The only problem I can see with MSXML is that it's a platform-specific library, which constrains the ability to use (and test).

...

I think that the project depends on the interface that you want to expose. I'd like to see a uniform method for opening XML files (preparing them to be parsed?), opening buffers for parsing, saving files?, etc.

My DOM API has one function to generate a DOM-tree from an XML file, and one to generate an XML file from a DOM-tree. I'm not sure anything else is needed, as far as parsing is concerned. Of course, SAX and XMLReader are a different story...

...

I'm guessing that a SAX-based interface wouldn't be too hard. I can't imagine that SAX providers don't expose wildly different APIs.

Right, though there are non-SAX (post-SAX ?) APIs that are considered better than SAX, notably XMLReader.

...

How does this sound ? Is there any interest in this ? Such an endeavor may

...
also help rebut some of the arguments against the current approach, and thus help move us further towards an acceptable and official solution.

Well, I'm interested :) I think designing by the committee is the wrong way to go these libraries (XML, BigInt, etc.). I say wrap up the stuff that's out there and figure out where the problems come as we go. Nothing's ever perfect the first time you do it, and this isn't one of those times where it needs to be.

True. OK, let me put it this way: The goal here should be to define an API. I have an existing implementation that wraps libxml2. I welcome any proposal for alternate implementations / backends. Be warned, though, that I consider XML compliance a prerequisite, not an optional feature. :-) Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Andrew Sutton

10:32 p.m.

...

Well, I'm interested :)

...
True. OK, let me put it this way: The goal here should be to define an API. I have an existing implementation that wraps libxml2. I welcome any proposal for alternate implementations / backends. Be warned, though, that I consider XML compliance a prerequisite, not an optional feature. :-)

Well, apparently my background knowledge isn't up to par :) but I still think this is a good GSoC project. I'm curious to see what Ilie is taking from this discussion. Hopefully we're not scaring people away ;) Andrew Sutton andrew.n.sutton@gmail.com

Ilie Halip

23 Mar 23 Mar

9:14 a.m.

...

I'm curious to see what Ilie is taking from this discussion. Hopefully we're not scaring people away ;)

No, not really - though I do admit the talk around here has been a little confusing. :) I haven't had time to reply yesterday, sorry. I'm not sure this kind of project would have enough complexity to actually work on for more than 2-3 weeks. Except for the actual calls to the underlying XML parser, there are little modifications to be done (API change to support multiple backends, maybe detecting them at runtime?, setting options if needed). There will be problems when the backend parser doesn't support a necessary feature (like TinyXML not supporting validation), or implementation details making it difficult to use (like MSXML requiring COM, which means CoInitialize() calls on each thread that's using it), or those with enough features have huge binary packages. A first step would be researching different 3rd party xml libraries, and make some choices. And I fear it's me who has to make those choices, because I have to write the proposal. Any hints in this direction are greatly appreciated. Regards, Ilie.

Stefan Seefeld

12:34 p.m.

On 03/23/2010 05:14 AM, Ilie Halip wrote:

...

...
I'm curious to see what Ilie is taking from this discussion. Hopefully we're not scaring people away ;)

No, not really - though I do admit the talk around here has been a little confusing. :) I haven't had time to reply yesterday, sorry.

I'm not sure this kind of project would have enough complexity to actually work on for more than 2-3 weeks. Except for the actual calls to the underlying XML parser, there are little modifications to be done (API change to support multiple backends, maybe detecting them at runtime?, setting options if needed).

Let me write down a little "shopping list" of things that I consider worthwhile doing, based on the existing boost.xml sandbox project: * Add tests, to provide better coverage (different string types, different input; test error reporting, etc.) * Add examples, to show the different features * Add at least one new backend (implementation), and make any required modifications to the API to support this abstraction. (I made every effort to keep the current API agnostic to the backend used, but since I didn't really attempt to work with something other than libxml2, there may well be aspects where the backend "leaks" through.

...

There will be problems when the backend parser doesn't support a necessary feature (like TinyXML not supporting validation), or implementation details making it difficult to use (like MSXML requiring COM, which means CoInitialize() calls on each thread that's using it), or those with enough features have huge binary packages.

I don't expect the result of the additional backend binding to necessarily be very appealing. After all, there is a reason why I chose libxml2: At the time I found this to be the best choice. The main incentive for this work is to make sure the backend *may* be switched without affecting the API.

...

A first step would be researching different 3rd party xml libraries, and make some choices. And I fear it's me who has to make those choices, because I have to write the proposal. Any hints in this direction are greatly appreciated.

Yes, doing this research should be part of the GSoC project. One rather obvious candidate seems to be Xerces (http://xerces.apache.org/xerces-c/). It has been around for a long time (as long as libxml2), and thus can be assumed stable and feature-complete. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Andrew Sutton

2 p.m.

...

I'm not sure this kind of project would have enough complexity to actually work on for more than 2-3 weeks. Except for the actual calls to the underlying XML parser, there are little modifications to be done (API change to support multiple backends, maybe detecting them at runtime?, setting options if needed).

You may be underestimating the amount of effort required to help make a library Boost-ready :) I think this is project has the potential to yield good results for Boost. Obviously, the library wouldn't be reviewed for a year or so, but with the new distributed development model, that may be less of an issue. Also, if accepted, there's nothing to prevent you from expanding the scope once your mentor feels that you've completed your core goals. Andrew Sutton andrew.n.sutton@gmail.com

Ilie Halip

24 Mar 24 Mar

10:24 a.m.

...

Let me write down a little "shopping list" of things that I consider worthwhile doing, based on the existing boost.xml sandbox project:

* Add tests, to provide better coverage (different string types, different input; test error reporting, etc.) * Add examples, to show the different features * Add at least one new backend (implementation), and make any required modifications to the API to support this abstraction. (I made every effort to keep the current API agnostic to the backend used, but since I didn't really attempt to work with something other than libxml2, there may well be aspects where the backend "leaks" through.

Those sound ok. Are you also thinking about switching backends at runtime? Because to me it makes more sense like that, rather then linking to 2-3 specific libraries. Plus, other parsers could be easily integrated by any boost::xml user. In any case, I'll come up with a proposal draft sometime in the next days. Hope you guys can help me with suggestions. :) Regards, Ilie.

Stefan Seefeld

10:57 a.m.

On 03/24/2010 06:24 AM, Ilie Halip wrote:

...

Are you also thinking about switching backends at runtime?

No.

...

Because to me it makes more sense like that, rather then linking to 2-3 specific libraries. Plus, other parsers could be easily integrated by any boost::xml user.

I think you may have misunderstood. The choice of backend needs to be made at compile-time. Being able to switch at runtime would incur a significant runtime overhead, and I can't see any use-case for that. The exercise here is to make sure the boost.xml interface hides backend details sufficiently well that it is possible to have different implementations. Aside from that, the wrapper should be as thin as possible, to preserve performance and small memory footprint.

...

In any case, I'll come up with a proposal draft sometime in the next days. Hope you guys can help me with suggestions. :)

Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Ilie Halip

28 Mar 28 Mar

9:26 a.m.

Well, I've looked over a few projects, but I haven't been able to find a candidate better than Xerxes. I made a list of desired features (such as validation, error reporting, ability to parse XML chunks from strings, availability of a Dom-like interface) and just browsed documentation, looked at examples, reading through header files... Specifically, I looked at TinyXml, expat, ezXml, mini-xml, Parsifal, and 2-3 others I can't remember. Aside from a multitude of features which others lack, Xerxes also prides itself with conformance with a number of W3C standards and recommendations. So I guess this is the way to go. I downloaded the 3.1.0 version, and already started looking through the samples. Also, I had 2 other questions. 1. Except the student-mentor discussions, reports, code reviews... is there anything else expected from me? I know other organizations require students to maintain wiki pages about their progress, or blog about it. 2. Is it a problem if I also apply for another project of another organization? Because I will be competing with other students, and I might not succeed. Thanks, Ilie.

Stefan Seefeld

7:25 p.m.

On 03/28/2010 05:26 AM, Ilie Halip wrote:

...

Well, I've looked over a few projects, but I haven't been able to find a candidate better than Xerxes. I made a list of desired features (such as validation, error reporting, ability to parse XML chunks from strings, availability of a Dom-like interface) and just browsed documentation, looked at examples, reading through header files... Specifically, I looked at TinyXml, expat, ezXml, mini-xml, Parsifal, and 2-3 others I can't remember.

For avoidance of doubt: we are talking about a backend other than the existing libxml2, right ? I.e., the goal is not to replace libxml2, but to complement it (so users can choose).

...

Aside from a multitude of features which others lack, Xerxes also prides itself with conformance with a number of W3C standards and recommendations. So I guess this is the way to go. I downloaded the 3.1.0 version, and already started looking through the samples.

Good.

...

Also, I had 2 other questions. 1. Except the student-mentor discussions, reports, code reviews... is there anything else expected from me? I know other organizations require students to maintain wiki pages about their progress, or blog about it.

I'm not aware of an official requirement to have a blog or similar.

...

2. Is it a problem if I also apply for another project of another organization? Because I will be competing with other students, and I might not succeed.

Right. I think it is fine to apply to multiple applications, as long as you keep all prospective mentors / mentoring orgs aware of this, so they can plan accordingly. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

5606

Age (days ago)

5614

Last active (days ago)

List overview

Download

27 comments

8 participants

participants (8)

Andrew Sutton
Ilie Halip
Jens Weller
Jose
Mathias Gaunard
Phil Endecott
Rene Rivera
Stefan Seefeld

[GSoC] Boost.XML

Ilie Halip

Andrew Sutton

Rene Rivera

Ilie Halip

Jens Weller

Phil Endecott

Mathias Gaunard

Phil Endecott

Jose

Ilie Halip

Andrew Sutton

Phil Endecott

Stefan Seefeld

Phil Endecott

Stefan Seefeld

Stefan Seefeld

Andrew Sutton

Stefan Seefeld

Andrew Sutton

Stefan Seefeld

Andrew Sutton

Ilie Halip

Stefan Seefeld

Andrew Sutton

Ilie Halip

Stefan Seefeld

Ilie Halip

Stefan Seefeld

tags

participants (8)