[XML] Searching for an XML parsing library in boost

Shibu Bera

27 Feb 2007 27 Feb '07

8:17 a.m.

I would like to know, if any library for parsing XML file is available in boost or not. I srearched in the boost.org site for the same but, I cound't get any library as such.

...

If library is present, please let me know the details.

...

If not, please allow me to know, if any work is going in the concern topic.

Thank you, -- Regards, Shibu

Show replies by date

Sebastian Redl

27 Feb 27 Feb

9:31 a.m.

Shibu Bera wrote:

...

...
if any library for parsing XML file is available in boost or not.

I wonder why Thunderbird treats your entire post as a quote.

Anyway, no, no XML library is currently available in Boost, although various plans are made every now and then. Sebastian Redl

Péter Szilágyi

1:08 p.m.

...

Anyway, no, no XML library is currently available in Boost, although various plans are made every now and then.

A little side question: is there any Unicode character support in boost? STL includes the wstring but that is 4 byte unicode (and quite hard to mingle with simple strings). But is there a way to use UTF 8/16/32? An XML parser would need to know at least these 3 basic sets. Peter

Sebastian Redl

2:30 p.m.

...

A little side question: is there any Unicode character support in boost? No. Regex, when built with Unicode support, requires ICU for that. Boost doesn't have its own Unicode stuff. (There's something in the vault,

Péter Szilágyi wrote: though.)

...

STL includes the wstring but that is 4 byte unicode (and quite hard to mingle with simple strings). Actually, it is whatever the compiler decides it should be. On Linux systems with a default GCC, yes, that's UTF-32, but under Windows it's typically UCS-2 or UTF-16 (with or without surrogate support, that is). But is there a way to use UTF 8/16/32? I'm working on one. But I don't know how well it will be received. I hope I can show a preliminary version soon.

Sebastian Redl

Stefan Seefeld

2:48 p.m.

Sebastian Redl wrote:

...

...
A little side question: is there any Unicode character support in boost? No. Regex, when built with Unicode support, requires ICU for that. Boost doesn't have its own Unicode stuff. (There's something in the vault,

Péter Szilágyi wrote: though.)

...
STL includes the wstring but that is 4 byte unicode (and quite hard to mingle with simple strings). Actually, it is whatever the compiler decides it should be. On Linux systems with a default GCC, yes, that's UTF-32, but under Windows it's typically UCS-2 or UTF-16 (with or without surrogate support, that is).

More specifically, 'wchar_t' and derived types have nothing to do with Unicode. The two are orthogonal concepts. As per the spec, wchar_t has to be large enough to hold the extended character set specified by the supported locales. There is no mention in the spec that this character set has to be Unicode. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Péter Szilágyi

2:55 p.m.

...

No. Regex, when built with Unicode support, requires ICU for that. Boost doesn't have its own Unicode stuff. (There's something in the vault, though.)

I ask because I was thinking about writing an XML lib for the Boost collection, but that requires specific Unicode string handling, which I don't have time to implement. So if there would be some (at least basic) UTF support (and of course the need for such), I would consider trying to implement the core XML parser (+some core extensions) as a Google Summer Of Code project.

Stefan Seefeld

3:04 p.m.

Péter Szilágyi wrote:

...

...
No. Regex, when built with Unicode support, requires ICU for that. Boost doesn't have its own Unicode stuff. (There's something in the vault, though.)

I ask because I was thinking about writing an XML lib for the Boost collection, but that requires specific Unicode string handling, which I don't have time to implement. So if there would be some (at least basic) UTF support (and of course the need for such), I would consider trying to implement the core XML parser (+some core extensions) as a Google Summer Of Code project.

Are you aware of the work that went into boost XML 'bindings' in the past ? I submitted an XML library supporting a DOM API, as well as some xmlreader. (The implementation was based on libxml2, as I believe it would be foolish to attempt to reinvent that particular wheel.) My strategy for dealing with Unicode has been to delegate that to the user, i.e. all classes are parametrized for a (Unicode) string type. Users then plug in their own Unicode library. I believe this to be the only viable option, since often users want to use only Unicode, or only XML (e.g. if it is clear that the content is all ASCII), so there is no reason to lump both together. It would be great to get some momentum to review all past and present ideas and build something on top of that. I may be able to help, if you are interested. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Christian Henning

3:05 p.m.

Hi Peter, I was just wondering if you know about Arabica http://www.jezuk.co.uk/cgi-bin/view/arabica It does not implement its own xml parser but rather wraps around some xml parser implementation, like Xerces, libxml, MSxml, etc. Christian On 2/27/07, Péter Szilágyi <peterke@gmail.com> wrote:

...

...
No. Regex, when built with Unicode support, requires ICU for that. Boost doesn't have its own Unicode stuff. (There's something in the vault, though.)

I ask because I was thinking about writing an XML lib for the Boost collection, but that requires specific Unicode string handling, which I don't have time to implement. So if there would be some (at least basic) UTF support (and of course the need for such), I would consider trying to implement the core XML parser (+some core extensions) as a Google Summer Of Code project. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Péter Szilágyi

3:41 p.m.

...

Hi Peter, I was just wondering if you know about Arabica

http://www.jezuk.co.uk/cgi-bin/view/arabica It does not implement its own xml parser but rather wraps around some

...

xml parser implementation, like Xerces, libxml, MSxml, etc.

No, I haven't heard about this wrapper before, but I did hear of the other parsers. My only problem with them is that I don't really like including too many separate libraries into a project. That's why I turned to the boost collection (being in the TR1). I don't mind using multiple libs from here, but I definitely don't like mixing libs from multiple sources. They become hard to keep up to date. (Plus I've seen that the TR2 proposals contained both Unicode and XML support (although too late for that now)). That's the reason I was thinking about somehow putting together these 2 libs. Are you aware of the work that went into boost XML 'bindings' in the past ?

...

I submitted an XML library supporting a DOM API, as well as some xmlreader. (The implementation was based on libxml2, as I believe it would be foolish to attempt to reinvent that particular wheel.)

No, actually being quite a new member, I don't know about previous attempts (didn't really find anything neither of the site nor the mailing lists) It would be great to get some momentum to review all past and present ideas

...

and build something on top of that. I may be able to help, if you are interested.

Where could I find these ideas/implementations that you mentioned? Peter

Stefan Seefeld

4:05 p.m.

Péter Szilágyi wrote:

...

Are you aware of the work that went into boost XML 'bindings' in the past ?

...
I submitted an XML library supporting a DOM API, as well as some xmlreader. (The implementation was based on libxml2, as I believe it would be foolish to attempt to reinvent that particular wheel.)

No, actually being quite a new member, I don't know about previous attempts (didn't really find anything neither of the site nor the mailing lists)

It would be great to get some momentum to review all past and present ideas

...
and build something on top of that. I may be able to help, if you are interested.

Where could I find these ideas/implementations that you mentioned?

Mailing list archives are at http://boost.org/more/mailing_lists.htm. Specifically, you may search for 'XML API' at http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/boost/. One point of contention is about what functionality an 'XML API' provides (A mere parser ? Access to a DOM-like tree with various ways to build & manipulate it, e.g. XPath, XInclude, validation, etc. ? An XML Reader API ?) Another concern is / was about whether or not this may rely on existing libraries, as well as how flexible it should be (e.g. in terms of the supported character types). (I'd dare to say that those who propose to re-implement everything inside boost either suffer the NotInventedHere syndrome, don't have a good understanding of what XML is, or grossly underestimate the required work, not only to implement it, but also to make it reasonably efficient.) One (unofficial) submission I did is in http://boost-consulting.com/vault/ under 'Programming Interfaces'. That is a bit dated, though. I can send you a more up-to-date version I have off-list, if you are interested. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Christian Henning

4:21 p.m.

...

One (unofficial) submission I did is in http://boost-consulting.com/vault/ under 'Programming Interfaces'. That is a bit dated, though. I can send you a more up-to-date version I have off-list, if you are interested.

I would be interested for the new version. Can you put it in the vault? Thanks, Christian

Stefan Seefeld

4:32 p.m.

Christian Henning wrote:

...

...
One (unofficial) submission I did is in http://boost-consulting.com/vault/ under 'Programming Interfaces'. That is a bit dated, though. I can send you a more up-to-date version I have off-list, if you are interested.

I would be interested for the new version. Can you put it in the vault?

Sure, I can. If there are a number of people who would like to get involved, may be we can drum up enough momentum and get some space in the sandbox ? (Right after the switch to subversion... :-) ) Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Sebastian Redl

4:36 p.m.

Stefan Seefeld wrote:

...

Sure, I can. If there are a number of people who would like to get involved, may be we can drum up enough momentum and get some space in the sandbox ? (Right after the switch to subversion... :-) )

Count me in on that. Sebastian Redl

Christian Henning

4:42 p.m.

Not sure what you mean with getting space in the sandbox. When I was young I just made space. ;-)) Anyway, I just tried to compile the test code and it failed on MSVC 7.1 . Is that a known issue? c:\boost\boost\xml\dom\node.hpp(40) : error C2063: 'boost::xml::dom::detail::factory' : not a function c:\boost\boost\xml\dom\node.hpp(54) : see reference to class template instantiation 'boost::xml::dom::node<S>' being compiled c:\boost\boost\xml\dom\node.hpp(51) : error C2942: 'boost::xml::dom::node<S>' : template-class-id redefined as a formal argument of a function c:\boost\boost\xml\dom\attribute.hpp(18) : error C2063: 'boost::xml::dom::detail::factory' : not a function c:\boost\boost\xml\dom\attribute.hpp(27) : see reference to class template instantiation 'boost::xml::dom::attribute<S>' being compiled

Stefan Seefeld

4:58 p.m.

Christian Henning wrote:

...

Not sure what you mean with getting space in the sandbox. When I was young I just made space. ;-))

:-)

...

Anyway, I just tried to compile the test code and it failed on MSVC 7.1 . Is that a known issue?

No. I haven't even tried to compile that with MSVC, only with my system g++. There is clearly room for improvement, i.e. to make the code as portable as possible.

...

c:\boost\boost\xml\dom\node.hpp(40) : error C2063: 'boost::xml::dom::detail::factory' : not a function c:\boost\boost\xml\dom\node.hpp(54) : see reference to class template instantiation 'boost::xml::dom::node<S>' being compiled c:\boost\boost\xml\dom\node.hpp(51) : error C2942: 'boost::xml::dom::node<S>' : template-class-id redefined as a formal argument of a function c:\boost\boost\xml\dom\attribute.hpp(18) : error C2063: 'boost::xml::dom::detail::factory' : not a function c:\boost\boost\xml\dom\attribute.hpp(27) : see reference to class template instantiation 'boost::xml::dom::attribute<S>' being compiled

That looks bogus. Surely boost::xml::dom::detail::factory() is a function (template). I think MSVC is confused. :-) Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Christian Henning

9:01 p.m.

Hi Stefan,

...

...
Anyway, I just tried to compile the test code and it failed on MSVC 7.1 . Is that a known issue?

No. I haven't even tried to compile that with MSVC, only with my system g++. There is clearly room for improvement, i.e. to make the code as portable as possible.

I got the code to compile on MSVC 7.1. I think the compiler was confused because of the detail namespace. If you move the factory function out of the detail namespace and change the code accordingly it does compile. OK, not quite, I also need to include the assert.h header. Do you want to change that in your code? Also, the test code is crashing. I have tried dom.cpp , so far but it seems to me that there are some bugs in the test code. I don't think it makes much sense to follow up on these issues since you have new library version ready. Christian

Stefan Seefeld

9:16 p.m.

Christian Henning wrote:

...

Hi Stefan,

...
...
Anyway, I just tried to compile the test code and it failed on MSVC 7.1 . Is that a known issue? No. I haven't even tried to compile that with MSVC, only with my system g++. There is clearly room for improvement, i.e. to make the code as portable as possible.

I got the code to compile on MSVC 7.1. I think the compiler was confused because of the detail namespace. If you move the factory function out of the detail namespace and change the code accordingly it does compile. OK, not quite, I also need to include the assert.h header. Do you want to change that in your code?

Also, the test code is crashing. I have tried dom.cpp , so far but it seems to me that there are some bugs in the test code.

I don't think it makes much sense to follow up on these issues since you have new library version ready.

OK. Thanks for the feedback though ! I'm going to clean up my local version a bit and put it into the vault. In the not-so short term I think we should ask to get write-permission to the sandbox CVS, and then set a project up there. Douglas (noting you are project owner), is that possible ? (I'm 'stefan' on sf.net...) On a related note, what's the status of the subversion move ? While the main repository is certainly on hold until the next release, there is no such constraint for the sandbox. Can't that be moved already already (if the plan to migrate isn't obsoleted yet) ? Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Doug Gregor

9:37 p.m.

On Feb 27, 2007, at 4:16 PM, Stefan Seefeld wrote:

...

In the not-so short term I think we should ask to get write-permission to the sandbox CVS, and then set a project up there. Douglas (noting you are project owner), is that possible ? (I'm 'stefan' on sf.net...)

You're all set.

...

On a related note, what's the status of the subversion move ? While the main repository is certainly on hold until the next release, there is no such constraint for the sandbox. Can't that be moved already already (if the plan to migrate isn't obsoleted yet) ?

I, personally, want to integrate the two repositories together (with suitable access controls), so that libraries that make it into Boost can just be "svn move"'d over without losing/copying history. So, we're going to wait for 1.34.0 before we setup the Subversion repository. Cheers, Doug

Jeff Garland

10 p.m.

New subject: svn repository (was [XML] Searching for an XML parsing library in boost)

Doug Gregor wrote:

...

I, personally, want to integrate the two repositories together (with suitable access controls), so that libraries that make it into Boost can just be "svn move"'d over without losing/copying history. So, we're going to wait for 1.34.0 before we setup the Subversion repository.

Perhaps we should consider using the sandbox as an initial test of the conversion and setup process? The advantages I see are that we wouldn't have to wait for 1.34 and any issues in the process are shaken out before the 'mission critical' repository is involved. Jeff

Stefan Seefeld

11:13 p.m.

Doug Gregor wrote:

...

On Feb 27, 2007, at 4:16 PM, Stefan Seefeld wrote:

...
In the not-so short term I think we should ask to get write-permission to the sandbox CVS, and then set a project up there. Douglas (noting you are project owner), is that possible ? (I'm 'stefan' on sf.net...)

You're all set.

Thanks !

...

...
On a related note, what's the status of the subversion move ? While the main repository is certainly on hold until the next release, there is no such constraint for the sandbox. Can't that be moved already already (if the plan to migrate isn't obsoleted yet) ?

I, personally, want to integrate the two repositories together (with suitable access controls), so that libraries that make it into Boost can just be "svn move"'d over without losing/copying history. So, we're going to wait for 1.34.0 before we setup the Subversion repository.

That sounds reasonable, in particular, as 1.34 should be out Real Soon Now, right ? ;-) Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Péter Szilágyi

9:01 p.m.

...

Sure, I can. If there are a number of people who would like to get involved, may be we can drum up enough momentum and get some space in the sandbox ? (Right after the switch to subversion... :-) )

That would be great. Actually, it is what *you* put into it. Compiler decides what the size

...

of wchar_t should be. As long as your code points fit into that size, you will be fine. For example you can store UTF-16 characters in 4-byte wchar_t.

Well that's true, but wouldn't that be a waste? The other problem is that in UTF-32 every single "character" is an actual separate entity. In UTF-16 and UTF-8 espcially, entities are made up of multiple "characters", so you would need to "decode" them to their 32bit representation in order to use them correctly. (Actually, doing it this way would lead to quite a flexible lib... only the reader and the writer must be aware of the conversions and internally a wstring will suffice...)

...

(I'd dare to say that those who propose to re-implement everything inside boost

...
either suffer the NotInventedHere syndrome, don't have a good understanding of what XML is, or grossly underestimate the required work, not only to implement it, but also to make it reasonably efficient.)

I'd second that. One middle-ground option would be to include a small XML parser

How much functionality do you mean by "small XML parser"? Peter

Stefan Seefeld

9:27 p.m.

Péter Szilágyi wrote:

...

Actually, it is what *you* put into it. Compiler decides what the size

...
of wchar_t should be. As long as your code points fit into that size, you will be fine. For example you can store UTF-16 characters in 4-byte wchar_t.

Well that's true, but wouldn't that be a waste? The other problem is that in UTF-32 every single "character" is an actual separate entity. In UTF-16 and UTF-8 espcially, entities are made up of multiple "characters", so you would need to "decode" them to their 32bit representation in order to use them correctly. (Actually, doing it this way would lead to quite a flexible lib... only the reader and the writer must be aware of the conversions and internally a wstring will suffice...)

I think you are missing the point. It's not an argument for any particular encoding. Rather, the point is that there is no pre-defined mapping between Unicode (or other) encoding and any C++ character type.

...

...
(I'd dare to say that those who propose to re-implement everything inside boost

...
either suffer the NotInventedHere syndrome, don't have a good understanding of what XML is, or grossly underestimate the required work, not only to implement it, but also to make it reasonably efficient.) I'd second that. One middle-ground option would be to include a small XML parser

How much functionality do you mean by "small XML parser"?

That's a good question. Also, it would still be a parser only, as opposed to any in-memory representation (tree ?) with assorted APIs. Such a parser may be sufficient if all you have in mind is an XMLReader-like API, but it surely isn't if what you want is a DOM, with XPath-based lookup, incremental validation, etc., etc. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Péter Szilágyi

9:47 p.m.

...

That's a good question. Also, it would still be a parser only, as opposed to any in-memory representation (tree ?) with assorted APIs. Such a parser may be sufficient if all you have in mind is an XMLReader-like API, but it surely isn't if what you want is a DOM, with XPath-based lookup, incremental validation, etc., etc.

In my opinion in order for an XML library to be useful, it should support parsing and generating XML documents, in-memory representation, construction and modification support, as well as at least basic validation. I think you are missing the point. It's not an argument for any particular

...

encoding. Rather, the point is that there is no pre-defined mapping between Unicode (or other) encoding and any C++ character type.

I understand this, I was just thinking about how the different encodings could be represented as wstrings while keeping the string's base functionality (one wchat_t truly one char, not just part of it). Sincerely, Peter

Stefan Seefeld

10:18 p.m.

Péter Szilágyi wrote:

...

...
That's a good question. Also, it would still be a parser only, as opposed to any in-memory representation (tree ?) with assorted APIs. Such a parser may be sufficient if all you have in mind is an XMLReader-like API, but it surely isn't if what you want is a DOM, with XPath-based lookup, incremental validation, etc., etc.

In my opinion in order for an XML library to be useful, it should support parsing and generating XML documents, in-memory representation, construction and modification support, as well as at least basic validation.

I would think different XML APIs can co-exist (possibly sharing implementation). Some use cases really only require XML streaming / parsing, and such users shouldn't be forced to see a full DOM API. I think the way the XML specs are defined allows us to make such APIs rather modular / orthogonal.

...

I think you are missing the point. It's not an argument for any particular

...
encoding. Rather, the point is that there is no pre-defined mapping between Unicode (or other) encoding and any C++ character type.

I understand this, I was just thinking about how the different encodings could be represented as wstrings while keeping the string's base functionality (one wchat_t truly one char, not just part of it).

Right. But, as you have pointed out in your previous mail, only fixed-sized encodings can be used like this. Often you don't need / want random access, making UTF-8 a better choice. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Péter Szilágyi

10:43 p.m.

...

I would think different XML APIs can co-exist (possibly sharing implementation). Some use cases really only require XML streaming / parsing, and such users shouldn't be forced to see a full DOM API. I think the way the XML specs are defined allows us to make such APIs rather modular / orthogonal.

I agree that those who do not need DOM support shouldn't work with complex functionality, but I in my opinion the API would be much more powerful if the two "functionalities" were implemented into a single library/module/whatever providing separate functions for those who need just basic XML support and separate for those who want to go wild. What I'm trying to point out here is that maintaining two separate APIs isn't really the best solution... instead I would suggest one module with "simple" + "expert" methods... and everyone can use what they prefer the most (even mixing the two). So internally it could use a DOM representation, but if the user chooses to stick with the basics, then there are the methods for it and he will never know about the DOM. If on the other hand he wants full control, then he can have access to everything as well.

Sebastian Redl

10:54 p.m.

Péter Szilágyi wrote:

...

What I'm trying to point out here is that maintaining two separate APIs isn't really the best solution... instead I would suggest one module with "simple" + "expert" methods... That's still effectively two APIs. Whether it's two implementations is a different question. So internally it could use a DOM representation, but if the user chooses to stick with the basics, Just pointing out: implementation stream parsers on top of a DOM representation is about as inefficient as you can get.

Sebastian Redl

Boris Kolpackov

28 Feb 28 Feb

7:12 a.m.

"Péter Szilágyi" <peterke@gmail.com> writes:

...

...
I'd second that. One middle-ground option would be to include a small XML parser

How much functionality do you mean by "small XML parser"?

Enough to be able to implement all other APIs on top of it. Since you want the base to be efficient, SAX2 would be a good candidate except it could be tricky to implement xmlreader API on top of that. -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Jerry Lawson

2:03 p.m.

My 2 cents: what about starting with the tinyxml project: http://sourceforge.net/projects/tinyxml and adding the various APIs?

Péter Szilágyi

4:25 p.m.

Any suggestions how things should/could be started/planned? Sincerely, Peter

Robert Ramey

4:37 p.m.

A while ago I made a suggestion about using the spirit parser with its associated xml grammers. No one has commented on this. I'm curious why this idea doesn't seem to be attractive to anyone else.. I used it with very good results in the serialization library. It created a much more robust and maintainable parser than I could have done by hand. What am I missing here? Robert Ramey Péter Szilágyi wrote:

...

Any suggestions how things should/could be started/planned?

As far as I can tell, its already done - there's nothing to do.

...

Sincerely, Peter _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Stefan Seefeld

4:52 p.m.

Robert Ramey wrote:

...

A while ago I made a suggestion about using the spirit parser with its associated xml grammers.

No one has commented on this.

I think I have. May be not explicitly, but my comments about XML not only being about parsing was definitely directed at that suggestion. (And nobody had any arguments against that...)

...

I'm curious why this idea doesn't seem to be attractive to anyone else.. I used it with very good results in the serialization library. It created a much more robust and maintainable parser than I could have done by hand. What am I missing here?

I agree, though that comparison only compares two alternatives (roll-your-own vs. a spirit parser). As I keep reiterating, though, an XML API is *much* more than a parser. Even for XML streaming, you likely want to add support for URL lookup (requiring some support for http and other protocols), and possibly incremental validation. And we haven't even talked about a DOM-like API yet. Yes, that could all be built 'by hand' on top of existing libs (bgl ?), but there is enough domain-specific stuff that would need to be added (XPath, say) that would qualify such an approach as 'done by hand' much like what you criticise above yourself. One of the best (free, portable, efficient) XML libraries around these days is libxml2. Having watched that evolve I can somewhat appreciate all the hard work that went into that. I'm not foolish enough to want to start from scratch. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Mateusz Loskot

4:58 p.m.

Robert Ramey wrote:

...

A while ago I made a suggestion about using the spirit parser with its associated xml grammers.

No one has commented on this. I'm curious why this idea doesn't seem to be attractive to anyone else.. I used it with very good results in the serialization library. It created a much more robust and maintainable parser than I could have done by hand. What am I missing here?

Robert, Indeed, it's a very interesting question. I like the idea of Spirit parser, not only for XML processing, but from time to time I got some comments that Spirit performance for parsing big input is not very good. Although, I've never seen any numbers presenting bad performance of Spirit, I think people may be scared of using it for parsing big data sets. Cheers -- Mateusz Loskot http://mateusz.loskot.net

Boris Kolpackov

1 Mar 1 Mar

5:59 p.m.

Hi Robert, "Robert Ramey" <ramey@rrsd.com> writes:

...

A while ago I made a suggestion about using the spirit parser with its associated xml grammers.

No one has commented on this. I'm curious why this idea doesn't seem to be attractive to anyone else.. I used it with very good results in the serialization library. It created a much more robust and maintainable parser than I could have done by hand. What am I missing here?

The question is whether it is a conforming XML parser? That means support for: - namespaces - character references - entity references - CDATA - DTD well-formedness checking, entity declaration processing and replacement, substitution of default attribute values, etc. My uneducated guess is that "spirit-based XML grammar" is not a conforming XML parser. The next question is how much effort it will take to fix it up and whether it will still be as robust, maintainable, and efficient (I doubt it very much). The reason why you had good results with serialization library is because you control both production and consumption of the instances so you can easily restrict yourself to a subset of XML. Once you need to process *any* valid XML things get a lot more complicated. hth, -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Robert Ramey

6:58 p.m.

Boris Kolpackov wrote:

...

Hi Robert,

"Robert Ramey" <ramey@rrsd.com> writes:

...
A while ago I made a suggestion about using the spirit parser with its associated xml grammers.

No one has commented on this. I'm curious why this idea doesn't seem to be attractive to anyone else.. I used it with very good results in the serialization library. It created a much more robust and maintainable parser than I could have done by hand. What am I missing here?

The question is whether it is a conforming XML parser? That means support for:

Actually I don't think that's the question at all. The question is about the strategy of development. To parse a grammar one can use a grammar driven parser (yacc, bison, spirit, etc) or one can write code to explicitly parse the grammar.

...

- namespaces - character references - entity references - CDATA - DTD well-formedness checking, entity declaration processing and replacement, substitution of default attribute values, etc.

...

My uneducated guess is that "spirit-based XML grammar" is not a conforming XML parser.

Not relevant. The question isn't which features the particular XML parser included with spirite supports. Any missing features could be added to the grammar without too much problem - that's the appeal of using a grammar driven approach.

...

The next question is how much effort it will take to fix it up

Much less than hand coding yet another xml parser.

...

and whether it will still be as robust, maintainable,

A parser generated from a formal grammar is going to be much more robust, and maintainable. The grammar can be vrified independently of the implementaion.

...

and efficient (I doubt it very much).

This might be a legitimate concern. Some tests suggest that a hand coded parser can be made more efficient than a machine generated one. But of course it would really depend on the quality of the hand coding itself which is hard to speculate on. In anycase this would strike me as pre-mature optimization. If it were my problem, I would start with the most expedient way to make a robust and maintainable parser. If I found it to be "too slow" that module could well be replaced with a hand coded equivalent.

...

The reason why you had good results with serialization library is because you control both production and consumption of the instances so you can easily restrict yourself to a subset of XML.

The reason I had good results with spirit with serialization library is that it's good, robust, well designed and well documented code. I built on that.

...

Once you need to process *any* valid XML things get a lot more complicated.

Which is even more reason to avoid a hand coded parser. Robert Ramey

Stefan Seefeld

7:40 p.m.

Robert Ramey wrote:

...

Boris Kolpackov wrote:

...

...
My uneducated guess is that "spirit-based XML grammar" is not a conforming XML parser.

Not relevant.

The question isn't which features the particular XML parser included with spirite supports. Any missing features could be added to the grammar without too much problem - that's the appeal of using a grammar driven approach.

That is also the appeal of "Let's reinvent it !". There are XML parsers out there that support all the above. Why not use them ?

...

...
The next question is how much effort it will take to fix it up

Much less than hand coding yet another xml parser.

True. But much more than using an existent (compliant) parser.

...

...
and whether it will still be as robust, maintainable,

A parser generated from a formal grammar is going to be much more robust, and maintainable. The grammar can be vrified independently of the implementaion.

The XML 'grammar' is in fact the most trivial aspect of it.

...

...
The reason why you had good results with serialization library is because you control both production and consumption of the instances so you can easily restrict yourself to a subset of XML.

The reason I had good results with spirit with serialization library is that it's good, robust, well designed and well documented code. I built on that.

What features of XML are you using ? External subsets or any other URLs that need to be looked up ? XInclude support to make documents modular ? Etc., etc.

...

...
Once you need to process *any* valid XML things get a lot more complicated.

Which is even more reason to avoid a hand coded parser.

I think you (still) totally miss the point. Sorry. Stefan -- ...ich hab' noch einen Koffer in Berlin...

Robert Ramey

8:51 p.m.

Stefan Seefeld wrote:

...

I think you (still) totally miss the point. Sorry.

LOL - I guess so. The first message in the thread asks the question:

...

I would like to know, if any library for parsing XML file is available in boost or not. I srearched in the boost.org site for the same but, I cound't get any library as such.

which is the question I've addressed. I.m not sure what else we're referring to here.

...

That is also the appeal of "Let's reinvent it !". There are XML parsers out there that support all the above. Why not use them ?

Perhaps that's a good question to the original poster.

...

What features of XML are you using ? External subsets or any other URLs that need to be looked up ? XInclude support to make documents modular ? Etc., etc.

How are these questions related to parsing XML syntax? Robert Ramey

Joel de Guzman

10:41 p.m.

Robert Ramey wrote:

...

Stefan Seefeld wrote:

...
I think you (still) totally miss the point. Sorry.

LOL - I guess so. The first message in the thread asks the question:

...
I would like to know, if any library for parsing XML file is available in boost or not. I srearched in the boost.org site for the same but, I cound't get any library as such.

which is the question I've addressed. I.m not sure what else we're referring to here.

...
That is also the appeal of "Let's reinvent it !". There are XML parsers out there that support all the above. Why not use them ?

Perhaps that's a good question to the original poster.

Robert (as do I) probably represent big chunk of c++ programmers out there who wish for a simple XML parser without most of the bells and whistles described in this thread. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Péter Szilágyi

2 Mar 2 Mar

11:31 a.m.

...

Robert (as do I) probably represent big chunk of c++ programmers out there who wish for a simple XML parser without most of the bells and whistles described in this thread.

I don't agree with this. I consider this one of the greatest weaknesses of C++... that there are no support for the new technologies and if there is, it's very limited. At every lager C++ project, where you have to mix different technologies, make it cross platform, etc, it's extremely annoying that you cannot proceed with the program itself, because the functionality provided by the language / standard libs is so limited, that you must first write the tools with which to write your program. This is why people tend to move to Java/C#... in C++ if I want to write a conf. file in XML, it takes me days to find suitable library and insert it into the code, read the specifications etc. Most of the time is taken up by writing the base components to the project and not the program logic itself. Exactly because of the lack of the bells and the whistles. There are many many libs out there all providing base functionality... but when it comes to something more complex it turns out that none of them can handle thus in the whole project implementing the logic itself is less work than implementing base technologies needed, but unavailable. For example, large projects many times need: networking (with encryption), XML parsers (documents, validation, transformation), powerful GUI, database connectivity (multiple type of databases is the ideal), image manipulations etc. and make this cross platform. In Java and C# everything is included... in C++ not exactly... you have to look through hundreds of libs to find the one that is most suitable, probably trying multiple before settling at one, because of the lack of the bells on one and the whistles on the other. C++ desperately needs new technology support to survive... but it also needs to support it fully. Today's projects are all about mixing all kinds of new things together to produce more powerful programs... but there isn't anything to mix... thus companies prefer newer languages, where they can produce value (even at the cost of speed) and not just code that's already implemented in other languages. Ok, this got a bit long and especially off topic :)... I just wanted to point out, that in reality, the bells are the most important in a lib, those make a technology powerful, and not just limited to a few uses. Of course this is only my opinion :) Sincerely, Peter

Stefan Seefeld

1 Mar 1 Mar

10:46 p.m.

Robert Ramey wrote:

...

Stefan Seefeld wrote:

...
I think you (still) totally miss the point. Sorry.

LOL - I guess so. The first message in the thread asks the question:

...
I would like to know, if any library for parsing XML file is available in boost or not. I srearched in the boost.org site for the same but, I cound't get any library as such.

which is the question I've addressed. I.m not sure what else we're referring to here.

Well, we debated how to fill that gap, i.e. in particular, what exactly needs to be added to boost. I pointed out that I had already submitted an API (implemented in terms of libxml2), while you suggested that using spirit with some formal XML grammar might be a good starting point.

...

...
That is also the appeal of "Let's reinvent it !". There are XML parsers out there that support all the above. Why not use them ?

Perhaps that's a good question to the original poster.

I don't think so, as the quest for a boost.xml API is still valid IMO. There is no standardized way to handle with XML input in C++. The point I'm trying to make here is that proposing a C++ API for and suggesting to (re-)implement XML handling are two quite different things.

...

...
What features of XML are you using ? External subsets or any other URLs that need to be looked up ? XInclude support to make documents modular ? Etc., etc.

How are these questions related to parsing XML syntax?

Think of C++. What does it take to 'parse C++ syntax' ? Quite a lot, as it turns out. Quite a bit of semantic analysis, to disambiguate the syntax. Now, XML grammar is (fortunately) much simpler, however I have never come across the need to only 'parse XML syntax' without doing that extra work required to do the rest. What do you expect an XML parser to return ? The XML spec has quite a clear definition of that (http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#intro) That's what Boris was referring to as a 'conforming XML parser'. Anything non-conformant shouldn't be called an 'XML parser'. There already is way too much non-conformance in the wild. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Joel de Guzman

10:44 p.m.

Boris Kolpackov wrote:

...

Hi Robert,

...

The question is whether it is a conforming XML parser? That means support for:

- namespaces

yes

...

- character references

yes

...

- entity references

yes

...

- CDATA

yes

...

- DTD well-formedness checking, entity declaration processing and replacement, substitution of default attribute values, etc.

no. it is not validating. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Sebastian Redl

2 Mar 2 Mar

10:22 a.m.

Joel de Guzman wrote:

...

Boris Kolpackov wrote:

...
- DTD well-formedness checking, entity declaration processing and replacement, substitution of default attribute values, etc.

no. it is not validating.

Wrong. The internal subset of a DTD must be checked for well-formedness and be processed for entity replacement even by a non-validating parser. Sebastian Redl

Joel de Guzman

11:42 a.m.

Sebastian Redl wrote:

...

Joel de Guzman wrote:

...
Boris Kolpackov wrote:

...
- DTD well-formedness checking, entity declaration processing and replacement, substitution of default attribute values, etc.

no. it is not validating.

Wrong. The internal subset of a DTD must be checked for well-formedness and be processed for entity replacement even by a non-validating parser.

Ah yes. Duh! I always mix that up. Neither is done ATM. Come Spirit-2, I intend to do both; not because I want to reinvent the wheel, but rather because I want to have practical examples for Spirit. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Boris Kolpackov

5:31 p.m.

Joel de Guzman <joel@boost-consulting.com> writes:

...

...
- entity references

yes

Well, if you don't do DTD parsing then you can't possibly support entity references. They are defined in DTD.

...

...
- DTD well-formedness checking, entity declaration processing and replacement, substitution of default attribute values, etc.

no. it is not validating.

As already pointed out, in order to be conforming non-validating XML parser it still needs to parse and perform a number of tasks on DTD. This is all spelled out very clearly in the spec. Though I am impressed you support namespaces and character references. -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Stefan Seefeld

29 Mar 29 Mar

1:32 p.m.

Stefan Seefeld wrote:

...

Christian Henning wrote:

...
...
One (unofficial) submission I did is in http://boost-consulting.com/vault/ under 'Programming Interfaces'. That is a bit dated, though. I can send you a more up-to-date version I have off-list, if you are interested.

I would be interested for the new version. Can you put it in the vault?

Sure, I can. If there are a number of people who would like to get involved, may be we can drum up enough momentum and get some space in the sandbox ? (Right after the switch to subversion... :-) )

I have now checked in my prototype into the sandbox (boost-sandbox/xml in boost-sandbox.cvs.sourceforge.net:/cvsroot/boost-sandbox), and would very much appreciate any feedback or even collaboration ! At present I only use a very minimal 'build system' using Makefiles. I would appreciate help to set up a bbv2 build infrastructure. I have put some notes into README and other files. Again, any feedback is more than welcome ! Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Boris Kolpackov

27 Feb 27 Feb

7:55 p.m.

Hi Stefan, Stefan Seefeld <seefeld@sympatico.ca> writes:

...

(I'd dare to say that those who propose to re-implement everything inside boost either suffer the NotInventedHere syndrome, don't have a good understanding of what XML is, or grossly underestimate the required work, not only to implement it, but also to make it reasonably efficient.)

I'd second that. One middle-ground option would be to include a small XML parser (e.g., Expat) verbatim into boost and then build the rest from that. Expat is a SAX2 parser so it might be non-trivial to build the so- called xmlreader API from it. One possibility would be to use the parser suspension for that. Not sure what impact on performance it's going to have, though. hth, -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Boris Kolpackov

7:50 p.m.

Sebastian Redl <sebastian.redl@getdesigned.at> writes:

...

Actually, it is whatever the compiler decides it should be. On Linux systems with a default GCC, yes, that's UTF-32, but under Windows it's typically UCS-2 or UTF-16 (with or without surrogate support, that is).

Actually, it is what *you* put into it. Compiler decides what the size of wchar_t should be. As long as your code points fit into that size, you will be fine. For example you can store UTF-16 characters in 4-byte wchar_t. -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Robert Ramey

3:56 p.m.

Note that the serialization library uses spirit - a boost library - to parse xml files. Spirit includes a fairly complete xml grammer and is used to parse both narrow and wide char files. Robert Ramey Shibu Bera wrote:

...

...
I would like to know, if any library for parsing XML file is available in boost or not. I srearched in the boost.org site for the same but, I cound't get any library as such.

...
If library is present, please let me know the details.

...
If not, please allow me to know, if any work is going in the concern topic.

Thank you,

6689

Age (days ago)

6719

Last active (days ago)

List overview

Download

46 comments

12 participants

participants (12)

Boris Kolpackov
Christian Henning
Doug Gregor
Jeff Garland
Jerry Lawson
Joel de Guzman
Mateusz Loskot
Péter Szilágyi
Robert Ramey
Sebastian Redl
Shibu Bera
Stefan Seefeld