Testing Property Tree xml parser

Jose

26 Apr 2006 26 Apr '06

4:49 p.m.

Hi, I am trying to finish my review of the Property Tree. I find the scope of the library for program configuration very useful. I am currently testing only the read_xml parsing, and although it is only meant for very simple xml files i find its xml support very very sketchy. I've performed multiple experiments with rss feed files, which are very simple xml files. This is the loop I am using to test different xml paths using a command line utility: BOOST_FOREACH(ptree::value_type &v, pt.get_child( argv[2] )) cout << "value: " << v.second.data() << endl; Below is a list of my experiments: 1. parsing the artima.com spotlight feed Result: FAILED The path is rdf:RDF.item.title and I get invalid character entitly. I think the parser should support the semicolon within the tag name, given that in many cases the config files might be generated by real xml programs which use namespaces and it should be able to read them even if it does not support save. 2. parsing the MSDN visual c++ feed Result: FAILED The path is rss.channel.item.title and I get an "xml parse error". Is there a posibility of getting more meaningful errors ? 3. parsing the main CNN feed Result: FAILED The path is rss.channel.item.title. This query fails with no error but if the path is shortened to rss.channel.item it dumps all the values within item, but there is no value at that level (only nested tags) 4. Parsing the Google News RSS feed Result: FAILED The path is rss.channel.item.title. I get "Invalid character entity error". A more meaningful error should be possible with the position in the file where the entity occurs. 5. Parsing the Google News Atom feed Result: FAILED The path is feed.entry.title. I get "Invalid character entity error". regards jose

Show replies by date

Sebastian Redl

26 Apr 26 Apr

5:48 p.m.

Jose wrote:

...

I am currently testing only the read_xml parsing, and although it is only meant for very simple xml files i find its xml support very very sketchy. <snip> 1. parsing the artima.com spotlight feed

Result: FAILED

The path is rdf:RDF.item.title and I get invalid character entitly. I think the parser should support the semicolon within the tag name, given that in many cases the config files might be generated by real xml programs which use namespaces and it should be able to read them even if it does not support save.

The problem here is not the colon. The problem is the " entity, which is a required part of XML but is not supported by boost::property_tree::xml_parser::decode_char_entities() in detail/xml_parser_utils.hpp, lines 62-87. Also not supported is the ' entity, which is also required. Definitely a bug in PropTree.

...

2. parsing the MSDN visual c++ feed

Result: FAILED

The path is rss.channel.item.title and I get an "xml parse error". Is there a posibility of getting more meaningful errors ?

Do you have an URL?

...

3. parsing the main CNN feed

Result: FAILED

The path is rss.channel.item.title. This query fails with no error but if the path is shortened to rss.channel.item it dumps all the values within item, but there is no value at that level (only nested tags)

You misunderstand your own program. A node has only one value. What your loop does it retrieve all the children of the node you select with the path and print their values. So for the path rss.channel.item.title, you get the title element of the first item element in the channel. This element has no children, so the loop is never entered. In your second test you specify rss.channel.item, so you get the item element. This element has four children: the title, link, description and pubDate elements. For each of these children, the value (content) is printed. The test succeeded.

...

4. Parsing the Google News RSS feed

Result: FAILED

The path is rss.channel.item.title. I get "Invalid character entity error". A more meaningful error should be possible with the position in the file where the entity occurs.

Again, the problem seems to be the " entity.

...

5. Parsing the Google News Atom feed

Result: FAILED

The path is feed.entry.title. I get "Invalid character entity error".

Same. Attached is a patch that fixes the bug. Sebastian Redl

Jose

6:13 p.m.

On 4/26/06, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

...
2. parsing the MSDN visual c++ feed

Result: FAILED

The path is rss.channel.item.title and I get an "xml parse error". Is

Jose wrote: there

...
a posibility of getting more meaningful errors ?

Do you have an URL?

http://msdn.microsoft.com/visualc/rss.xml

...

3. parsing the main CNN feed

...
Result: FAILED

The path is rss.channel.item.title. This query fails with no error but if the path is shortened to rss.channel.item it dumps all the values within item, but there is no value at that level (only nested tags)

You misunderstand your own program. A node has only one value. What your loop does it retrieve all the children of the node you select with the path and print their values. So for the path rss.channel.item.title, you get the title element of the first item element in the channel. This element has no children, so the loop is never entered. In your second test you specify rss.channel.item, so you get the item element. This element has four children: the title, link, description and pubDate elements. For each of these children, the value (content) is printed. The test succeeded.

So, what is the code to read the multiple titles ? This is my oversight for not looking at this in more detail

...

4. Parsing the Google News RSS feed

...
Result: FAILED

The path is rss.channel.item.title. I get "Invalid character entity

error".

...
A more meaningful error should be possible with the position in the file where the entity occurs.

Again, the problem seems to be the " entity.

...
5. Parsing the Google News Atom feed

Result: FAILED

The path is feed.entry.title. I get "Invalid character entity error".

Same.

Attached is a patch that fixes the bug.

Sebastian RedlI

Thanks, I think the multiple rss feeds are good for testing as they can expose multiple issues. I am glad you had tested this also. Jose

Sebastian Redl

6:31 p.m.

Jose wrote:

...

http://msdn.microsoft.com/visualc/rss.xml

Parses and displays fine for me. That's without my patch.

...

So, what is the code to read the multiple titles ? This is my oversight for not looking at this in more detail

It's rather complicated. This is not what PropTree was built for, as far as I can see - support for multiple nodes with the same name is rather weak. Path resolution always takes the first that comes up. It's not an XPath engine - not by far. The steps to take in this specific case would be: 1) Get the child "rss.channel". 2) Call sort() on the channel. (Marcin, is sort() stable, i.e. would it preserve the relative order of the item elements here?) 3) Use std::equal_range() with "item" to get the start and end iterator of the item elements. 4) For each of the items, get the child "title". Sebastian Redl

Jose

6:46 p.m.

On 4/26/06, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

Jose wrote:

...
http://msdn.microsoft.com/visualc/rss.xml

Parses and displays fine for me. That's without my patch.

Hi Sebastian, I get this: Error: /root/vc.xml(1): xml parse error I am running latest FC4 on x86_64. what platform are you using ?

...

So, what is the code to read the multiple titles ? This is my oversight for

...
not looking at this in more detail

It's rather complicated. This is not what PropTree was built for, as far as I can see - support for multiple nodes with the same name is rather weak. Path resolution always takes the first that comes up. It's not an XPath engine - not by far.

I would think this is needed to query a configuration file with a normal amount of settings The steps to take in this specific case would be:

...

1) Get the child "rss.channel". 2) Call sort() on the channel. (Marcin, is sort() stable, i.e. would it preserve the relative order of the item elements here?) 3) Use std::equal_range() with "item" to get the start and end iterator of the item elements. 4) For each of the items, get the child "title".

I would like to get a method to handle all value matches given a path string. One can not afford to write code for a simple case like this. I am not saying full XPath but probably the library is really incomplete if it can not handle a basic scenario like this. Marcin, is this in the todo list ? In many cases I want to use an external xml file as the configuration file and then use a path query to get the results that I need as configuration input. In this case I don't want to load the whole xml in a configuration object but just the subset (I misunderstood the library as I quickly hacked the debug_settings example and thought this was no brainer) Sebastian Redl

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sebastian Redl

7:44 p.m.

Jose wrote:

...

Hi Sebastian,

I get this:

Error: /root/vc.xml(1): xml parse error

I am running latest FC4 on x86_64. what platform are you using ?

Gentoo on x86_64, so there should be no difference. Can you send me your input file?

...

I would think this is needed to query a configuration file with a normal amount of settings

No, it is not, actually. As long as each collection of elements is always wrapped by an element containing nothing else, it is easy to iterate over these things. Or by iterating over all subnodes and handling them according to key (it->first). For most configurations, that will suffice.

...

In many cases I want to use an external xml file as the configuration file and then use a path query to get the results that I need as configuration input. In this case I don't want to load the whole xml in a configuration object but just the subset (I misunderstood the library as I quickly hacked the debug_settings example and thought this was no brainer)

I don't think it's possible to load just a subset. At least not with the current parser. Sebastian Redl

Marcin Kalicinski

10:04 p.m.

...

...
http://msdn.microsoft.com/visualc/rss.xml

Parses and displays fine for me. That's without my patch.

...
So, what is the code to read the multiple titles ? This is my oversight for not looking at this in more detail

It's rather complicated. This is not what PropTree was built for, as far as I can see - support for multiple nodes with the same name is rather weak.

I would not say it was not built for it - it is a legitimate use. It may rather indicate that ptree interface is lacking something with regards to multiple nodes with the same name. But maybe it's not that bad, see below.

...

Path resolution always takes the first that comes up. It's not an XPath engine - not by far. The steps to take in this specific case would be: 1) Get the child "rss.channel". 2) Call sort() on the channel. (Marcin, is sort() stable, i.e. would it

Sort is not stable. But I thing there is a better algorithm: iterate over all nodes and compare keys. It is O(n) instead O(n log n), plus it preserves the sequence of nodes. This will list all the titles: BOOST_FOREACH(boost::property_tree::ptree::value_type &v, pt.get_child( argv[2] )) if (v.first == "item") std::cout << "title: " << v.second.get("title", "no title") << '\n'; Best regards, Marcin

Jose

27 Apr 27 Apr

10:56 a.m.

On 4/27/06, Marcin Kalicinski <kalita@poczta.onet.pl> wrote:

...

Sort is not stable. But I thing there is a better algorithm: iterate over all nodes and compare keys. It is O(n) instead O(n log n), plus it preserves the sequence of nodes. This will list all the titles:

BOOST_FOREACH(boost::property_tree::ptree::value_type &v, pt.get_child( argv[2] )) if (v.first == "item") std::cout << "title: " << v.second.get("title", "no title") << '\n';

Yes, this is what I was looking for. What I think is missing is a method to construct the full query path, so that you don't have to code for "item" and "title" , something like pt.get_match("rss.channel.item.title") . Also, the fact that it preserves the ordering is important (and performance is less important in this case)

Marcin Kalicinski

28 Apr 28 Apr

12:54 p.m.

...

Yes, this is what I was looking for. What I think is missing is a method to construct the full query path, so that you don't have to code for "item" and "title" , something like pt.get_match("rss.channel.item.title") .

This starts to look like XPath/XSLT and will quickly get really, really complicated. I don't think the world needs another standard competing with W3C ones. The power of ptree is that you can write such queries in C++, and it takes just 3 lines of code to do it! Best regards, Marcin

Marcin Kalicinski

26 Apr 26 Apr

10:05 p.m.

...

The path is rdf:RDF.item.title and I get invalid character entitly.

I just added missing " and &apos. This was a definite bug.

...

Result: FAILED

The path is rss.channel.item.title and I get an "xml parse error". Is there a posibility of getting more meaningful errors ?

Error should have line number on it. I tried the URL you posted later, and it parses fine for me. Perhaps you saved xml file in UTF-8? UTF-8 files sometimes have binary prefix, which will make the parsing fail, unless you are using locale with UTF-8 codecvt facet. Try saving the file in ASCII.

...

3. parsing the main CNN feed 4. Parsing the Google News RSS feed 5. Parsing the Google News Atom feed

Can you post URLs for these (or better xml files you tried to parse?). I will investigate. Best regards, Marcin

Jose

11:55 p.m.

On 4/27/06, Marcin Kalicinski <kalita@poczta.onet.pl> wrote:

...

...
3. parsing the main CNN feed 4. Parsing the Google News RSS feed 5. Parsing the Google News Atom feed

Can you post URLs for these (or better xml files you tried to parse?). I will investigate.

3. http://rss.cnn.com/rss/cnn_topstories.rss 4. http://news.google.com/?output=rss 5. http://news.google.com/?output=atom Thanks Best regards,

...

Marcin

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Jose

27 Apr 27 Apr

11:28 a.m.

On 4/27/06, Marcin Kalicinski <kalita@poczta.onet.pl> wrote:

...

Error should have line number on it. I tried the URL you posted later, and it parses fine for me. Perhaps you saved xml file in UTF-8? UTF-8 files sometimes have binary prefix, which will make the parsing fail, unless you are using locale with UTF-8 codecvt facet. Try saving the file in ASCII.

I removed the binary prefix and it works fine. Shouldn't PT library handle this automatically ? UTF-8 is the most common encoding nowadays

6983

Age (days ago)

6985

Last active (days ago)

List overview

Download

11 comments

3 participants

participants (3)

Jose
Marcin Kalicinski
Sebastian Redl