[spirit] qi xml parser

newer
Re: [Boost-users] boost::function...

Michael Powell

30 Jun 2014 30 Jun '14

8:14 p.m.

Hello, I am building out a general use xml parser including attributes, arbitrary number of elements, and so on. So far so good, makes sense parsing names and so forth. However, how do you handle element content? Which could either be a string, or zero or more other elements (basically of the same rule as the enclosing element rule). It would seem you need a terminus, the empty element tag. In such a way that populates the parent (initial) element, and its children (of the same element kind). I'll be adapting structs to capture the results. I am also using a couple of helpful references, for instance: http://www.w3.org/TR/xml11/ http://stackoverflow.com/questions/9473843/boost-spirit-how-to-extend-xml-pa... Also not sure quite how to capture the adapted parts at strategic rule opportunities. My domain model will look something like this, keeping it simple as possible: struct xattribute { std::string name; std::string value; }; typedef std::vector<xattribute> xattribute_vector; struct xelement; typedef std::vector<xelement> xelement_vector; struct xelement { std::string name; std::string content; xattribute_vector attributes; xelement_vector children; }; Thanks... Best regards, Michael Powell

Show replies by date

Michael Powell

30 Jun 30 Jun

9:44 p.m.

On Mon, Jun 30, 2014 at 3:14 PM, Michael Powell <mwpowellhtx@gmail.com> wrote:

...

Hello,

I am building out a general use xml parser including attributes, arbitrary number of elements, and so on.

So far so good, makes sense parsing names and so forth. However, how do you handle element content? Which could either be a string, or zero or more other elements (basically of the same rule as the enclosing element rule).

It would seem you need a terminus, the empty element tag. In such a way that populates the parent (initial) element, and its children (of the same element kind).

I'll be adapting structs to capture the results. I am also using a couple of helpful references, for instance:

http://www.w3.org/TR/xml11/ http://stackoverflow.com/questions/9473843/boost-spirit-how-to-extend-xml-pa...

I'm not sure reading the Xml specification, and some boost tickets from several years ago, the following couldn't represent content: content %= *(chars_ - chars_("<&")) | *(comment | child_element); Where comment is defined as expected. child_element is the potential for recursion into the element grammar where content is defined. Basically a member variable of the same type as the container struct (element grammar).

...

Also not sure quite how to capture the adapted parts at strategic rule opportunities.

My domain model will look something like this, keeping it simple as possible:

struct xattribute { std::string name; std::string value; };

typedef std::vector<xattribute> xattribute_vector;

struct xelement;

typedef std::vector<xelement> xelement_vector;

struct xelement { std::string name; std::string content; xattribute_vector attributes; xelement_vector children; };

Thanks...

Best regards,

Michael Powell

Michael Powell

10:07 p.m.

On Mon, Jun 30, 2014 at 4:44 PM, Michael Powell <mwpowellhtx@gmail.com> wrote:

...

On Mon, Jun 30, 2014 at 3:14 PM, Michael Powell <mwpowellhtx@gmail.com> wrote:

...
Hello,

I am building out a general use xml parser including attributes, arbitrary number of elements, and so on.

So far so good, makes sense parsing names and so forth. However, how do you handle element content? Which could either be a string, or zero or more other elements (basically of the same rule as the enclosing element rule).

It would seem you need a terminus, the empty element tag. In such a way that populates the parent (initial) element, and its children (of the same element kind).

I'll be adapting structs to capture the results. I am also using a couple of helpful references, for instance:

http://www.w3.org/TR/xml11/ http://stackoverflow.com/questions/9473843/boost-spirit-how-to-extend-xml-pa...

I'm not sure reading the Xml specification, and some boost tickets from several years ago, the following couldn't represent content:

content %= *(chars_ - chars_("<&")) | *(comment | child_element);

Where comment is defined as expected. child_element is the potential for recursion into the element grammar where content is defined. Basically a member variable of the same type as the container struct (element grammar).

Indeed, I cook up a simple(ish) example, and I get the error: Error 3 error C2460: 'xml::xml_element_grammar<std::_String_const_iterator<std::_String_val<std::_Simple_types<char>>>,boost::spirit::ascii::space_type>::child_element' : uses 'xml::xml_element_grammar<std::_String_const_iterator<std::_String_val<std::_Simple_types<char>>>,boost::spirit::ascii::space_type>', which is being defined i:\source\kingdom software\cppxml\xml\xiparser.h 187 1 xml Nothing fancy, fairly plain-old-Xml there: using boost::spirit::qi::phrase_parse; using boost::spirit::ascii::space; std::string txt = "<test><one /><two>2</two><three att=\"3\"/></test>"; xml::xml_element_grammar<> g; xml::xelement element; bool result = phrase_parse(txt.cbegin(), txt.cend(), g, space, element); How do you model when parent needs to look like a child, depending on the direction of the grammar's rule? In other words, the defining rule is a "parent", but when it's done parsing, it could very well operate like a child to a container parent.

...

...
Also not sure quite how to capture the adapted parts at strategic rule opportunities.

My domain model will look something like this, keeping it simple as possible:

struct xattribute { std::string name; std::string value; };

typedef std::vector<xattribute> xattribute_vector;

struct xelement;

typedef std::vector<xelement> xelement_vector;

struct xelement { std::string name; std::string content; xattribute_vector attributes; xelement_vector children; };

Thanks...

Best regards,

Michael Powell

Michael Powell

1 Jul 1 Jul

2:49 a.m.

On Mon, Jun 30, 2014 at 5:07 PM, Michael Powell <mwpowellhtx@gmail.com> wrote:

...

On Mon, Jun 30, 2014 at 4:44 PM, Michael Powell <mwpowellhtx@gmail.com> wrote:

...
On Mon, Jun 30, 2014 at 3:14 PM, Michael Powell <mwpowellhtx@gmail.com> wrote:

...
Hello,

I am building out a general use xml parser including attributes, arbitrary number of elements, and so on.

So far so good, makes sense parsing names and so forth. However, how do you handle element content? Which could either be a string, or zero or more other elements (basically of the same rule as the enclosing element rule).

It would seem you need a terminus, the empty element tag. In such a way that populates the parent (initial) element, and its children (of the same element kind).

I'll be adapting structs to capture the results. I am also using a couple of helpful references, for instance:

http://www.w3.org/TR/xml11/ http://stackoverflow.com/questions/9473843/boost-spirit-how-to-extend-xml-pa...

I'm not sure reading the Xml specification, and some boost tickets from several years ago, the following couldn't represent content:

content %= *(chars_ - chars_("<&")) | *(comment | child_element);

Where comment is defined as expected. child_element is the potential for recursion into the element grammar where content is defined. Basically a member variable of the same type as the container struct (element grammar).

Indeed, I cook up a simple(ish) example, and I get the error:

Error 3 error C2460: 'xml::xml_element_grammar<std::_String_const_iterator<std::_String_val<std::_Simple_types<char>>>,boost::spirit::ascii::space_type>::child_element' : uses 'xml::xml_element_grammar<std::_String_const_iterator<std::_String_val<std::_Simple_types<char>>>,boost::spirit::ascii::space_type>', which is being defined i:\source\kingdom software\cppxml\xml\xiparser.h 187 1 xml

Nothing fancy, fairly plain-old-Xml there:

using boost::spirit::qi::phrase_parse; using boost::spirit::ascii::space;

std::string txt = "<test><one /><two>2</two><three att=\"3\"/></test>";

xml::xml_element_grammar<> g; xml::xelement element;

bool result = phrase_parse(txt.cbegin(), txt.cend(), g, space, element);

How do you model when parent needs to look like a child, depending on the direction of the grammar's rule? In other words, the defining rule is a "parent", but when it's done parsing, it could very well operate like a child to a container parent.

I made it a little ways past this part. Focused on the simpler parts and got those parsing fine. typedef boost::make_recursive_variant< boost::variant<std::string, std::vector<boost::recursive_variant_> > >::type tag_soup; I'm not positive, but I think the best possible way to represent what an Xml content can be, either a vector of xelement, or a std::string, is to represent that fork in the road as a recursive_variant_. There's still the parent/child nature to resolve, though. xelement[child] can have an xelement[parent], and xelement[parent] has children.

...

...
...
Also not sure quite how to capture the adapted parts at strategic rule opportunities.

My domain model will look something like this, keeping it simple as possible:

struct xattribute { std::string name; std::string value; };

typedef std::vector<xattribute> xattribute_vector;

struct xelement;

typedef std::vector<xelement> xelement_vector;

struct xelement { std::string name; std::string content; xattribute_vector attributes; xelement_vector children; };

Thanks...

Best regards,

Michael Powell

Gavin Lambert

7 Jul 7 Jul

4:13 a.m.

On 1/07/2014 08:14, Michael Powell wrote:

...

So far so good, makes sense parsing names and so forth. However, how do you handle element content? Which could either be a string, or zero or more other elements (basically of the same rule as the enclosing element rule).

It would seem you need a terminus, the empty element tag. In such a way that populates the parent (initial) element, and its children (of the same element kind).

Usually this sort of thing is resolved by using nodes -- a node is the basic base type of everything, and the entire tree can be treated as a homogenous tree of nodes. But specific derived types of nodes vary in behaviour, eg. an element node represents a tag, an attribute node modifies its parent element, a text node can only contain basic text (no sub-nodes), a whitespace node can only contain whitespace (and can be discarded or reinserted unless it's a significant-whitespace node), etc. You could either use actual inheritance or you could make "node" a variant type over the various "subtypes". Which one makes more sense may depend on how much implementation they end up wanting to have in common.

Michael Powell

1:38 p.m.

On Sun, Jul 6, 2014 at 11:13 PM, Gavin Lambert <gavinl@compacsort.com> wrote:

...

On 1/07/2014 08:14, Michael Powell wrote:

...
So far so good, makes sense parsing names and so forth. However, how do you handle element content? Which could either be a string, or zero or more other elements (basically of the same rule as the enclosing element rule).

It would seem you need a terminus, the empty element tag. In such a way that populates the parent (initial) element, and its children (of the same element kind).

Usually this sort of thing is resolved by using nodes -- a node is the basic base type of everything, and the entire tree can be treated as a homogenous tree of nodes. But specific derived types of nodes vary in behaviour, eg. an element node represents a tag, an attribute node modifies its parent element, a text node can only contain basic text (no sub-nodes), a whitespace node can only contain whitespace (and can be discarded or reinserted unless it's a significant-whitespace node), etc.

You could either use actual inheritance or you could make "node" a variant type over the various "subtypes". Which one makes more sense may depend on how much implementation they end up wanting to have in common.

It's an interesting question, thank you for the insight. I get what you mean, and yes, it is all about trade offs, between utility of the domain model and developing a grammar that will help, and not hinder, the domain model. The interesting questions are to do with how and when to allow children (i.e. likely not in a comment node). It could be that generally everything can be modeled as a tree, i.e. xobject children of xobject parent, with appropriate specializations and/or treatment of references, pointers, iterators, etc. Then the trade off is trusting that the end user is responsible with his/her usage of domain model.

...

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Gavin Lambert

11:09 p.m.

On 8/07/2014 01:38, Michael Powell wrote:

...

It's an interesting question, thank you for the insight. I get what you mean, and yes, it is all about trade offs, between utility of the domain model and developing a grammar that will help, and not hinder, the domain model. The interesting questions are to do with how and when to allow children (i.e. likely not in a comment node). It could be that generally everything can be modeled as a tree, i.e. xobject children of xobject parent, with appropriate specializations and/or treatment of references, pointers, iterators, etc. Then the trade off is trusting that the end user is responsible with his/her usage of domain model.

Yes, that's the idea, you just have a single tree of objects, but each one is specialised with its particular quirks. (Have a look at the XML DOM definitions.) If you use actual inheritance, you could have the base class provide the common functionality (tracking parents and children, with add/remove methods) but have them influenced by protected abstract policy methods implemented by the specific type classes, eg. such that a comment node can refuse to have children added to it. Also, while I haven't looked at it myself, I recently recall seeing this post, which you might want to investigate further: http://lists.boost.org/boost-users/2014/05/82093.php

4065

Age (days ago)

4072

Last active (days ago)

List overview

Download

6 comments

2 participants

participants (2)

Gavin Lambert
Michael Powell