
Hi, I am at present using spirit for parsing unipen file as specified in http://www.unipen.org/dataformats.html . That is great and quite easy to parse, thanks to spirit. Now I am looking to use spirit for parsing an specific xml file ( w3c inkml file). So my intension is not to have a generic xml parser, rather than a specific xml parser (which also have some BNF grammar) . Anyone had used spirit for domain specific xml parsing? I believe using spirit will make it faster. Also I am interested to parse only a portion of the whole document at a time, and generate data from that portion only, rather generating data for whole DOM (The files are large, 4-20 MB typically) my xml file is something like, <page> <trace ref = "1"> 0,1,2,3,4,3,4,3,5,4,3,4 </trace> ... </page> note that inside <trace> the grammar is a BNF (comma sep float pairs mostly) Any comments, or snippets to show how to do it? -- Abir Basak, Member IEEE Software Engineer, Read Ink Technologies B. Tech, IIT Kharagpur email: abir@abirbasak.com homepage: www.abirbasak.com

On 2/12/07, abir basak
Hi, I am at present using spirit for parsing unipen file as specified in http://www.unipen.org/dataformats.html . That is great and quite easy to parse, thanks to spirit. Now I am looking to use spirit for parsing an specific xml file ( w3c inkml file). So my intension is not to have a generic xml parser, rather than a specific xml parser (which also have some BNF grammar) . Anyone had used spirit for domain specific xml parsing? I believe using spirit will make it faster. Also I am interested to parse only a portion of the whole document at a time, and generate data from that portion only, rather generating data for whole DOM (The files are large, 4-20 MB typically)
Spirit is a great parser but if what you are aiming for is speed it's probably not for you. It can be quite slow compared to hand-written parsers (especially with complex grammars). If you still want to use it though, I think I remember one of the spirit examples involving parsing some basic XML. Libxml2's xmlreader gives a forward-only reader that doesn't generate any DOM, with speed that will be hard to beat. It can also validate using a schema, which can be useful if your app ever has a chance of being given an invalid inkml file. -- Cory Nelson

Hi Cory,
"Cory Nelson"
Libxml2's xmlreader gives a forward-only reader that doesn't generate any DOM, with speed that will be hard to beat.
Any decent SAX2 parser beats xmlreader. Expat and Libxml2's own SAX2 do.
It can also validate using a schema, which can be useful if your app ever has a chance of being given an invalid inkml file.
Validation is a good idea though Libxml2's XML Schema implementation is far from complete. -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Cory Nelson wrote:
On 2/12/07, abir basak
wrote: Hi, I am at present using spirit for parsing unipen file as specified in http://www.unipen.org/dataformats.html . That is great and quite easy to parse, thanks to spirit. Now I am looking to use spirit for parsing an specific xml file ( w3c inkml file). So my intension is not to have a generic xml parser, rather than a specific xml parser (which also have some BNF grammar) . Anyone had used spirit for domain specific xml parsing? I believe using spirit will make it faster. Also I am interested to parse only a portion of the whole document at a time, and generate data from that portion only, rather generating data for whole DOM (The files are large, 4-20 MB typically)
Spirit is a great parser but if what you are aiming for is speed it's probably not for you. It can be quite slow compared to hand-written parsers (especially with complex grammars). If you still want to use it though, I think I remember one of the spirit examples involving parsing some basic XML.
We are hoping to address the performance concerns with 2.0 (under development). Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Note that your task requirements are similar to those required by xml serialization where spirit parser was used with good results and good progammer experience. The relevent file is boost/archive/xml_grammar.* Robert Ramey abir basak wrote:
Hi, I am at present using spirit for parsing unipen file as specified in http://www.unipen.org/dataformats.html . That is great and quite easy to parse, thanks to spirit. Now I am looking to use spirit for parsing an specific xml file ( w3c inkml file). So my intension is not to have a generic xml parser, rather than a specific xml parser (which also have some BNF grammar) . Anyone had used spirit for domain specific xml parsing? I believe using spirit will make it faster. Also I am interested to parse only a portion of the whole document at a time, and generate data from that portion only, rather generating data for whole DOM (The files are large, 4-20 MB typically) my xml file is something like, <page> <trace ref = "1"> 0,1,2,3,4,3,4,3,5,4,3,4 </trace> ... </page> note that inside <trace> the grammar is a BNF (comma sep float pairs mostly) Any comments, or snippets to show how to do it?

Hi Abir,
abir basak
Now I am looking to use spirit for parsing an specific xml file ( w3c inkml file). So my intension is not to have a generic xml parser, rather than a specific xml parser (which also have some BNF grammar) . Anyone had used spirit for domain specific xml parsing?
Trust me you don't want to go this route. Parsing XML is a lot more than finding opening and closing tags. To implement a conforming XML parser you will need to handle namespaces, entity references, CDATA, etc. This is a lot harder to get right than most people think. The only time it makes sense to have a domain specific XML parser is when you have control over all your XML instances and can make sure that only a subset of XML 1.0 is used. This is normally done for performance reasons.
I believe using spirit will make it faster.
Highly unlikely since most of the XML parsers are hand-coded.
Also I am interested to parse only a portion of the whole document at a time, and generate data from that portion only, rather generating data for whole DOM (The files are large, 4-20 MB typically) my xml file is something like,
[...]
note that inside <trace> the grammar is a BNF (comma sep float pairs mostly)
You can use a SAX2 parser (e.g., Expat or Xerces-C++) to handle XML and then use Spirit-based parser to handle the data. hth, -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding

Boris Kolpackov wrote:
Hi Abir,
abir basak
writes: Now I am looking to use spirit for parsing an specific xml file ( w3c inkml file). So my intension is not to have a generic xml parser, rather than a specific xml parser (which also have some BNF grammar) . Anyone had used spirit for domain specific xml parsing?
Trust me you don't want to go this route. Parsing XML is a lot more than finding opening and closing tags. To implement a conforming XML parser you will need to handle namespaces, entity references, CDATA, etc. This is a lot harder to get right than most people think.
Yes I know the full xml grammar is really hard to implement. I had a tough time to implement it in ANTLR :( Here my intension is not to use full xml grammar, and make a subset of it, and test how it performs esp when I know what are all tags & what attribute they can contain. So a generalized validation is also not needed, as the grammar will validate the file. Moreover the file is not fully xml, rather also contains BNF grammar (like SVG or the one I had given as example). I will surely use a full phased xml parser, if the situation demands so. But now I am in a mood to experiment with this particular subset of xml (a w3c format known as InkML, or even a subset of inkml).
The only time it makes sense to have a domain specific XML parser is when you have control over all your XML instances and can make sure that only a subset of XML 1.0 is used. This is normally done for performance reasons.
Yes, the grammar of the file format is specific, just like xhtml or mathml doesn't need to match all nodes.
I believe using spirit will make it faster.
Highly unlikely since most of the XML parsers are hand-coded.
Not sure why! I always had specific xml parsers in Antlr (the highly used language recognition tool) faster than the generic one.
Also I am interested to parse only a portion of the whole document at a time, and generate data from that portion only, rather generating data for whole DOM (The files are large, 4-20 MB typically) my xml file is something like,
[...]
note that inside <trace> the grammar is a BNF (comma sep float pairs mostly)
You can use a SAX2 parser (e.g., Expat or Xerces-C++) to handle XML and then use Spirit-based parser to handle the data.
At present I am using Qt Sax parser. That is a good one. This one is a thought specific to this particular task.
hth, -boris
And thanks for suggestions ... -- Abir Basak, Member IEEE Software Engineer, Read Ink Technologies B. Tech, IIT Kharagpur email: abir@abirbasak.com homepage: www.abirbasak.com

Hi Abir,
abir basak
Yes, the grammar of the file format is specific, just like xhtml or mathml doesn't need to match all nodes.
XHTML is an XML 1.0 which means it can contain all kinds of valid XML constructs, including entity references, CDATA, etc. I believe InkML is the same. I think unless you control the production of XML and can restrict the feature set used (for example as boost serialization does), you are really forcing yourself into a corner since you won't be able to handle all valid instances of your vocabulary.
Highly unlikely since most of the XML parsers are hand-coded.
Not sure why! I always had specific xml parsers in Antlr (the highly used language recognition tool) faster than the generic one.
It is possible that you can come up with an Antlr-generated parser for a subset of XML that is faster than the general-purpose parser. Though I still doubt it and will believe it when I see the benchmark results ;-). Of course we are talking about comparing high-performance parsers such as SAX2 here. hth, -boris -- Boris Kolpackov Code Synthesis Tools CC http://www.codesynthesis.com Open-Source, Cross-Platform C++ XML Data Binding
participants (5)
-
abir basak
-
Boris Kolpackov
-
Cory Nelson
-
Joel de Guzman
-
Robert Ramey