
Boris Kolpackov wrote:
Hi Abir,
abir basak
writes: Now I am looking to use spirit for parsing an specific xml file ( w3c inkml file). So my intension is not to have a generic xml parser, rather than a specific xml parser (which also have some BNF grammar) . Anyone had used spirit for domain specific xml parsing?
Trust me you don't want to go this route. Parsing XML is a lot more than finding opening and closing tags. To implement a conforming XML parser you will need to handle namespaces, entity references, CDATA, etc. This is a lot harder to get right than most people think.
Yes I know the full xml grammar is really hard to implement. I had a tough time to implement it in ANTLR :( Here my intension is not to use full xml grammar, and make a subset of it, and test how it performs esp when I know what are all tags & what attribute they can contain. So a generalized validation is also not needed, as the grammar will validate the file. Moreover the file is not fully xml, rather also contains BNF grammar (like SVG or the one I had given as example). I will surely use a full phased xml parser, if the situation demands so. But now I am in a mood to experiment with this particular subset of xml (a w3c format known as InkML, or even a subset of inkml).
The only time it makes sense to have a domain specific XML parser is when you have control over all your XML instances and can make sure that only a subset of XML 1.0 is used. This is normally done for performance reasons.
Yes, the grammar of the file format is specific, just like xhtml or mathml doesn't need to match all nodes.
I believe using spirit will make it faster.
Highly unlikely since most of the XML parsers are hand-coded.
Not sure why! I always had specific xml parsers in Antlr (the highly used language recognition tool) faster than the generic one.
Also I am interested to parse only a portion of the whole document at a time, and generate data from that portion only, rather generating data for whole DOM (The files are large, 4-20 MB typically) my xml file is something like,
[...]
note that inside <trace> the grammar is a BNF (comma sep float pairs mostly)
You can use a SAX2 parser (e.g., Expat or Xerces-C++) to handle XML and then use Spirit-based parser to handle the data.
At present I am using Qt Sax parser. That is a good one. This one is a thought specific to this particular task.
hth, -boris
And thanks for suggestions ... -- Abir Basak, Member IEEE Software Engineer, Read Ink Technologies B. Tech, IIT Kharagpur email: abir@abirbasak.com homepage: www.abirbasak.com