
On Thu, September 7, 2006 11:04 am, loufoque wrote:
It could also be possible to make the push and pull parsers more or less independant, so that each one can be as efficient as it can be.
True. Basically, by making a direct push parser implementation, you can avoid the overhead of state saving that a pull parser requires. However, that effectively means duplicating the work, so the library should be written in such a way that the client can easily substitute the push parser that is implemented on top of a pull parser with a direct push parser.
- Since it needs to access resources from various sources, typically specified as URLs, it needs a flexible and runtime-switchable input system. - In particular, it should be possible to plug schema resolvers in at runtime, so that program extensions can provide support for, say, the ftp: schema.
That would be the work of another library, that would provide a way to read any kind of resource from an URL, a bit like what PHP has. That kind of library would be very useful too outside of the XML library.
True, and I have no intention of providing such a library within the Xml library. But it is something that needs to be kept in mind when thinking about I/O.
Maybe a more low-level approach like what boost asio provides could be interesting, especially since this models also provides asynchronous I/O.
I'm not sure how useful async I/O is for a parser. Parsing of incomplete data seems more important. If you have that and you want asynchronous parsing events, you can start the async I/O and have a handler that parses the newly received data, posting events to the async queue. The Xml library could provide such a handler, but that would be a very independent feature. The problem with the really low-level approach is the one you mention next.
Since XML needs good Unicode support and the like, maybe there is work to be done in that area first in boost.
Oh, yes, the Unicode problem. It would take an examination of systems in use, but my impression is that most programs use either UTF-16 or UTF-8 as their internal coding. With that in mind, I think it might be best to have the XML library internally support exactly these two encodings (perhaps as two template specializations) and interact with the user only in these two encodings. The transcoding of whatever external character set/encoding is used would then be an issue for the I/O interface. However, such transcoding requires the I/O interface to be sufficiently abstractable to provide it transparently - which is an obstacle for the low-level approach you suggest above.
The ability to parse partial content would be a great plus.
Yes, that seems to be important. However, it should be at the discretion of the user to switch it off, enabling the parser to work with a single lookahead character. (To support partial content, either the parser needs to support extremely complex state saving, or cache content until a complete event has been generated.)
Writing a complete XML solution is a lot of work, especially if you want to support all XML technologies (XMLSchema, RelaxNG, XPath, XLink, XInclude, XPointer...)
It is. However, it is something that, I think, can be done very well in steps, i.e. first release supports only pull parsing, second release adds push parsing, third adds a DOM, fourth another technology, etc. As long as you consider all possible technologies when implementing the basic ones, this ought to be feasible. And yes, it's a lot of work. I'm willing to put a lot of work into it.
Maybe it could be interesting to reuse libxml2, which is under the MIT license, to build something on top of it. Of course first we need to weight the gains behind a new C++ implementation.
See my reply to Stefan Seefeld. I think that within Boost, depending on an external library, no matter what license, is a very bad idea. I also think that an implementation intended from the ground up to work with C++ is a better choice. Thank you for your comments. They've given me some ideas.